Zulip Chat Archive

Stream: lean4

Topic: automatic spelling generation & comparison

Robin Arnez (Mar 14 2025 at 20:22):

I've made a tool that automatically can generate the expected spelling of a theorem (according to the naming convention) and compare it with the existing name. The program can find keywords (= name parts that are only contained in the existing names and not in the generated names), count the amount of exactly correct names, names that are a sublist (List.isSublist) of the generated name and names that are a sublist after disregarding keywords. Then it also collects a list of those that are a subset after disregarding keywords and finally a list of everything that didn't fit.
Here is the code for anyone wondering:
Spelling.lean
And here is the (current) list of suggested spellings:
suggested_spellings.txt
The program doesn't completely capture some details of naming which is why there are actually many that don't actually match even with non-keyword subset match.
But still it produced some interesting results:

Of the 15367 total analyzed theorems, 898 match the generated spelling exactly
Of the remaining, 9489 (the majority!) are a sublist of the generated spelling
3072 of the remaining are a sublist if you ignore keywords
446 of the remaining are a subset after ignoring keywords and
1462 are left without matching the generated spelling
The most common keywords are: self (693 theorems), left (483), right (471), congr (230), pos (208), forall (157, really a technicality), inj (132), comm (112), def (110), cancel (76), assoc (62), trans (61), distrib (60), nonneg (56), ext (47), nonpos (42), refl (40), fallback (30), apply (27), antisymm (27), symm (23), coe (23), total (18), irrefl (18), asymm (18), fun (17)
Some keywords that probably shouldn't be: toUISize, uSize8ToBitVec, equivWithCapacity, forin, isValidUInt32 (maybe), ofBoolUsing, updateAtKey (internal stuff), bnot, plus (maybe), fixFEq, int8toBitVec (maybe), existsOfSubtype, negOne, mul2, bex, ball, boole, geq
Some incorrect names: Bool.and_not_iff_right_iff_imp, Bool.not_or_iff_left_iff_imp, Bool.or_not_iff_right_iff_imp, Bool.not_and_iff_left_iff_imp, Array.beq_empty_iff + List.nil_beq_iff (eq), Array.elem_cons_self (push), Array.filterMap_eq_nil_iff + other stuff with Array.<...>nil<...> (empty), Array.getD_get?_setIfInBounds (getElem?), BitVec.getLsbD_ge + friends (of_ge), BitVec.getLsbD_rotateLeftAux_of_le (lt), BitVec.lt_of_getLsb?_eq_some + BitVec.getLsb?_ge (getElem?), List.exists_mem_nil (not ...), Vector.toList_empty + Vector.toList_zipWith (toArray), Vector.toList_eq_empty_iff (nil), imp.swap (namespace??), Nat.ge_two_pow_implies_high_bit_true, Nat.ne_zero_implies_bit_true
Names in wrong order: Option.guard_isSome, Option.choice_isSome_iff_nonempty, Option.isSome_filter_of_isSome, Nat.one_isPowerOfTwo, Nat.and_pow_two_sub_one_eq_mod + others (two_pow), List.head?_isSome, Decidable.not_and_iff_or_not_not (not_or_not), List.filter_length_eq_length, String.Pos.zero_addString_byteIdx

There are probably more examples but these are a few that I could find by looking through them for a while.

Michael Rothgang (Mar 14 2025 at 21:37):

Just yesterday, I was talking with @Vlad Tsyrklevich at Leaning in's conference dinner that this would be a useful tool. Wow, I'm really excited that this tool exists! Thanks a lot for working on this!

Johan Commelin (Mar 17 2025 at 06:04):

Great work! Ultimately, it would be great if this could become a syntax linter that provides feedback right as you write your declarations.

Michael Rothgang (Mar 17 2025 at 15:50):

In the short term/in addition, I was thinking that it could be nice to

extract the logic of "check a single declaration"
add two commands, #suggest_verbose_name (which creates suggestions as it currently does, with the understanding that the naming convention may allow/prefer slightly different names) and #check_declaration_name which verifies that the current name is a subset of the verbose one

Michael Rothgang (Mar 17 2025 at 15:50):

@Robin Arnez Is that something you'd like to work on?

Michael Rothgang (Mar 17 2025 at 15:52):

I was also wondering: to run this in mathlib, we presumably need to add lots of recommended_spelling: do we have a list of all notation which needs this? How difficult would it be to generate one?

Robin Arnez (Mar 17 2025 at 17:10):

I might work on this more and yeah, suggesting names and checking them are probably the most important use cases. I'm not really sure if this makes as much sense in mathlib because I'm pretty sure the conventions are a bit different there (e.g. it's called add_right_injective instead of injective_add_right). Regarding recommended_spelling: For most operations it's probably fine to use the lower-case declaration name as a notation name (e.g. we have HasSubset.Subset => subset). The main problem is the question of how we order the parameters, e.g. take HasSubset.Subset x y which should be spelled x_subset_y instead of subset_x_y, so that's something that would need to be denoted somewhere when necessary. Additionally, a challenge is special spellings, e.g. singleton for [a]. Ultimately, we might also want a more complex algorithm for correctness than just subset or sublist to make sure we allow everything that is okay according to the naming convention (e.g. bitVecToNat instead of toNat).

Michael Rothgang (Mar 17 2025 at 18:27):

I would love to have a tool for helping with the mathlib naming convention: this would be great for authors and reviewers alike, and could help to gradually fix the (relatively few) inconsistencies mathlib names have.

Michael Rothgang (Mar 17 2025 at 18:28):

If your tool is that or could be tweaked to become such a tool, that's even nicer (no need to duplicate work). My understanding is that core/stdlib naming and mathlib are supposed to be compatible/augmenting each other.

Bryan Gin-ge Chen (Mar 17 2025 at 18:41):

This also looks useful for tactics like to_additive that generate new declarations from old ones! Let's definitely try to get this into mathlib (or batteries maybe?).

Kim Morrison (Mar 18 2025 at 03:21):

Robin Arnez said:

Names in wrong order: Option.guard_isSome, Option.choice_isSome_iff_nonempty, Option.isSome_filter_of_isSome, Nat.one_isPowerOfTwo, Nat.and_pow_two_sub_one_eq_mod + others (two_pow), List.head?_isSome, Decidable.not_and_iff_or_not_not (not_or_not), List.filter_length_eq_length, String.Pos.zero_addString_byteIdx

These are fixed in lean#7541.

Kim Morrison (Mar 18 2025 at 03:46):

Robin Arnez said:

Some incorrect names: Bool.and_not_iff_right_iff_imp, Bool.not_or_iff_left_iff_imp, Bool.or_not_iff_right_iff_imp, Bool.not_and_iff_left_iff_imp, Array.beq_empty_iff + List.nil_beq_iff (eq), Array.elem_cons_self (push), Array.filterMap_eq_nil_iff + other stuff with Array.<...>nil<...> (empty), Array.getD_get?_setIfInBounds (getElem?), BitVec.getLsbD_ge + friends (of_ge), BitVec.getLsbD_rotateLeftAux_of_le (lt), BitVec.lt_of_getLsb?_eq_some + BitVec.getLsb?_ge (getElem?), List.exists_mem_nil (not ...), Vector.toList_empty + Vector.toList_zipWith (toArray), Vector.toList_eq_empty_iff (nil), imp.swap (namespace??), Nat.ge_two_pow_implies_high_bit_true, Nat.ne_zero_implies_bit_true

and these are corrected in lean#7542.

Kim Morrison (Mar 18 2025 at 03:47):

@Robin Arnez, I'd really encourage you to keep working on this tool! The list you provided was entirely correct (i.e. everything was incorrect) and I'd happily have either more manual iterations, or access to an automatic tool or linter.

Jeremy Tan (Mar 18 2025 at 04:25):

But we wouldn't want to replace Std.DTreeMap.Internal.Impl.toListModel_updateCell with the recommended Std.DTreeMap.Internal.Impl.toListModel_impl_updateCell_eq_filter_compare_fst_beq_gt_toListModel_append_toList_inner_findCell_toListModel_compare_append_filter_compare_fst_beq_lt_toListModel_of_ordered_of_balanced, right?

Jeremy Tan (Mar 18 2025 at 04:32):

(I know to use common sense here. I'm saying that the majority 9489 lemmas whose names are pure sublists of the autogenerated name, and perhaps others, don't really need renaming)

Bhavik Mehta (Mar 18 2025 at 04:35):

It looks like the pattern for refl/symm/trans lemmas isn't in there, eg these from your suggestions list

Setoid.refl: equiv
Setoid.symm: equiv_of_equiv
Setoid.trans: equiv_of_equiv_of_equiv

(there's a slight subtlety in that some of these are called comp rather than trans), but trans alone accounts for 20 non-matches

Robin Arnez (Mar 18 2025 at 15:44):

trans is treated as a keyword so it doesn't actually show up as an incorrect spelling (e.g. eq_trans is treated as eq which is a subset of eq_of_eq_of_eq, so that's okay). However, that means these are not checked which is an issue on its own.

Robin Arnez (Mar 18 2025 at 15:48):

Jeremy Tan schrieb:

But we wouldn't want to replace Std.DTreeMap.Internal.Impl.toListModel_updateCell with the recommended Std.DTreeMap.Internal.Impl.toListModel_impl_updateCell_eq_filter_compare_fst_beq_gt_toListModel_append_toList_inner_findCell_toListModel_compare_append_filter_compare_fst_beq_lt_toListModel_of_ordered_of_balanced, right?

Yeah, of course not. These should be more treated as suggestions as actual name candidates which is why we check for sublist/subset instead of exactly the same name.

Robin Arnez (Mar 18 2025 at 16:49):

Also I actually made my own PR: lean4#7499

Kim Morrison (Mar 18 2025 at 22:21):

Robin Arnez said:

Also I actually made my own PR: lean4#7499

Apologies that I'd missed that. So many PRs. :-)

Last updated: May 02 2025 at 03:31 UTC