Zulip Chat Archive

Stream: mathlib4

Topic: Let's catch typos!


Kenny Lau (Jul 08 2025 at 12:56):

I stumbled across a typo which I fixed in #26893, where tmul was misspelled as tuml. Do we have programmers here which can analyse the whole Mathlib to identify potential typos? Checking against a dictionary won't work, but I thought maybe we can do a frequency analysis, tokenising on each "word" to match Tmul or tmul (because we have two different cases conventions), and looking at say the bottom 1% of the frequencies. Is this a good algorithm? I might even write it myself.

Jireh Loreaux (Jul 08 2025 at 13:17):

I think a better idea is the linter to auto-suggest (or auto-verify) the naming convention. There's an implementation already I think.

Damiano Testa (Jul 08 2025 at 13:19):

There is also a suspicious G (vs g) in docs#Ideal.pointwise_smul_toAddSubGroup.

Damiano Testa (Jul 08 2025 at 13:21):

docs#Nat.tensto_primeCounting and docs#Nat.tendsto_primeCounting', where the latter is trying to avoid clashing with the intended name of the former!

Kenny Lau (Jul 08 2025 at 13:22):

Jireh Loreaux said:

I think a better idea is the linter to auto-suggest (or auto-verify) the naming convention. There's an implementation already I think.

are there any results?

Damiano Testa (Jul 08 2025 at 13:23):

docs#Submodule.map_iSup_comap_of_sujective. I might stop here, but this is fun!

Kenny Lau (Jul 08 2025 at 13:23):

did you catch this yourself over time, or is this from a program?

Damiano Testa (Jul 08 2025 at 13:23):

I just looked for them after you mentioned this.

Kenny Lau (Jul 08 2025 at 13:24):

but how do you know where to look for?

Damiano Testa (Jul 08 2025 at 13:24):

I did a poor man's tokenization:

git ls-files 'Mathlib/*.lean' | xargs grep -o "\(theorem\|lemma\) [^ ]*" | awk '{print $2}' | awk -F_ '{for(i=1;i<=NF;i++) {print $i}}' | awk -F. '{for(i=1;i<=NF;i++) {print $i}}' | sort | uniq -c | sort -n -k1 -r | grep "^ *1 " | less

and scanned through the output.

Bryan Gin-ge Chen (Jul 08 2025 at 13:24):

Not sure if Jireh is also referring to this, but I still think it'd be worth continuing the work started in #lean4 > automatic spelling generation & comparison @ 💬 and making it into a linter.

Kenny Lau (Jul 08 2025 at 13:25):

Damiano Testa said:

I did a poor man's tokenization:

git ls-files 'Mathlib/*.lean' | xargs grep -o "\(theorem\|lemma\) [^ ]*" | awk '{print $2}' | awk 'BEGIN{FS="_"} {for(i=1;i<=NF;i++) {print $i}}' | awk 'BEGIN{FS="."} {for(i=1;i<=NF;i++) {print $i}}' | sort | uniq -c | sort -n -k1 -r | grep "^ *1" | less

and scanned through the output.

oh wow I was about to spend like 1 hour writing the program, you saved my time by a one liner!

Damiano Testa (Jul 08 2025 at 13:25):

Btw, I agree that a linter would be better (as would an autoformaliser)!!

Damiano Testa (Jul 08 2025 at 13:26):

Kenny Lau said:

oh wow I was about to spend like 1 hour writing the program, you saved my time by a one liner!

It is a pretty long line...

Damiano Testa (Jul 08 2025 at 13:33):

docs#NumberField.ComplexEmbedding.IsConj.coe_stabilzer_mk

Damiano Testa (Jul 08 2025 at 13:36):

docs#Nat.smmoothNumbers_eq_factoredNumbers_primesBelow

Jireh Loreaux (Jul 08 2025 at 13:36):

Bryan Gin-ge Chen said:

Not sure if Jireh is also referring to this, but I still think it'd be worth continuing the work started in #lean4 > automatic spelling generation & comparison @ 💬 and making it into a linter.

Yep, that's the one I was thinking of.

Damiano Testa (Jul 08 2025 at 13:38):

I remember that a while back I went and fixed typos in authors names in files, but this has relapsed since.

Damiano Testa (Jul 08 2025 at 13:48):

docs#Lagrange.nodalWeight_eq_eval_nodal_derative
docs#Polynomial.divByMonic_add_X_sub_C_mul_derivate_divByMonic_eq_derivative

Robin Arnez (Jul 08 2025 at 16:38):

Oh, should I repeat lean4#9175 on all of mathlib :-)?

Kenny Lau (Jul 08 2025 at 16:40):

@Robin Arnez yes please :melt: did you write a program to do that? does it use the algorithm I suggested above?

Robin Arnez (Jul 08 2025 at 16:40):

Oh well it's for typos

Robin Arnez (Jul 08 2025 at 16:40):

Something like docs#PresheafOfModules.Derivation.Universal.desc

Robin Arnez (Jul 08 2025 at 16:40):

(doc-string)

Kenny Lau (Jul 08 2025 at 16:40):

yes but how are you searching for them?

Robin Arnez (Jul 08 2025 at 16:41):

I use a program to generate a list of "tokens" (lowercase words)

Robin Arnez (Jul 08 2025 at 16:42):

The code for anyone interested:

import Mathlib

partial def endLink (s : String) (p : String.Pos) : String.Pos :=
  if h : s.atEnd p then p
  else
    let c := s.get' p h
    if c = ')'  c = '\n'  c = ' ' then
      p
    else
      endLink s (p + c)

partial def splitStringCamelCase (s : String) : Array String :=
  go s 0 0 #[]
where
  go (s : String) (p1 p : String.Pos) (a : Array String) : Array String :=
    if h : s.atEnd p then
      let part := (s.extract p1 p).toLower
      if part.isEmpty then a else a.push part
    else
      let c := s.get' p h
      if c.isLower then
        if c = 'h'  s.substrEq (p + c) "ttp" 0 3 then
          let pos := endLink s (p + c)
          go s pos pos (a.push (s.extract p pos))
        else
          let c' := s.get? (p + c)
          if c'.any (·.isUpper) then
            -- boundary
            let part := (s.extract p1 (p + c)).toLower
            go s (p + c) (p + c) (a.push part)
          else
            go s p1 (p + c) a
      else if c.isUpper then
        let c' := s.get? (p + c)
        if c'.any (·.isUpper) && (s.get? (p + c + c'.get!)).any (·.isLower) then
          -- boundary
          let part := (s.extract p1 (p + c)).toLower
          go s (p + c) (p + c) (a.push part)
        else
          go s p1 (p + c) a
      else
        let part := (s.extract p1 p).toLower
        let a := if part.isEmpty then a else a.push part
        go s (p + c) (p + c) a

def extractNameComponents (n : Lean.Name) : Array String :=
  let n := Lean.privateToUserName n
  let n := n.eraseMacroScopes
  go n #[]
where
  go (n : Lean.Name) (a : Array String) : Array String :=
    match n with
    | .anonymous => a
    | .str root s => go root (splitStringCamelCase s ++ a)
    | .num root _ => go root a

/--
info: #["ab", "cd", "ex", "facing", "only", "problems", "without", "solutions", "oh", "this", "works", "too", "and", "yeah",
  "i", "don", "t"]
-/
#guard_msgs in
#eval extractNameComponents `abCDExFacingOnlyPROBLEMSWithoutSOLUTIONS_oh_this_works.tooAndYeah_i_don't

def countTerm (n : String) : StateRefT (Std.HashMap String Nat) IO Unit :=
  modify fun m => m.alter n fun | none => some 1 | some a => some (a + 1)

def countName (n : Lean.Name) : StateRefT (Std.HashMap String Nat) IO Unit :=
  forM (extractNameComponents n) countTerm

def countEnvironment (env : Lean.Environment) : StateRefT (Std.HashMap String Nat) IO Unit := do
  for (n, _) in env.constants.map₁ do
    countName n
  for i in [0:env.header.modules.size] do
    for (_, n) in Lean.docStringExt.getModuleEntries env i do
      countName (Lean.Name.mkStr1 n)

def findSource (env : Lean.Environment) (s : String) : Lean.Name := Id.run do
  for (n, _) in env.constants.map₁ do
    if s  extractNameComponents n then
      return n
  for i in [0:env.header.modules.size] do
    for (s1, n) in Lean.docStringExt.getModuleEntries env i do
      if s  extractNameComponents (Lean.Name.mkStr1 n) then
        return s1
  return .anonymous

def findUncommonVocabulary : StateRefT (Std.HashMap String Nat) IO (Array (String × Nat)) := do
  let a  get
  return (a.toArray.qsort fun a b => a.2 < b.2  (a.2 = b.2  a.1 < b.1))

def mainEntry : Lean.CoreM (Array (String × Nat)) := do
  let env  Lean.getEnv
  let act := do
    countEnvironment env
    findUncommonVocabulary
  act.run' (s := )

#eval return repr ( mainEntry)

Kenny Lau (Jul 08 2025 at 16:44):

oh wow writing a Lean program to do it was definitely not in my todo list, I was planning to use Python XD

Robin Arnez (Jul 08 2025 at 16:45):

The most common tokens are

  1. "of" (170815 times)
  2. "is" (124041 times)
  3. "the" (116252 times)
  4. "a" (94516 times)
  5. "proof" (81658 times)
  6. "eq" (77826 times)
  7. "theory" (77466 times)
  8. "to" (76558 times)
  9. "lean" (73890 times)
  10. "map" (72755 times)
  11. "category" (67646 times)
  12. "simp" (55755 times)
  13. "add" (46543 times)
  14. "inst" (43220 times)
  15. "std" (37493 times)
  16. "iff" (36896 times)
  17. "with" (36894 times)
  18. "on" (35264 times)
  19. "set" (34117 times)
  20. "match" (33812 times)

Damiano Testa (Jul 08 2025 at 16:46):

The lemmas that I mentioned above are in #26904.

Robin Arnez (Jul 08 2025 at 16:46):

Note that many of these are because of internal declarations like _simp_1, eq_1, _proof_1_1, match_1

Robin Arnez (Jul 08 2025 at 16:49):

Oh docs#SimpleGraph.compononentComplFunctor_nonempty_of_infinite

Aaron Liu (Jul 08 2025 at 16:50):

"the" and "a" are somewhat unexpected

Robin Arnez (Jul 08 2025 at 16:50):

including doc-strings

Kenny Lau (Jul 08 2025 at 16:51):

Thanks for the top 20 list; however I imagine the gold would actually be the list of bottom 20 :melt:

Robin Arnez (Jul 08 2025 at 16:54):

Well,

  1. "aaa"
  2. "aabbaa"
  3. "ababacabac"
  4. "abba"
  5. "abbabb"
  6. "abbabba"
  7. "abbabbabba"
  8. "abd"
  9. "abelianizations"
  10. "abi"
  11. "aborted"
  12. "absolyte"
  13. "absorb"
  14. "absorp"
  15. "abstractional"
  16. "abused"
  17. "acb"
  18. "accb"
  19. "accomplish"
  20. "accounts"

All of these occur exactly once

Kenny Lau (Jul 08 2025 at 16:55):

@Damiano Testa wait what's the script to deprecate?

Kenny Lau (Jul 08 2025 at 16:56):

i've been doing it by hand :melt:

Robin Arnez (Jul 08 2025 at 16:56):

I also have the full list for anyone interested

Kenny Lau (Jul 08 2025 at 16:56):

Thanks!

Kenny Lau (Jul 08 2025 at 16:57):

Where is ababacabac? I couldn't find it on Mathlib

Robin Arnez (Jul 08 2025 at 16:57):

Well it's in lean core

Robin Arnez (Jul 08 2025 at 16:57):

(remember, I used a Lean program to find these in all declarations)

Kenny Lau (Jul 08 2025 at 16:58):

https://github.com/leanprover/lean4/blob/6ad12525adc69fb402d488895e6d200424d49e4b/tests/lean/string_imp.lean#L20

Kenny Lau (Jul 08 2025 at 16:58):

aha found it XD

Aaron Liu (Jul 08 2025 at 16:58):

test file

Robin Arnez (Jul 08 2025 at 16:58):

Nah, you're looking for https://github.com/leanprover/lean4/blob/6ad12525adc69fb402d488895e6d200424d49e4b/src/Init/Data/String/Basic.lean#L614

Robin Arnez (Jul 08 2025 at 16:59):

docs#String.splitOn

Aaron Liu (Jul 08 2025 at 16:59):

I found two of them with git grep

Robin Arnez (Jul 08 2025 at 16:59):

It doesn't look through tests

Aaron Liu (Jul 08 2025 at 16:59):

exactly those two

Damiano Testa (Jul 08 2025 at 17:02):

Kenny Lau said:

Damiano Testa wait what's the script to deprecate?

It is scripts/add_deprecations.sh: it is somewhat buggy and actually edits your local files, so make sure to run in when you can easily revert the changes. It is more reliable when the diff with master is only renames of declarations. It misses progressively more deprecations as the diff increases.

Mario Carneiro (Jul 08 2025 at 17:07):

Robin Arnez said:

The most common tokens are

  1. "of" (170815 times)
  2. "is" (124041 times)
  3. "the" (116252 times)
  4. "a" (94516 times)
  5. "proof" (81658 times)
  6. "eq" (77826 times)
  7. "theory" (77466 times)
  8. "to" (76558 times)

my conclusion is that the most popular area of math is proof theory

Kenny Lau (Jul 08 2025 at 17:59):

I finally wrote a Python program to search through Mathlib.

The program: https://github.com/kckennylau/mathlib4/blob/typos/typos.py

# Put me in mathlib4
# Don't forget to add "typos.py", "typos.txt" and "typos-filename.txt" to .git/info/exclude

import os
import re
import sys

freq = dict()
first_seen = dict()

def record(filename, token):
    global freq, first_seen
    if token in freq:
        freq[token] += 1
    else:
        first_seen[token] = filename
        freq[token] = 1

def process(filename, file):
    tokens = re.findall("[A-Z]?[a-z]+", file)
    for token in tokens:
        token = token.lower()
        record(filename, token)

count = 0
for root, dirs, files in os.walk('Mathlib'):
    for file in files:
        filename = os.path.join(root, file)
        with open(filename, "r", encoding="UTF-8") as auto:
            count += 1
            if count % 100 == 0:
                sys.stdout.write("\r%d files processed."%count)
                sys.stdout.flush()
            process(filename, auto.read())

report ="%d files processed.\n"%count
report+="Number of distinct tokens: %d\n"%len(freq)
report+="Total number of tokens: %d\n\n"%(sum(freq[i] for i in freq))

print("\r"+report)

typos_name = open("typos-filename.txt", "w")
typos = open("typos.txt", "w")

typos_name.write(report)
typos.write(report)

keys = sorted(freq.keys(), key=lambda n:(freq[n],n))
for key in keys:
    typos_name.write("%s, %s, %s\n"%(key, freq[key], first_seen[key]))
    typos.write("(%s, %s),"%(key, freq[key]))

typos_name.close()
typos.close()

Kenny Lau (Jul 08 2025 at 17:59):

The outputs:

Kenny Lau (Jul 08 2025 at 18:01):

6607 files processed.
Number of distinct tokens: 17105
Total number of tokens: 11438980

It searches through every token of the form [A-Z]?[a-z]+

Kenny Lau (Jul 08 2025 at 18:05):

I can spot quite a few typos already, I'll surely have fun correcting them on Thursday.

David Renshaw (Jul 09 2025 at 01:51):

Gemini 2.5 Pro seems to be pretty good at finding typos: mathlib4#26915, mathlib4#26916

Kenny Lau (Jul 26 2025 at 21:20):

#27519

Kenny Lau (Jul 26 2025 at 21:20):

I started experimenting with using GPT to find and correct typos

Kenny Lau (Jul 26 2025 at 21:21):

I'm not sure what the proper procedure is, so I just included the company and the model's name in the commit message

Kenny Lau (Jul 27 2025 at 09:45):

I'll post the PR's in #PR reviews > Catching typos: a continuous thread

Kenny Lau (Jul 27 2025 at 11:22):

image.png

:eyes:

Damiano Testa (Aug 11 2025 at 11:13):

Does docs#Zlattice.FG, vs ZLattice.everything_else count as a typo?

Kenny Lau (Aug 11 2025 at 11:13):

sure

Damiano Testa (Aug 11 2025 at 11:18):

#28228

Harald Husum (Aug 20 2025 at 09:23):

I've made a bunch of typo fixing PRs lately, and thought I'd share my process for future reference.

I don't have much Lean experience and my day job is as a Python developer, hence my goto IDE is PyCharm. PyCharm has, as part of their code inspection suite, a proofreading tool that points out typos and grammar mistakes. It is this tool I use to find typos. I select a directory to focus on and inspect it repeatedly, each time adding new mathlib- or mathematics-specific words to the dictionary/whitelist, such that I'm left with warnings mostly for true typos in the end. As I process those, I also check across mathlib to see if a typo is repeated elsewhere.

This work has been a bit cumbersome, but has also been surprisingly effective in finding improvements, IMO.

A lesson is that proofreading tools need not be completely tailored to Lean before they can be effective in helping improve mathlib.

Thomas Murrills (Aug 20 2025 at 16:56):

(Aside, but this thread inspired me to PR fixes to a few typos in core I'd been collecting. :) lean4#10009)

Harald Husum (Aug 20 2025 at 19:17):

Not too long ago, I read this Quanta article on "emergent misalignment". That is, an LLM trained to solve specific tasks in harmful ways can become misaligned in unrelated domains, presumably because it somehow relates different harmful concepts to each other.

As a motivation for anyone considering fixing typos in publicly available lean code, I wonder whether "emergent misalignment" could have a dual, in the form of emergent alignment. That is, perhaps ensuring high quality and near typo free docstrings in lean code used as training material for LLMs, could help ensure that the LLMs associate lean code with high standards. Then, perhaps the lean code an LLM will generate will be of higher quality than it otherwise would be, as a result of this association, even if it isn't code quality directly we're addressing with typo fixes.

I'm probably hoping for too much here, but I found the idea amusing.


Last updated: Dec 20 2025 at 21:32 UTC