Zulip Chat Archive

Stream: PhysLean

Topic: Spelling in PhysLean

Joseph Tooby-Smith (Oct 23 2025 at 09:43):

I made a low-tech linter for spelling documentation in PhysLean.

I basically made a list of all the words that appear in PhysLean to date, got an AI to remove the ones which where incorrectly spelled or manifestly do not correspond to code etc., thereby creating a list of allowed words. Then checked all the documentation agains this list of allowed words. Naturally any new words would have to be added to the allowed words list, and it doesn't deal with code snippets in documentation in any smart way.

The corresponding PR is PhysLean#781. If anyone has better ways we could 'lint' for spelling I would be very glad to hear them.

Kenny Lau (Oct 23 2025 at 09:46):

interesting, i was thinking about doing something similar for mathlib (also with the help of GPT) but it ended up very slow so i ended up only correcting 300 files; i think you'll just have to do it every month or so; did you make the corrections manually or with LLM? i was also wondering if there were ways to speed up the process

Joseph Tooby-Smith (Oct 23 2025 at 09:52):

I ended up doing the actual corrections manually, it took about 2 hours to do so wasn't so bad. Though I had to keep pasting words in to MS word to get suggestions for the correct spellings which was a bit of a pain.

I think if one keeps on top of it, it should be much quicker to do it in the future. But yeah, I am hoping there is a much quicker way to go about this. I think the main problem is the number of specialized words in maths/physics, so that a normal dictionary would not cover it.

Kenny Lau (Oct 23 2025 at 09:52):

yeah I just fed entire files to GPT so that it has enough context

Kenny Lau (Oct 23 2025 at 09:53):

I suppose you could isolate the errors and then feed like -2/+2 lines to GPT and have it correct the error

Joseph Tooby-Smith (Oct 24 2025 at 05:55):

Yeah - and then maybe a yes or no type thing to allow someone to easily allow or disallow a change :).

Joseph Tooby-Smith (Oct 24 2025 at 05:57):

Based on @Damiano Testa suggestion I have segregated the meta code for the spelling PR away from the 100s of spelling corrections. The meta code is in PhysLean#783.

Rein Zustand (Dec 29 2025 at 19:38):

The standard practice that I know is used in Zulip is via codespell, which detects for common typos. I caught 57 typos running the tool, 14 of which is in this PR: https://github.com/HEPLean/PhysLean/pull/866.

The codespell linter can be enabled via pre-commit (e.g. https://github.com/mesa/mesa/blob/e9054adbb9efb626d0d038fe8c5178d34338e1ab/.pre-commit-config.yaml#L27-L30), which is much faster than running it as GH Actions.

Rein Zustand (Dec 29 2025 at 19:42):

Note: I only included 14 to keep the PR small, not because the rest are false positives.

Joseph Tooby-Smith (Dec 30 2025 at 05:32):

Nice! I approved your PR, and merged it. What sort of configuration does codespell need? Is it in a sense 'ready to go', or does one have to teach it about Lean (say)?

Rein Zustand (Dec 30 2025 at 06:15):

This is the content of the .codespellignore that I used:

digitalize
statics
disjointness
lightYears
tne
hge
Breal
ket
rIn
FRO

codespell focuses on typical typos, instead of programmatically check the edit distance with all the English dictionary terms, its dictionary is in the form of

aaccess->access
aaccessibility->accessibility
aaccession->accession
aache->cache, ache,
aack->ack
aactual->actual
aactually->actually
aadd->add

(https://github.com/codespell-project/codespell/blob/main/codespell_lib/data/dictionary.txt)

And so, I only had to teach few PhysLean-specific variables that sound like typos.

Joseph Tooby-Smith (Dec 30 2025 at 10:36):

Great! This looks really nice, and it seems possible to use this as a get this setup as a GitHub workflow, right?

Rein Zustand (Dec 30 2025 at 10:47):

Yes, by committing a .pre-commit-config.yaml, and registering the repo in https://pre-commit.com/. Mathlib actually uses a hook, although not codespell: https://github.com/leanprover-community/mathlib4/blob/master/.pre-commit-config.yaml.

Joseph Tooby-Smith (Dec 30 2025 at 12:49):

Before merging your PR, I would like to get a better understanding of pre-commit. Do you know how it differs from GithHub workflows, sorry if I'm being a bit slow with regard to this.

Rein Zustand (Dec 30 2025 at 12:51):

It will show up in the list of checks just like with GH Actions (e.g. https://github.com/mesa/mesa/pull/3041 where the CI passed). There won't be a bot that noisily says "pre-commit passed/failed", no worries.

Rein Zustand (Dec 30 2025 at 12:53):

Upon checking Mathlib, they didn't use the pre-commit SaaS (https://github.com/leanprover-community/mathlib4/pull/33404). They seemed to only use pre-commit locally.

Alfredo Moreira-Rosa (Dec 30 2025 at 12:54):

pre-commit can run locally before you commit.
I consider pre-commit to be a bad idea because it blocks the local development workflow.
It's really better in my view to have linters in IDE and in CI/CD, but not pre-commits.

Rein Zustand (Dec 30 2025 at 12:59):

I consider pre-commit to be a bad idea because it blocks the local development workflow.

I personally have used the pre-commit config for CI purposes only in the past. But, that said, some found itthe local version to be useful. I was skimming past threads in this Zulip server by searching for keyword "pre-commit".

Joseph Tooby-Smith (Dec 30 2025 at 19:06):

@Rein Zustand Do you know if there is a similar way that we can do something like codespell without using pre-commit, or are they somehow intrinsically linked? I still don't have a good mental model of pre-commit, but I will try and get one over the next few days.

In the mean time, could we separate the spelling corrections in PhysLean#868 from the initalization of .pre-commit-config.yaml?

Rein Zustand (Dec 30 2025 at 19:34):

No, I meant, yes, codespell can be used with the usual GH Actions, it's just a CLI tool. In the past, in the repositories I maintained (e.g. https://github.com/mesa/mesa), I had looked for optimizing the run time for lightweight CLI tools, and found pre-commit CI to be much faster than GH Actions. No need for container initialization, setting up dependencies, etc.

Rein Zustand (Dec 30 2025 at 19:55):

In case if you wonder if using pre-commit CI might be a niche solution. It is actually used by several major FOSS projects:

Python-only in the list. There are Rust projects using it as well, but not necessarily the foundational projects like the ones listed above.

Alfredo Moreira-Rosa (Dec 30 2025 at 21:08):

I would really encourage the GH actions route. It's not that bad to have the CI take a litle more time for a lint feature and evoid a Python dependency when it's not necessary.

Rein Zustand (Dec 30 2025 at 21:18):

Using GH Actions about as long as the current Python-based linter in the CI (7s of which is due to installing Python). ~15s vs ~5s.

Joseph Tooby-Smith (Dec 31 2025 at 05:18):

Ok, given these times are much smaller compared to the Lean longer, I would think a GH action is probably the way to go for now (can always modify later). @Rein Zustand would you be able to make a PR implementing it as a GH action?

Rein Zustand (Dec 31 2025 at 11:10):

Changed to GH Actions: https://github.com/HEPLean/PhysLean/pull/868.

Last updated: Feb 28 2026 at 14:05 UTC