Zulip Chat Archive
Stream: mathlib4
Topic: Linter for large files?
Michael Rothgang (Jan 17 2024 at 17:57):
As I understand it, there's a general consensus that large files should be split up, as this makes adding to, refactoring and minimisation of files much easier. For instance:
- smaller files allow for better parallelism when building mathlib (in principle; depending on number of cores and possibly specifics of the import graphs)
- splitting files helps keeping imports in check
- improves editing times in a file (this will be helped by checking proofs in parallel)
- lower first build times when loading a file (AIUI, this will not)
- minimisation becomes easier
- large files tend to indicate opportunities for splitting or refactoring into less monolithic structures
- personally, it's easier to find things in a smaller file --- especially if the file is old and the main docstring only covers half the results there :-)
Every file has a tendency to grow over time (as more results are formalised :tada:); it's not clear to me if there are any mechanisms to prevent backsliding.
Should there be a lint for large files, say over 1000 or 1500 lines? We can add the current large files as exception and whittle them down over time. If a PR pushes a file over the limit, the lint can be allowed --- but hopefully, that's a tiny speed bump to nudge reconsidering the structure. (FWIW, this is what Rust also did.)
Michael Rothgang (Jan 17 2024 at 17:58):
This has been discussed before (example), and a style linter suggested. AFAICT, nobody wrote such a linter yet. Can we, as a first step, agree this would be good to have (and perhaps record this in an issue)?
Michael Rothgang (Jan 17 2024 at 18:00):
Some data: as of a316e2b352f84e46f07f318d51397264a29a5018
, mathlib has
- 134 files of 800--999 lines
- 54 files of 1000-1099 lines
- 30 files of 1100-1199 lines
- 27 files of 1200-1299 lines
- 29 files of 1300-1399 lines
- 22 files of 1400-1499 lines
- altogether: 162 files of 1000-1499 lines
- 61 files of 1500-1999 lines
- 29 files of 2000-2499 lines
- 13 files of 2500-2999 lines
- 5 files of 3000-3999 lines; one file larger than 4000 lines
Michael Rothgang (Jan 17 2024 at 18:01):
For reference, here is the list of files with more than 2000 lines.
4457 Mathlib/Data/List/Basic.lean
3992 Mathlib/Data/Finset/Basic.lean
3878 Mathlib/GroupTheory/Subgroup/Basic.lean
3373 Mathlib/Order/Filter/Basic.lean
3196 Mathlib/Data/Multiset/Basic.lean
3041 Mathlib/Data/Set/Basic.lean
2983 Mathlib/Topology/Order/Basic.lean
2865 Mathlib/RingTheory/PowerSeries/Basic.lean
2821 Mathlib/Computability/TuringMachine.lean
2815 Mathlib/Analysis/Normed/Group/Basic.lean
2757 Mathlib/Topology/Algebra/Module/Basic.lean
2728 Mathlib/CategoryTheory/Limits/Shapes/Pullbacks.lean
2716 Mathlib/Data/Real/ENNReal.lean
2715 Mathlib/Data/Matrix/Basic.lean
2675 Mathlib/Algebra/BigOperators/Basic.lean
2653 Mathlib/Combinatorics/SimpleGraph/Connectivity.lean
2593 Mathlib/SetTheory/Ordinal/Arithmetic.lean
2551 Mathlib/SetTheory/Cardinal/Basic.lean
2550 Mathlib/Data/Finset/Pointwise.lean
2410 Mathlib/RingTheory/Ideal/Operations.lean
2409 Mathlib/Data/Set/Lattice.lean
2395 Mathlib/Data/DFinsupp/Basic.lean
2367 Mathlib/Analysis/InnerProductSpace/Basic.lean
2362 Mathlib/MeasureTheory/Constructions/BorelSpace/Basic.lean
2306 Mathlib/Analysis/Asymptotics/Asymptotics.lean
2275 Mathlib/Topology/Separation.lean
2250 Mathlib/MeasureTheory/MeasurableSpace/Basic.lean
2240 Mathlib/Data/Finset/Lattice.lean
2231 Mathlib/Topology/Algebra/Group/Basic.lean
2227 Mathlib/Topology/MetricSpace/PseudoMetric.lean
2191 Mathlib/MeasureTheory/Measure/MeasureSpace.lean
2158 Mathlib/Algebra/MonoidAlgebra/Basic.lean
2148 Mathlib/MeasureTheory/Function/StronglyMeasurable/Basic.lean
2118 Mathlib/CategoryTheory/Limits/Shapes/Biproducts.lean
2111 Mathlib/Analysis/Calculus/ContDiff/Basic.lean
2110 Mathlib/GroupTheory/MonoidLocalization.lean
2096 Mathlib/Order/CompleteLattice.lean
2072 Mathlib/Topology/Basic.lean
2069 Mathlib/Computability/TMToPartrec.lean
2068 Mathlib/Order/Filter/AtTopBot.lean
2066 Mathlib/RingTheory/UniqueFactorizationDomain.lean
2065 Mathlib/Logic/Equiv/Basic.lean
2046 Mathlib/Data/Complex/Exponential.lean
2033 Mathlib/Analysis/NormedSpace/OperatorNorm.lean
2020 Mathlib/Data/Fin/Basic.lean
2019 Mathlib/Order/UpperLower/Basic.lean
2018 Mathlib/Topology/UniformSpace/Basic.lean
2010 Mathlib/MeasureTheory/Integral/Bochner.lean
Ruben Van de Velde (Jan 17 2024 at 18:04):
This seems like a good idea (as long as it's easy to override). Probably we mostly need someone to actually do it :)
Johan Commelin (Jan 17 2024 at 18:26):
I'm inclined to start with a limit of 1500 lines. We can decide later if really we want to push it down to 1000.
Jireh Loreaux (Jan 17 2024 at 18:26):
An "easy" split. ENNReal
should just get its own folder inside Data
Johan Commelin (Jan 17 2024 at 18:26):
1000 could still be an informal rule of thumb, with 1500 a hard limit (modulo prehistoric exceptions).
Michael Rothgang (Jan 17 2024 at 18:58):
Let me try to make stone soup: branch#MR-lint-large-files has a quick implementation of the lint, set to 1500 lines.
Michael Rothgang (Jan 17 2024 at 18:59):
I'd also prefer rewriting that in Lean, but I'd need help for this: where to start, where to look. (I haven't really programmed in Lean before.)
Michael Rothgang (Jan 17 2024 at 19:06):
(The first commit updates the exceptions; this is pre-existing. It seems we're missing the bot updating this file when entries are removed/it hasn't been ported yet.)
Ruben Van de Velde (Jan 17 2024 at 19:12):
If you put the current number of lines in the error, will we need to change the exceptions every time the length changes?
Yury G. Kudryashov (Jan 17 2024 at 19:15):
I'm going to move parts of Filter.Basic
to new files.
Michael Rothgang (Jan 17 2024 at 19:22):
Ruben Van de Velde said:
If you put the current number of lines in the error, will we need to change the exceptions every time the length changes?
Hm, bummer. Can we configure CI to do this? (Does it do this automatically already?)
Michael Rothgang (Jan 17 2024 at 19:23):
I'd prefer printing the number of lines to the user (to give some information) - the way update-style-lints works implies this is also printed to the file. I could remove the number.
Yury G. Kudryashov (Jan 17 2024 at 19:27):
Exceptions can give per-file limits on wc -l
Michael Rothgang (Jan 17 2024 at 19:29):
Submitted #9821 for updating the list of style exceptions (prior to my PR) and #9822 for the new lint.
Michael Rothgang (Jan 17 2024 at 19:29):
Yury G. Kudryashov said:
Exceptions can give per-file limits on
wc -l
I don't understand this sentence. What are you referring to? (I don't know mathlib's CI inside out.)
Yury G. Kudryashov (Jan 17 2024 at 19:35):
I don't know how does your linter work. I was trying to say that it can be implemented so that it verifies that non-exceptional files have ≤1500 lines and exceptional files have LOC ≤ the number in their entry in a csv, yaml, or json file.
Michael Rothgang (Jan 17 2024 at 20:13):
Jireh Loreaux said:
An "easy" split.
ENNReal
should just get its own folder insideData
Thanks for the idea! I just did this, branch#MR-split-ennreal has a split. A helping hand at adding module docs would be very welcome (feel free to just push to the branch directly).
Michael Rothgang (Jan 17 2024 at 20:16):
@Jireh Loreaux :point_up: Do you have some spare moment, by chance?
Jireh Loreaux (Jan 17 2024 at 20:29):
@Michael Rothgang I'll try to look in the next 30 minutes or so.
Michael Rothgang (Jan 17 2024 at 20:36):
Yury G. Kudryashov said:
I don't know how does your linter work. I was trying to say that it can be implemented so that it verifies that non-exceptional files have ≤1500 lines and exceptional files have LOC ≤ the number in their entry in a csv, yaml, or json file.
Good idea - adding a watermark so exceptional files cannot grow too much without acknowledgement. That goes beyond the current script, though.
Ruben Van de Velde (Jan 17 2024 at 20:56):
A simple rule that lands is probably better than a more complicated rule that doesn't. It's not like it's a strict limit
Michael Rothgang (Jan 17 2024 at 23:28):
It turns out there was a loophole in my argument - I found a very mild hack to add a watermark and updated the script. On errors, the script also prints the watermark, which means the update script also works with this. (I've tested this locally and on CI, by merging master and updating the output.)
Michael Rothgang (Jan 17 2024 at 23:28):
In other words: #9822 is ready for review.
Michael Rothgang (Jan 19 2024 at 14:31):
This PR has landed now. Thanks for the fast review!
Last updated: May 02 2025 at 03:31 UTC