Zulip Chat Archive

Stream: mathlib4

Topic: Linter for large files?


Michael Rothgang (Jan 17 2024 at 17:57):

As I understand it, there's a general consensus that large files should be split up, as this makes adding to, refactoring and minimisation of files much easier. For instance:

  • smaller files allow for better parallelism when building mathlib (in principle; depending on number of cores and possibly specifics of the import graphs)
  • splitting files helps keeping imports in check
  • improves editing times in a file (this will be helped by checking proofs in parallel)
  • lower first build times when loading a file (AIUI, this will not)
  • minimisation becomes easier
  • large files tend to indicate opportunities for splitting or refactoring into less monolithic structures
  • personally, it's easier to find things in a smaller file --- especially if the file is old and the main docstring only covers half the results there :-)

Every file has a tendency to grow over time (as more results are formalised :tada:); it's not clear to me if there are any mechanisms to prevent backsliding.

Should there be a lint for large files, say over 1000 or 1500 lines? We can add the current large files as exception and whittle them down over time. If a PR pushes a file over the limit, the lint can be allowed --- but hopefully, that's a tiny speed bump to nudge reconsidering the structure. (FWIW, this is what Rust also did.)

Michael Rothgang (Jan 17 2024 at 17:58):

This has been discussed before (example), and a style linter suggested. AFAICT, nobody wrote such a linter yet. Can we, as a first step, agree this would be good to have (and perhaps record this in an issue)?

Michael Rothgang (Jan 17 2024 at 18:00):

Some data: as of a316e2b352f84e46f07f318d51397264a29a5018, mathlib has

  • 134 files of 800--999 lines
  • 54 files of 1000-1099 lines
  • 30 files of 1100-1199 lines
  • 27 files of 1200-1299 lines
  • 29 files of 1300-1399 lines
  • 22 files of 1400-1499 lines
  • altogether: 162 files of 1000-1499 lines
  • 61 files of 1500-1999 lines
  • 29 files of 2000-2499 lines
  • 13 files of 2500-2999 lines
  • 5 files of 3000-3999 lines; one file larger than 4000 lines

Michael Rothgang (Jan 17 2024 at 18:01):

For reference, here is the list of files with more than 2000 lines.

    4457 Mathlib/Data/List/Basic.lean
    3992 Mathlib/Data/Finset/Basic.lean
    3878 Mathlib/GroupTheory/Subgroup/Basic.lean
    3373 Mathlib/Order/Filter/Basic.lean
    3196 Mathlib/Data/Multiset/Basic.lean
    3041 Mathlib/Data/Set/Basic.lean
    2983 Mathlib/Topology/Order/Basic.lean
    2865 Mathlib/RingTheory/PowerSeries/Basic.lean
    2821 Mathlib/Computability/TuringMachine.lean
    2815 Mathlib/Analysis/Normed/Group/Basic.lean
    2757 Mathlib/Topology/Algebra/Module/Basic.lean
    2728 Mathlib/CategoryTheory/Limits/Shapes/Pullbacks.lean
    2716 Mathlib/Data/Real/ENNReal.lean
    2715 Mathlib/Data/Matrix/Basic.lean
    2675 Mathlib/Algebra/BigOperators/Basic.lean
    2653 Mathlib/Combinatorics/SimpleGraph/Connectivity.lean
    2593 Mathlib/SetTheory/Ordinal/Arithmetic.lean
    2551 Mathlib/SetTheory/Cardinal/Basic.lean
    2550 Mathlib/Data/Finset/Pointwise.lean
    2410 Mathlib/RingTheory/Ideal/Operations.lean
    2409 Mathlib/Data/Set/Lattice.lean
    2395 Mathlib/Data/DFinsupp/Basic.lean
    2367 Mathlib/Analysis/InnerProductSpace/Basic.lean
    2362 Mathlib/MeasureTheory/Constructions/BorelSpace/Basic.lean
    2306 Mathlib/Analysis/Asymptotics/Asymptotics.lean
    2275 Mathlib/Topology/Separation.lean
    2250 Mathlib/MeasureTheory/MeasurableSpace/Basic.lean
    2240 Mathlib/Data/Finset/Lattice.lean
    2231 Mathlib/Topology/Algebra/Group/Basic.lean
    2227 Mathlib/Topology/MetricSpace/PseudoMetric.lean
    2191 Mathlib/MeasureTheory/Measure/MeasureSpace.lean
    2158 Mathlib/Algebra/MonoidAlgebra/Basic.lean
    2148 Mathlib/MeasureTheory/Function/StronglyMeasurable/Basic.lean
    2118 Mathlib/CategoryTheory/Limits/Shapes/Biproducts.lean
    2111 Mathlib/Analysis/Calculus/ContDiff/Basic.lean
    2110 Mathlib/GroupTheory/MonoidLocalization.lean
    2096 Mathlib/Order/CompleteLattice.lean
    2072 Mathlib/Topology/Basic.lean
    2069 Mathlib/Computability/TMToPartrec.lean
    2068 Mathlib/Order/Filter/AtTopBot.lean
    2066 Mathlib/RingTheory/UniqueFactorizationDomain.lean
    2065 Mathlib/Logic/Equiv/Basic.lean
    2046 Mathlib/Data/Complex/Exponential.lean
    2033 Mathlib/Analysis/NormedSpace/OperatorNorm.lean
    2020 Mathlib/Data/Fin/Basic.lean
    2019 Mathlib/Order/UpperLower/Basic.lean
    2018 Mathlib/Topology/UniformSpace/Basic.lean
    2010 Mathlib/MeasureTheory/Integral/Bochner.lean

Ruben Van de Velde (Jan 17 2024 at 18:04):

This seems like a good idea (as long as it's easy to override). Probably we mostly need someone to actually do it :)

Johan Commelin (Jan 17 2024 at 18:26):

I'm inclined to start with a limit of 1500 lines. We can decide later if really we want to push it down to 1000.

Jireh Loreaux (Jan 17 2024 at 18:26):

An "easy" split. ENNReal should just get its own folder inside Data

Johan Commelin (Jan 17 2024 at 18:26):

1000 could still be an informal rule of thumb, with 1500 a hard limit (modulo prehistoric exceptions).

Michael Rothgang (Jan 17 2024 at 18:58):

Let me try to make stone soup: branch#MR-lint-large-files has a quick implementation of the lint, set to 1500 lines.

Michael Rothgang (Jan 17 2024 at 18:59):

I'd also prefer rewriting that in Lean, but I'd need help for this: where to start, where to look. (I haven't really programmed in Lean before.)

Michael Rothgang (Jan 17 2024 at 19:06):

(The first commit updates the exceptions; this is pre-existing. It seems we're missing the bot updating this file when entries are removed/it hasn't been ported yet.)

Ruben Van de Velde (Jan 17 2024 at 19:12):

If you put the current number of lines in the error, will we need to change the exceptions every time the length changes?

Yury G. Kudryashov (Jan 17 2024 at 19:15):

I'm going to move parts of Filter.Basic to new files.

Michael Rothgang (Jan 17 2024 at 19:22):

Ruben Van de Velde said:

If you put the current number of lines in the error, will we need to change the exceptions every time the length changes?

Hm, bummer. Can we configure CI to do this? (Does it do this automatically already?)

Michael Rothgang (Jan 17 2024 at 19:23):

I'd prefer printing the number of lines to the user (to give some information) - the way update-style-lints works implies this is also printed to the file. I could remove the number.

Yury G. Kudryashov (Jan 17 2024 at 19:27):

Exceptions can give per-file limits on wc -l

Michael Rothgang (Jan 17 2024 at 19:29):

Submitted #9821 for updating the list of style exceptions (prior to my PR) and #9822 for the new lint.

Michael Rothgang (Jan 17 2024 at 19:29):

Yury G. Kudryashov said:

Exceptions can give per-file limits on wc -l

I don't understand this sentence. What are you referring to? (I don't know mathlib's CI inside out.)

Yury G. Kudryashov (Jan 17 2024 at 19:35):

I don't know how does your linter work. I was trying to say that it can be implemented so that it verifies that non-exceptional files have ≤1500 lines and exceptional files have LOC ≤ the number in their entry in a csv, yaml, or json file.

Michael Rothgang (Jan 17 2024 at 20:13):

Jireh Loreaux said:

An "easy" split. ENNReal should just get its own folder inside Data

Thanks for the idea! I just did this, branch#MR-split-ennreal has a split. A helping hand at adding module docs would be very welcome (feel free to just push to the branch directly).

Michael Rothgang (Jan 17 2024 at 20:16):

@Jireh Loreaux :point_up: Do you have some spare moment, by chance?

Jireh Loreaux (Jan 17 2024 at 20:29):

@Michael Rothgang I'll try to look in the next 30 minutes or so.

Michael Rothgang (Jan 17 2024 at 20:36):

Yury G. Kudryashov said:

I don't know how does your linter work. I was trying to say that it can be implemented so that it verifies that non-exceptional files have ≤1500 lines and exceptional files have LOC ≤ the number in their entry in a csv, yaml, or json file.

Good idea - adding a watermark so exceptional files cannot grow too much without acknowledgement. That goes beyond the current script, though.

Ruben Van de Velde (Jan 17 2024 at 20:56):

A simple rule that lands is probably better than a more complicated rule that doesn't. It's not like it's a strict limit

Michael Rothgang (Jan 17 2024 at 23:28):

It turns out there was a loophole in my argument - I found a very mild hack to add a watermark and updated the script. On errors, the script also prints the watermark, which means the update script also works with this. (I've tested this locally and on CI, by merging master and updating the output.)

Michael Rothgang (Jan 17 2024 at 23:28):

In other words: #9822 is ready for review.

Michael Rothgang (Jan 19 2024 at 14:31):

This PR has landed now. Thanks for the fast review!


Last updated: May 02 2025 at 03:31 UTC