Zulip Chat Archive

Stream: general

Topic: Oversplitting


Yaël Dillies (Dec 11 2022 at 10:31):

Recently we've been splitting a lot of files. Let me state a few reasons why we should be careful while doing so:

  • Having many files means that similar lemmas will likely be spread across several, which hinders discoverability. One way to fix this is to cross-reference files, but from experience I know that cross-references evolve much slower than the actual content of the files.
  • The more files we split, the more opportunities to split we create. Because each file holds less lemmas, it's much easier (and tempting), to get rid of each. Reciprocally, we will be more and more reluctant to add imports to already-existing files even though those imports make sense from a theory point of view and would have already been there had we had less files.

Yaël Dillies (Dec 11 2022 at 10:31):

@Yuyang Zhao, please read this.

Ruben Van de Velde (Dec 11 2022 at 15:12):

It's always a trade-off - importing too much can also make library_search grind to a halt and you can lose sight of the forest for the trees when looking for a result manually in a huge file. Bigger files also increase the chance of ending up with an import loop. But certainly, not every file that can be split should be split.

Moritz Doll (Dec 12 2022 at 00:52):

I haven't timed it, but I would suspect that library_search timeouts are less likely in Lean 4, but I agree that mindlessly splitting files is not a good idea. If performance is increased by splitting files, then I am generally in favor of them (running Lean 3 on a rather old Laptop can be really painful at times) and splitting huge files (>1000 loc) makes me happy as well.
As for discoverability, I think this is a totally independent issue: arguably most important mathematical results need additional technology to prove them, so they are more likely be in their own file, which means they will not be easily found by library_search or simp. I think the only good answer is a external search tools (Hoogle for Lean, but even better). Finding theorems by browsing the docs is not a pleasant experience if you are not familiar with the structure/level of abstractions in mathlib and we cannot do anything about that. I don't think that splitting files has a major impact there.

Anne Baanen (Dec 12 2022 at 10:56):

I definitely agree your points: splitting files is not an unqualified good. Certainly there are tradeoffs for each point on the spectrum between lumping everything into one huge file and giving every declaration its own file, and the optimal position for mathlib is likely not at one of the extremes. However, I am rather sure that generally mathlib without active splitting efforts tends towards the too-lumpy side of optimal and so splitting efforts generally improve the situation.


Last updated: Dec 20 2023 at 11:08 UTC