Zulip Chat Archive
Stream: general
Topic: Is a monolithic mathlib sustainable?
Filippo A. E. Nuccio (Dec 07 2021 at 14:50):
Johan Commelin said:
At the same time, it's not clear to me that our current monorepo approach will scale to 100x the size. (Which is a lower bound on the endgoal I'm aiming for.) But we can solve that issue when it becomes a real problem.
Since I have been asked this a couple of times, can you speculate a bit on this? What are the problems you might foresee? More on the "human side" (having enough maintainers with enough time to review all PR or stuff like this) or more " practical" (CI would take two days to compile mathlib) or something else?
Arthur Paulino (Dec 07 2021 at 15:09):
I do think this subject would deserve a thread of its own, just to keep the focus on Kevin's paper
Kevin Buzzard (Dec 07 2021 at 15:10):
I've had tons of feedback on the paper (many thanks to eveyone!) and am currently trying to write v1 of the document. I'm quite happy for the conversation to wander, but the reason it should have a thread of its own is not because it could derail this one but because it will be easier to find later on.
Notification Bot (Dec 07 2021 at 15:28):
This topic was moved here from #maths > ICM talk by Anne Baanen
Anne Baanen (Dec 07 2021 at 15:29):
I went ahead and moved a relatively arbitrary segment of the thread to a new discussion. Feel free to complain if your message was wrongly moved or wrongly left out :)
Arthur Paulino (Dec 07 2021 at 15:30):
Funny, I was just doing something similar :smiley:
Johan Commelin (Dec 07 2021 at 15:32):
One of the main reasons that we want a monolithic mathlib is that it makes it a lot easier to refactor big chunks of theories. But if mathlib is 100x the size it is now, I don't see how one would be able to refactor in practice the definition of add_comm_group
(say). At least not with our current technology.
So, if such refactors are no longer done (hopefully because some common core of mathlib has just become very stable), then it might as well be packaged separately.
But recently I posted a quiz about stable files in mathlib, and it turns out that mathlib doesn't have stable files right now. Like I said, we should probably only worry about these problems when they become actual problems.
Yakov Pechersky (Dec 07 2021 at 15:35):
Part of what's stable or not is conflated when we consider whitespace, formatting, documentation. We will need better language tooling to ask temporal questions about the API stability, as opposed to the non-API file contents.
Arthur Paulino (Dec 07 2021 at 15:39):
I think @Filippo A. E. Nuccio raised good points. And especially the later might be worse because it can make working on mathlib a more annoying task for core maintainers. Is it possible to gather data on the history of CI runs?
Johan Commelin (Dec 07 2021 at 15:42):
@Arthur Paulino You know about https://leanprover.zulipchat.com/#narrow/stream/113538-CI/topic/build.20time.20bot/near/263993147?
Bryan Gin-ge Chen (Dec 07 2021 at 15:42):
You can also grab pretty much whatever data you want on previous runs from GitHub's API: https://docs.github.com/en/rest/reference/actions#workflow-runs
edit: that won't include runs that were done on Travis before we switched to GitHub actions, but it looks like Travis has an API for that as well: https://developer.travis-ci.com/resource/builds#Builds
Filippo A. E. Nuccio (Dec 07 2021 at 15:43):
Johan Commelin said:
One of the main reasons that we want a monolithic mathlib is that it makes it a lot easier to refactor big chunks of theories. But if mathlib is 100x the size it is now, I don't see how one would be able to refactor in practice the definition of
add_comm_group
(say). At least not with our current technology.So, if such refactors are no longer done (hopefully because some common core of mathlib has just become very stable), then it might as well be packaged separately.
But recently I posted a quiz about stable files in mathlib, and it turns out that mathlib doesn't have stable files right now. Like I said, we should probably only worry about these problems when they become actual problems.
And is the hope that refactoring can become somewhat "automatized" too big a dream? After all, it is a clever version of a search-and-replace, once one comes up with the "good" refactor, no? It seems to me that often a big issue with refactoring are variables (what is the local name of that-and-that variable, which ones are left implicit/explicit?), and that this is something that could be automatized.
Filippo A. E. Nuccio (Dec 07 2021 at 15:44):
I understand that if one adds many new fields as we recently did for abelian groups adding explicit -action, this becomes a nightmare, but many nightmares have already turned into dreams in Lean4, no?
Anne Baanen (Dec 07 2021 at 15:45):
Indeed, my hope is that Lean 4's delaborator will allow us to interactively modify Lean files without breaking anything.
Arthur Paulino (Dec 07 2021 at 15:46):
Amazing, thanks!
I asked about CI runs because these charts make me think that the growing speed is increasing, so I wonder how the CI runs are being impacted.
Anne Baanen (Dec 07 2021 at 15:50):
Here are some graphs I made a while ago:
Yaël Dillies (Dec 07 2021 at 15:50):
Sorry what. 40 lines per second ?!?
Anne Baanen (Dec 07 2021 at 15:50):
I took the part of the graph before 13th of June, to skip the big jump. According to LibreOffice, the growth is roughly cubical up to then, x^2.934 to be precise, with R² ≥ 0.98.
If I take all the data, the growth is x^2.31 with R² ≥ 0.95.
Anne Baanen (Dec 07 2021 at 15:51):
Yaël Dillies said:
Sorry what. 40 lines per second ?!?
We have hundreds of thousands of lines in mathlib in total, and there's about ten thousand seconds in 3 hours, so that makes sense, right?
Yaël Dillies (Dec 07 2021 at 15:52):
Ah, not 40 new lines per second, but 40 compiled lines per second?
Anne Baanen (Dec 07 2021 at 15:53):
Ah, I see why you were confused! Yes, these graphs are about "significant" lines of code versus compile time according to the build bot.
Arthur Paulino (Dec 07 2021 at 16:03):
Anne Baanen said:
I took the part of the graph before 13th of June, to skip the big jump. According to LibreOffice, the growth is roughly cubical up to then, x^2.934 to be precise, with R² ≥ 0.98.
If I take all the data, the growth is x^2.31 with R² ≥ 0.95.
So build duration seems to grow at a rate that's something between quadratic and cubic (closer to cubic) w.r.t. the number of lines? Did I understand it correctly?
Anne Baanen (Dec 07 2021 at 16:07):
Yes, according to the data from approximately the first half of this year we had about that correlation. I don't claim that this is actual causation, since Azure seems to show a lot more delays than previously.
Kevin Buzzard (Dec 07 2021 at 16:35):
I don't know much about all this CS stuff but isn't the Linux kernel a gazillion lines of code in a monorepo right now?
Kevin Buzzard (Dec 07 2021 at 16:38):
And as for Isabelle's modularity working, I'm not so sure it's as easy as that. When they did schemes in Isabelle they had to use a new definition of ring so they just wrote a new definition and ignored the old one, and that's what they use for their sheaves of rings. If they later on want to do eg sheaves of modules then they won't be able to use any theory of Isabelle modules because they'll be for the wrong rings. I suspect the monorepo approach is what has enabled us to get this far, and if we have trouble getting 100 times bigger then maybe all approaches will!
Kevin Buzzard (Dec 07 2021 at 16:41):
The other thing: I remember when core lean 3 was frozen before the fork, and we would have to work around issues. I could imagine core mathlib being frozen one day, the idea being that "we've tried to do it in several ways and this way seems to work the best overall". Then if people don't like bits of it they can work around it, like we used to work around core lean issues which we couldn't fix
Mauricio Collares (Dec 07 2021 at 16:56):
Kevin Buzzard said:
I don't know much about all this CS stuff but isn't the Linux kernel a gazillion lines of code in a monorepo right now?
Speaking of which, https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext is a good read!
Kevin Buzzard (Dec 07 2021 at 17:00):
I guess one thing worth noting is that even though I don't understand much of what is going on behind the scenes with all this bors and azure and "lena is offline" stuff , what is clear to me is that the system isn't just "store a lot of code on github and it all just works", it's "store a lot of code on github and also use quite a bit of infrastructure to do a lot of the background work for us". Again I remember when mathlib was much smaller and we didn't need any of this extra infrastructure, but the point I'm trying to make is that in 2021 this infrastructure is there and can be used, so perhaps the "don't make a monorepo" is advice which has not aged well?
Arthur Paulino (Dec 07 2021 at 17:13):
I am convinced that a monolith is the most appropriated approach for the development of mathlib
(and I haven't ever questioned this self-contained approach). But I'd also ask myself about to which extent this is sustainable, or rather, what has to be done in order to make such a repository maintainable. Are there some low-hanging fruits?
There was a topic about optimizing the lint check and it was argued that it's hard because even adding an "unrelated" simp lemma can make some proof break. But maybe it can be optimized if no such breakable change is made?
Arthur Paulino (Dec 07 2021 at 17:19):
What are the conditions that would allow artifacts from previous builds to be trusted and reused?
Can we slow down the bull before it hits the wall (can we anticipate/prevent harsh issues)?
Yakov Pechersky (Dec 08 2021 at 03:03):
The github actions Rob wrote are very useful for keeping repos that rely on mathlib up to date. In general, that sort of tooling of "distributed" responsibility that is facilitated via "actions" helps improve interacting with big repos. A step mathlib could take is by having a pre-commit
set of actions that a user should run before pushing. These would reduce the CI load. We already do fast "lint the files" before builds -- a pre-commit
would allow the user to run that locally in a clearer way.
Huỳnh Trần Khanh (Dec 08 2021 at 04:22):
lol I'm a programmer and here's my 2 cents :joy: big, monolithic repositories aren't bad at all, they are extremely common and the word "monolithic" doesn't have any negative connotation. I know a lot of projects that are monolithic and sustainable. Linux, DefinitelyTyped, Babel and React for example are extremely successful monolithic projects
Arthur Paulino (Dec 08 2021 at 09:44):
Maybe a little bit of survivor bias :thinking:
Huge repos are harder to manage. This is a common thing among badly managed (usually old) tech companies: their services are hosted on very very big repositories that are hard to tweak and have slow (and risky) maintenance/deploy cycles. There is no shadow of doubt, however, that mathlib maintainers are skilled and knowledgeable enough to keep the ball rolling nicely.
Mario Carneiro (Dec 08 2021 at 11:29):
Yaël Dillies said:
What about Isabelle? They're kind of modular and although I wouldn't want to have to deal with several duplicate libraries, that seems to work.
It's a stretch to call isabelle modular. Isabelle has very centralized maintenance, mostly Makarius and Tobias Nipkow and Larry Paulson and their research groups handling the infrastructure. There is the main Isabelle/HOL distribution and the AFP, which are organized roughly like mathlib and the archive: the same group of people does changes to the distribution and fixes downstream breakage in the AFP, and everyone has to move to the next version every year to stay up to date.
Mario Carneiro (Dec 08 2021 at 11:53):
Arthur Paulino said:
Huge repos are harder to manage.
Huge code is hard to manage, whether it is in one place or several. Splitting a huge singular product into many small pieces only works well if the pieces are loosely coupled. Mathlib is not loosely coupled (cue dependency graph), and this is very much by design.
I think the google example is enough to show that, as long as enough work is put into infrastructure to keep the ball rolling, there is no actual upper bound on the size of the repo if we want to be a monorepo all the way. If mathlib becomes 100 times the size, I'm sure things will look different, but not fundamentally so. It has already changed quite a bit from when it was just a simple github project worked on by a few people: we have a maintainers group, dedicated CI machines, tooling like leanproject
and elan
to make it easy to install and use mathlib on your machine. I expect things like this to happen naturally as we face the consequences of our own success.
Eric Wieser (Dec 08 2021 at 12:04):
cue dependency graph
https://eric-wieser.github.io/mathlib-import-graph/
Rob Lewis (Dec 08 2021 at 15:17):
Let me add: we've had the conversation "how long can we sustain this development pattern?" about mathlib periodically since at least fall 2018, when the repo was an order of magnitude smaller than it is now
Rob Lewis (Dec 08 2021 at 15:17):
This isn't an argument that it will keep scaling, of course
Rob Lewis (Dec 08 2021 at 15:18):
But the scalability has been on people's minds for a long time, which is part of why it's been able to scale up
Mac (Dec 09 2021 at 04:50):
Mario Carneiro said:
Huge code is hard to manage, whether it is in one place or several. Splitting a huge singular product into many small pieces only works well if the pieces are loosely coupled. Mathlib is not loosely coupled (cue dependency graph), and this is very much by design.
Despite having been one of the skeptics of mathlib's monolthic approach, I do think it works fine for it and will continue to do so in the future. The main reason for this is strong consensus on the kind of mathematics mathlib targets. As long as there exists a single clear vision to a project, monorepos work (and work generally much better than many small projects). Such a vision can come from above (e.g., from a company like Google or Facebook) or via community (e.g., mathlib), or even a single individual (e.g., Linus for Linux for a long time).
One mistake some people CS people (like me) make is that they would not expect there to be so much agreement among mathematicians about how to formalize their field, as such a thing is not common in CS. I think this is further acerbated by the fact that many CS people would probably disagree with how mathematicians do so, as CS generally takes a constructive approach to mathematics (or sometimes even more wackier approaches like those of linear, relevant, or paraconsistent logics), but professional mathematicians are usually classical.
Chris Lovett (Dec 09 2021 at 05:01):
From a pure software engineering perspective having worked on large code bases myself (windows, sql server, .NET frameworks and visual studio) monolithic code with dependency graphs like this are hard to maintain, especially when the team grows to more than 1000 people. But the single repo versus multiple repos is not really the issue until it exceeds about 20 million lines of code. So in software engineering we "invested" in modularity technologies (like interfaces, etc) and forced lower levels to be stable using human processes. We also forced architectural layering where lower layers are not allowed to depend on higher layers, it's a build break if they try to. So one might want to think about defining some "layers" in mathlib and creating tools that verify the layer dependencies are not violated. But figuring out how to this with a language like Lean might be an interesting research project on it's own...
Will Sawin (Dec 09 2021 at 05:08):
One mistake some people CS people (like me) make is that they would not expect there to be so much agreement among mathematicians about how to formalize their field, as such a thing is not common in CS.
This is interesting, what would you expect people to disagree about?
Mac (Dec 09 2021 at 05:16):
@Will Sawin I gave some examples to this effect in my post. There are a number of debates (at least in CS and also in philosophy) as to how formalizations should be grounded: different logics (linear, relevant, intuitionistic, paraconsistent, classical, etc.), different theories (set vs type vs category, etc.), different splits between data and proofs, etc. On top of all that semantic disputes, there are also syntactic disputes (e.g., code style) that CS people also tend to disagree vehemently. Generally one would not be able to construct such a large community that agrees on these things in CS, especially on something as large as formalizing an entire field and without and clear governance or a benevolent dictator for life.
Mario Carneiro (Dec 09 2021 at 05:26):
Chris Lovett said:
So one might want to think about defining some "layers" in mathlib and creating tools that verify the layer dependencies are not violated. But figuring out how to this with a language like Lean might be an interesting research project on it's own...
There are plenty of "layers" in mathlib already. Definitions are built on more definitions, and there is a clear notion of API boundary between them. If you have to unfold more than one layer of definitions you are doing it wrong, and I and others have been preaching this gospel for a long time; and mathlib reflects this. It's not an airtight guarantee by any means, it's not really much more protection than some basic OOP-like design principles, but it's enough to keep the cognitive burden manageable when dealing with definitions at any part of the stack. I'm sure multisets are a dependency of perfectoid spaces but I'm also sure they essentially never came up, in text or in the minds of the authors.
Mario Carneiro (Dec 09 2021 at 05:30):
Also, re: "We also forced architectural layering where lower layers are not allowed to depend on higher layers, it's a build break if they try to", this is already the case in lean, out of the simple necessity to avoid cyclic proof dependencies. The mathlib import graph is a DAG, which means there is a well defined notion of "lower" and "higher" layers, and lower layers are indeed not allowed to depend on higher layers on pain of breaking the build.
Mario Carneiro (Dec 09 2021 at 05:32):
Although, we have not been vigilant enough to enforce which files are low and which are high, which can sometimes lead to unexpected transitive dependencies. Eventually this surfaces as an import cycle and we have to figure out which file is the bad actor.
Mario Carneiro (Dec 09 2021 at 05:44):
But also, keep in mind what those arrows in the dependency graph mean. If the definition of convergence in a topological space requires the notion of finite set or filter or what have you, that's just the way it is, we can't just say "use something else" or "go through the API" because that is the API they are going through, there is nothing wrong with the code. Mathematics is very interconnected, and it should be no surprise that when you formalize it you still get something very interconnected. The main tool we have for untangling the dependency graph is splitting files into more parts such that each part only has a subset of the dependencies, which can decrease spurious transitive dependencies. But if you need the Cantor-Bernstein theorem to prove that cardinals are well ordered, that's not a dependency we can avoid by rejiggering things.
Junyan Xu (Dec 09 2021 at 05:56):
example of import cycle
Mario Carneiro (Dec 09 2021 at 06:08):
Hm, that's not great. Unless faithful.div_comp
is an actual dependency of functor.hext
(i.e. it is a cycle at theorem granularity), that theorem should be moved such that it can take advantage of functor.hext
Junyan Xu (Dec 09 2021 at 06:38):
I think the most appropriate solution in this case is to move two lemmas about opposites in the eq_to_hom file into the opposite file. It's desirable to keep related things in one place, but that can't always be achieved...
Junyan Xu (Dec 09 2021 at 06:40):
Hmm, I think it would work (and is better) to move hext
to functor
, since it's not about eq_to_hom at all ...
Junyan Xu (Dec 09 2021 at 06:44):
But there's another lemma in fullly_faithful
that indeed require eq_to_hom
, so it's better to reduce dependencies of eq_to_hom
.
It's been two years and no one cared to refactor it, and people are happy building a lot on top of them ... (I changed the eq_to_hom
file recently and it takes a long time to recompile)
Junyan Xu (Dec 09 2021 at 06:51):
By the way, anyone knows about a similar gadget like eq_to_hom
in other proof assistants? The simp lemmas make it manageable to work with equality of objects in categories. (and I wonder about what issues people encounter when working with sigma types where there isn't a category around)
Chris Lovett (Dec 09 2021 at 06:53):
Mario Carneiro said:
Also, re: "We also forced architectural layering where lower layers are not allowed to depend on higher layers, it's a build break if they try to", this is already the case in lean, out of the simple necessity to avoid cyclic proof dependencies. The mathlib import graph is a DAG, which means there is a well defined notion of "lower" and "higher" layers, and lower layers are indeed not allowed to depend on higher layers on pain of breaking the build.
That's great, those layers don't seem to be visible in https://eric-wieser.github.io/mathlib-import-graph/, but it must be possible to derive a kind of higher level dependency graph showing these layers in mathlib with all arrows pointing downwards from top layers to bottom layers, has anyone tried to do that ?
Johan Commelin (Dec 09 2021 at 07:08):
@Chris Lovett I think there is a lot of "invisible" structure in that graph. But I don't know enough graph algorithms to extract it in a way that is visibly pleasant for humans. Default clustering algorithms typically come up with clusters that don't really resemble the "standard textbook presentation" of how mathematics is subdivided.
Trebor Huang (Dec 16 2021 at 03:07):
Just a thought. As long as the definitions and theorems themselves aren't circular, wouldn't circular imports just be fine?
Of course circular import is probably very confusing in terms of the code structure and stuff, but I think it's worth discussing whether controlled use of this may be beneficial...?
Mario Carneiro (Dec 16 2021 at 03:09):
Indeed, top down declaration order being a topological order of the dag is a mere convenience, but it is useful for keeping the dag straight in your head. It is no loss of generality to assume that declarations are in topological order, and it is easier to read
Mario Carneiro (Dec 16 2021 at 03:10):
The main reason to allow forward declaration is explicitly so you can create cycles, and for theorem proving this is not okay, or at least it has to be carefully controlled through e.g. mutual
blocks
Last updated: Dec 20 2023 at 11:08 UTC