Zulip Chat Archive

Stream: general

Topic: mathlib; unintended consequences?


Scott Morrison (May 12 2021 at 07:48):

While I've always been a strong supporter of the "monolithic repository" model for mathlib development (it keeps everything in sync, prevents bitrot, and enables library-scale refactors), I now appreciate a strong argument against it.

As Leo has said recently, it is essential for the future of Lean 4 that it is adopted beyond the interactive theorem proving / mathematics formalisation community, and sees real use as a programming language. Of course, hopefully this is going to happen just on the merits of Lean 4 as an amazingly flexible and fast functional programming language. But I do think it's worth worrying for a moment that mathlib's success may have actually hurt Lean 3's adoption to this point, and to make sure, if this is the case, that we don't repeat it.

The way in which mathlib-the-library may actually have hurt lean3-the-language is precisely that it is the only game in town. For good reasons, it's big, and unified, and gets a lot of attention. But this also means that someone coming to Lean3 who is not motivated primarily by "let's formalise some mathematics / experiment with what formalisation can do / investigate how formalisation can be used in teaching" may well be turned off, when they see (fairly or not) that "this is all we're interested in". I think for people who've been in the community for a few years, it's pretty easy to come up with examples where this may have been in play.

How could we address this?

An interesting example that was recently mentioned on a Lean4 thread was a crypto library. In some sense much of what is in a crypto library (as exists in many programming languages) is fair game for mathlib. The way it would be developed here would be a bit weird for a crypto library (we'd never bother formalising properties of padding schemes, and we might accidentally mark some essential algorithms noncomputable...) but even if you ignore that, it might be completely unusable as a crypto library merely because no one would want to include a 42mb zip file full of proofs about the Hahn-Banach theorem, etc., in order to run Diffie-Hellman.

Of course there's no obstacle to someone writing a "conventional" crypto library in either Lean 3, or more likely in Lean 4, but hopefully this example illustrates the problem that Leo was raising: could the desire for a giant mathlib repo that contains all the maths underlying a crypto library (as well as a zillion other things), actually inhibit potential users/contributors to the language ecosystem from building a more focused library?

Oliver Nash (May 12 2021 at 09:19):

I don't have an answer to this interesting question but I wonder to what extent this issue could be addressed by creating logic that consumes some set of Mathlib lemmas / defs and automagically creates a subset of Mathlib that was sufficient for them.

Oliver Nash (May 12 2021 at 09:20):

I guess lean --make path/to/list_of_lemmas.lean sort of does this but presumably there is enormous scope for improvement on this given the different design intention. One could even imagine logic that iteratively shrank the subset, taking advantage of successive specialisations.

Oliver Nash (May 12 2021 at 09:20):

If we wanted to cater to a particular community (e.g., cryptanalysts) one could even conceive of producing an artefact with a relevant subset as part of the Mathlib CI.

Oliver Nash (May 12 2021 at 09:20):

I have no idea if something along the lines of this very vague outline is possible (or desirable) but I like the idea of leveraging the fact that Mathlib can be understood by a computer to do things that are impossible for non-digital libraries of mathematics.

Oliver Nash (May 12 2021 at 09:22):

Of course I expect there will be problems too. I can imagine someone who wants to compute the genus of a hyperelliptic curve over ℂ using Riemann-Hurwitz except that we might have proved Riemann-Hurwitz using Riemann-Roch which in turn we have proved as a special case of the Index Theorem for the Dolbeault operator, which we have proved as a special case of the version of the IT for families of pseudo-differential elliptic operators etc.

Daniel Fabian (May 12 2021 at 15:45):

I had a brief conversation about this from a slightly different context. Right now, having mathlib is monolithic to an extent because of very complex dependencies. A way how one could potentially remedy that would be interfaces / type classes. You don't necessarily need the whole chain of dependencies when you have an interface plus some of the special results you need inside of, say your linarith tactic.

This btw. doesn't contradict a monorepo approach. It would more like be the idea, that mathlib could potentially be 50 independent packages sitting in one monorepo.

Mario Carneiro (May 12 2021 at 16:12):

I think we should try to design mathlib such that a file-dependency approach like Oliver mentions works well. That will mostly solve the problem of "crypto library has unneeded dependencies".

Regarding inclusion of a crypto library in mathlib, that has always been within scope, at least the way I envision it; it's just that there aren't that many CS folks contributing to mathlib due largely to the way publicity has gone. There are some things like verified hashmaps but lean 3 is slow enough that it's fairly unrealistic to imagine that such libraries will be used in practice. Lean 4 changes the ball game here; fast and verified code is no longer a pipe dream.

Regarding splitting mathlib into projects using typeclasses to abstract over representation: this works only in very specific contexts and generally makes reasoning a lot more complicated, so I would only employ it when necessary. For sure something like linarith could be written in that style, but for example abstracting over types that are the integers seems like a bad idea because it promotes the existence of multiple integer types, which is exactly what the monolithic mathlib design is intended to prevent: we want one int, one real and so on so that things can interoperate. Those types are defined with low dependencies so that they can be used in more contexts. But I don't think that there is any neat separation of mathlib into weakly connected components except "tactics" / "most of mathlib" / "category theory" (and we have the graphs to prove it), so a hypothetical crypto library should be able to ad hoc define its own requirements by depending only on certain files that don't live too high up in the tree.

Mario Carneiro (May 12 2021 at 16:18):

Oliver Nash said:

Of course I expect there will be problems too. I can imagine someone who wants to compute the genus of a hyperelliptic curve over ℂ using Riemann-Hurwitz except that we might have proved Riemann-Hurwitz using Riemann-Roch which in turn we have proved as a special case of the Index Theorem for the Dolbeault operator, which we have proved as a special case of the version of the IT for families of pseudo-differential elliptic operators etc.

In cases like this, it is sometimes advisable to prove the specific theorem directly, assuming that this can reasonably be done without going via the generalization. This is already the case for data.real.basic, which is a direct construction of the reals from cauchy sequences even though we already had a high-powered definition using the completion of a topological ring. This was done specifically to decrease the number of files in the up-set of data.real.basic because it was a performance problem when everyone wants to use the reals as an example and lean has to load a zillion files to do so.

Joe Hendrix (May 12 2021 at 16:30):

Has there been much discussion about Mathlib "releases"?
I think a primary benefit of partitioning a large monorepo into libraries is that there is less coordination needed to release stable versions. If a library has a more well-defined scope and public interface, one could use semantic versioning for improved stability and predictability when upgrading.

Johan Commelin (May 12 2021 at 16:35):

But about every month we have a large scale refactor happening in mathlib (only in the last couple of weeks we already had 3: redefinition of (add_)monoid and another one for (add_)group, refactoring of basis in linear algebra).
Such refactors would need tight coordination between multiple of those sub-libraries.
They also break backwards compatibility, which means 3 major version bumps in the semver release, in a month.

Mario Carneiro (May 12 2021 at 16:52):

If a library has a more well-defined scope and public interface, one could use semantic versioning for improved stability and predictability when upgrading.

Mathlib of course does not have a well defined scope (or rather, it has essentially unbounded scope). The math required for a crypto library (or any particular, fixed target application) could be much more constrained scope, but I really struggle to see how to determine where any particular theory in mathlib is "done", and in particular every part of it grows when new things are added, because we want theorems about how everything interacts with everything else

Mario Carneiro (May 12 2021 at 16:57):

One trick that I am familiar with from set.mm is that the main part of the library has a (substantial) changelog for it, which enumerates every single theorem that was changed or removed, with a short note on replacements, ordered by date. The intent is that a semi-automated procedure can be used to consume the changelog and apply fixes to a dependent library

Mario Carneiro (May 12 2021 at 16:58):

This changelog is updated continuously, on every commit, and for something like lean it would probably be best to have a CI script construct it

Mario Carneiro (May 12 2021 at 16:59):

So it's not "releases" per se but you can still work out a reasonable plan for upgrading if you know the version of the library before and after


Last updated: Dec 20 2023 at 11:08 UTC