Zulip Chat Archive

Stream: lean4

Topic: Lean 4 needs to be optimised by at least 10x. Ideas?

Mr Proof (Jun 22 2024 at 02:14):

Coming from a software developer background rather than an academic math background. I am viewing Lean as a piece of software. In that respect my observations are that it needs a great deal of optimisation. Open source projects have the habit of building up inefficiencies over time.

For example, the built version of Mathlib, is several GB and takes several GB of RAM to load in and several seconds to load. That is not the fault of mathlib, but rather it must be the software and data-structures it is built upon. They were probably fine to begin with but with huge libraries I think they need updating.

When Mathlib and other libraries get bigger this is only going to get worse.

Can we throw some ideas around to see if we can think of ways to make it more memory efficient?

Some ideas I have are (perhaps some are already implemented)

JIT compilation of libraries. Libraries are compiled on the fly only when they are needed. (e.g. when you import Mathlib, it could just load in a list of every function. Then when you call a particular function, it could compile, or download, the particular library it needs , Algebra, say, and cache it for later, then remove ones that haven't been used for a while).
Olean files only moved to RAM when needed. (e.g. when you import Mathlib - don't instantly move it all to RAM0
Improve the internal data-structures to be more efficient
Improve repetition in files by referencing external indexes
Improve cases where tactics like "simp" are expanded inefficiently.
Move some of Lean processing to the GPU(?) probably difficult as it's not suited for parallelism.
Olean files stored in a compressed format and decompressed when needed. (I get 3x memory saving just by storing the olean files as ZIP on disk)

Any other ideas? Are there any "low hanging fruit" that could cut the footprint in half?

Asei Inoue (Jun 22 2024 at 03:34):

Currently you have to require the whole Mathlib, even if you only need Mathlib.Tactic. It would be good to have the ability to select a subdirectory.

Sorry if this isn't the right place to talk about it.

Kim Morrison (Jun 22 2024 at 03:48):

The original question seems to presuppose that no one is thinking about and working on all these things. Perhaps read the FRO roadmap?

Kim Morrison (Jun 22 2024 at 03:50):

Regarding require and Mathlib: the subdirectories do not at all reflect the import structure, and I expect there are very few subsets of the directories (besides subsets close to all of Mathlib) which would work without the complement.

Kim Morrison (Jun 22 2024 at 03:51):

If you want to work on that, there is massive amounts of work to be done simplifying and streamlining the import structure of Mathlib (perhaps one day enabling partial require) and assistance with this is appreciated.

Kim Morrison (Jun 22 2024 at 03:54):

It's also worth being aware that making changes to fundamental aspects of Lean is a challenging process, and you need substantial experience with the ecosystem to be able to make a positive contribution. Work on technical debt in Mathlib, and contributing high quality documentation, are probably better things for new contributors to be thinking about.

Mr Proof (Jun 22 2024 at 03:56):

Yes, indeed the FRO says Scalability Enhancements but I think it's good to have a discussion where people can all throw in their ideas. Perhaps the people working on this will find out something they didn't think about. :smile: The more ideas the better IMO.
(I did a search and didn't find any similar discussions)

BTW, is Lean all made by volunteers or is there an in-house team at Microsoft(?) who work on the ~~kernel~~ compiler?

Asei Inoue (Jun 22 2024 at 05:15):

Latest Lean code include license header of Amazon.

Asei Inoue (Jun 22 2024 at 05:16):

Lean FRO must have been funded by some foundation.

Asei Inoue (Jun 22 2024 at 05:17):

By the way, I am also interested in Lean's support for GPUs and parallel processing - GPUs may not be useful for proofs, but they are useful for scientific computing.

Mauricio Collares (Jun 22 2024 at 07:48):

About point 2, oleans aren't ever moved to RAM, they're just mmapped. And mathlib already uses olean compression on the wire, but they get decompressed on disk because otherwise you can't mmap them.

Henrik Böving (Jun 22 2024 at 08:24):

Asei Inoue said:

Latest Lean code include license header of Amazon.

That's because Leo switched from MSR to Amazon, the rest of the FRO is not Amazon employees.

Mr Proof said:

BTW, is Lean all made by volunteers or is there an in-house team at Microsoft(?) who work on the kernel?

To my knowledge there is nobody working at Microsoft that does large contributions to the Lean compiler itself anymore. There is also basically nobody working actively on the kernel, the kernel is finished, it has been barely touched for a long time. As to who is working on the Lean compiler, that's the FRO.

Asei Inoue said:

Lean FRO must have been funded by some foundation.

You can read on our website which foundations etc are funding the FRO.

A. (Jun 22 2024 at 08:34):

To continue with Kim's thought, while valiant efforts are sporadically made to simplify the dependency structure of Mathlib, as I understand it they tend to be ad hoc and to work chiefly at file level. It could be a very nice computer-sciencey project to extract the theorem-level dependency graph and attack it with the latest in network algorithms.

Ruben Van de Velde (Jun 22 2024 at 08:44):

File level dependencies are how lean works right now, and I'm not aware of plans to change that. It would be interesting to have some kind of automated analysis that could flag when heavy dependencies are only used by a small fraction of the declarations in a file, for example

Henrik Böving (Jun 22 2024 at 08:46):

Extracting the theorem level dependency graph is definitely possible with a meta program that just looks at all theorem statements + proofs and figures out dependencies based on that. One thing to keep in mind here however is that fully elaborated Lean terms do not contain references to the syntax declarations (such as math notation or mroe importantly tactics) that they used anymore. This means that while a tactic can produce a proof term that looks very simple it might have required some part of a library in order to do so and you are not seeing that library anymore now.

Ruben Van de Velde (Jun 22 2024 at 08:50):

Is that something lean would be interested in tracking in the future, you think, Henrik?

Damiano Testa (Jun 22 2024 at 08:52):

In fact, the issue of "oleans for syntax" has appeared in several discussions.

Henrik Böving (Jun 22 2024 at 08:52):

I don't know about that. I do know that there is some demand for this in other features like some Mathlib linters but I'm aware of no movement to change this currently.

Mario Carneiro (Jun 22 2024 at 18:06):

We already have data on the 'hidden edges' caused by notations and tactics though, that's noshake.json

Eric Wieser (Jun 22 2024 at 18:12):

Isn't notation information visible in the infotree?

Eric Wieser (Jun 22 2024 at 18:13):

I would expect it to only be things like positivity extensions (which have no syntax at the caller) referring to lemmas in an earlier file that are neither in the infotree nor the proof term

Eric Wieser (Jun 22 2024 at 18:13):

(And I think the right thing to do there is record the used extension names in the proof in some way)

Mario Carneiro (Jun 22 2024 at 18:18):

Eric Wieser said:

Isn't notation information visible in the infotree?

Yes, but the infotree is not stored

Mario Carneiro (Jun 22 2024 at 18:18):

also, even in a live elaboration session lean isn't very good at recording what parts of the state are actually used. The most accurate method is the obvious one: comment out imports until something breaks

Kim Morrison (Jun 23 2024 at 04:10):

I think a useful tool would be a variant of shake that reported if an import was not used in the first N lines (e.g. N=500) of a file. These would be obvious targets for splitting files, and you could crank down N (or use a fraction of the file length).

Mr Proof (Jun 23 2024 at 04:51):

I was thinking of having a go at making a little C command line tool, that basically just read all the Lean files, and searched for the words "import" to make a list of the libraries used in a file. That could be run pretty quickly I imagine to make some sort of tree dependency graph. Which you could feed into something like Mathematica to give a graphical view.
Also, it could search for the words like "def" , "lemma", "theorem" etc. to make a list of every definition in a file. This could all be done without parsing anything.
While not being as accurate as writing this in Lean code I imagine it would run a lot faster(?)
Or would it be "necessary to break out Emacs and modify that Perl script"?

Yury G. Kudryashov (Jun 23 2024 at 06:12):

We already have import-graph written in Lean that can export to dot, hence generate .pdf.

Yury G. Kudryashov (Jun 23 2024 at 06:14):

It runs fast enough (large invocations with pdf output are slow because of dot choosing locations for nodes, not because of Lean is slow in generating the graph) and is 100% accurate.

Rowan Williams (Aug 08 2024 at 00:41):

Hello! I'm new here, but I was just looking for a place to ask whether or not LEAN was supposed to be taking upwards of 2.5 GB of RAM, and it seems like it is!

This may be a silly solution, but based on what I understand it seems reasonable.

Once a formal proof has been found and verified, on our machines we could replace the actual proof itself with sorry or change all proven theorems into axioms. These could be automated changes that triggers on certain GitHub actions. The idea is that a full repository of formalized proofs is still kept, but individuals on small computers like mine could still use what's already been formalized to formalize new things, without having the system be bogged down trying to "reverify" everything that's already been proven tens of thousands of times on our own computers.

Again, not sure if this is actually how it works, and it definitely feels like a lazy solution, but at the same time, "I don't see why not."

Speaking of, I know there's some sort of save tactic that supposedly prevents lean from rerunning the file every time you edit a line. Not sure if it's relevant.

Xiyu Zhai (Aug 08 2024 at 01:02):

I’m wondering if Mathlib is bigger than my rust dependencies for my 200k loc personal project. I believe that the source code is not optimsed for incremental computation. One thing at least is that in the roadmap it’s mentioned that Lean compiler needs module signature. Modern languages like rust golang have developed a bunch of techniques in dealing with things like this. I’m optimistic things will dramatically improve in the following years

Xiyu Zhai (Aug 08 2024 at 01:11):

It seems in the compiler, most things are written in lean itself. Although lean tries to be a fast language, it's still not as convenient to optimise as a true system language like golang, rust, c++. Personally I would stick to a fixed system level language for the compiler itself, without doing too much bootstrapping. But that's just a personal opinion. I could imagine that for safety reasons the compiler is written in a bootstrapping way. Anyway that looks like too much a change to dicuss.

Currently the incremental mechanism is complicated to get it designed right, because Lean just seems too powerful. My reading of the repo tells me that all the features interact in a convoluted way such that it's hard to get things incremental at a very granular level.

Yury G. Kudryashov (Aug 08 2024 at 01:13):

Rowan Williams said:

Once a formal proof has been found and verified, on our machines we could replace the actual proof itself with sorry

Kim Morrison (Aug 08 2024 at 01:13):

Lean has had incremental compilation of tactics in interactive mode for the last few months. If you're running on a recent version you should see that when you edit a proof the yellow progress bar only starts at or near your edit.

Kim Morrison (Aug 08 2024 at 01:14):

We're also working (longer term) on separating oleans into two components, one with as many proofs as possible stripped out.

Kim Morrison (Aug 08 2024 at 01:15):

Intra-file parallelism is also on the near term horizon!

Sebastian Ullrich (Aug 08 2024 at 09:36):

Rowan Williams said:

whether or not LEAN was supposed to be taking upwards of 2.5 GB of RAM

It depends on what you're doing with it and how you measure that. On Linux, importing all of Mathlib takes about 36MB of RAM (RssAnon in /proc/$PID/status) and another 38MB of data read lazily from .oleans (RssFile), though that part can be shared across processes.

Sebastian Ullrich (Aug 08 2024 at 09:40):

The most expensive file in Mathlib at its peak uses 3.7GB, which likely means it's just generating a lot of (temporary) terms while running. It's been some time, but we looked at such an outlier file before and with a few adjustments, memory use can usually be lowered drastically.

Last updated: May 02 2025 at 03:31 UTC