Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Micro Mathlib


Gerben Koopman (Nov 19 2025 at 13:20):

Maybe this is a silly question, given that I have relatively little Lean/Mathlib experience compared to AI experience, but here goes.

While debugging ML agents that use the Mathlib library for premises and as a dataset (e.g. LeanDojo Benchmark 4), this often has to be done on an HPC cluster, as many machines can not fit the traced Mathlib in memory. Of course there are unit tests you could write, but that is quite an exercise. Ideally I would like to use a 'toy version' of Mathlib. Something that is large enough for meaningful testing, but small enough to only take ~16GB RAM tops. Does such an ML toy example of Mathlib exist? I considered going back to a very early commit, but then you get Lean version issues of course.

I could try to excise large portions of files for a custom MathlibTest, or write my own, but both seem quite involved and easier said than done. I can imagine I'm missing something, such as perhaps I don't need to load all of Mathlib. If so, please let me know (and how) :slight_smile:
Thanks in advance!

David Renshaw (Nov 19 2025 at 13:37):

Looks like you got the markdown syntax backwards on the URL that you posted.

Gerben Koopman (Nov 19 2025 at 13:37):

Oh thanks for the heads-up @David Renshaw, I pasted and clicked the link icon, but indeed it's backwards.

David Renshaw (Nov 19 2025 at 13:39):

The file at that URL is 68.1 MB. How is that taking up more than 16 GB of memory?

Gerben Koopman (Nov 19 2025 at 13:44):

So the file only contains the pointers to the further Mathlib environment. It has extracted theorems and proofs, but for the proof tree and premise retrieval you need Mathlib itself. An example line of the jsonl looks like this:

{"url": "https://github.com/leanprover-community/mathlib4", "commit": "29dcec074de168ac2bf835a77ef68bbe069194c5", "file_path": "Mathlib/AlgebraicGeometry/PrimeSpectrum/Basic.lean", "full_name": "PrimeSpectrum.range_comap_of_surjective", "start": [750, 1], "end": [754, 21], "traced_tactics": [{"tactic": "rw [\u2190 Set.image_univ]", "annotated_tactic": ["rw [\u2190 <a>Set.image_univ</a>]", [{"full_name": "Set.image_univ", "def_path": "Mathlib/Data/Set/Image.lean", "def_pos": [701, 9], "def_end_pos": [701, 19]}]]}]}

Note the start and end values, as well as def_pos. These reference lines and columns in the Mathlib files.

Gerben Koopman (Nov 19 2025 at 13:46):

For supervised learning I guess only this 68MB file is sufficient, as you can backprop over the used tactics from Mathlib proofs, but for premise retrieval or reinforcement learning you probably want to have access to Mathlib itself. Hence the need for a 'toy version' of the dataset during testing (HPC hours are a budget, time and environmental waste).

Eric Wieser (Nov 19 2025 at 14:48):

Are you asking for a smaller mathlib environment or a smaller mathlib training dataset?

Gerben Koopman (Nov 19 2025 at 14:49):

The first, a smaller mathlib environment. Or, if it exists, a subgraph of mathlib's proof DAG to import.

Alex Meiburg (Nov 19 2025 at 16:34):

A natural candidate would be to just use core Lean + Batteries. They still have nontrivial proofs, e.g. in Batteries/Data/List/Lemmas.lean. Curating a "tiny Mathlib" would necessarily be extremely subjective, and it would be hard to say what is "worth keeping" or not. You'll end up with plenty of cases where a 'basic' concept in math, like real numbers, ends up importing a good chunk of 'advanced' math (topology, lattice theory). If you have test problems you want to import, it will be tricky to decide what the right minimal set to include is, except for manually doing a #min_imports type thing.

But just for testing that your model can load an environment, learn something, interact... Lean+Batteries should be a fine testbed

Gerben Koopman (Nov 19 2025 at 16:39):

Thanks @Alex Meiburg, sounds good! This sounds like exactly what I was looking for.

Gerben Koopman (Nov 19 2025 at 16:40):

I agree that any subset would be subjective and up to interpretation of minimal, but indeed load-learn-interact seems sufficient to me.

Kim Morrison (Nov 19 2025 at 22:52):

The Mathlib import graph is necessarily a DAG, so you can always just slice it off across any arbitrary cut to get a library that is "small enough".

Kim Morrison (Nov 19 2025 at 22:52):

e.g. remove a random leaf file until it's small enough.

Gerben Koopman (Nov 19 2025 at 22:59):

I guess this is a bit of a different question, but is there an existing way to find or visualize this DAG of the current Mathlib? Not just for this, but for other applications too. How do I know which files or folders are leafs without manually looking for them in the documentation.

Gerben Koopman (Nov 19 2025 at 23:00):

I could write a program to do that but I assume that this exists and is actively maintained in a more efficient/idiomatic manner, given how central the DAG is to Mathlib.

Eric Wieser (Nov 19 2025 at 23:04):

https://leanprover-community.github.io/mathlib4_docs/mathlib.html is a visualization

Gerben Koopman (Nov 19 2025 at 23:05):

Thanks for the help everyone

Eric Wieser (Nov 19 2025 at 23:05):

Or you can use the ImportGraph package to directly write out a gexf XML file


Last updated: Dec 20 2025 at 21:32 UTC