Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Paper: MLFMF

Jason Rute (Oct 28 2023 at 14:27):

There is a new paper MLFMF: Data Sets for Machine Learning for Mathematical Formalization by @Andrej Bauer, Matej Petkovíc, and Ljupčo Todorovski

Jason Rute (Oct 28 2023 at 14:27):

This is interesting as this might be the first machine learning paper of @Andrej Bauer (who is big in the proof assistant world, especially as a proof assistant evangelist).

Jason Rute (Oct 28 2023 at 14:27):

I hope the authors will share more, but the paper seems to be a dataset of Lean and Agda in graph form. It is probably the first Agda dataset for Machine Learning (and at least the first I know about). I'm a bit confused about how they hope people will use this dataset. I think it is basically the internal graph structure of Lean (as say in the environment or in the olean files). They call it "link prediction", but if I'm not mistaken (and I probably am) it is asking if a theorem/definition/axiom X is used in another theorem/definition/axiom Y. If we restrict this to theorems, then this is classical premise selection, which is very important for other applications. Hammers use premise selection for selecting theorems to send to the ATP. Similarly, MagnusHammer uses premise selection and powerful Isabelle tactics to solve theorems in one tactic. ReProver does premise selection to decide what arguments to suggest to the tactic creation model. It also provides a nice suggestion tool as I think @Adam Topaz was playing with. Indeed premise selection is one of the oldest forms of machine learning for theorem proving and the HOLStep dataset was an important premise selection for early neural theorem proving. When comparing to HOLStep (which isn't used in research anymore), it does seem like this dataset and paper are about 5 years too late, but I still think there are a lot of important and unanswered questions in this space.

Jason Rute (Oct 28 2023 at 14:27):

This dataset seems to be based around graphs and s-expressions (again as found in say the olean files or the Exprs in the environment) and their baselines use really simple machine learning models. It would be interesting to compare how well their models do against the transformer-based approaches, like in MagnusHammer, ReProver, or @Adam Topaz's experiments with OpenAI embeddings, just on this link prediction task. Even if Transformers are slower, that might be okay since you only need to run the transformer a few times in a typical premise selection task. (Assuming one has already run and cached the results of the transformer over the whole library.) Or it may be that these simpler approaches are just as good, and worth it for the huge time savings they would create.

Jason Rute (Oct 28 2023 at 14:27):

Note, to convert this dataset and the models trained on it into a practical tool, I think they would need (in Lean) to extract the corresponding graph data from the environment, but this doesn't seem unreasonable to do in a tactic. It should all be there in the same form in the environment.

Jason Rute (Oct 28 2023 at 14:28):

I still need to look deeper into the models they used and how they used this graph data.

Jason Rute (Oct 28 2023 at 14:31):

Also a comment to the authors (e.g. @Andrej Bauer) about the abstract:

With more than 250 000 entries in total, this is currently the largest collection of formalized mathematical knowledge in machine learnable format.

One has to be careful with this sort of claim. For example, CoqGym has training data for 121,644 files worth of data (and I think it also could be used for similar prediction tasks, although I don't think it has).

Jason Rute (Oct 28 2023 at 14:48):

And the dataset used for MagnusHammer contained 433K Isabelle premises.

Scott Morrison (Oct 30 2023 at 02:29):

See also https://github.com/semorrison/lean-training-data#premises, which extracts the term-level dependencies between declarations.

Jason Rute (Oct 30 2023 at 03:49):

And @Jesse Michael Han's https://github.com/jesse-michael-han/lean-step-public was also a similar dataset for Lean 3. I don't recall how explicit the dependencies were in the dataset, but they were there. Indeed, we used those to create a premise selection task. It was one of the auxiliary tasks in the PACT paper.

Last updated: May 02 2025 at 03:31 UTC