Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: corpus of all rewrites in mathlib4

Scott Morrison (Mar 17 2023 at 01:00):

I've been experimenting a little with extracting tactic invocations in Lean4, and have some moderate success. In particular, I've been able to extract the goal state and rewrite lemma for (almost) every rewrite used in the mathlib4 library. It should be possible to extract other tactics (i.e. besides rw) in a similar fashion.

Here's a random sample of the data:
Screen-Shot-2023-03-17-at-10.59.06-am-1.png

I also have the full goal state (i.e. the hypotheses) available as well. There are approximately 25000 pairs at present (once we restrict to rewriting on the goal, and rewriting by an atomic term rather than some complicated expression), and I'd expect this to roughly double once all of mathlib is ported to mathlib4.

Would anyone like to help me do something with this data? I'd love to have an effective tool for selecting promising rewrite lemmas, given a goal.

Left to my own devices, I would first try cosine-similarity nearest neighbours in the tf-idf vectors for the goals (tokenizing on whitespaces and parentheses?). But I don't think this will do a very good job! I'm sure people here have much cleverer things to do with it, and know all the relevant tooling.

Jason Rute (Mar 17 2023 at 02:03):

You may also want to look at the tooling Daniel Selsam wrote for mining Lean 4 proof data: https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/What.20is.20the.20AMI.20project's.20vision.20for.20AI.20in.20Lean.3F/near/290848202

Jason Rute (Mar 17 2023 at 02:05):

And of here is a previous conversation about building rewrite search tools using machine learning: #Machine Learning for Theorem Proving > rw_hint

Jason Rute (Mar 17 2023 at 02:14):

Before you go too far, how well does the built-in Lean 3 tactic suggestion tool (based on meta's HTPS paper) work if you force it to suggest rw tactics? I guess the question is if there is a lot of value in making a specific rewrite suggestion tool over just a general tactic suggestion tool. I could see it go either way. Rewriting is a particularly special setting with a lot of advantages (like the fact that equality proofs are reversible and don't branch, so you can rewrite both the LHS or the RHS when doing a proof search).

Jason Rute (Mar 17 2023 at 02:18):

Also, I've always though rewriting was very similar to solving a Rubik's cube (but harder and more general). I think these two papers are very interesting in that regard:

The main idea is that instead of predicting the next move, you predict the state you will be at N moves later (for a small value of N like 3 or 4). Then you search for that. It does much better. This is hard to implement for theorem proving in general, but for rewriting it should be reasonably straight forward (maybe).

Scott Morrison (Mar 17 2023 at 02:59):

I haven't forgotten the previous conversation about rw_hint and rw_search. :-) I'd like them to one day happen!

Scott Morrison (Mar 17 2023 at 03:00):

I feel like the "built-in Lean 3 tactic suggestion tool" is a bit of a dead-end. It's far from actually built-in, requires remote queries (so is slow and limited) and I want tools that can be used by other tactics.

Jason Rute (Mar 17 2023 at 03:33):

There are a lot of models out there with vastly different speeds to generate features for things like goals and theorem statements. For example, Tactician uses something like TF-IDF, but based on the syntax tree, so it is really fast and cheap to run. (IBM's TRAIL is another similar approach.) Other models use neural networks like TreeLSTMs, graph neural networks, and Transformers (of vastly different sizes). Some are faster than others. The larger (and hence slower) models are possibly better for situations where you don't have to call them as frequently, but sometimes a large model which can make really good suggestions is more useful over a fast model which makes poor solutions. If you want it to be runnable by other tactics, neural models may not be an (easy) way to go (unless you want to interface Lean and a neural network library or an API).

Jason Rute (Mar 17 2023 at 03:33):

The ability to assign every theorem a cached embedding and to similarly assign an embedding to a goal in real time, is immensely useful. It would facilitate all sorts of other tools like library search, Hammer-like tools (e.g. MagnusHammer), tactic suggestions, proof search, library search, etc that other tactics could use. I think this would be more valuable than any one application, especially if the interface was flexible enough to support different embedding methods. Simple ones could be written in pure Lean, and more complex ones could be gotten from external tools or APIs.

Jason Rute (Mar 17 2023 at 03:34):

But I also think the world is moving in the direction of larger models for better or worse. Such models seem to interface well with widgets and editor plugins, so one might also consider that as well.

Jason Rute (Mar 17 2023 at 03:40):

I thought I heard that Lean 4 has some ability to assign metadata to declarations in the environment, but I might be mistaken.

Bartosz Piotrowski (Mar 17 2023 at 11:14):

Hi @Scott Morrison , you may want to check out our recent work we submitted to ITP with @Ramon Fernández Mir and @Edward Ayers . Machine-Learned Premise Selection for Lean: preprint, code with simple demo. (The preprint will appear also on arXiv on Monday).

The assumption of the project was to build a lightweight, easy-to-use and easy-to-modify premise selection tool implemented directly in Lean 4. So we avoid large neural nets and calls to a server, but rather rely on simple, custom-built ML models trained on combinations of syntactic features. (And pretrained models are available for downloading.)

The suggestions are provided via suggest_premises tactic; we also provide a widget that tries to combine the suggested lemmas with simple tactics and shows which ones advance the current proof state.

We focus on premise selection in general, not only rw, and we train on all pairs (theorem statement and its hypotheses, lemmas used in the proof) we could extract from mathlib. But the tool may be adapted specifically to the rw tactic.

(The tool at the current stage may not be super precise, but it was meant to provide a baseline solution, and it may be improved in many ways.)

Scott Morrison (Mar 17 2023 at 12:37):

This sounds great, I'll have a look.

Scott Morrison (Mar 17 2023 at 12:38):

What are the prospects for making it usable in practice? (My bar for usable is pretty high: essentially integration into mathlib4.)

Scott Morrison (Mar 17 2023 at 12:39):

(e.g. a tactic built-in to mathlib, which, if it can't detect a pretrained model locally, offers to curl it for you)

Bartosz Piotrowski (Mar 17 2023 at 13:48):

I think it may be rather easily made usable in practice in that sense. Extracting predictions form a pretrained model is fast, and the model itself is not super heavy (< 1 GB).

Bartosz Piotrowski (Mar 17 2023 at 13:49):

Go ahead, see if you can run it and play with it. Any suggestions for improvements are welcome!

Scott Morrison (Apr 12 2023 at 11:55):

Any progress on making this available in mathlib, @Bartosz Piotrowski?

David Renshaw (Apr 12 2023 at 11:57):

@Scott Morrison how does your rw extraction work? Do you modify term elaboration in Core?

Scott Morrison (Apr 12 2023 at 11:59):

I just dump all the infotrees, modified to include slightly more information in their printed format, and parse them afterwards. It was a very low effort experiment. :-)

Bartosz Piotrowski (Apr 17 2023 at 09:24):

Scott Morrison said:

Any progress on making this available in mathlib, Bartosz Piotrowski?

Not yet! Lately I was busy with other stuff, but soon I'll focus on this. I'll let you know!

David Renshaw (May 24 2023 at 20:18):

Scott Morrison said:

I just dump all the infotrees, modified to include slightly more information in their printed format, and parse them afterwards. It was a very low effort experiment. :-)

Do you have an example showing how I can get my hands on infotrees for this kind of thing? I'm trying Lean.Elab.getInfoTrees, but it's returning an empty array.

Oh... maybe you are doing set_option trace.Elab.info true to get a textual dump?

David Renshaw (May 24 2023 at 20:33):

I can add this to my lakefile:

package mypackage where
  moreLeanArgs := #["-D", "trace.Elab.info=true"]

but then I need to parse the textual output. I'm looking for a cleaner way to extract goal information for all locations in a library.

Scott Morrison (May 25 2023 at 00:24):

I'll send you a link to my very-WIP branch.

Last updated: May 02 2025 at 03:31 UTC