Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: rw_hint

Patrick Massot (Mar 03 2020 at 08:10):

Here is a naive idea from someone who knows nothing about any kind of machine learning. We have Jason's extraction of tactic states from mathlib. Presumably we can then ask his database to list all uses of rw in mathlib. After some more extraction work if needed, we can split up all uses of rw [fact1, fact2,...] into rw fact1, rw fact2, .... Now we have https://github.com/leanprover-community/mathlib/pull/2030. For each use of rw in mathlib, we should now have access to:

what was the current goal
what it became after the rw step in mathlib
the list of everything that rw_hint suggested, together with the goal after applying this suggestion

Now my naive suggestion is: can we train some machine learning stuff to rank the propositions of rw_hint? There are extra credits if such a trained machine could actually be used while writing Lean interactively (i.e. actually improving the output of rw_hint). Presumably this last step would need Lean4 magic to be fast enough.

Jason Rute (Mar 03 2020 at 14:26):

@Patrick Massot Indeed this is exactly the sort of application for which I think Machine Learning will bring a lot of value to ITP systems! The task you describe is known in the literature as "premise selection", "lemma selection", or "clause selection" (in the setting of FOL provers). (Actually, I also recently proposed that machine learned premise selection be added to the new Lean Hammer project.) Let me take the time to flesh out your idea more and write up a specific proposal on how one could accomplish it, including discussion on if it would just be a toy prototype or if it could be a usable tool in Lean 3.

Scott Morrison (Mar 03 2020 at 17:48):

@Jason Rute, we should talk, either here or offline, about rewrite_search, an existing effort in this direction (that goes far beyond what has been PR'd to mathlib in rw_hint).

Jason Rute (Mar 05 2020 at 23:10):

Sorry for the long delay. I’ve been pretty busy with work and planning my upcoming wedding.

Jason Rute (Mar 05 2020 at 23:10):

@Scott Morrison Yes, it would be good to talk. I know nothing about that project. Is Rewrite heuristics and pair exploration for automated theorem proving a good place to learn more? And is the lean-rewrite-search repo the correct repo? I took a quick glance at the paper, but I still have a lot of questions, especially about how the distance estimate is calculated and used.

Jason Rute (Mar 05 2020 at 23:10):

I think that @Patrick Massot's question is a bit orthogonal to the rewrite_search project. Patrick’s question is about selecting or ranking lemmas which can be used for rewriting (even in the case where the goal isn’t an equality or where rewriting alone isn’t going to solve the goal). I think premise selection also goes way beyond the rewrite tactic (and indeed there are a lot of papers on the subject). I don’t think there currently is any premise selection in the rewrite_search tactic (at least according to the paper), but it might be a good addition to prune the search space at the beginning of the search. I don’t want to step on any toes, but I hope that some open discussion about applications of machine learning (in particular premise selection) would be good for the Lean community.

Jason Rute (Mar 05 2020 at 23:10):

In the end what I think Patrick is looking for is a policy function $f$ which takes as input both the current goal, and all the available lemmas and hypotheses which can be used for rewriting (this could be before or after they are selected as applicable for rewriting) and ranks or scores the lemmas according to how “useful” they are. (An alternate or complimentary approach would be to have a value function which takes the goals after rewriting and assigns each of them a value of how “good” they are. A value function would be quite valuable in search algorithms like rewrite_search.) While it is possible to hand-engineer the policy function, I assume we want it to be (at least in part) learned automatically.

Jason Rute (Mar 05 2020 at 23:10):

For this sort of problem, there are typically two main methods (which are compatible with each other) to gather data to train such a policy function. The first is supervised learning where we use labeled training data, probably from human proofs. The second is reinforcement learning which (loosely speaking) generates it own labeled data by exploring the environment and then uses that for training. Let's for now just consider the supervised learning case.

Jason Rute (Mar 05 2020 at 23:11):

For the supervised learning we need positive and negative examples of premises, since this is essentially a binary classification problem. I think Patrick is correct in his suggestion of one way to get this data. We can use my dataset and the script which generates it as a starting point, modifying it as Patrick suggested. There are some minor details to consider, however, I agree that with minimal work, we could get a clean dataset of both positive and negative examples. This dataset could even be useful to machine learning researchers on its own.

Jason Rute (Mar 05 2020 at 23:11):

Before I get into how to build the premise selection function and the machine learning algorithm, I would suggest for a problem like this, to first do it in all in Python and not Lean (except for maybe some of the hand-engineered feature extraction which might be easiest to do in Lean). Since you will already have a dataset of goals with positive and negative examples, just treat this like a standard machine learning dataset. This way you can do rapid prototyping and leverage standard machine learning Python packages. You can even get help from an ML researcher or someone in the Lean community with more ML experience.

Jason Rute (Mar 05 2020 at 23:11):

Also, with this dataset, it is very important to break it into testing and training (or train, validate, test) so as to avoid your algorithm memorizing the correct lemma to use from the dataset. The question on how to split this up for theorem proving is an interesting and important question in its own right. (Does one split randomly by goal instance, by proof, by file, or by subject area?)

Jason Rute (Mar 05 2020 at 23:11):

Then later, you can incorporate your algorithm into Lean either directly or via a FFI. Note that a lot of standard ML algorithms (logistic regression, naive Bayes, simple feed-forward dense neural networks, linear SVMs, gradient boosted trees) can be coded directly in Lean (assuming Lean can work with floating point numbers and standard floating point operations). (Training these models is another story, but once you have trained parameters, it would be easy-ish to directly code these parameters in Lean and then see if the speed is sufficiently fast.)

Jason Rute (Mar 05 2020 at 23:11):

So in summary, my suggested steps are:

Create the dataset
Split into training and testing datasets
Use Python and standard machine learning packages to rapidly prototype ideas. Start simple. You don’t need state-of-the-art, just good enough (and fast enough). Maybe bring in someone with machine learning experience to help.
Code up your pre-trained algorithm in Lean, either directly or via FFI. Note that you don’t need to have any of the training libraries in Lean, just the learned parameters.

Jason Rute (Mar 05 2020 at 23:11):

Now, for how to design the premise selector:

One typical way to design such a premise selection function $f$ would be to break it into two parts. The first part encodes the goal and premises into vectors of fixed length. This is known as an embedding (since it embeds a large space, the space of all goals and formulas, into a smaller space $\mathbb{R}^d$ for a fixed dimension $d$ . Each coordinate of this vector is known as a feature (and the process is called feature extraction, since each coordinate describes a particular feature of the formula (e.g. the number of variables in the formulas). I'll talk more about feature exaction in a bit, but there are a lot of options here, both learned and hand-engineered. The next part is a learned function which takes in the vector embeddings (i.e. features) for each formula, goal pair and assigned a value to that pair. The type of function and value depends on the machine learning method. In many cases it would be a value between 0 and 1 where 1 means the premise is useful and 0 means it’s not. (You can think of this value as the probability that the premise is useful). (If using neural approaches, there is also the option to have a probability distribution over all the premise choices, using the softmax function. This would be good if, say, we want to only consider the case where we rewrite with one lemma at a time.)

Jason Rute (Mar 05 2020 at 23:12):

For the embeddings, there are a lot of options, but they typically fall into two camps:

Hand engineered embeddings are written by a human and can be coded directly in Lean. These often have the advantage of being symbol agnostic, which is really good if you want to apply these embeddings to new problems. It is probably good to throw in a few features you know will be important, such as if the lemma is from the local context or not. While one can put a lot of work into making these features, there are also some good standard options which are pretty general. One of the simplest is a “bag-of-words” approach where each symbol is it’s own feature and we just count the number of that symbol in the term. TF-IDF is a variation on this approach which weights words according to how unique they are to that formula. Josef Urban et al’s work (e.g. the ENIGMA project) often uses another approach called term-walks, and IBM research has come up with TRAIL patterns, which also are very good for this use case. Both of these latter approaches try to capture patterns in the formulas that go beyond the vocabulary. Now, you may notice that if we are trying to create a fixed length vector embedding that it is a problem to have an arbitrary number of words as features. One standard way to fix this is to hash the word/symbol/pattern and then use that hash as the index of the feature.
Machine learned embeddings are usually created by a graph or tree neural network which is able to traverse the graph of the term. I don’t see an easy way to code this in Lean and one would need to use some interface to call outside of Lean to compute this. However, sometimes graph neural networks produce great results, but it is not clear yet to me how well they generalize to new vocabularies. I think this depends a lot of the specifics of the architecture. Also graph neural networks require that the entire policy function be made of of composed neural networks too. This paper has the state of the art for premise selection using graph neural networks, but simpler approaches would also work.

Jason Rute (Mar 05 2020 at 23:12):

My advice would be to try to use one of the hand-engineered approaches first. It would involve more upfront coding, but one wouldn’t have to deal with graph neural networks which present their own difficulties. One might still want to use the FFI since then one can cache the embeddings since you are going to be asking for the same embeddings over and over again. (I assume Lean 3 doesn’t have a way to cache function calls.)

Jason Rute (Mar 05 2020 at 23:12):

See here for a discussion of a paper which talks about these various different embedding approaches. One thing to note is that in the setting of that paper, the policy function is used inside a search algorithm so it is called many many times and that benefits the faster hand-engineered features over the slower neural network features, even if the slower features produce better results. For this case I think we wouldn’t be calling the embedding function nearly as often and we could afford more expensive algorithms if they show vastly superior results.

Jason Rute (Mar 05 2020 at 23:12):

Now, as for the machine learning part, simple neural networks, XGBoost, linear SVMs, logistic regression, or naive Bayes are all reasonable first approaches. Again, if you follow the rapid prototyping advice above, I would start with the simpler ones first, like logistic regression, naive Bayes, and linear SVMs since they are fast, easy to train, and easy to code into Lean. But it doesn’t hurt to try the others too to see how they improve the results.

Jason Rute (Mar 05 2020 at 23:12):

It sounds like a fun project. I hesitate to commit to working on it however, especially over the next 5 weeks, for which I will be very busy (or on my honeymoon).

Jason Rute (Mar 05 2020 at 23:39):

@Scott Morrison I read through that paper more closely, and your use of machine learning is a nice trick! You are not pretrainjng the classifier per-say but instead training it during the search algorithm. This is interesting since it could extend to almost any path finding problem.

Jason Rute (Mar 05 2020 at 23:43):

...if you can work from both ends of the path.

Jason Rute (Mar 05 2020 at 23:47):

It also reminds me a bit of meta learning, whereby one pretrains a model and then updates that model in real time according to the task at hand. In your case, you aren’t pretraining, but you are updating your model in real time.

Jason Rute (Mar 06 2020 at 00:01):

It also reminds me of this self learned AI algorithm to solve the Rubik’s cube and similar problems. https://doi.org/10.1038/s42256-019-0070-z I think your problem is more difficult since the goal in the Rubik’s cube never changes and they exploited that in their training algorithm.

Patrick Massot (Mar 07 2020 at 15:53):

Thanks Jason! It's hard to predict whether I'll have any time to work on this, but it would be a fun project to start learning stuff about machine learning. Of course the standard issue is I already have a huge pile of project, either ongoing or only dreamed of. Let's see if I can remove a couple of things from my TODO list.

Jason Rute (Mar 07 2020 at 16:39):

Also, feel free to reach out in ~5 weeks. I think I could probably at least help you create that training dataset. It seems pretty straight-forward now that I've already started working on similar things.

Jason Rute (Mar 20 2020 at 00:56):

My wedding got postponed because of coronavirus :sad:, so I'm actually much more free now. I'm going to continue working on my extraction tool and I can start thinking more about this rw_hint training dataset that @Patrick Massot had in mind. Also, @Scott Morrison, I'd be happy to discuss rw_search more.

Johan Commelin (Mar 20 2020 at 03:59):

I'm sorry to hear that your wedding got postponed. Good luck and stay safe!

Patrick Massot (Mar 20 2020 at 13:11):

Yeah, this is really sad. Please take care of you and your fiancée before playing with rw_hint!

Jason Rute (Apr 08 2020 at 12:39):

I'm to the point that I'd like to start "playing with rw_hint". Can one of you remind me how to use a branch of mathlib? Is it as simple as modifying the leanpkg.toml file?

Patrick Massot (Apr 08 2020 at 12:39):

leanproject get mathlib:rw_hint

Patrick Massot (Apr 08 2020 at 12:40):

will download mathlib on the correct branch with compiled oleans, in mathlib_rw_hint

Jason Rute (Apr 08 2020 at 12:43):

Thanks! I do this from inside an existing lean project right?

Patrick Massot (Apr 08 2020 at 12:45):

This is when you want to work on mathlib itself.

Patrick Massot (Apr 08 2020 at 12:46):

Do you want this for mathlib as a dependency?

Stanislas Polu (Aug 21 2020 at 17:19):

Out of curiosity where can I learn more about rw_hint ?

Stanislas Polu (Aug 21 2020 at 17:21):

Ah I found PR #2030 on mathlib. Is rw_hint integrated in the VSCode extension today?

Bryan Gin-ge Chen (Aug 21 2020 at 18:26):

No, the PR was closed without being merged.

Bryan Gin-ge Chen (Aug 21 2020 at 18:26):

ping: @Scott Morrison

Scott Morrison (Aug 21 2020 at 22:59):

Should we restore this? It's a pretty small PR.

The ideal usage would be to integrate it with the goal inspector, so that when you zoom in on some subexpression there is a button you can click to run rw_hint just on that subexpression. This would make it hugely more useful, because then you are using your knowledge of what you want to rewrite to direct the rewriter.

Jalex Stark (Aug 22 2020 at 18:10):

can we autogenerate conv mode scripts from clicking on subexpressions?

Last updated: May 02 2025 at 03:31 UTC