Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Data sets: where to put them?

Jason Rute (Apr 24 2020 at 23:58):

I'm in the process of creating some clean Lean datasets ready for machine learning prediction, and I'm trying to figure out where to put them. Any suggestions?

Jason Rute (Apr 24 2020 at 23:58):

One option is GitHub? Another is Kaggle. Or I could put it on GitHub and link to it from Kaggle. I am sure there are many other places. Any one with familiarity with this sort of thing? Does anyone else what to see, play with the datasets?

Jason Rute (Apr 24 2020 at 23:59):

One of the datasets will be all (well...most) instances of tactic keywords and the state when they are applied. (I'm a bit of a perfectionist and there are a number of small issues in this dataset, but I think it is worth releasing anyway to start playing with. I want to learn more about formula embeddings and I think it will be a great dataset to start with. I can always create another version later. (This isn't a state of the art dataset. HOLStep and the Mizar datasets have been around for a long time, so I don't think it needs to be perfect at the beginning.)

Jason Rute (Apr 25 2020 at 00:00):

Another dataset will be all of the instances of the rw tactic, the exact rewrite formula applied, and the state just before the rewrite is applied. If it is easy, I might also include all the outputs of rw_hint so that it is a true prediction problem (pick which of these rewrites is the one the human used).

Jason Rute (Apr 25 2020 at 00:01):

I'm also still working on other methods of "proof recording" so I can create more and better datasets.

Alex J. Best (Apr 25 2020 at 00:03):

This is great! Can't wait to take a look. I'd say github unless you are over the file size limit (100mb per file).

Jason Rute (Apr 25 2020 at 00:12):

I think I can stay under that. I can always break up the files.

Jalex Stark (Apr 27 2020 at 12:42):

Kaggle is a good place to get some ml folks accidentally interested in ITPs :)

Stanislas Polu (Jul 16 2020 at 07:48):

@Jason Rute hi! sorry for the long radio silence. Where can I find the latest about your datasets?

Jason Rute (Jul 19 2020 at 00:26):

@Stanislas Polu Also sorry for taking a few days to get back to you. I moved a few weeks ago, and haven't got back into the COVID lockdown groove I was in before. Also, I have some stuff in my personal life which is making me busier right now. I haven't put any datasets online yet, but let me tell you what sort of things I have been working on (and also convince you why the datasets are a mess right now).

I figured out how to run the lean server info command on every character. It takes about half a day on a c4 EC2 instance if I recall. You can find the tool here: Lean Info Scraper. While I didn't put the raw data online, I did put a visual version of it here: Annotated Lean. As discussed here, there are a lot of peculiarities in the results. This has made it hard to turn them into a clean dataset. I recently realized that some (but certainly not all) of the peculiarities are due to the way I was calling the server. If one calls the info command twice before the first info command finishes, weird behaviors happen. That might have been fixed in the most recent version of lean, and it would also wouldn't be an issue if I used asynchronous programming to call the lean server. That is what Patrick's lean-client for python does. (I've been working to clean that project up so I can use it.)
After the discussion here, on making a machine learned ranking tool for rewrites, I've gathered some data on rewriting in lean. Instead of using the messy lean-server info command, I just modified lean to record every application of the rw (and similar tactics) during compilation. I have the details for each rw tactic what was used to rewrite, and what was the goal before the rewrite. It took a few hours to run. Also in order to run it I have to manually edit the lean code, but this isn't bad since it is just .lean files. I don't have to touch the C++ files. I was hoping to clean up the data a bit before posting it. I was also possibly hoping to integrate with the rw_hint. That way I would have positive and negative examples. Right now I just have positive examples (which using hard negative mining and the like would still make it ok data). Also, I'm very uncertain right now on exactly how one should present goals and expressions for machine learning. There are a number of approaches: (1) just take the pretty-printed goal (however, no one does ML from the PP goal), (2) take a parsed version of the expression (the problem here is that there are a number of different small issues including the handling of variables and types), (3) run the expression through a standard hard-coded feature extractor for logical expressions (there are a number of standard ones and this is good since no Lean tactic is going to start using GNNs, LSTMs, or Transformers anytime soon). Of course it is possible that I'm just overthinking this and should just dump what I have. (The code lives on a branch here: https://github.com/leanprover-community/lean/compare/master...jasonrute:v3.9.0-rw-recording . Anyone could run it. Ask me if you want instructions.)
I could also apply something similar to all the standard tactics and get information about every tactic application, including the goal, the tactic, and the parameters used. Again it involves modifying the source lean files (but not the source C++).
My big project is a new lean-python interface which doesn't go through the lean server. Instead, it lets you communicate back and forth between Lean and Python inside a tactic. This lets one create fast-ish Gym environments for Lean inside Python (where Lean applies the tactics that Python tells it to and keeps tract of the results, including a branching state tree) and where Python runs the machine learning. Conversely it can also be used to create prototype tactics which run inside Lean but communicate with Python. Again, Lean tests out various tactics (as Python tells it to do) and keeps track of the results. I'm close to done with my prototype, but over the last 6 weeks I have got caught up with other things. After I finish my prototype, I'm planning no soliciting advice from you and others.
My (too ambitious) goal was to finish the new Lean-Python communication tool and then make a version of TacticToe for Lean. I will record tactic proofs (as explained above) and then plug them into k-nearest neighbors. Using my interface, I could create a tactictoe tactic which performs a tree search guided by this k-NN model. With minimal work, a user, like Kevin Buzzard, could set it up and try it out and let me know what he thinks. It might also be possible to install it on a web server. However, I'm just one person without a lot of free time, so we will see...

Besides the above issues (especially my lack of free time), there are also some others:

Lean 3 (both the library, the application, and the server) is changing really quickly, and the library is being massively refactored, which makes it hard to build stable datasets (unless I just fix a previous version of Lean like Lean 3.10).
Lean 4 will theoretically come out in two months. We will see if they keep this timeline, but again, it makes it unclear if all this data gathering and code will go to waste.
I don't think there is a lot of interest in Lean and machine learning right now. I'm hoping after I build some of the above tools, I can create interest, but we'll see.

Stanislas Polu (Jul 20 2020 at 09:10):

@Jason Rute thank you so much for your very detailed response. I really enjoyed the annotated Lean files!! This is really great. We'll dig into the lean-client for python as the demo there seems to include most of what we need to get started with doing exploratory ML work on Lean. Of the issues you mention, I think we can help with the third one obviously :) For the two others I feel like we can simply freeze the version of Lean 3 we use and I'm not too worried that Lean 4 will provide reasonable interface we can adapt to?

Last updated: May 02 2025 at 03:31 UTC