Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: What is the AMI project's vision for AI in Lean?


Jason Rute (Jul 23 2022 at 21:12):

(If you would like some food for thought, here are some thoughts I had a year ago about interfaces. Some follow-up discussion is in #Machine Learning for Theorem Proving > HOList or Lean?. Also, I'm sure my thoughts have changed on this a decent amount over the past year.)

Leonardo de Moura (Jul 24 2022 at 12:51):

Thanks for asking the question. I am not an AI/ML, but I realize the impact this kind of technology may have on proof assistants and formal mathematics in the future. For example, @Zhangir Azerbayev 's demo last week was terrific. I was trying to convey on the slide that we are committed to making Lean the best platform for research on AI math assistants. I am particularly interested in projects that have the potential to benefit many Lean users.
Note that Lean 4 is implemented on Lean, and has many more APIs available to users than Lean 3. We are happy to add additional APIs that may help AI/ML research projects.

Zhangir Azerbayev (Jul 25 2022 at 15:39):

Here are some of my thoughts on what kind of APIs might be useful.

For machine learning researchers, it would be fantastic if there were officially maintained and supported versions of lean-step (which is used to extract pre-training data) and lean-gym (which is used for proof-search). I spoke to @Bartosz Piotrowski recently, and he may be interested in porting these to lean 4. Ideally, these libraries would be well-documented enough that a researcher whose needs differ from the default behavior could write a fork without too much difficulty.

For building applications that can be put in the hands of users, providing APIs for other programming languages to interface with Lean would be helpful. For example, right now the lean-chat app, which is written in javascript, only returns raw text to the user. The app would be more usable if we could syntax highlight and type check the outputs.

Jason mentioned in his write-up about interfaces that text-to-text models aren't really aware of imports and namespaces. For example, they might predict rw [add_comm] when the nat namespace isn't open. A very useful feature for post-processing LM outputs would be a way to figure which imports/namespaces/locales a given piece of code requires. I don't know if this is easy or hard to implement.

@Edward Ayers Do you have any more thoughts from your experience working on lean-chat?

Jason Rute (Jul 26 2022 at 04:24):

@Leonardo de Moura and @Zhangir Azerbayev, thanks for the responses!. As for supporting projects "that have the potential to benefit many Lean users," I think it helps to make it easy for researchers to put in the extra effort to make their systems usable in Lean. Providing well-documented examples of similar systems goes a long way.

Jason Rute (Jul 26 2022 at 04:24):

Note, here is one common situation that comes up: For reinforcement learning and bulk evaluation of an AI system, it is usually most useful to call Lean from inside Python (or another language), as is done in lean-gym. However, to use the AI system in practice, it is usually most useful to call the AI model (either through a server or through a child process) from within a Lean tactic (or widget, or editor plugin). This is what is done in lean-gptf. If a research team was only going to build one interface, it would probably be the former since that is what they need to train and evaluate their model. So making it easy to do both with a similar API would go a long way to turning a research AI model into a user-facing AI model with little effort (well...that and making it easy to install or serve the model to the user's system). Also, this would make it easier to debug a model by being able to play with it inside Lean itself.

Jason Rute (Jul 26 2022 at 04:25):

@Zhangir Azerbayev said:

For machine learning researchers, it would be fantastic if there were officially maintained and supported versions of lean-step (which is used to extract pre-training data) and lean-gym (which is used for proof-search).

Speaking of such projects, I love that Lean and the Lean community have a tradition of centrally maintaining the most important Lean tools so that they stay up to date and useful to the widest audience. Here are the AI for Lean repos that could in principle be centralized, documented more clearly, and generalized to the needs of a wide audience.

Jason Rute (Jul 26 2022 at 04:25):

For Lean 3

  • lean_proof_recording: This is my repo used in the PACT paper and cleaned up by Stan Polu for some future work at OpenAI. The purpose is to record the Lean tactic data including proof states and tactic commands. This information isn't stored in the Lean environment, and I didn't know how to hack the Lean 3 C++ code, so I did something very Rube-Goldberg-esk (I'm both very proud and very horrified by what I did to make it work.) Because it is such a hack, it has a tendency to break when Lean is upgraded. I don't maintain it anymore (and for certain reasons I can't.) If I was going to do it again for Lean 3, I would build it on top of Mario's #general > AST export for lean 3 used in the Lean port which is not built into the Lean 3 community edition. Also, while the most important information I gathered (and really the only information used from this repo in the PACT paper) was the text strings, I also gathered a lot more information that could be useful down the road.
  • lean-step-public. This is Jesse's repo used in the PACT paper to get all the term information. It is stored in the lean environment, so this is mostly Lean meta code.
  • lean-step-tpe. This is Jesse's repo containing the evaluation code that we used in the PACT paper. In particular, this code runs the model from within Lean (using a web server IIRC). Lean even handles the best-first tree search, which was written in Lean meta code. Since the model was text based and served over a web-API it is a pretty general paradigm. I think an interface to FairSeq (another language model library) was also provided.
  • lean-gptf This is the user facing tactic coming out of the PACT work. See #lean-gptf. The model basically just queries a web-server with the goal state and gets back tactic suggestions. Again, the code is quite general and would work for any text to text model. I can't remember if OpenAI is still supporting this.
  • lean-gym. This is a library for running Lean in Python on a standard set of problems including mathlib theorems and minif2f. This is a gym-like-environment (if you are familiar with the open-ai machine learning gym or similar environments for reinforcement learning research). Python can send tactics to Lean in plain text and get responses. The gym also lets python also jump between different proof states in the same proof search tree, giving it complete control over the order of the tree search. I don't know if OpenAI used this tool in their recent papers on AI for Lean, but I do think OpenAI still maintains this. See #Machine Learning for Theorem Proving > Getting started with lean-gym
  • I don't know if OpenAI or Meta has or will release their code for their recent papers. (I don't think Meta used lean-gym in their recent work, since Meta's proof search breaks up the proof state into its individual goals to handles each goal separately. Lean-gym keeps the whole proof state together.)

Jason Rute (Jul 26 2022 at 04:25):

Lean 4

  • lean-gym for Lean 4 prototype This is a prototype for a lean gym in Lean 4. When Stan introduced Lean gym, Daniel Selsam made a prototype Lean 4 version. I don't know if it was just his expertise in Lean 4, but he made it look really easy, writing it in a day or two if I recall. He also did an experiment with tactic tracing in Lean 4 here. It is remarkably simple. Nothing like my hacky lean-proof-recording. If one is willing to wait for Lean 4, I'd just start with Daniel's code. Lean 4 makes this all much easier. Also, Lean 4 has a different tactic style. The tactics are more compositional and nested, and this can make the lean-gym and lean-gptf paradigm break down where a single action is a whole tactic. Instead, Daniel showed it is easy to break up a nested tactic into its parts automatically (based on the macro system) and then predict only one part at a time. See #Machine Learning for Theorem Proving > Lean gym for Lean 4 (and its back links to previous topics) for more discussion on this.

Edward Ayers (Jul 26 2022 at 13:19):

Lean 4 is already a large improvement over Lean 3 in this regard. Two things that spring to mind that would be useful as a python API are:

  • a server mode where Lean accepts individual commands or declarations for checking. (maybe something like a notebook like workflow)
  • import-suggesting: given an unrecognised identifier, suggest import and open statements that are needed.

Timothee Lacroix (Sep 08 2022 at 08:32):

Hey, I don't know how late I am to this party, I worked with Gabriel on the lean API for the Meta paper on proof search.

The current code we have isn't super usable by third-parties but I'd be happy to share it if the plan is to get the same capabilities maintained in Lean4 (most of the hard parts were written by Gabriel anyways :p )

Here is what our code currently allows. It's similar in interface to lean-gym but has some useful differences.

  • Create a "proof session" to prove a declaration. For declarations in mathlib, we load the exact parser state at the declaration location (meaning we have the correct locales, and for a simp lemma, the lemma itself hasn't been added to simp yet :) ). This was done brutally by keeping full parser states for mathlib in RAM, which is quite heavy and required either 45min of parsing mathlib at startup, or the use of an internal c++ checkpointing tool ...

  • Apply a tactic to a given "node" in a proof session :

    • allow timeout and limits on a bunch of metrics (we restricted adding hypothesis with the same name, we limited the term size, number of metavars and number of subgoals).
    • The resulting tactic state is split into "independent" (ie no shared metavariables) states.
    • Using the "fast" type-checker for tactics is not enough, Gabriel wrote something that allows invoking the kernel type-checker. This is critical as it allows for direct signal on whether a tactic worked or not, rather than wait for the final kernel evaluation to realise some goal magically disappeared :/
    • New "nodes" are added to the session if no previous alpha-equivalent node exists in the session, otherwise, return the previously existing node. We do this to merge nodes in our search trees, but this makes reconstructing the final proof quite difficult (I do it from python). I believe it would be much more robust to have something in the API that reconstruct a proof from a given proof-tree in the session (since the lean parser would be available then).
  • There are currently lots of lean segfaults which I've stopped chasing down. I just restart the lean API. This leads to lots of lost proof search ressources.

  • Load a new user given theorem or have:

    • This is useful for research in conjecturing, autoformalization and so on. For now we first load an existing decl to set the context (imported files, open namespaces, etc...), then change the goal to something user specified.
    • We were somewhat hindered by not being able to "round-trip" between pretty-printed tactic states and this. (Using the full print is not convenient).
  • run arbitrary lean command from a file and get its output (we use this for automated proof cleaning).

To add on the capabilities previously mentioned regarding an officially supported way of extracting datasets, it would be amazing to have access to the lean tokenizer from python. Lean-step / PACT are amazing data sources, something I'd love to see is an official, versioned dump of lean-step / pact extractions for mathlib.

Finally, regarding interaction with ML models included into VSCode, I've put out a PR on vscode-lean. The idea is to have something general where users can easily choose their favorite "suggestion" api in. Most likely these suggestions won't be amazingly useful to start, but hopefully they'll only get better :)
Some capabilities I've "removed" from this PR (because it required modifying the lean server and I didn't want people to deal with a lean fork ...) is evaluation of all suggested tactics by lean. This was useful to filter out (or at least color code) model suggestions (to quickly see if a tactic closes the goal, or check by hovering what the new subgoals are, or if the tactic just fails), but was quite CPU hungry ...

I'd be super happy to discuss this in more details / help out in whatever ways I can. The current state of things in our team is that we have our lean api built around lean 3.30 and a matching mathlib commit. Playing catch-up with lean and mathlib is a bit difficult since we're far from being lean experts, so having everything supported would allow us (and the model we produce) to stay current.

Jason Rute (Sep 08 2022 at 13:11):

Timothee Lacroix said:

To add on the capabilities previously mentioned regarding an officially supported way of extracting datasets, it would be amazing to have access to the lean tokenizer from python. Lean-step / PACT are amazing data sources, something I'd love to see is an official, versioned dump of lean-step / pact extractions for mathlib.

In Lean 3, parsing is done by character and I don't think there is anything internally like a token, although maybe I am just not thinking about it in the right way, since there is definitely something token-ish about Lean's syntax and @Mario Carneiro wrote a tool to parse Lean 3 into an abstract syntax tree. In Lean 4, I think parsing proceeds via a bunch of intermediate steps which should be able to provide a full and faithful abstract syntax tree of a file. From that one should be able to get tokens as well as a lot more semantic information. If those ASTs are also combined with internal information from Lean, such as goal states (both for tactics, but also for sub-terms), that would be a treasure trove of information. For Lean 3, I would probably start with Mario's tool. For Lean 4, I would start with Daniel's tool. (Both linked above.) But I agree it would be best if the Lean devs or Lean community maintained these tools and handled making versioned dumps of the dataset.

Jason Rute (Sep 08 2022 at 13:12):

Timothee Lacroix said:

Finally, regarding interaction with ML models included into VSCode, I've put out a PR on vscode-lean. The idea is to have something general where users can easily choose their favorite "suggestion" api in. Most likely these suggestions won't be amazingly useful to start, but hopefully they'll only get better :)
Some capabilities I've "removed" from this PR (because it required modifying the lean server and I didn't want people to deal with a lean fork ...) is evaluation of all suggested tactics by lean. This was useful to filter out (or at least color code) model suggestions (to quickly see if a tactic closes the goal, or check by hovering what the new subgoals are, or if the tactic just fails), but was quite CPU hungry ...

@Edward Ayers @Wojciech Nawrocki @Bartosz Piotrowski This sounds like your area of interest.

Timothee Lacroix (Sep 08 2022 at 13:43):

@Jason Rute regarding tokenization, indeed what the tokens are is up for debate :)
Something natural would be to have identifiers tokenized as {name1} DOT {name2} ...
Then you want symbols like [ { ( or the notation variations to also be considered as tokens ...

We used BPE tokens learned on arxiv + lean I believe, which is good but not perfect. In particular one thing we aren't able to easily do is to "normalize" tactics : remove extra white spaces, or similar things. We did this with a bunch of regex, but its not easy to do without breaking semantics, and we're never sure that someone won't introduce a new notation that'll break our regex ^^

Jason Rute (Sep 08 2022 at 14:52):

Note that notations are where all the issues lie. You can add new binders, delimiters, and infix tokens with notation. But I guess those are all just tokens. As for notations like ∫ (x : ℝ) in a..b, f x ∂μ, I guess everything maximal substring which is just part of the notation should be it's own token, so .. would be the tokens for this notation and ( x : ) in a .. b , f x μ would be the full tokenization of the term. It looks like the mathlib docs already know about this sort of since you can click on a token to go to its corresponding definition.

Jason Rute (Sep 08 2022 at 14:53):

For Lean 4, normalization can be automatically with the syntax AST. For example in Daniel's tool, he got the syntax AST, and used that to generate the data strings (instead of just using the human written data). That normalizes at the level of syntax at least, e.g. normalizing whitespace and some abbreviations IIRC.


Last updated: Dec 20 2023 at 11:08 UTC