Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Lean 4 gym


Kevin Buzzard (Feb 07 2024 at 14:20):

There was Lean Gym a couple of years ago, but it was Lean 3. Are there any plans for a Lean 4 Gym? Someone at Imperial is interested in this question and asked me, so I thought I'd ask here to see if there are people thinking about gyms for Lean.

Mario Carneiro (Feb 07 2024 at 14:20):

I think this is https://github.com/leanprover-community/repl/

Kaiyu Yang (Feb 07 2024 at 14:23):

You can also use LeanDojo to interact with Lean from Python: https://github.com/lean-dojo/LeanDojo

Jason Rute (Feb 07 2024 at 19:59):

I think the short answer is that Lean Dojo or the repl is the way to go if you want if you want to control Lean from the outside, and if you want to control lean from the inside (just interacting with an external tool like Lean copilot does) then Aesop has some good functionality for that as talked about in #Machine Learning for Theorem Proving > How to run a simple ATP baseline for Lean

Jason Rute (Feb 07 2024 at 20:03):

But I think the best would be for your friend to come on here and talk about what they would like to see in a gym. I feel the field has changed since the Lean 3 gym. There are a lot more different avenues for interaction with Lean, including seeing the full proof history so far, the other code in the file, previous theorems in the environment, and such. Also pretty printed text is not the only (or even best) mode for communication about the current state of a proof. Lean Gym 3 had a very simple interface (that as far as I know was never used by anyone but OpenAI who built it).

Jason Rute (Feb 07 2024 at 20:06):

I'm also curious what interface Copra uses. Is it custom? How does it work? @Amitayush Thakur

Jason Rute (Feb 07 2024 at 20:08):

And DT Solver for that matter?

Jason Rute (Feb 07 2024 at 20:10):

I assume Llemma used Lean Dojo's interface for the Formal-to-Formal experiments, but I don't know for sure.

Amitayush Thakur (Feb 07 2024 at 20:21):

Copra uses custom interface, but the same interface can be used with other languages too. The Lean 3 interface in Copra is implemented using the lean commandline. Unlike LeanDojo it is lightweight in setup and doesn't need docker installation. Although it doesn't support the level of annotations you can generate through LeanDojo. You can checkout https://github.com/trishullab/copra. We are planning to create a nice simple python package of cross language ITP python interface which can be used by anyone. We recently added support of Isabelle other than Lean 3 and Coq. I would recommend copra if you want to try multiple ITPs and not just Lean 3.

Amitayush Thakur (Feb 07 2024 at 20:22):

Also feel free to open any feature request you would want for Copra

Jason Rute (Feb 07 2024 at 20:36):

So every proof step you run lean? Or do you use the lean language server?

Jason Rute (Feb 07 2024 at 20:42):

A long time ago (pre pandemic) I made some tools to interact with Lean 3 via the language server, but it was quite slow since the language server had a hard coded delay of 0.1 seconds. I don't know if it is still that case. (Of course if you are calling GPT4, then the LSP is no longer the bottle neck.) Nonetheless, I'm more convinced the language server needs to be part of a working solution. It is the only way to get tactics to see and interact with the text in the file.

Amitayush Thakur (Feb 07 2024 at 20:45):

Yes. I tried using LSP, ran into some nasty concurrency issues. In terms of parsing get responses from LSP. It also uses line number and column number delimitation which makes it hard to parse subsequent states or detect state changes. It is also not very well documented. The only documentation I found was for Javascript (the VScode plugin for Lean). Also, if I have understood LeanDojo code correctly, then I think it also does the same, and doesn't use LSP.

Jason Rute (Feb 07 2024 at 20:48):

@Amitayush Thakur But you are using the language server, right? So you fixed your issues?

Jason Rute (Feb 07 2024 at 20:49):

I see that part of the code is here: https://github.com/trishullab/copra/blob/ba06409aaf3d261e989a01a5589712a55922df62/src/lean_server/lean_cmd_server.py#L30

Jason Rute (Feb 07 2024 at 20:51):

For how to handle the concurrency issues, see https://github.com/leanprover-community/lean-client-python.

Jason Rute (Feb 07 2024 at 20:53):

Wait, maybe you are just running Lean, parsing the output, and then running Lean again? How is this possibly fast enough to do a tree search?

Amitayush Thakur (Feb 07 2024 at 20:56):

It is fast, and probably faster than using server. This is very similar to how LeanDojo does it (only with a custom Docker image). The server code works in some cases but the back and forth chitchat waiting also wastes time and leads to race conditions in some cases. So to avoid all that we just used lean cmdline

Jason Rute (Feb 07 2024 at 20:58):

Do you know how fast? Do you have numbers on the latency? Either @Amitayush Thakur or @Kaiyu Yang?

Amitayush Thakur (Feb 07 2024 at 21:01):

Some of the numbers you can get from Copra paper. The per query time includes these time to run. I think it was around < 1s

Jason Rute (Feb 07 2024 at 21:03):

Ok, so I guess again this is because you are calling GPT-4, so that is what is taking so much time. But for other approaches, like my recent Graph2Tac paper, 1s per call would be horrible. That means you can only make 60 model calls in 10 minutes. You would need a really good model for that, and it seems no longer a good idea to call the model on every step.

Jason Rute (Feb 07 2024 at 21:09):

If you look at the times in the Graph2Tac paper, our graph model is running about 30 tactics per second. The k-NN is running about 300 tactics per second.

Amitayush Thakur (Feb 07 2024 at 21:09):

The ReProver model is faster so as per our experiments it took 0.52s per query (including Lean execution). I still think a big chunk of 0.52s went in querying the LLM. So just running Lean from commandline is also not very bad.

Amitayush Thakur (Feb 07 2024 at 21:10):

The infra for Coq is very good. It is much faster and preserves history and everything. For our Coq code implementation we use the same one as Proverbot

Jason Rute (Feb 07 2024 at 21:11):

Yes, our transformer baselines with GPU do about 5 tactics per second, and .5 with CPU. So there the transformer is taking so long that the interface speed doesn't matter.

Notification Bot (Feb 08 2024 at 11:02):

Cris SALVI has marked this topic as resolved.

Notification Bot (Feb 08 2024 at 11:02):

Cris SALVI has marked this topic as unresolved.

Cris Salvi (Feb 08 2024 at 11:15):

Is there existing work trying to fine-tune pre-trained LLMs by reinforcement learning (RL) with feedback from Lean 4 instead of RL with human feedback?

Zhangir Azerbayev (Feb 08 2024 at 16:36):

Note that there are WIP python bindings for the lean repl.

An alternative to a gym that works better in many contexts is to write a Lean tactic that queries your ML model via HTTP or the Lean FFI. Then you can benchmark your tactic using something like the 'tactic_benchmark' utility in semorrison/lean-training-data.

Jason Rute (Feb 08 2024 at 16:43):

@Cris Salvi If you replace Lean 4 with Lean 3, then yes there are at least two works. A curriculum version of Lean GPT-f using expert iteration and HTPS using MCTS. But both LLMs are still tactic prediction models, not general purpose LLMs. For Isabelle, I think Baldur learns from Isabelle feedback, but I'm not certain on this.

Jason Rute (Feb 08 2024 at 17:03):

And even older is the HOList family of projects in HOL-Light, some of which use RL (and get really impressive results). (But this is a graph neural network model, not a Transformer model.)

namibj (Feb 09 2024 at 02:01):

Cris Salvi said:

Is there existing work trying to fine-tune pre-trained LLMs by reinforcement learning (RL) with feedback from Lean 4 instead of RL with human feedback?

I'd like to collaborate if you're planning to try something there; I'd plan to go off of LeanCopilot's RAG with a PG approach, possibly using some of the spare tokens to train a critic into the tactic generator LLM like FB's RL theorem prover work did https://arxiv.org/abs/2205.11491, for better guiding aesop than just using cumulative LLM decoder logit's along the search tree of aesop. That one would be a rather simple RL aspect.

Jason Rute (Feb 09 2024 at 02:19):

The paper link to FB’s work is two messages above (HTPS).

Gerard Calvo Bartra (Feb 24 2024 at 16:57):

I might be late to the party, but I am also in if you plan to set up a gymnasium environment for Lean 4. I believe RL can help a lot in ML for Theorem Proving. Please let me know if this goes forward!

Siddhartha Gadgil (Feb 25 2024 at 04:23):

I believe it was already pointed above that Lean 4 REPL is a Lean Gym and more.

Amitayush Thakur (Feb 10 2025 at 05:52):

If anyone is still looking for a gym environment for theorem proving, our environment itp-interface (https://github.com/trishullab/itp-interface) might come in handy. It is implemented in Python and can support both Lean 4 and Coq.

So what's different?

  • Supports both Lean 4 and Coq. Yes, you just have to write code only once (for proof automation) and it will work for both the ITPs!
  • Supports parallel tactic execution, which can help a lot while automatically searching for proofs through LLMs. It can maintain a proof environment pool which can help in running multiple tactics seamlessly for the same state.
  • Manages memory more efficiently. If you use Lean 4 REPL, then after a couple of backtracks the JRPC bloats and can take a large amount of memory. We avoid such pitfalls.
  • Lean 4 REPL proof mode has some issues regarding accepting incorrect proofs, we avoid that. (See https://github.com/leanprover-community/repl/issues/44)
  • One can in parallel collect proof step data across multiple repositories.
  • Support for multiple versions of Coq and Lean 4
  • The library is designed to scale and run on ray cluster while doing a highly parallel proof search.

Setup is super easy:

  1. pip install itp-interface
  2. install-itp-interface
  3. We are done setting up for Lean 4. No more installation steps (including the need to install Lean 4) and other hassles.
    (See our python package: https://pypi.org/project/itp-interface/ for more details)

We build our work on top of the libraries Lean 4 REPL(https://github.com/leanprover-community/repl) and Coq Serapy (https://github.com/HazardousPeach/coq_serapy).

I would like to thank my co-contributer: @George Tsoukalas , and the community at Zulip for giving us valuable feedback while building this tool.

This tool can be further supplemented by our proof search tool (https://pypi.org/project/proof-wala/) which can run multilingual proof searches for Coq and Lean. The two can be used for end-to-end proof generation given a theorem in Lean 4 or Coq. The proof-wala tool also supports visualization of tree searched and annotation of proof trees which can be further used for some form of RL style training.

Looking forward to any feedback you might have.

Eric Wieser (Feb 10 2025 at 18:21):

@Amitayush Thakur , I recommend creating a new thread, and editing your two messages to just link to that thread. Otherwise you'll get replies in two places.

Amitayush Thakur (Feb 11 2025 at 00:04):

@Eric Wieser Thank you. We have now announced on a topic here: #Machine Learning for Theorem Proving > itp-interface and proof-wala


Last updated: May 02 2025 at 03:31 UTC