Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Which Python <--> Lean interfaces exist?


Vasily Ilin (Jan 28 2025 at 22:43):

TLDR: what are the existing Python <--> Lean interfaces and their competitive advantages over each other?

What are the tools to interface with Lean from Python? I know of three: PyPantograph (I have been bettling with its installation in the past 24 hours, with some success but some issues remain), LeanDojo (I heard it's worse than Pantograph but I don't remember in what way), and LeanClient (very new so I guess there will be many rough edges). My use-case is that our Lean lab at UW is starting an autoformalization project, in which we explore how different LLMs without any fine-tune can close goals. We want to just write Python code without any Lean meta-programming. The main feature we need is to take the current proof state and apply a tactic to it, and report the compilation error or the new proof state.

Vasily Ilin (Jan 28 2025 at 22:54):

It seems that LeanClient does not allow for a tactic string to be passed but only a whole lean file, so that won't work. But I came across lean-repl-py. Has anyone experimented with it?

Thomas Zhu (Jan 28 2025 at 22:55):

For that use case you describe, repl should be the best for interacting with Lean during a proof, other than the tools you mentioned.

Vasily Ilin (Jan 28 2025 at 22:57):

@Thomas Zhu , what is the difference between lean-repl and Pantograph and LeanDojo? I am sure there are some differences but I don't see them at first glance

Thomas Zhu (Jan 28 2025 at 23:04):

repl is lightweight and written only in Lean. I don't know the lean-repl-py you mentioned but on first glance it is a wrapper around repl. We have been using repl for interacting between Python and Lean and the experience is pretty good. LeanDojo (and probably Pantograph) additionally support extracting training data for fine tuning and not just interacting with Lean in real time, but I am not familiar with them enough to give a detailed comparison. I think Pantograph has better support for some tactics and sorry-filling.

Thomas Zhu (Jan 28 2025 at 23:09):

In their paper they mention support for have, conv, and calc tactics, which I am not sure repl fully supports. They also allow a model to submit a proof draft with sorry as holes, and then work on the holes. (cc @Leni Aniva)

Leni Aniva (Jan 28 2025 at 23:10):

Thomas Zhu said:

In their paper they mention support for have, conv, and calc tactics, which I am not sure repl fully supports. They also allow a model to submit a proof draft with sorry as holes, and then work on the holes. (cc Leni Aniva)

yeah pantograph supports have, conv, and calc. It is also written only in Lean

Jason Rute (Jan 29 2025 at 00:51):

The problem with asking for Python <--> Lean interfaces is that there are a myriad of ways that one could "talk to Lean" from within Python. Are you interested in:

  • Just proving a single theorem with a language model (you could use some sort of REPL or just print the proof to a file and run lean as sub-process on the file)
  • interacting with a whole file similar to what you do in a Lean editor (For that, it would be good to interact with Lean LSP language server. See @Oliver Dressler's https://github.com/oOo0oOo/leanclient )
  • Guiding a tree search from within Python. (For that you need a tool which has the ability to jump from one proof state in a proof to another one. I think maybe LeanDojo, Pantograph, or maybe one of the REPLs could help.)
  • Extract data. For that it depends on what sort of data. Any of the above tools might help.

Jason Rute (Jan 29 2025 at 00:51):

Honestly, it sounds like you are just in the first case and could get by with no special application. Use Python to edit your file and to run Lean from the command line as a process. Parse the messages which aren't too hard to parse. Easy-peasy. :smile:

Jason Rute (Jan 29 2025 at 00:51):

I wrote more about the many options for interacting with Lean here: #Machine Learning for Theorem Proving > leanclient: Interact with Lean from Python via the LSP But that was also more about the low-level methods. It is true there are a lot of high-level tools now available, some in Lean, some in Python. Don't discount some of the tools written in Lean. Some are reasonably user-facing and don't involve metaprogramming.

Jason Rute (Jan 29 2025 at 00:52):

@Vasily Ilin Finally, I think it is best if you are very specific about what features you need.

  • Do you need to see goal states?
  • Do you need to interact with existing mathlib files or will all your files be new files with a rigid structure and just a single theorem per file?
  • Do you need any special Lean features (like tooltips, goal states for expressions), or just the ability to run Lean on a proof?
  • Will your proofs be only tactic proofs or also term proofs?
  • Will your AI write separate lemmas (which go before the main theorem)?
  • Do you need the ability to interface with other theorem provers? If so how do you envision this happening?

Jason Rute (Jan 29 2025 at 00:52):

Also, you say you are doing auto-formalization, but then you mention proving theorems. Will you be auto-formalizing statements, definitions, or proofs? Or all three?

Jason Rute (Jan 29 2025 at 00:55):

And if what you are doing is a bit complicated, don't hesitate to at least consult if not partner with others who know metaprogramming.

Jason Rute (Jan 29 2025 at 01:08):

Sorry, I missed the part where you said you need to apply a single tactic to the current proof state and see the result. Then you need a REPL-like tool. But the follow-up question is if you plan to do a tree search (best-first search, Monte Carlo tree search, or iterative deepening)? You need a REPL that supports this. Also note, if you already have a proof-state-to-tactic predictor, you could just plug it into Aesop and have a full proof-search tool. That is what Lean Copilot does for example. (And I think that is what ABEL may also do when released if I understand.) It might require a bit more Lean programming, but not too much if your tactic predicting LLM is hosted over a network connection (from the internet or a Python server).

Vasily Ilin (Jan 29 2025 at 01:25):

Jason Rute said:

Also, you say you are doing auto-formalization, but then you mention proving theorems. Will you be auto-formalizing statements, definitions, or proofs? Or all three?

I might be mis-using the term. For me "autoformalization" is "automatic formalization/proof of an already stated theorem", not "first translating from natural language to formal language and then formalizing".

Vasily Ilin (Jan 29 2025 at 01:26):

Jason Rute said:

Sorry, I missed the part where you said you need to apply a single tactic to the current proof state and see the result. Then you need a REPL-like tool. But the follow-up question is if you plan to do a tree search (best-first search, Monte Carlo tree search, or iterative deepening)? You need a REPL that supports this. Also note, if you already have a proof-state-to-tactic predictor, you could just plug it into Aesop and have a full proof-search tool. That is what Lean Copilot does for example. (And I think that is what ABEL may also do when released if I understand.) It might require a bit more Lean programming, but not too much if your tactic predicting LLM is hosted over a network connection (from the internet or a Python server).

Woah! I don't know anything about aesop but I love the idea of "just plugging it into aesop". What's a good resource to learn more?

Jason Rute (Jan 29 2025 at 01:27):

Autoformalization usually implies translation from informal to formal, not automated theorem proving. So if you are not translating something I would say it isn’t autoformalization.

Vasily Ilin (Jan 29 2025 at 01:31):

I guess I am doing autoproving then. Does not roll off the tongue the same way as autoformalization

Jason Rute (Jan 29 2025 at 01:36):

Ok, I've probably oversold Aesop. But the example is here: https://github.com/leanprover-community/aesop/blob/caa2968c7b51ea680605e7d500c6739912aa0446/AesopTest/TacGen.lean#L10

What you would need to supply is a server which takes as input a goal state string, and outputs a list of tactic suggestions with a floating point number between 0 and 1 for each specifying how good that tactic is (usually you use the model's reported probability, i.e. 2^log_prob, of the LLM prediction, unless you have a value function). It is okay to use JSON for your server communication. Then a person who knows a bit of Lean programming could write you a little function which replaces myTacGen in this example and interfaces with your server. Running aesop will call your model at each step of its tree search. But there are lots of little paper cuts involved. Maybe LeanDojo or Pantograph would be better.

Thomas Zhu (Jan 29 2025 at 01:42):

I think a question that remains is whether you want to benchmark the proving ability of different language models, or do you want to develop a tool someone can use to prove theorems. For the former you can do the proof search logic (if any) in Python. For the latter you can look into Aesop + tacGen, and/or LeanCopilot/LLMLean/etc.

Vasily Ilin (Jan 29 2025 at 01:46):

Thomas Zhu said:

I think a question that remains is whether you want to benchmark the proving ability of different language models, or do you want to develop a tool someone can use to prove theorems. For the former you can do the proof search logic (if any) in Python. For the latter you can look into Aesop + tacGen, and/or LeanCopilot/LLMLean/etc.

That's a good way to put it. I am very far from developing any kind of useful tool, I just want to learn how to do these sort of things. Since I like LLMs, I thought I'd combine the two. So yea, essentially my goal is to benchmark the ability of existing LLMs to perform auto-proving (sounds much worse that auto-formalization...) with a naive scaling-test-time-compute strategy

Thomas Zhu (Jan 29 2025 at 01:54):

For benchmarking, I strongly recommend our work miniCTX as a modern benchmark for LLMs. In fact, we are currently working on evaluating theorem proving abilities of LLMs and will release a type of benchmark/leaderboard soon!

Jason Rute (Jan 29 2025 at 02:07):

naive scaling-test-time-compute strategy

Note, that tree search (which I assume you will have if you are doing individual tactic generation) is one of the original scaling-test-time-compute strategies. Almost all of the original neural theorem provers used it. (And the newer ones like DeepSeek-Prover got better when they added tree search.) You see the same sort of log-linear scaling as you extend the search time (or search tree size) as you do in modern test-time-scaling plots. There are even folk observations (not well tested) that running smaller models in a tree search are just as good as running larger models as long as you let them search for the same amount of overall compute. (The only two methods that I know of which likely scale better than tree search are multiple independent tree searches as in the HTPS paper, and reinforcement learning plus tree search, like in AlphaProof.)

Vasily Ilin (Jan 29 2025 at 02:18):

Jason Rute said:

naive scaling-test-time-compute strategy

Note, that tree search (which I assume you will have if you are doing individual tactic generation) is one of the original scaling-test-time-compute strategies. Almost all of the original neural theorem provers used it. (And the newer ones like DeepSeek-Prover got better when they added tree search.) You see the same sort of log-linear scaling as you extend the search time (or search tree size) as you do in modern test-time-scaling plots. There are even folk observations (not well tested) that running smaller models in a tree search are just as good as running larger models as long as you let them search for the same amount of overall compute. (The only two methods that I know of which likely scale better than tree search are multiple independent tree searches as in the HTPS paper, and reinforcement learning plus tree search, like in AlphaProof.)

Yes, I am not trying to get a new SotA with this project, just playing around with the ideas :). Maybe once we learn how to do this naive strategy we'll branch out into some RL too. But not now

Oliver Dressler (Jan 29 2025 at 08:00):

Vasily Ilin said:

It seems that LeanClient does not allow for a tactic string to be passed but only a whole lean file, so that won't work. But I came across lean-repl-py. Has anyone experimented with it?

Just to clarify, maybe I missunderstood: The LSP does allow for incremental updates (see example). But you are going to have to keep track of coords (line, char), indentation etc yourself.

See limitations of the LSP...

GasStationManager (Jan 30 2025 at 19:33):

Before diving into more complex tree search strategies, I would recommend to first implement the simple inference-time-compute strategy of "LLM-Lean feedback loop", where LLM-written code is passed to Lean and any error messages passed back. Especially for general-purpose LLMs without any fine tunes, this can help smooth over the LLM's imperfect grasp of Lean 4 syntax by allowing the LLM to fix it. My implementation here.

Kim Morrison (Jan 30 2025 at 22:16):

There's also my ancient Sagredo implementation of this same loop.

Simon Sorg (Feb 06 2025 at 20:00):

Vasily Ilin said:

It seems that LeanClient does not allow for a tactic string to be passed but only a whole lean file, so that won't work. But I came across lean-repl-py. Has anyone experimented with it?

I am maintaining lean-repl-py as part of my thesis, but it is not really much more than a convenient wrapper around the Lean repl and the python subprocess module. I just found myself re-writing the same 150 lines of code every time I started something new, so I separated this. Let me know if you end up using it and if there's anything that would make it more convenient though!
Personally, I think Pantograph is most likely the best that's out there.

Amitayush Thakur (Feb 10 2025 at 05:55):

If anyone is still looking for a gym environment for theorem proving, our environment itp-interface (https://github.com/trishullab/itp-interface) might come in handy. It is implemented in Python and can support both Lean 4 and Coq.

So what's different?

  • Supports both Lean 4 and Coq. Yes, you just have to write code only once (for proof automation) and it will work for both the ITPs!
  • Supports parallel tactic execution, which can help a lot while automatically searching for proofs through LLMs. It can maintain a proof environment pool which can help in running multiple tactics seamlessly for the same state.
  • Manages memory more efficiently. If you use Lean 4 REPL, then after a couple of backtracks the JRPC bloats and can take a large amount of memory. We avoid such pitfalls.
  • Lean 4 REPL proof mode has some issues regarding accepting incorrect proofs, we avoid that. (See https://github.com/leanprover-community/repl/issues/44)
  • One can in parallel collect proof step data across multiple repositories.
  • Support for multiple versions of Coq and Lean 4
  • The library is designed to scale and run on ray cluster while doing a highly parallel proof search.

Setup is super easy:

  1. pip install itp-interface
  2. install-itp-interface
  3. We are done setting up for Lean 4. No more installation steps (including the need to install Lean 4) and other hassles.
    (See our python package: https://pypi.org/project/itp-interface/ for more details)

We build our work on top of the libraries Lean 4 REPL(https://github.com/leanprover-community/repl) and Coq Serapy (https://github.com/HazardousPeach/coq_serapy).

I would like to thank my co-contributer: @George Tsoukalas , and the community at Zulip for giving us valuable feedback while building this tool.

This tool can be further supplemented by our proof search tool (https://pypi.org/project/proof-wala/) which can run multilingual proof searches for Coq and Lean. The two can be used for end-to-end proof generation given a theorem in Lean 4 or Coq. The proof-wala tool also supports visualization of tree searched and annotation of proof trees which can be further used for some form of RL style training.

Looking forward to any feedback you might have.

Amitayush Thakur (Feb 11 2025 at 00:16):

Amitayush Thakur said:

If anyone is still looking for a gym environment for theorem proving, our environment itp-interface (https://github.com/trishullab/itp-interface) might come in handy. It is implemented in Python and can support both Lean 4 and Coq.

So what's different?

  • Supports both Lean 4 and Coq. Yes, you just have to write code only once (for proof automation) and it will work for both the ITPs!
  • Supports parallel tactic execution, which can help a lot while automatically searching for proofs through LLMs. It can maintain a proof environment pool which can help in running multiple tactics seamlessly for the same state.
  • Manages memory more efficiently. If you use Lean 4 REPL, then after a couple of backtracks the JRPC bloats and can take a large amount of memory. We avoid such pitfalls.
  • Lean 4 REPL proof mode has some issues regarding accepting incorrect proofs, we avoid that. (See https://github.com/leanprover-community/repl/issues/44)
  • One can in parallel collect proof step data across multiple repositories.
  • Support for multiple versions of Coq and Lean 4
  • The library is designed to scale and run on ray cluster while doing a highly parallel proof search.

Setup is super easy:

  1. pip install itp-interface
  2. install-itp-interface
  3. We are done setting up for Lean 4. No more installation steps (including the need to install Lean 4) and other hassles.
    (See our python package: https://pypi.org/project/itp-interface/ for more details)

We build our work on top of the libraries Lean 4 REPL(https://github.com/leanprover-community/repl) and Coq Serapy (https://github.com/HazardousPeach/coq_serapy).

I would like to thank my co-contributer: George Tsoukalas , and the community at Zulip for giving us valuable feedback while building this tool.

This tool can be further supplemented by our proof search tool (https://pypi.org/project/proof-wala/) which can run multilingual proof searches for Coq and Lean. The two can be used for end-to-end proof generation given a theorem in Lean 4 or Coq. The proof-wala tool also supports visualization of tree searched and annotation of proof trees which can be further used for some form of RL style training.

Looking forward to any feedback you might have.

Check out more details here: #Machine Learning for Theorem Proving > itp-interface and proof-wala

GasStationManager (Feb 14 2025 at 19:05):

A couple of updates to LeanTool, my LLM-Lean interface, that may be of interest to folks here:

  • Now supports reasoning models including o3-mini, Deepseek r1, and gemini-flash-thinking. Now you have the reasoning power of o3-mini / r1,
    that also outputs valid Lean 4. (Some of these models like r1 and gemini-flash-thinking do not support function calling, the new LeanTool update allows them to pass code in plain text instead. )

  • OpenAI-compatible API proxy server: queries get routed to the LLM, which talks to Lean directly, fix errors if any, and returns valid Lean 4 code. Works with coding assistants including Cline and Continue, and chat interfaces like OpenWebUI. I have set up a demo for the proxy server at http://www.codeproofarena.com:8800/v1 . This is bring-your-own-api-key; if you have an API key for Anthropic / OpenAI / Google / Deepseek, you can point your app to the above API base address and pass the API key. You can of course run your own server if you install LeanTool locally; I'm taking @Jason Rute 's advice from his recent Lean Together talk to make tools more easily available for users to try.

Kim Morrison (Feb 15 2025 at 00:08):

What do we need to have this usable from within VSCode?

Kim Morrison (Feb 15 2025 at 00:09):

GasStationManager said:

the proxy server at http://www.codeproofarena.com:8800/v1

I get a 404.

GasStationManager (Feb 15 2025 at 00:58):

Both Continue and Cline are VSCode extensions. Once installed, for Cline you'll be prompted to set up a model; put the above URL as the API base URL and model type as "openAI-compatible", model name "sonnet" for sonnet 3.5, or "gpt" for gpt 4o, "deepseek" for Deepseek v3. And fill in your API key for the model.
For Continue it will be similar; there will be a link to the config file you can click on, and then add a custom model as in the instructions here https://docs.continue.dev/customize/model-providers/openai

GasStationManager (Feb 15 2025 at 01:08):

Kim Morrison said:

GasStationManager said:

the proxy server at http://www.codeproofarena.com:8800/v1

I get a 404.

That URL should be fine as the API base URL; the actual request goes to something like .../v1/chat/completions

Kevin Buzzard (Feb 15 2025 at 08:18):

I also get a 404

GasStationManager (Feb 15 2025 at 17:20):

Getting 404 from the browser is okay / normal as http://www.codeproofarena.com:8800/v1 is the "API base URL" you would put in the config of your app. To send a raw POST request you'd send to http://www.codeproofarena.com:8800/v1/chat/completions


Last updated: May 02 2025 at 03:31 UTC