Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Lean on top of language models?

Vadim Fomin (Nov 28 2022 at 17:19):

Hey there! I recently joined this community and I want to play around with Lean. My main background is in natural language processing. It's my impression that you can use neural networks for theorem proving by training a large language model and using Lean for verification after generating candidate proofs (e. g. this). Can anyone point me to how you can actually use Lean for that sort of stuff automatically, by talking to Lean from Python? (A wrapper or a server with an API would be nice...)

Andrés Goens (Nov 28 2022 at 17:37):

the openai project that did that wrote such a wrapper and published it: https://github.com/openai/lean-gym

Andrés Goens (Nov 28 2022 at 17:37):

It's my impression that you can use neural networks for theorem proving by training a large language model and using Lean for verification after generating candidate proofs

wether that's true I guess is up for debate

Timothee Lacroix (Nov 28 2022 at 17:41):

Why do you say it's up for debate?
I believe the only debate in the post you linked is about the faithfulness of formal statements in minif2f, not about the use of lean to verify proofs :)

Other useful resources by openai for neural theorem proving and lean :

https://github.com/jesse-michael-han/lean-step-public
https://github.com/jasonrute/lean_proof_recording

Timothee Lacroix (Nov 28 2022 at 17:43):

And latest minif2f https://github.com/facebookresearch/miniF2F

Andrés Goens (Nov 28 2022 at 17:45):

Timothee Lacroix said:

Why do you say it's up for debate?
I believe the only debate in the post you linked is about the faithfulness of formal statements in minif2f, not about the use of lean to verify proofs :)

oh, sorry, I should have been a bit more precise. I didn't mean that using Lean to verify them was up for debate, but rather how well NLP-based language models transfer over to Lean/formal statements

Andrés Goens (Nov 28 2022 at 17:45):

(thanks for the clarification)

Timothee Lacroix (Nov 28 2022 at 17:50):

Ah yeah :) some work remains to be done, that's for sure :D

Jason Rute (Nov 28 2022 at 17:52):

First, it should be pointed out that Lean (and most other ITPs) are designed as computer languages and use-cases like this have been ad-hoc-ly added on top, and don't feel very natural. For lean, there are three many routes of communication:

Metaprogramming: Lean has an extensive meta-programming framework which lets you look at the internals of Lean objects, execute tactics, and just in general communicate with IO in a variety of ways. The problem is that you have to know Lean's metaprogramming framework (written in Lean) to get anywhere. The lean-gym mentioned above as well as written in Lean. If your usecase fits into that, then you don't need to even touch Lean's metaprogramming code, but the API is limited. The two datasets mentioned above also are gotten via metaprogramming. Those are the same datasets used by both OpenAI and Meta Research.
C++: Lean 3 is written in C++. (An aside, Lean 3 will probably be the main version of Lean for about another year or less, while the library is ported to Lean 4. Lean 4 is mostly written in Lean 4 and has better programming features. My answer is for Lean 3 since that is where all the proof data is currently located.) Meta Research wrote (proprietary) code to directly hook into the C++ backend. However, a small piece of that is available for everyone to use, namely, you can hook up any LLM to Lean's infoview to give suggestions. (#general > api suggestion vscode integration) This is the easies way to use an existing LLM. Again, the API is particular to their use case and format.
Lean communicates with VS Code, Emacs, etc. via a language server. It is a simple JSON server that can be used directly. But there is also a light weight python interface (which is not well maintained) at https://github.com/leanprover-community/lean-client-python. I could imagine a copilot-like API which directly interacts with the editor or language server.

Jason Rute (Nov 28 2022 at 17:57):

As for how to go about training a LLM, it is fairly easy if you have the data (and the GPUs/skills :smile:). See this talk (https://www.youtube.com/watch?v=EXpmbAfBNnw) or any of the following papers:

Jason Rute (Nov 28 2022 at 17:59):

If you aren't ITP specific, there is also a toy environment called INT, which is easier to work with:

Jason Rute (Nov 28 2022 at 18:02):

And Isabelle also has a publicly available interface for machine learning maintained by @Albert Jiang and others, and used in their papers, including Thor. I think it is https://github.com/albertqjiang/Portal-to-ISAbelle

Jason Rute (Nov 28 2022 at 18:08):

Finally if you want to use pretrained models like GPT-3, Codex, etc. You could just have that generate whole proofs in one go, with no interaction. Then you could just interact with Lean via its command line interface, asking it to just check a proof in a file.

Vadim Fomin (Nov 28 2022 at 18:24):

Andrés Goens said:

the openai project that did that wrote such a wrapper and published it: https://github.com/openai/lean-gym

That's specifically what I was looking for, thank you!

Vadim Fomin (Nov 28 2022 at 18:26):

Thank you Jason, very useful!! Will mess around and probably come back with more questions :)

Vadim Fomin (Nov 28 2022 at 19:16):

I think the most reasonable thing for me would be to figure out how to start a server, and dive into its API

Last updated: May 02 2025 at 03:31 UTC