Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Proper LLM Lean4 Integration with recursive checks for error

Gridiron Player (Oct 02 2024 at 20:16):

image.png
Would love to see somebody in the community working on this.
Openai O1 + Lean4 terminal integration as a source of truth could be powerful.

I myself dont have time to work on this, but I can consult. I have experience creating multi agent systems.

I propose using Autogen with https://github.com/microsoft/trace

You'd first want to create a prototype without trace and just autogen.

Have one agent that creates the lean4 theorem statement and tries to prove the statement with lean4 statements.
Agent 2 runs the theorem with the terminal. If error comes up, pass back to agent 1 to correct the error and so on.

It's very simple to create that prototype. Shouldnt be more than 40 - 50 hours of work.

Jason Rute (Oct 02 2024 at 21:08):

Many people have run GPT-4 on standard Lean and Coq benchmarks, using various agentic (and less agentic) approaches. You can probably start with some of those benchmarks, but note this could get really expensive. That is I think one of the main things holding people back.

Jason Rute (Oct 03 2024 at 02:44):

To flesh out what I said above, I think we are in interesting times where basically two types of people can do similar levels of "original research". (1) A ML researcher could do a carefully controlled experiment on a benchmark using some particular agentic/prompting strategy with a particular frontier model and try to publish the results (either as the main result or as a baseline comparing some other approach). (2) A practitioner could just make a tool using their expert ability to prompt the model. Honestly, I see no reason a person in category (2) couldn't make just as good of a tool as one in category (1). As @Gridiron Player said, making a first attempt at these tools would be relatively fast. Of course, those in category (2) might not have the funding to properly benchmark their creation.

Jason Rute (Oct 03 2024 at 02:45):

Here are some existing projects. (I think only Sagredo was category (2) in that it was never published and there were never proper benchmarks for it, probably because it was too costly. Maybe CoqPilot and LeanLLM also fit in this category, although CoqPilot has a paper.)

Baldur
- LLM: Minerva (fine-tuned on Isabelle data)
- ITP: Isabelle
- Similar to @Gridiron Player's suggested approach, where there are rounds of proof correction trying to fix errors.
Draft-Sketch-Prove and Subgoal Learning)
- LLM: Codex (and ChatGPT for Subgoal Learning)
- ITP: Isabelle
- Produce natural language proof, and then use that to build formal proof sketch. SledgeHammer ATP is used to fill in missing details. (No proof correction, but it uses 100 attempts.)
Sagredo
- LLM: GPT-4
- ITP: Lean
- A real-world project that never took off since it was probably too expensive. It did the traditional approach of applying an LLM at every proof step which is pricey for GPT-4.
Number of paper baselines: Use frontier models to synthesize whole proofs
- LLMs: Code Llamma, GPT-4, GPT-4o, DeepSeek-Math
- ITPs: Isabelle, Lean, Coq
- The prompts may differ, but the idea is like code generation. Just generate a proof from the theorem statement. Often with multiple passes (everything from 1 to 128 passes).
Lyra
- LLM: GPT-4
- ITP: Isabelle
- I think it is an extension of Draft-Sketch-Prove., but with two correction methods to fix failing proofs. The first, Tool Correction, is automatic and doesn't require using an LLM. The second, Conjecture Correction, is similar to @Gridiron Player's suggestion and prompts GPT-4 with the error message to fix the error.
LEGO-Prover
- LLM: GPT-4
- Builds on Draft-Sketch-Prove by building a library of "skills" (small, reusable lemmas). Retrieves skills to help with proof synthesis.
Copra:
- LLMs: GPT-4 and GPT-4o
- ITPs: Lean and Coq
- A fully agentic solution, using lots of available information, but quite pricey since it prompts the LLM at every proof state (if I recall correctly) in a tree search.
CoqPilot (Paper:
- LLMs: GPT-3.5, GPT-4o, LLaMa-2, Claude
- ITP: Coq
- First tries whole proof synthesis (I think with premise selection to suggest lemmas), and if fails it does proof correction through a "multi-round communication process with an LLM" (the details are unclear to me).
DeepSeekProver v1.5
- LLM: DeepSeek-Prover (fine-tuned on synthetic Lean data)
- ITP: Lean
- Whole proof generation, but uses Monte-Carlo tree search to fix errors, only branching only on errors in the proofs (instead of branching on all proof states like some searched-based approaches). Not that different to what @Gridiron Player is suggesting, but a bit more complicated. (DeepSeek-Prover v1 just did whole proof generation with thousands of attempts, with the main innovation being in how the model was trained with RL on synthetic Lean data.)
LLMLean
- LLMs: GPT models, mistral models, or any model callable through together.ai API (also support for locally run models)
- ITP: Lean
- llmqed tactic for whole-proof generation. (I don't think it is iterative, but I could be mistaken. I think it just tries multiple attempts.)
MiniCXT:
- LLMs: fine-tuned DeepSeek-Coder-1.3b models (and a GPT-4 baseline)
- ITP: Lean
- Experiments using whole files as prompts, both on GPT-4 base and a fine-tuned DeepSeek-Coder-1.3b (using their novel file-tuning method).
PALM
- LLM: LLaMa, GPT-3.5, GPT-4o
- ITP: Coq
- They do whole-proof generation using retrieved lemmas (from premise selection). Then they do multiple rounds of error correction (similar to what @Gridiron Player was suggesting). (Edit: PALM’s error correction is fully symbolic so it doesn’t call the LLM again.)

Jason Rute (Oct 03 2024 at 11:09):

I just fixed and expanded my list above. I tried to stick to solutions using instruction-tuned models (or other models which require general purpose prompting). After making the summary, the main thing I've come to is that whole proof search is too simple, and tree search branching on tactics is too expensive (for instruction-tuned models, unless you can really take advantage of prompt caching). The happy middle ground used in Baldur, Lyra, CoqPilot, DeepSeek-Prover v1.5, and PALM is to correct errors as they appear.

Jason Rute (Oct 03 2024 at 11:56):

Also, reading the tweet again, I realize the OP was probably looking for something in the other direction, where the LLM calls Lean as a tool. A good paper in that direction is Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization. I'd love to see this approach made practical.

Gridiron Player (Oct 08 2024 at 14:50):

Thanks for your comment. Yes, I was looking for a system that calls Lean4 as a tool, correct. I believe that the only way to make LLM's not hallucinate is to have a source of truth, like lean. The same problem arises with humans by the way. Humans need peer review too. You can't blindly trust someone's math paper.

Formalizing and testing someone's math paper with lean4 would be even better than peer review, right? (Correct me if I'm wrong, I'm not a math researcher)

I'm trying to take the same approach to LLM's.

I am currently working on an agentic system that automatically creates unit tests for a FinTech application at my company. The verifying part is to run the unit tests and see if there are errors or exceptions, and then it will self-repair the unit tests based on those errors.

My proposed approach with Lean would be similar: If the LLM runs into lean errors, there must be some sort of mistake and then it can recursively try to fix it. If there are no lean errors, the proof should be mathematically correct, right?

As I gain experience with the unit testing system, I might be able to give the Lean thing a go too. But obviously it would also be cool to see other people working on this because I only have so much time in the day.

With my experience with some of the multi agent LLM's library, I'd be able to consult.

Kevin Buzzard (Oct 08 2024 at 18:33):

Yes, formalizing someone's math paper with Lean 4 would be better than peer review. Unfortuately this is out of reach at the minute, because for example formalising definitions is hard and most modern maths papers will use definitions which aren't currently in mathlib. Another issue is that even though this is a translation problem (which LLMs are good at) there's not enough good quality data to train on.

Gridiron Player (Oct 08 2024 at 18:56):

Thats a non issue. You dont have to fine tune an LLM for this. In fact, finetuning is overrated. You can just hook it up to the lean4 documentation or some vector database that maps the nautral language description of the theorem to its respective lean4 command in the mathlib.

But I dont think you even need a vector database. If the mathlib codebase is sufficiently commented out with natural language, that would take care of this.

That said, fitting the entire codebase into one prompt might be a challenge lol. So a vector database might still be necessary.

Also, as of my experience, most bigger LLM's are already pretty proficient at lean4 if you combine it with a compilation step.

Many options here without having to fine tune.

And then error descriptions of the lean4 compiler would take care of the rest.

Kevin Buzzard (Oct 08 2024 at 18:58):

Well if you want to knock off Lean proofs of the lemmas which don't have ticks by them in the FLT project that would be great! Here we have human-readable LaTeX but no Lean yet (although humans are working on it).

Gridiron Player (Oct 08 2024 at 18:59):

@Kevin Buzzard I might work on it at some point like I said. Just thinking outloud here and presenting options

Gridiron Player (Oct 08 2024 at 20:39):

Jason Rute said:

Here are some existing projects. (I think only Sagredo was category (2) in that it was never published and there were never proper benchmarks for it, probably because it was too costly. Maybe CoqPilot and LeanLLM also fit in this category, although CoqPilot has a paper.)

Baldur

LLM: Minerva (fine-tuned on Isabelle data)

ITP: Isabelle

Similar to Gridiron Player's suggested approach, where there are rounds of proof correction trying to fix errors.

Draft-Sketch-Prove and Subgoal Learning)

LLM: Codex (and ChatGPT for Subgoal Learning)

ITP: Isabelle

Produce natural language proof, and then use that to build formal proof sketch. SledgeHammer ATP is used to fill in missing details. (No proof correction, but it uses 100 attempts.)

Sagredo

LLM: GPT-4

ITP: Lean

A real-world project that never took off since it was probably too expensive. It did the traditional approach of applying an LLM at every proof step which is pricey for GPT-4.

Number of paper baselines: Use frontier models to synthesize whole proofs

LLMs: Code Llamma, GPT-4, GPT-4o, DeepSeek-Math

ITPs: Isabelle, Lean, Coq

The prompts may differ, but the idea is like code generation. Just generate a proof from the theorem statement. Often with multiple passes (everything from 1 to 128 passes).

Lyra

LLM: GPT-4

ITP: Isabelle

I think it is an extension of Draft-Sketch-Prove., but with two correction methods to fix failing proofs. The first, Tool Correction, is automatic and doesn't require using an LLM. The second, Conjecture Correction, is similar to Gridiron Player's suggestion and prompts GPT-4 with the error message to fix the error.

LEGO-Prover

LLM: GPT-4

Builds on Draft-Sketch-Prove by building a library of "skills" (small, reusable lemmas). Retrieves skills to help with proof synthesis.

Copra:

LLMs: GPT-4 and GPT-4o

ITPs: Lean and Coq

A fully agentic solution, using lots of available information, but quite pricey since it prompts the LLM at every proof state (if I recall correctly) in a tree search.

CoqPilot (Paper:

LLMs: GPT-3.5, GPT-4o, LLaMa-2, Claude

ITP: Coq

First tries whole proof synthesis (I think with premise selection to suggest lemmas), and if fails it does proof correction through a "multi-round communication process with an LLM" (the details are unclear to me).

DeepSeekProver v1.5

LLM: DeepSeek-Prover (fine-tuned on synthetic Lean data)

ITP: Lean

Whole proof generation, but uses Monte-Carlo tree search to fix errors, only branching only on errors in the proofs (instead of branching on all proof states like some searched-based approaches). Not that different to what Gridiron Player is suggesting, but a bit more complicated. (DeepSeek-Prover v1 just did whole proof generation with thousands of attempts, with the main innovation being in how the model was trained with RL on synthetic Lean data.)

LLMLean

LLMs: GPT models, mistral models, or any model callable through together.ai API (also support for locally run models)

ITP: Lean

llmqed tactic for whole-proof generation. (I don't think it is iterative, but I could be mistaken. I think it just tries multiple attempts.)

MiniCXT:

LLMs: fine-tuned DeepSeek-Coder-1.3b models (and a GPT-4 baseline)

ITP: Lean

Experiments using whole files as prompts, both on GPT-4 base and a fine-tuned DeepSeek-Coder-1.3b (using their novel file-tuning method).

PALM

LLM: LLaMa, GPT-3.5, GPT-4o

ITP: Coq

They do whole-proof generation using retrieved lemmas (from premise selection). Then they do multiple rounds of error correction (similar to what Gridiron Player was suggesting). (Edit: PALM’s error correction is fully symbolic so it doesn’t call the LLM again.)

The Sagredo video is actually nice because it points out some of the challenges. I dont think they are anything that can't be overcome but helpful nonetheless @Jason Rute

It is however a bit concerning that with a simple proof like the quadratic equation one that it struggled so much. A lot of it though seemed to be confusions with lean3. Also, Openai O1 is a lot better at math than gpt4, so we'll see!

Notification Bot (Oct 11 2024 at 23:05):

2 messages were moved from this topic to #Machine Learning for Theorem Proving > LeanAgent by Jason Rute.

Alok Singh (Oct 30 2024 at 22:53):

@Jason Rute i wonder about sagredo now if one used gemini flash or one of the cheaper models as price has come down

Jason Rute (Oct 31 2024 at 02:17):

@Alok Singh Maybe, but Sagredo never really has been tested. Also, I think a better thing than a cheap per-token model would be a model with prompt caching since that would make it cheaper overall. Thoughts @Kim Morrison?

Alok Singh (Oct 31 2024 at 07:05):

Gemini has caching under name context caching but seems the same

Last updated: May 02 2025 at 03:31 UTC