Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Tools for semi-automated line-by-line formalization


Terence Tao (May 12 2025 at 15:23):

This is a spinoff of a thread #Equational > Alternative proofs of E1689⊢E2 in the Equational Theories Project. In that thread, I posted a recording I made of formalizing a 20-30 line informal proof of one of the implications in the project into Lean in a semi-automated and mostly mechanical "line-by-line" fashion, in which I (with the assistance of Github Copilot) translated each line of the informal proof to a couple lines of Lean (with sorries), and then relied on finishing tactics like exact? and canonical to try to fill each sorry, turning to manual simplification of these subgoals with tactics (most notably convert) if the automated tools did not work. As one can see in the video, the full formalization took 33 minutes, or roughly one and a half minutes per line (I did not attempt to precisely define what a "line" is in this context, so this is an extremely approximate estimate).

I am curious to hear of other tools that might be suitable for this type of formalization task, where a relatively detailed informal proof already exists, to the point where each line can be formalized mostly mechanically, without significant knowledge of the broader proof strategy or context required, and where library search is not the bottleneck (this proof did not use any Mathlib lemmas other than tactics). I chose Github Copilot and canonical because I was already familiar with the former and curious about the latter, and both tools were easily integrated into my IDE (VSCode on Windows), but perhaps there are other workflows (e.g., using other modes of Copilot or similar tools than simply autocompletion of text at the cursor location) that would be worth testing out here.

GasStationManager (May 12 2025 at 19:09):

If you want to go beyond autocomplete and try letting the LLM attempt to prove some of the goals, my LeanTool is worth trying. It exposes the feedback from Lean to the LLM, to allow it to fix its errors interactively. Works with most LLMs and most coding assistants. Some of the current LLMs are worth trying, e.g. Claude Sonnet, Gemini-2.5, and DeepSeek. However, most LLMs don't know about Canonical (or some of the most recent Lean tools), so will need to be prompted into using it. Then it depends on how well the LLM is able to follow prompt instructions...

Another workflow to try: ask the LLM to read the document containing the informal proof, then ask it to formalize it into Lean, perhaps initially with sorrys for the subgoals. I don't know if it will work well for this domain, but doesn't hurt to try. This could be done in an interactive coding assistant like Cursor with agent mode, or something less interactive like Claude Code. Both supports using LeanTool.

Justin Asher (May 12 2025 at 20:33):

@GasStationManager Have you tried benchmarking any LLMs using LeanTool?

GasStationManager (May 12 2025 at 21:32):

@Justin Asher I have not tried it with the math proof benchmarks. But if anyone is interested in how the current LLMs are doing, and has a bit of budget to spend on API costs, I'm happy to collaborate!

Terence Tao (May 13 2025 at 06:01):

I made a second video in which I showed the previously informal and formal proofs to both Claude and o4, and then asked them to formalize a second informal proof of the same result that @Bruno Le Floch had also contributed: https://www.youtube.com/watch?v=zZr54G7ec7A

Justin Asher (May 13 2025 at 06:43):

I found Gemini 2.5 Pro to be extremely helpful at writing Lean metaprogramming code involving data extraction. I know this is not quite the same, but the underlying theme was that when provided with the appropriate context, the model is extremely capable of figuring out how to produce the desired code. All it took was a few attempts at feeding the compiler errors back to it before the code was worked out.

Context was extremely important here: Gemini would often hallucinate if you did not provide it with the source files. Thankfully, it has an extremely large context window, so you can input almost as much as you would like (not considering costs). This is something Noam Shazeer and Jeff Dean have talked about trying to build: being able to put an entire code base into the context window of an LLM (for instance, all of Mathlib + Lean).

Hence, I want to take a model like Gemini and provide it with appropriate search tools along with compiler / Lean feedback (similar to LeanTool @GasStationManager ). Then, the LLM can query the environment and figure out how to best formalize and prove given statements on its own, using the feedback as guidance.

I think this would be able to do most of the tasks you asked of Claude and ChatGPT to do.

Jeremy Avigad (May 16 2025 at 02:12):

@Terence Tao, @Josh Clune and I tried Duper on your example. The experiment is here:
https://github.com/avigad/EquationalTest
https://github.com/avigad/EquationalTest/blob/main/EquationalTest/Basic.lean
We are pleased that not only does it easily prove each of the lemmas, but it even proves the theorem easily without the lemmas. (You can delete the definitions of S and f and all the lemmas from the file.) The proofs are inscrutable, but as far as verifying correctness, it does pretty well.

Terence Tao (May 16 2025 at 03:43):

Nice! In the ETP we did solve some other implications of this sort using Duper, though it took some effort to then convert them to non-Duper proofs: #Equational > FINITE: The Lean+Duper implications

In general saturation techniques worked pretty well for the positive implications in the ETP. For instance Vampire was able to resolve this implication in 0.003 seconds: #Equational > What are the hardest positive implications for an ATP? @ 💬

Oliver Dressler (May 16 2025 at 11:22):

Justin Asher said:

Hence, I want to take a model like Gemini and provide it with appropriate search tools along with compiler / Lean feedback (similar to LeanTool GasStationManager ). Then, the LLM can query the environment and figure out how to best formalize and prove given statements on its own, using the feedback as guidance.

I have created such a tool box as an MCP, which you can integrate with VSCode (agent mode) or Cursor:
lean-lsp-mcp

It allows an LLM to interact with Lean, mainly via the language server protocol. E.g.:

  • Query the proof goal or term goal at any position
  • Access hover info (inline documentation) at any position
  • Code completions: Find available identifiers or import suggestions
  • Query leansearch.net using natural language search

I find this kind of context information allows the model to make better code suggestions.

On a side note, I would be happy to include a tool to query leanexplore.com if you provide an API and are OK with it.

Jeremy Avigad (May 16 2025 at 13:32):

Why were the Lean+Duper proofs called conjectures, and why was there a need to convert them to non-Duper proofs? Duper is not as powerful as Vampire, but its selling point is that it constructs Lean proof terms that are checked by Lean's kernel. In that sense, it is like using simp, omega, or linarith. Like those tools, however, the proof terms that Duper produces are large and not meant for human consumption. A strength of canonical is that it produces terms that are much more efficient.

Terence Tao (May 16 2025 at 13:54):

The objective of the Equational Theories Project was to formalize ~22 million implications/anti-implications (of which "Equation 1689 implies Equation 2" was an example) in a common formal framework. Early on, we chose that framework to be Lean + Mathlib with no further packages. It's true that we could have in principle switched over to Lean+Mathlib+Duper, but this might have been logistically complicated since all participants on the formalization side of the project have to use the same framework (also it may have complicated the mathlib bump process, and possibly also created issues with leanchecker). See #Equational > Future of Using ATPs for more discussion.

As discussed in that thread, there was a similar issue involving proofs discovered using egg, which we had temporarily added to the framework but then later decided to remove. One could have perhaps replaced the egg or duper proofs with their Lean proof terms, but I think there were concerns on scaling this out to 22 million implications (actually in practice we only formalized a generating set of ~600,000 of them) while keeping the compilation time to a reasonable size. (As you mention, the proof terms generated by canonical are significantly shorter and so one could in principle have used canonical in the ETP, though perhaps only installed on forks of the project, with the proofs migrated after being converted into proof terms that would compile without canonical.) (As an aside, egg could also handle the 1689=>2 implication, with just one intermediate statement required. We have many proofs of this fact now! 1689=>2 is far from the most difficult implication/anti-implication in the ETP -- that honor goes to 1729 !=> 255 (or the still unresolved 677=>255 for finite magmas) -- but I think it serves as a good toy problem (at our current state of the art) to make apples-to-apples comparisons of various formalization strategies, as it is just complicated enough to be non-trivial, but not so hard that most of the current automated tools simply fail to make progress on it.)

In the ETP we needed a Lean keyword (similar to proof_wanted) for results that had not yet been formalized within our chosen framework, but had been established by other means, e.g., Lean+Duper, Vampire, Z3, or an informal human written proof. We decided to create a keyword conjecture for this (it won out over the competing proposals claim and proof_wanted'). Because of this, we began informally referring to these results as 'conjectures" in our project, even though technically most of them had actually met quite high standards of proof. So for us, "conjecture" was an abbreviation for "conjecture relative to the formal framework used for our project".

ADDED LATER: If you want to try a harder challenge than 1689=>2 to test out Duper, one natural candidate is 650=>448, where our current proof is a long conversion of a Vampire proof (which took 3.6 seconds for at least one choice of Vampire parameters).

Justin Asher (May 16 2025 at 14:58):

@Oliver Dressler API is on the way. I will check out your project in more detail when I get the chance.

Justin Asher (May 16 2025 at 15:09):

The point is to not only provide the model with the ability to search for statements, but to also exploit dependency information to automatically generate a rich and resourceful context around the proof or document that it is working on.

Jeremy Avigad (May 16 2025 at 21:44):

Terry, thanks for the detailed explanation. That makes perfect sense.

Duper can't come close to competing with Vampire, but I took up your challenge anyhow. First, I imported the proof of 650=>448 and replaced every call to superposition and the other specialized tactics with a call to duper:
https://github.com/avigad/EquationalTest/blob/main/EquationalTest/Equation650_implies_Equation448.lean
Then I minimized it by hand, combining and deleting steps. I could shorten the proof somewhat, but not much, and only by increasing maxHeartbeats.
https://github.com/avigad/EquationalTest/blob/main/EquationalTest/Equation650_implies_Equation448_minimized.lean

Anyhow, Duper is not designed to compete with Vampire, but, rather, to serve the purpose announced in the topic of this thread, i.e. automatically fill in small steps in a proof. We are currently working on a sledgehammer for Lean that will use neural and symbolic methods to select background facts from the library and Lean-auto to instantiate type classes and export the problem to external provers like Zipperposition and cvc5. Duper (among other methods) will be used to reconstruct proofs in Lean. Chase and I are also hopeful that with premise selection and a method to instantiate type classes, Canonical will provide a complementary approach. And we are all excited about getting even further by combining neural and symbolic methods.

Bruno Le Floch (May 17 2025 at 05:56):

A small note: the file name starts with Equation550 which should be 650. I think everywhere else the number is correct.

Jeremy Avigad (May 17 2025 at 17:27):

Thanks! I fixed it.


Last updated: Dec 20 2025 at 21:32 UTC