Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: ATP-related tactics VS AI automation

V S (Mar 18 2025 at 12:44):

While researching various ways of implementing SMT provers in ITP and the benefits that would provide, I've come across a comment about various evaluations of hammers vs AI tools: #Lean Together 2024 > Lean-auto - Yicheng Qian @ 💬. A year ago, it was speculated that AI tools are becoming not worse than hammers. What about now? Did AI evolve enough to leave no benefits for other types of automation (which seems to be possible since hallucinations, which seem to be AI's only weakness, lead to mistakes in Lean)? What are the best models that provide the evidence of them beating their ATP competition?

Jason Rute (Mar 21 2025 at 11:14):

This is hard to evaluate since the AI for ATP community and the hammer community don’t talk, don’t go to the same conferences, and don’t address each others work very well. I think the AI folks would say that in papers like PISA (Isabelle), Thor (Isabelle), MagnusHammer (Isabelle), Graph2Tac (Coq/Roqc, mine) that AI (or AI plus tree search) tools now beat hammers in raw numbers of theorems proved, but some tools like Thor (Isabelle) and PALM (Coq/Roqc), and Cobblestone (Coq/Rocq) combine AI and Hammers for best effect.

The hammer folks might counter that hammers are designed to be tuned to the situation at hand unlike AI tools. Also hammers are more practical since there hasn’t been much effort in making good practical AI tools for ITPs. Even in Lean, I think the effort to make a Lean hammer has more buy-in than making practical AI tools for Lean.

See my Last Mile talk for more on the challenges here.

Time will tell…

V S (Mar 24 2025 at 08:59):

Jason Rute said:

This is hard to evaluate since the AI for ATP community and the hammer community don’t talk, don’t go to the same conferences, and don’t address each others work very well. I think the AI folks would say that in papers like PISA (Isabelle), Thor (Isabelle), MagnusHammer (Isabelle), Graph2Tac (Coq/Roqc, mine) that AI (or AI plus tree search) tools now beat hammers in raw numbers of theorems proved, but some tools like Thor (Isabelle) and PALM (Coq/Roqc), and Cobblestone (Coq/Rocq) combine AI and Hammers for best effect.

The hammer folks might counter that hammers are designed to be tuned to the situation at hand unlike AI tools. Also hammers are more practical since there hasn’t been much effort in making good practical AI tools for ITPs. Even in Lean, I think the effort to make a Lean hammer has more buy-in than making practical AI tools for Lean.

See my Last Mile talk for more on the challenges here.

Time will tell…

Thank you! Appreciate the detailed answer!
It's a shame that AI and hammer communities don't interact as much, it really seems like combining the two could lead to promising results

Auguste Poiroux (Mar 24 2025 at 10:31):

Another paper combining AI and hammers is Draft, Sketch, and Prove.
Another line of research that could be interesting to explore is using AI to automatically develop higher-level tactics. In particular, higher-level tactics could help enhance AI-based proof-search efficiency.
We have LEGO-Prover which factor out lemmas. A similar first step for high-level tactics could be to leverage Mathlib's library to find repeated patterns in proofs (in a fuzzy sense). DreamCoder work is quite close to this idea.
I have no time to work on that though, so, to anyone reading this, feel free to try ^^

Jason Rute (Mar 24 2025 at 10:48):

I may have gave a slightly misleading picture on the lack of synergy. Lots of AI papers (for Isabelle and Coq) use hammers in the AI pipeline. Thor uses them like tactics which an AI can call. DSP uses them to fill in an LLM-created proof sketch. Many papers use them to clean up LLM generated proofs. Magnus Hammer replaces the Hammer, but still uses the hammer data and the Hammer reconstruction tactics.

Also in the other direction, Hammers use machine learning to select lemmas (often called premise selection). Actually this happens at two levels. The first level selects lemmas to send to the ATP. Many use things like Naive Bayes or other older machine learning methods, but the new Lean Hammer will use more advanced LLM-based AI I’ve heard. The other level is premise selection in the ATP itself. A number of papers have used neural tools there, including one paper by Josef Urban and team which focused on SledgeHammer ATP using data from Isabelle.

The major theme here is that in the end you have a pipeline that can use lots of tools and types of AI, whether symbolic, neural, and LLM, and there are many ways to combine them.

But I still stand by the fact that much of this research is not used in practice.

V S (Mar 24 2025 at 11:43):

Jason Rute said:

but the new Lean Hammer will use more advanced LLM-based AI I’ve heard.

Can you please elaborate? I've skimmed the forum but was unable to find any info about it

Jason Rute (Mar 24 2025 at 11:47):

@V S I don't know if it's been mentioned here. I know this from talking with one of the people involved with building the hammer. I don't think it was so super secret that I shouldn't have said anything, but I also might be misremembering details, or the details might change as they build it, so I don't want to say too much. I'll let them reveal it in their own time.

V S (Mar 24 2025 at 13:11):

Jason Rute said:

V S I don't know if it's been mentioned here. I know this from talking with one of the people involved with building the hammer. I don't think it was so super secret that I shouldn't have said anything, but I also might be misremembering details, or the details might change as they build it, so I don't want to say too much. I'll let them reveal it in their own time.

Ok, thank you nonetheless! I hope the development goes smoothly, it would be great to have more tools for Lean

V S (Mar 24 2025 at 13:12):

Auguste Poiroux said:

Another paper combining AI and hammers is Draft, Sketch, and Prove.
Another line of research that could be interesting to explore is using AI to automatically develop higher-level tactics. In particular, higher-level tactics could help enhance AI-based proof-search efficiency.
We have LEGO-Prover which factor out lemmas. A similar first step for high-level tactics could be to leverage Mathlib's library to find repeated patterns in proofs (in a fuzzy sense). DreamCoder work is quite close to this idea.
I have no time to work on that though, so, to anyone reading this, feel free to try ^^

Thank you! Will keep this in mind

Last updated: May 02 2025 at 03:31 UTC