Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: is the AI for ITP community sleeping on o1?

Jason Rute (Dec 08 2024 at 21:24):

o1 and similar reasoning models like QwQ do really well on informal math reasoning, including math competition problems like AIME, AIMO Progress Prize 2, and the Putnam exam. But we have never really benchmarked (to my knowledge) any of the o1 variants on Lean/Isabelle/Coq problems because it is too expensive. But considering how good o1 and o1-style reasoning is, I wouldn’t doubt that o1 plus a little extra to fix errors (like DSP, Lyra, PALM, etc), couldn’t blow much of the competition out of the water. For example with DSP, we would just be using o1 to solve the problem informally and then the rest would be auto-formalization. (It might do better in Isabelle since Isabelle has been more stable. Lean’s library and syntax keeps changing, and Coq doesn’t have a standard even for things like the reals.)

GasStationManager (Dec 08 2024 at 22:00):

This is a good point. I agree it's worth trying o1 in e.g. DSP with o1 acting in the informal layer.

On the other hand, the kind of reasoning/search that o1 does would really benefit from feedback from a verifier, in this case the Lean (or isabelle etc) proof checker.
Last time I checked (when it was o1-preview), they don't yet support function calling,
which is how I implemented my Lean feedback mechanism. One could implement some custom code to make o1 talk to lean; for me I'll wait for o1 to release function calling, and hopefully it won't be for pro only...

Leni Aniva (Dec 09 2024 at 00:15):

GasStationManager said:

This is a good point. I agree it's worth trying o1 in e.g. DSP with o1 acting in the informal layer.

On the other hand, the kind of reasoning/search that o1 does would really benefit from feedback from a verifier, in this case the Lean (or isabelle etc) proof checker.
Last time I checked (when it was o1-preview), they don't yet support function calling,
which is how I implemented my Lean feedback mechanism. One could implement some custom code to make o1 talk to lean; for me I'll wait for o1 to release function calling, and hopefully it won't be for pro only...

I tried that in my Pantograph paper and it doesn't work well

Jason Rute (Dec 09 2024 at 02:05):

@Leni Aniva I forgot about o1 in Pantograph. I guess one should really debug what is going on in pantograph DSP verse the original DSP paper. Are the informal o1 proofs worse, or is it just much harder to translate informal proofs to Lean proof sketches, verse Isabelle proof sketches? I don’t know about o1, but GPT-4 is notoriously bad at Lean 4 syntax, constantly mixing up Lean 3 and Lean 4. I wonder if it works better with Isabelle? Or if there were subtle tricks in the DSP prompt. (Did they provided some examples of proof sketches?). Also, DSP was pass@100, where Pantograph DSP was pass@1. And it looks like Pantograph was using aesop/linarith/simp as a “hammer”. This is much worse than the Isabelle hammer and probably a large issue. In particular the Isabelle hammer has premise selection, while your “hammer” doesn’t (except for some lemmas tagged for these tactics). It would be better to use a proof search like ABLE/Copilot as the hammer or the real Lean hammer (is it called Lean auto?), but the later has never been officially tested on benchmarks.

GasStationManager (Dec 09 2024 at 02:42):

From what I can tell Lean auto doesn't (yet) have premise selection. I'm tempted to try plugging the premise selector from LeanDojo; though I see the model is a year old now. Has anyone used LeanDojo (or something else) for premise selection?

Jason Rute (Dec 09 2024 at 03:09):

Lean Copilot (based on Lean Dojo) has a premise selection tactic. It is Lean 4, but I don't know if it is up-to-date.

Jason Rute (Dec 09 2024 at 03:10):

(Also, only the premise selection tactic uses premise selection. Unlike Lean Dojo's ReProver, premise selection is not used for Lean copilot's proof search.)

Huajian Xin (Dec 09 2024 at 17:55):

We're working on this track :)

Harald Carlens (Dec 09 2024 at 19:07):

Agreed, this sounds very promising. Especially open models like QwQ, which could be fine-tuned on the latest mathlib to somewhat mitigate the concerns around lean's changing library.

Xiyu Zhai (Dec 23 2024 at 11:35):

I’m also working on this.

Xiyu Zhai (Dec 23 2024 at 11:36):

I believe reasoning on the natural language is far more promising than the lean side.

Xiyu Zhai (Dec 23 2024 at 11:37):

I’m developing a hybrid AI system for stable autoformalization.

Xiyu Zhai (Dec 23 2024 at 11:37):

This combined with O1 will make a ton of things trivial.

Last updated: May 02 2025 at 03:31 UTC