Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Best open-source/open-access models for proving Lean thms?

Andreas Gittis (Jan 17 2025 at 02:09):

I was wondering if people who actively use neural provers could give their feedback on what they think the most performant, easiest to use, and easiest to train models currently are in terms of suggesting tactics or entire proofs for Lean 4 theorems/goal states. Right now I'm considering options among InternLM, DeepSeek, GPT-4o, Open AI o1, Llamma 3, Llemma, and TheoremLlamma.

From personal experience I've found there to be some idiosyncrasies that don't come up in papers, e.g. when I used Llemma 34B I ended up having to engineer around the run-on/repetition issue it inherited from Llamma 2. It's also not clear to me how to evaluate choices like DeepSeek-Math vs DeepSeek-Prover or InternLM-math vs InternLM-math-plus. Maybe the differences are slight and each use-case has to experiment, but I just wanted to get a sense of what's going on from people who have experience. Like if o1 has limited tokens but is way better than 4o, or if model X frequently confuses Lean 3 and Lean 4 syntax while model Y doesn't, whether fine-tuning is finicky on model Z, etc. Just in general what are the common practices for selecting models and if there are any quirks to be aware of. Thanks!

GasStationManager (Jan 19 2025 at 01:01):

As far as models go I think Sonnet 3.5 is currently the best for Lean 4. See also the recent talk by @Jason Rute (it was posted on this YouTube channel but is private right now; presumably will be available soon)

The other models tend to have more trouble outputting correct Lean 4 syntax. Although (disclaimer: plugging my personal project) this can be partially addressed by automating the feedback loop of checking LLM's output in lean, getting back the error messages and letting the LLM fix its errors. I have recently gotten reasonable performance out of GPT-4o and Deepseek, combined with this feedback loop as implemented by LeanTool.

Jason Rute (Jan 19 2025 at 05:29):

@Andreas Gittis Honestly you have maybe played around with these models more than most of us, so you should tell us more about your experience. Seriously. How do you use these models in practice?

Jason Rute (Jan 19 2025 at 05:29):

I think right now there is almost nothing usable in practice. The best is general purpose coding models like Sonnet 3.5 which you can use in a plugin or editor.

Jason Rute (Jan 19 2025 at 05:29):

The Lean-specific models aren’t necessarily bad, but it isn’t even that clear how you would use them in practice. You would have to host them yourself. If it is a next step prediction model, would you manually query it for next steps when you get stuck? Or maybe you could hook it up to LLMLean. If it was a whole proof completion model, that would be easier to use since you could copy and paste the whole proof. I think however most of the whole proof models like DeepSeek-Prover are just for competition-style math. Also older models like Llemma are probably out of date with the current libraries.

Andreas Gittis (Jan 20 2025 at 19:36):

GasStationManager said:

As far as models go I think Sonnet 3.5 is currently the best for Lean 4.

Okay thanks I'll check it out.

Jason Rute said:

Andreas Gittis Honestly you have maybe played around with these models more than most of us, so you should tell us more about your experience. Seriously. How do you use these models in practice?

I'm definitely not the person to ask :sweat_smile:. I've used a couple different models just for machine learning projects and I've always been disappointed with their performance.

Jason Rute (Jan 20 2025 at 21:13):

@Andreas Gittis What would you like a good model for? I’ve lately been encouraging folks to think about what specifically (including MWEs) they would like these models to do but can’t? Are there obvious goals they can’t fill? Do you have examples you have found?

Last updated: May 02 2025 at 03:31 UTC