Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Discussion: Goedel-Prover-V2: A new SOTA model

Kelly Davis (Jul 16 2025 at 12:54):

Yong Lin said:

The Goedel team is happy to annouce our open sourced SOTA model Goedel-Prover-V2.

Our flagship model, Goedel-Prover-V2-32B, achieves the new SOTA...

All in 32B! Impressive! :clap:

Are there plans for an arXiv paper?

Notification Bot (Jul 16 2025 at 12:59):

A message was moved here from #announce > Goedel-Prover-V2: A new SOTA model for theorem proving by Matthew Ballard.

Yong Lin (Jul 16 2025 at 12:59):

Kelly Davis said:

Yong Lin said:

The Goedel team is happy to annouce our open sourced SOTA model Goedel-Prover-V2.

Our flagship model, Goedel-Prover-V2-32B, achieves the new SOTA...

All in 32B! Impressive! :clap:

Are there plans for an arXiv paper?

Arxiv paper is coming in 2-3 weeks.

Notification Bot (Jul 16 2025 at 13:00):

A message was moved here from #announce > Goedel-Prover-V2: A new SOTA model for theorem proving by Matthew Ballard.

Kaiyu Yang (Jul 16 2025 at 13:31):

Should we move the discussion to #Machine Learning for Theorem Proving ?

Kelly Davis (Jul 16 2025 at 13:37):

Kaiyu Yang said:

Should we move the discussion to #Machine Learning for Theorem Proving ?

Yes. It seems a more appropriate home.

Kaiyu Yang (Jul 16 2025 at 14:23):

@Matthew Ballard Could you please help us move the discussion? Thanks!

Notification Bot (Jul 16 2025 at 14:31):

This topic was moved here from #general > Discussion: Goedel-Prover-V2: A new SOTA model by Matthew Ballard.

Shreyas Srinivas (Jul 16 2025 at 14:48):

question : Does Pass@32 mean you collected the best of 32 attempts for each theorem?

Anh Nguyễn (Jul 16 2025 at 15:34):

Shreyas Srinivas said:

question : Does Pass@32 mean you collected the best of 32 attempts for each theorem?

I think it is more like generating 32 attempts and see if any one of them pass

Jeremy Tan (Jul 17 2025 at 02:05):

I'm trying to run Goedel-Prover-V2 on my laptop (with no GPU) and I'm always getting the error

SafetensorError: File does not contain tensor model.layers.19.input_layernorm.weight

(or similar). How do I fix this?

I want to see whether the new model can prove IMO 1997 Q3 (#27159), which is not in the training dataset, and if so what the generated output is

Jeremy Tan (Jul 17 2025 at 02:11):

I've tried it on Colab and I get the same error

Gavin Zhao (Jul 17 2025 at 03:56):

This looks very promising! I'm particularly curious about the "Verifier-Guided Self-Correction" part and what this means about using this model most effectively. For example, the way I tell Claude Code to write Lean code is:

Write some Lean code.
Run the Lean compiler to make sure it compiles.
If it doesn't compile, read the error message (including context and goal) and come up with a fix. Go back to step 2.

Is Goedel optimized to be used in a similar way? Or are you just doing this feedback loop in training?

Kaiyu Yang (Jul 18 2025 at 05:17):

Jeremy Tan said:

I'm trying to run Goedel-Prover-V2 on my laptop (with no GPU) and I'm always getting the error
SafetensorError: File does not contain tensor model.layers.19.input_layernorm.weight
(or similar). How do I fix this?

I want to see whether the new model can prove IMO 1997 Q3 (#27159), which is not in the training dataset, and if so what the generated output is

Can you provide the exact steps to reproduce the error (including code and commands you run)?

Kaiyu Yang (Jul 18 2025 at 05:18):

Gavin Zhao said:

This looks very promising! I'm particularly curious about the "Verifier-Guided Self-Correction" part and what this means about using this model most effectively. For example, the way I tell Claude Code to write Lean code is:

Write some Lean code.

Run the Lean compiler to make sure it compiles.

If it doesn't compile, read the error message (including context and goal) and come up with a fix. Go back to step 2.

Is Goedel optimized to be used in a similar way? Or are you just doing this feedback loop in training?

Something like that at a high level (only in inference). More details will be available in the paper to be released.

Jeremy Tan (Jul 18 2025 at 06:25):

Kaiyu Yang said:

Can you provide the exact steps to reproduce the error (including code and commands you run)?

mkdir goedel; cd goedel in any convenient location.
python3 -m venv .venv to create the virtual environment, then source .venv/bin/activate to activate it.
python3 -m pip install transformers torch accelerate to install the dependencies
For now I am just using the sample script provided on the HuggingFace page as a sanity check. Copy that script and paste it as prover.py in goedel.
Run the script – it takes quite a while to download the tensors (naturally), and then at the line

outputs = model.generate(inputs, max_new_tokens=32768)

the error I mentioned is thrown.

Shreyas Srinivas (Jul 18 2025 at 10:54):

Anh Nguyễn said:

Shreyas Srinivas said:

question : Does Pass@32 mean you collected the best of 32 attempts for each theorem?

I think it is more like generating 32 attempts and see if any one of them pass

So in principle it is possible that the model succeeds on problem X in round n and not in round (n + 1), but it would still count as a success in the statistics right?

Shreyas Srinivas (Jul 18 2025 at 11:12):

Are these passes parallel or sequential?

Ping J (Jul 18 2025 at 13:20):

this model sent a shiver down my spine

Ping J (Jul 18 2025 at 13:21):

never felt this way since gpt4

Ping J (Jul 18 2025 at 13:21):

Shreyas Srinivas said:

Are these passes parallel or sequential?

i guess parallel, where each containing 3 self-correct iterations

Bas Spitters (Jul 18 2025 at 14:03):

Is self-correction the same is Proof repair(the Curry-Howard equivalent of Program repair)?
E.g. the work by Talia Ringer https://homes.cs.washington.edu/~djg/theses/ringer_dissertation.pdf

(deleted) (Jul 18 2025 at 15:42):

The model doesn't always have an accurate model of how Lean works. When Lean provides the compiler errors the model can then fix the code.

Shange Tang (Jul 19 2025 at 23:44):

Shreyas Srinivas said:

question : Does Pass@32 mean you collected the best of 32 attempts for each theorem?

Pass@32 means that the prover tries 32 times for a theorem, and as long as one is verified by lean then we count it as correct.

Shange Tang (Jul 19 2025 at 23:52):

Your full name said:

Shreyas Srinivas said:

Are these passes parallel or sequential?

i guess parallel, where each containing 3 self-correct iterations

Yes, our model do the passes in parallel. And each pass contains an initial attempt of writing the entire proof + two rounds of self correction given the lean feedback. The process ended when a correct proof is found at any stage (or two rounds of self corrections still fail to find the proof)

Bohan Lyu (Jul 20 2025 at 00:16):

Hi! Our inference framework supports multiple rounds of modification and allows for multiple corrections for a single error.
When we report "pass@32 with self-correction," it means that we first perform inference 32 times. If none of the outputs are correct, we then apply a single correction to each error individually. If there are still no correct outputs, we continue from the previous attempts and apply another round of corrections. All of this is done within a 40k context length, which means that some errors cannot be sequentially corrected twice.

(deleted) (Jul 21 2025 at 11:47):

What Lean and mathlib version is this trained on?

(deleted) (Jul 21 2025 at 11:48):

I plan to run this model

Ping J (Jul 21 2025 at 17:14):

i thought this takes the cake until i saw https://deepmind.google/discover/blog/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad/ ... i wonder what post-AGI society will be like... i don't want to live on mars....

(deleted) (Jul 21 2025 at 17:24):

While a pro-AI person, I don't think speculating about a post-AGI society is very welcome on this Zulip instance

(deleted) (Jul 21 2025 at 17:25):

Especially when it's off topic for this thread

Ping J (Jul 21 2025 at 17:55):

i just noticed https://www.gemini.com/ is not owned by Google lol. (i know completely off topic but where else can community members chitchat w

Kevin Buzzard (Jul 21 2025 at 18:35):

Please take chitchat to the Lean Discord https://discord.com/invite/WZ9bs9UCvx

Kaiyu Yang (Jul 22 2025 at 01:40):

Jeremy Tan said:

Kaiyu Yang said:

Can you provide the exact steps to reproduce the error (including code and commands you run)?

mkdir goedel; cd goedel in any convenient location.

python3 -m venv .venv to create the virtual environment, then source .venv/bin/activate to activate it.

python3 -m pip install transformers torch accelerate to install the dependencies

For now I am just using the sample script provided on the HuggingFace page as a sanity check. Copy that script and paste it as prover.py in goedel.

Run the script – it takes quite a while to download the tensors (naturally), and then at the line
outputs = model.generate(inputs, max_new_tokens=32768)
the error I mentioned is thrown.

@Jeremy Tan We just released a GitHub repo (https://github.com/Goedel-LM/Goedel-Prover-V2), and the best way to get help is opening an issue there.

(deleted) (Jul 22 2025 at 02:11):

Ok. Looks like all Chinese firms use this exact ancient Lean and Mathlib version.

(deleted) (Jul 22 2025 at 02:12):

I'm also working for a Chinese firm too, so that's... compatibility!

(deleted) (Jul 22 2025 at 02:12):

Any chance this model gets supported by inference providers in the future?

(deleted) (Jul 22 2025 at 02:21):

Oh I'm not seeing a recommended prompt

(deleted) (Jul 22 2025 at 02:22):

Found the prompt. https://github.com/Goedel-LM/Goedel-Prover-V2/blob/main/src/utils.py#L339

(deleted) (Jul 22 2025 at 02:23):

To line 377.

(deleted) (Jul 22 2025 at 02:23):

I hope RL doesn't remove its ability to follow human provided proof sketches

Gavin Zhao (Jul 22 2025 at 02:56):

Wonder how much time/money it would take to re-train models like this with more recent Lean+Mathlib versions :eyes:

Pawan sasanka ammanamanchi (Aug 06 2025 at 05:24):

The paper seems to have been released here: https://arxiv.org/abs/2508.03613

(deleted) (Aug 28 2025 at 11:06):

Hello. I used Goedel Prover v2 to prove theorems in this list: https://leanprover.zulipchat.com/#narrow/channel/219941-Machine-Learning-for-Theorem-Proving/topic/AlphaGeometry.20doesn't.20have.20a.20secure.20foundation/near/532238675

I found that even when the theorem statements are expanded into real numbers, Goedel Prover v2 still struggles with them, except the easiest theorems.

I used ollama on Google Colab to run this model: https://ollama.com/alessandrorumampuk/Goedel-Prover-V2:32b

It could be possible that the version on ollama is not optimized, but I found that even when running the official inference code, the performance on my set of theorems is still poor.

Therefore, I believe that good benchmarking could be important in making strong AI models. I feel MiniF2F performance isn't representative of the model's strength in other high school math problems.

(deleted) (Aug 30 2025 at 11:24):

Screenshot_20250830_182202_com_android_chrome_SameTaskWebApkActivity.jpg

Very disastrous looping behavior here.

Kelly Davis (Sep 10 2025 at 16:13):

Huỳnh Trần Khanh said:

I found that even when the theorem statements are expanded into real numbers, Goedel Prover v2 still struggles with them...

Which theorems, as stated in PastedText.txt, failed?

(deleted) (Sep 10 2025 at 16:25):

Many. But try u31 first. Goedel Prover v2 fails on u31.

(deleted) (Sep 10 2025 at 16:25):

In fact, Goedel Prover v2 rarely succeeds.

(deleted) (Sep 10 2025 at 16:26):

Thank you for looking into this.

Alessandro Sosso (Nov 27 2025 at 13:15):

@Yong Lin Hi! Do you have any infomation on how long it took your models to generate a proof on average (both 8B and 32B)?
I am looking for a baseline to compare with my setup

Bas Spitters (Dec 10 2025 at 12:59):

@Yong Lin @Yong Lin Thanks for the nice work on Goedel. Are there plans on supporting recent versions of lean ?

Last updated: Feb 28 2026 at 14:05 UTC