Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: DeepSeek-Prover V2 TODO List

Jason Rute (May 05 2025 at 00:44):

There have been a lot of discussions recently on the DeepSeek-Prover results and no feedback yet from the team. I was told from @Huajian Xin they were on holiday until Monday. When they get back and look at this, here is a summary and recommended TODO list of stuff that needs to be addressed (if the DeepSeek team is reading this):

Jason Rute (May 05 2025 at 00:44):

It seems that they only used the Lean REPL to check for correctness (in Lean 4.9.0) (according to @Huajian Xin). There was a bug discovered with at least three of their proofs (one proof in MiniF2F, and two of the 7B Putnam solutions). The bug was in the front end of Lean. Your model learned that if it gave a proof with apply? (which introduces a sorry) along with a type jump in the proof, then Lean 4.9.0 would not report the uses of sorry, not add it to the environment, and not check the proof with the kernel. (It’s not a bug in the kernel, just in the front end, and it has already been fixed in newer versions of Lean, but we don’t know if anyone previously knew about this bug. Details can be found in #lean4 > apply? might suppress other warnings/errors and in #Machine Learning for Theorem Proving > DeepSeek-Prover V2.) Any “proof” the model finds with apply? is likely wrong (although it is in principle possible to solve a goal with apply? when it works like exact?).

Jason Rute (May 05 2025 at 00:44):

It was interesting that their model found it, but at the same time, it could have been caught with the following workflow:
- lake build all the proofs (would not have caught it, but it is still important to do)
- Use #print axioms to check that the only axioms in each proof were propext, Classical.choice, and Quot.sound. This would have caught it, not because it would have printed a SorryAx axiom (as intended), but because the bug prevented the theorem from getting added to the environment (and therefore it was never checked by Lean’s kernel). #print axioms just complains about the theorem not existing.
- Extract the proofs and use lean4checker to test them. In general, this is the gold standard, but in your case, this would not have caught the error alone (since the incorrect theorem was never exported). It might catch other errors, however.
- Use @Rob Lewis's autograder to check the proof against the reference. It was not designed for this in Lean 4.9.0, and there needs to be a better solution going forward (so for DeepSeek v3, they should please upgrade Lean), but here is an example of how it would work for their project. @Kim Morrison already checked their known bad MiniF2F proof with the autograder, but didn't have time to check all their proofs.

Jason Rute (May 05 2025 at 00:45):

They should double-check all their proofs in any claimed benchmarks (using the above methods). They should report any changes to the benchmark score. (I think the miniF2F score will only change by one solution, so less than a percentage point, but they should please check everything, and not just the bad proof. We don’t know about PutnamBench and ProverBench, since those solutions were not released yet, but it is also especially suspicious that they mention this apply? and Cardinal.toNat trick, which is a way to bump the type level in the proof to induce the bug, is common in the 7B model.)

Jason Rute (May 05 2025 at 00:45):

For PutnamBench, they should please have @George Tsoukalas officially verify their solutions (then they can go on the PutnamBench leaderboard). Also, @George Tsoukalas, please confirm that you check #print axioms as that would catch the current bug in some of the published PutnamBench solutions. Ideally you should also use lean4checker in addition to #print axioms.

Jason Rute (May 05 2025 at 00:45):

As for ProverBench, there are a number of errors in the benchmark problems (see #Machine Learning for Theorem Proving > ProverBench), making the problems either false or vacuously true. They should please fix all the errors and go through their benchmark rigorously before others use it. (This happens every time there is a new benchmark, including MiniF2F, ProofNet, CombiBench, and ProverBench, but at the same time, I hope this field can rise up to having high-quality standards. Without these standards, the benchmarks diverge, where some versions have fixed problems and some do not. MiniF2F is one of the worst offenders, unfortunately.)

Jason Rute (May 05 2025 at 00:45):

Here are ways they can fix ProverBench:
- Go through all the problems and identify errors. Is it true in English? Is it true as translated? (Don’t just fix the few problems we tell you have errors.)
- Look for missing conditions?
- Look for issues with edge cases?
- Look for common issues with subtraction of natural numbers, division by 0, and issues with the base case n = 0.
- Be careful of implicit conditions in problems like distinctness conditions (e.g. in “Let x and y be two numbers”, does the English assume they are distinct?)
- Follow the approach from the DeepSeek v1 paper to use their powerful AI to find:
  - False problems
  - Vacuously true problems (where the hypotheses imply False).
- Consider manually writing private proofs of the trickier problems to make sure they are formalized correctly.

Jason Rute (May 05 2025 at 00:46):

MiniF2F: There are far too many versions of MiniF2F, all with different variations of corrected problems. There is no right solution here, but they should note that one proof in their paper is of a misformalized MiniF2F theorem (and that has caused them some embarrassment on X/twitter). One option is to continue to use Numina’s version of MiniF2F. It is no more official than any other version, but it is where the state-of-the-art results currently are. But note that Numina's version has many false and vacuously true problems (cc @Jia Li). Another option is to use Harmonic’s version. (I don't know what Lean version they use, cc @Aidan Swope, @Alex Meiburg.) Harmonic’s version is not that official (no one but them uses it), but it is the most cleaned up (I think they privately formalized a proof for every problem), and maybe it should become the official version (or used by someone to make an official version). The only problem is that the valid-test splits are changed, so you would have to change them back to what literally everyone else uses for validation and test. (MiniF2F is a huge mess. It is bad that this field never cleaned it up months ago when there started to be multiple Lean 4 versions with different sets of corrected problems. Further, it is possible that by running MiniF2F on a corrected dataset, one could increase their score just because their are fewer false problems. Yes, some problems are vacuously true and DeepSeek-Prover can find some proofs of those, but I imagine more are false than vacuously true.) Whatever they do, please don’t make yet another version of MiniF2F (at least not without building consensus)!

George Tsoukalas (May 05 2025 at 04:27):

Jason Rute said:

For PutnamBench, they should please have George Tsoukalas officially verify their solutions (then they can go on the PutnamBench leaderboard). Also, George Tsoukalas, please confirm that you check #print axioms as that would catch the current bug in some of the published PutnamBench solutions. Ideally you should also use lean4checker in addition to #print axioms.

To confirm, they have sent me the proofs. Since it is a fairly large number it will take some time to verify each one.

Justin Asher (May 05 2025 at 05:32):

Jason Rute said:

The only problem is that the valid-test splits are changed, so you would have to change them back to what literally everyone else uses for validation and test.

I just created a version of the miniF2F using Harmonic's dataset and the original validation/test splits. Please let me know if you notice any mistakes! I can add the informal proofs and headers from @Kaiyu Yang's dataset, if people would find this helpful.

Eric Rodriguez (May 05 2025 at 09:08):

Harmonic minif2f is for leanprover/lean4:v4.5.0-rc1@ mathlib commit 8f013c457aea2ea1b0156c0c98043c9ed018529f

GasStationManager (May 05 2025 at 14:03):

A couple comments:

it was interesting to see that the model found and exploited this bug, but come to think of it this will be standard behavior of all models trained with RL: if the model finds an exploit through exploration, it will use it, because as far as it is concerned it is outputting correct proofs.
this exploit would have been caught by a (careful) application of #print axioms, but the next one might not be
With that in mind, my recommendation to all labs training prover models using RL, is to incorporate the safest possible proof checking in their training pipeline. There would be performance tradeoffs, but would be better than trying to fix it after significant amount of training compute has been spent.

Last updated: Feb 28 2026 at 14:05 UTC