Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: ProverBench


Jason Rute (Apr 30 2025 at 21:51):

As part of the DeepSeek-Prover 2 work (#Machine Learning for Theorem Proving > DeepSeek-Prover V2), they released a benchmark called ProverBench made of a combination of AIME '24 '25 problems and textbook problems in Number Theory, Algebra, Analysis, and Probability.

Jason Rute (Apr 30 2025 at 21:51):

They seem (like the recent CombiBench benchmark) to put auxiliary definitions above the theorem statement, which is a natural way to do it.

Jason Rute (Apr 30 2025 at 21:51):

I haven't looked at it yet, but I'm afraid---like the CombiBench benchmark in #general > CombiBench : A benchmark focusing on combinatorics problem (and frankly most other recent benchmarks) that it will be filled with errors, making a significant percent of the problems either impossible or trivially true. The report doesn't say what they did to prevent this. @Huajian Xin, I know you are not the main author, but do you (or another author) know what quality control was done on this benchmark? (The report doesn't address this.)

Jason Rute (Apr 30 2025 at 23:42):

Problem elementary_algebra__p1 is not true even in English:

For a polynomial \(P\) of degree \(n\) where \(P(i)\) equals the remainder of \(i\) modulo 2 for each \(i=0,1,\dots,n\), \(P(n+1)\) is 1 if \(n\) is even and 0 if \(n\) is odd.

Jason Rute (Apr 30 2025 at 23:46):

(Just consider the n=0 or n=1 cases. The formalization is for polynomials over the integers, but I’m not sure what ring the original intended, but it isn’t true for n=0 for any ring.)

Thomas Zhu (May 01 2025 at 00:00):

The formal version of elementary_algebra__p1 also doesn't mention degree at all, so even if the informal is true the translation is also wrong

Thomas Zhu (May 01 2025 at 00:12):

(deleted)

Jason Rute (May 01 2025 at 04:26):

I’m pretty sure calculus__p1 makes no sense and can’t possibly be true. The + C notation in calculus can’t translate into Lean in that way, right?!?!

Jason Rute (May 01 2025 at 04:28):

theorem integral_f_eq_integralF_plus_C (x C : ):  x , f x = integralf x + C :=

Where f and integralf are particular functions from the reals to the reals.

Alex Meiburg (May 01 2025 at 06:18):

It would be kind of hilarious to define notation " +C" that maps a function to its quotient by "these functions are equal up to a constant shift" equivalence class, and then you have mean 'indefinite integration' into that class

Alex Meiburg (May 01 2025 at 06:21):

... but yes since that isn't currently in mathlib, that problem's definitely wrong. C becomes an implicit argument to the theorem that can be anything.

Alex Meiburg (May 01 2025 at 06:26):

This theorem is correct "by accident" (e can be any real nunber):

/-
The derivative of \(f(x)=e^x\) at \(x=e\) is \(e^e\).
-/
open Real

theorem derivative_of_exp_at_e : deriv (fun x => exp x) e = exp e :=

Alex Meiburg (May 01 2025 at 06:27):

There's also many many instances in there of using axiom, and incorrectly. Like this one:

import Mathlib

/-
Let T: Z × Z → Z be a linear function satisfying T(1, 0) = 3 and T(0, 1) = -5. Then for all (x, y) ∈ Z × Z, T(x, y) = 3x - 5y.
-/

open Int AddMonoidHom

variable (T :  ×  →+ )

axiom T10 : T (1, 0) = 3

axiom T01 : T (0, 1) = -5

lemma T_expression :  (x y : ), T (x, y) = 3 * x - 5 * y :=
  sorry

which means that "given any function T : ℤ × ℤ →+ ℤ, it is true that T (1, 0) = 3", which is of course not true

Alex Meiburg (May 01 2025 at 06:29):

Oh man, roughly 1/3rd of the dataset has axiom in it. That's too bad.

Thomas Zhu (May 01 2025 at 07:09):

At least some of the axioms seem to be because they are formalizing problems that have subproblems. For example if the informal problem has subproblems (a) and (b), (a) will be formalized as theorem a : A := by and (b) as axiom a : A; theorem b : B := by

Sabbir Rahman (May 01 2025 at 07:14):

axioms are still unnecessary there, why not just add the fact as a hypothesis

Niels Voss (May 01 2025 at 07:35):

For number_theory__p26,

import Mathlib

/-
For any real number α, there exist infinitely many pairs of positive integers (p, q) such that |α - p/q| < 1/q^2
-/
theorem infinitely_many_positive_integer_pairs_for_real_approximation (α : ) :
   (p q : ) (h : q > 0),  (n : ), n > 0   (p_n q_n : ) (h_n : q_n > 0),
  |α - (p_n : ) / q_n| < 1 / q_n^2 := sorry

there seems to be a few errors. First of all, p and q don't do anything at all. Also, there's nothing stating that there needs to be a distinct pair for each natural number n; the solution could just find a single pair and keep reusing that. But more importantly, the natural language problem itself seems impossible, since if α is -2 then |α - (p_n : ℝ) / q_n| ≥ 2 > 1 / q_n^2, so the problem is false.

Niels Voss (May 01 2025 at 07:45):

For number_theory__p27,

import Mathlib

/-
Prove that 4kxy-1 does not divide the number x^m+y^n for any positive integers x,y,k,m,n.
-/
theorem no_division {x y k m n : PNat}
  (h :  z : PNat, (x : )^(m : ) + (y : )^(n : ) = z) :
  ¬((4 * k * x * y - 1)  z) := sorry

z is being treated as an autoimplicit because it is actually unbound in ¬((4 * k * x * y - 1) ∣ z). This makes the statement false:

example : False := @no_division 3 1 1 1 1 1 ⟨(2 : +), by decide dvd_rfl

Thomas Zhu (May 01 2025 at 07:55):

Everyone should really turn off autoImplicit for formalizing any statement using an LLM, it is such an easy fix for many wrong formalizations! I would also imagine something like disabling the instances Div Nat and Div Int to also fix many errors.

Niels Voss (May 01 2025 at 07:56):

For number_theory__p28,

import Mathlib

/-
Let p be a prime number and N=\prod _{k=1}^{p-1}(k^2+1). The remainder of N upon division by p is 4 if p is congruent to 3 modulo 4, and 0 if p is congruent to 1 modulo 4.
-/

open Finset

theorem prime_remainder_theorem (p : ) (hp : Prime p) :
  let N := (range (p - 1)).prod (λ k => k^2 + 1);
  if p % 4 = 3 then N % p = 4 else N % p = 0 := sorry

The if statement fails to account for p = 2. As stated, this is false:

example : False := by
  have := prime_remainder_theorem 2 (Nat.prime_iff.mp Nat.prime_two)
  contradiction

Niels Voss (May 01 2025 at 08:02):

These were just the first three problems I looked at. Many of these seem like easy fixes, but it's at least a little concerning that a lot of the problems seem to contain mistakes.

Jason Rute (May 01 2025 at 12:02):

I should mention, the ones where they are using axiom incorrectly means that they are introducing contradictory axioms that the model can exploit (maybe easily) to prove anything. It is possible that their model “solved” a number of these vacuously true problems inflating their benchmark score.

Thomas Zhu (May 01 2025 at 12:14):

Jason Rute said:

I should mention, the ones where they are using axiom incorrectly means that they are introducing contradictory axioms that the model can exploit (maybe easily) to prove anything. It is possible that their model “solved” a number of these vacuously true problems inflating their benchmark score.

Regarding this, I found a problem mathd_algebra_275 in miniF2F-test that had a wrong formalization and escaped the attention of both Kimina Prover and DeepSeek teams. Kimina Prover didn't solve it but DeepSeek Prover v2 very apparently exploited the incorrectness of the formalization and used exfalso to prove it.

Jason Rute (May 01 2025 at 12:15):

MiniF2F is a mess

Thomas Zhu (May 01 2025 at 12:22):

Yes, there must be better version control than submitting pull requests to Kaiyu Yang's repo... but my point is that DeepSeek Prover v2 may have learned to recognize and exploit some formalization errors during the RL training.

Jason Rute (May 01 2025 at 12:22):

There are so many errors in MiniF2F. Some are in the original MiniF2F for Lean 3, but I think others are due to the translation to Lean 4. There are a few miniF2F versions which even makes it more complicated. Harmonic cleaned them up (and it is rumored that DeepMind also has a cleaned-up version), but no one wants to use Harmonic's version because they changed the splits. The best would be to use Harmonic's version with the original splits, but that seems unlikely now since so many papers use (presumably) Kaiyu's version.

Jason Rute (May 01 2025 at 12:23):

This is one reason I make such a big deal when new benchmarks come out. We should be doing better as a field!

Jason Rute (May 01 2025 at 12:25):

(I'm a bit surprised since the original DeepSeek-Prover paper had all these tests to check if an autoformalization statement was false or vacuously true (the hypotheses are contradictory). You would think they would use the same techniques to clean up their own benchmark. :smile: )

Mert Ünsal (May 01 2025 at 15:17):

I think the version we used at Kimina includes all the fixes Harmonic made + some more. I am personally not an expert on formalization and we have a small team working on this to maintain a MiniF2F with as little errors as possible. We published it here: https://huggingface.co/datasets/AI-MO/minif2f_test

Would it be possible for anyone to make a list of mistakes in this? @Junqi Liu can update our dataset to maintain a clean version.

Mert Ünsal (May 01 2025 at 15:19):

(You can also directly make a PR on the HF repo - although I agree this is not the best way to maintain a Lean project, it makes it easy for people to run benchmarks.)

Eric Wieser (May 01 2025 at 15:21):

Alex Meiburg said:

variable (T :  ×  →+ )

axiom T10 : T (1, 0) = 3

axiom T01 : T (0, 1) = -5

lemma T_expression :  (x y : ), T (x, y) = 3 * x - 5 * y :=
  sorry

This makes the question void:

import Mathlib

/-
Let T: Z × Z → Z be a linear function satisfying T(1, 0) = 3 and T(0, 1) = -5. Then for all (x, y) ∈ Z × Z, T(x, y) = 3x - 5y.
-/

open Int AddMonoidHom

variable (T :  ×  →+ )

axiom T10 : T (1, 0) = 3

axiom T01 : T (0, 1) = -5

lemma oops : False := by
  cases T10 0

Mert Ünsal (May 01 2025 at 15:25):

Thomas Zhu said:

Jason Rute said:

I should mention, the ones where they are using axiom incorrectly means that they are introducing contradictory axioms that the model can exploit (maybe easily) to prove anything. It is possible that their model “solved” a number of these vacuously true problems inflating their benchmark score.

Regarding this, I found a problem mathd_algebra_275 in miniF2F-test that had a wrong formalization and escaped the attention of both Kimina Prover and DeepSeek teams. Kimina Prover didn't solve it but DeepSeek Prover v2 very apparently exploited the incorrectness of the formalization and used exfalso to prove it.

@Thomas Zhu would you be able to help us with a corrected version?

Jason Rute (May 01 2025 at 15:29):

Is it not fixed in Harmonics’ version?

Mert Ünsal (May 01 2025 at 15:31):

This is the version we have in Kimina:

import Mathlib
import Aesop

set_option maxHeartbeats 0

open BigOperators Real Nat Topology Rat

/-- If $\left(\sqrt[4]{11}\right)^{3x-3}=\frac{1}{5}$, what is the value of $\left(\sqrt[4]{11}\right)^{6x+2}$? Express your answer as a fraction. Show that it is \frac{121}{25}.-/
theorem mathd_algebra_275 (x : ) (h : ((11 : ) ^ (1 / 4)) ^ (3 * x - 3) = 1 / 5) :
((11 : ) ^ (1 / 4)) ^ (6 * x + 2) = 121 / 25 := by sorry

Jason Rute (May 01 2025 at 15:35):

Harmonic's version: https://github.com/harmonic-ai/datasets/blob/d3143cb7b8b5636d4829d5b8b88b48850217e8f8/minif2f/train.json#L1257-L1261

Sébastien Gouëzel (May 01 2025 at 15:36):

Note that in the power 1/4 is a ratio of natural numbers, so it is zero...

Sébastien Gouëzel (May 01 2025 at 15:36):

This is fixed in harmonic's version.

Thomas Zhu (May 01 2025 at 15:37):

This theorem is vacuously true currently: by norm_num at h. DeepSeek-Prover-V2 exploited this while Kimina-Prover didn't find this

Jason Rute (May 01 2025 at 15:37):

The wild west of minif2f is really hurting this field.

Jason Rute (May 01 2025 at 15:38):

Also, to be clear, similarly, don't use ProofNet. Use the fixed version ProofNet# by @Auguste Poiroux .

Eric Wieser (May 01 2025 at 15:52):

Should the "canonical" versions be moved insideleanprover-community to make this kind of thing clear?

Eric Wieser (May 01 2025 at 15:52):

(Leaving the original repo owners of course, but with the power for the community to transfer ownership if the author goes AWOL)

Auguste Poiroux (May 01 2025 at 16:02):

I think having official maintained versions is a great idea. I would be happy to transfer the ownership of ProofNet#. For now, I only made a HuggingFace repo and no Github repo because I wanted to limit potential future data contamination (not sure if it is any better though). But I guess making a Github repo would be better for the following reasons:

  • Porting the dataset to new Lean versions (currently limited to Lean v4.16.0-rc2)
  • Fixing further mistakes

Eric Wieser (May 01 2025 at 16:03):

My impression is that unless the dataset is released via github, the lean community is unlikely to check the theorem statements, and even more unlikely to try and submit a correction.

Eric Wieser (May 01 2025 at 16:06):

So I think you can either say "this is a fixed dataset which may or may not actually contain real AIME/AMC/IMO/Putnam/... questions", or "this is a living dataset which will be updated to try and remain faithful to the true questions", but sitting on the fence isn't really a good option.

Auguste Poiroux (May 01 2025 at 16:10):

I agree. I prefer the second option, but this requires some kind of versioning. A PutnamBench-like leaderboard with an additional "version" column might help tracking that.

Thomas Zhu (May 01 2025 at 16:11):

@Mert Ünsal Meanwhile, I opened a PR for this fix: https://huggingface.co/datasets/AI-MO/minif2f_test/discussions/5. However the "files changed" doesn't really show the change because of LFS. Long-term I would very much support a community-maintained version on GitHub though.

Thomas Zhu (May 01 2025 at 16:12):

(Please also validate my change is ok first: load_dataset("AI-MO/minif2f_test", revision="refs/pr/5"))

Mert Ünsal (May 01 2025 at 16:26):

Thanks a lot @Thomas Zhu - I merged your PR and acknowledged your contribution in the dataset card. Please let us know if you find any other issues - I pinged our internal team to have another look as well!

Deming Xu (May 01 2025 at 17:21):

I think there are also questions that become harder than they are supposed to be

theorem amc12b_2021_p3
(x : )
(h₀ : 2 + 1 / (1 + 1 / (2 + 2 / (3 + x))) = 144 / 53) :
x = 3 / 4 :=

I was trying the Deepseek model and I failed to make it prove this MiniF2F problem. This problem looks the same in the version by Harmonic. The "denominator equals 0" issue seems to make the proof cumbersome... Can anyone prove (or disprove) this manually?

field_simp requires the denominator to be non-zero, so it can't be used here.

I think the theorem should still be correct: since x/0=0 in Lean, there are some more points on the graph of this function f(x) that are originally undefined, but this should not affect the solution of f(x)=144/53.

In general, I think the domain issues of functions like 1/x sqrt(x) tan(x) arcsin(x) in MiniF2F cause many proofs to look longer and harder than the mathematical proofs on paper.

It would be nice if we have better tactics to deal with these issues. These issues are probably uncommon in Mathlib but for these high school equation problems they become pretty serious.

I figured out a proof with by_cases

Deming Xu (May 01 2025 at 17:31):

(deleted)

Mert Ünsal (May 01 2025 at 17:38):

is this MiniF2F-test? if so, it's not great to have the problem solutions in a public forum (a large enough LLM pre-trained on the internet can remember the proof.)

Joseph Myers (May 02 2025 at 00:01):

Regarding domain issues, the convention I use in IMOLean is that any functions with restricted domain in a hypothesis require an additional formal hypothesis to be added that the arguments are within the domain for the informal function (for example, that a division in a hypothesis is not dividing by zero), while any such functions in the conclusions to the problem require such a statement to be added to the formal conclusion (so if a problem asks you to prove something involving division, you implicitly have to prove that this conclusion does not involve dividing by zero, since informal mathematics doesn't use the Lean convention on the result of division by zero). In both cases, the extra hypotheses / conclusions may be omitted from the formal statement if sufficiently obvious.


Last updated: May 02 2025 at 03:31 UTC