Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: DeepSeek-Prover


Huajian Xin (May 24 2024 at 05:31):

We submitted a new paper on arxiv and successfully achieved large-scale synthesis of formal proof data for Lean4. The trained model successfully proved 52% of theorems using the Lean4 language on the miniF2F benchmark. http://arxiv.org/abs/2405.14333

Kim Morrison (May 24 2024 at 05:42):

Example b, in section A.3, which I think is one of your claimed IMO problems, the formal statement does not agree at all with the informal statement, and as a consequence a trivial proof is acceptable that doesn't at all reflect the difficulty of the problem.

Kim Morrison (May 24 2024 at 05:42):

The problem confused the Nat.sqrt function with the actual square root function Nat \to Real!

Huajian Xin (May 24 2024 at 05:47):

Kim Morrison said:

Example b, in section A.3, which I think is one of your claimed IMO problems, the formal statement does not agree at all with the informal statement, and as a consequence a trivial proof is acceptable that doesn't at all reflect the difficulty of the problem.

Yes, errors still exist in the FIMO benchmark (https://github.com/liuchengwucn/FIMO) and require further inspection.

Kim Morrison (May 24 2024 at 05:47):

Maybe you shouldn't make claims about solving IMO problems then?

Huajian Xin (May 24 2024 at 05:48):

Kim Morrison said:

Maybe you shouldn't make claims about solving IMO problems then?

In addition to the FIMO benchmark, we have also solved some IMO problems in miniF2F. These have received more community inspections. We will add these examples to the paper update. Thank you for your correction!

Kim Morrison (May 24 2024 at 05:50):

I'm confused by example a as well. The problem statement is about positive integers, yet the formal statement extends the domain to zero, and then the proof immediately uses facts about the function at zero. I think this means this proof is also unrelated to the actual problem.

Sébastien Gouëzel (May 24 2024 at 05:50):

There is also a problem in Example a, in section A.3, where the informal statement only allows positive (nonzero) integers, but the formal statement allows zero, and the proof is based on the use of zero.

Kim Morrison (May 24 2024 at 05:52):

Are the IMO problems from miniF2F available?

Huajian Xin (May 24 2024 at 06:19):

Kim Morrison said:

Are the IMO problems from miniF2F available?

Yes, the following are five proofs generated by our model of IMO from miniF2F:

theorem imo_1959_p1 (n : ) : Nat.gcd (21*n+4) (14 *n+3)=1 := by
  simp [(by ring : 21 * n + 4 = (14 * n + 3) * 1 + (7 * n + 1)), (by ring : 14 * n + 3 = (7 * n + 1) * 2 + 1)]

theorem imo_1964_p2 (a b c : ) (h₀ : 0 < a  0 < b  0 < c) (h₁ : c < a + b) (h₂ : b < a + c)
    (h₃ : a < b + c) :
    a^2 * (b + c - a) + b^2 * (c + a - b) + c^2 * (a + b - c)  3 * a * b * c := by
    norm_num
    ring_nf
    nlinarith [mul_self_nonneg (b - a + c - b), mul_self_nonneg (c - a + b - c), mul_self_nonneg (a - b + c - a)]

theorem imo_1964_p1_2 (n : ) : ¬7  2 ^ n + 1 := by
  have : 2 ^ n % 7 = 1  2 ^ n % 7 = 2  2 ^ n% 7 = 4:= by
    revert n
    norm_num
    intro n
    induction' n with n IH
    · simp
    · rcases IH with (IH | IH | IH) <;>
        simp [pow_succ, IH, Nat.mul_mod, Nat.add_mod, Nat.mod_eq_of_lt]
  intro H
  rw [ Nat.mod_add_div (2 ^ n) 7] at H
  revert H
  have : ¬2 ^ n % 7 = 0 := by
    intro h
    rw [h] at this
    simp at this
  omega

theorem imo_1965_p2 (x y z : ) (a :   ) (h₀ : 0 < a 0  0 < a 4  0 < a 8)
    (h₁ : a 1 < 0  a 2 < 0) (h₂ : a 3 < 0  a 5 < 0) (h₃ : a 6 < 0  a 7 < 0)
    (h₄ : 0 < a 0 + a 1 + a 2) (h₅ : 0 < a 3 + a 4 + a 5) (h₆ : 0 < a 6 + a 7 + a 8)
    (h₇ : a 0 * x + a 1 * y + a 2 * z = 0) (h₈ : a 3 * x + a 4 * y + a 5 * z = 0)
    (h₉ : a 6 * x + a 7 * y + a 8 * z = 0) : x = 0  y = 0  z = 0 := by
  refine' ⟨_, _, _⟩ <;> cases' le_total x y with h h <;>
  cases' le_total x z with h' h' <;>
  cases' le_total y z with h'' h'' <;>
  nlinarith

theorem imo_1983_p6 (a b c : ) (h₀ : 0 < a  0 < b  0 < c) (h₁ : c < a + b) (h₂ : b < a + c)
    (h₃ : a < b + c) : 0  a ^ 2 * b * (a - b) + b ^ 2 * c * (b - c) + c ^ 2 * a * (c - a) := by
  have h₄ : 0  (a - b) ^ 2 * (a + b - c) := by
    nlinarith [sq_nonneg (a - b), sq_nonneg (a + b - c)]
  have h₅ : 0  (b - c) ^ 2 * (a + c - b) := by
    nlinarith [sq_nonneg (b - c), sq_nonneg (a + c - b)]
  have h₆ : 0  (c - a) ^ 2 * (b + a - c) := by
    nlinarith [sq_nonneg (c - a), sq_nonneg (b + a - c)]
  nlinarith

Huajian Xin (May 24 2024 at 06:44):

In fact, many benchmarks currently widely used in the field of AI have errors. For example, we found that a theorem of miniF2F also has errors: the premise of https://github.com/facebookresearch/miniF2F/blob/main/lean/src/test.lean#L946 should be (3 * n) % 11 = 2.

Kim Morrison (May 24 2024 at 07:21):

I think one should be very careful to not claim anything that could be interpreted as "we solve an IMO problem" when in fact you mean "we solve a broken problem from a poorly maintained database".

Moreover, I think papers that are insufficiently careful in this regard should be promptly updated. Otherwise it looks bad for everyone, no?

Huajian Xin (May 24 2024 at 07:25):

Kim Morrison said:

I think one should be very careful to not claim anything that could be interpreted as "we solve an IMO problem" when in fact you mean "we solve a broken problem from a poorly maintained database".

Moreover, I think papers that are insufficiently careful in this regard should be promptly updated. Otherwise it looks bad for everyone, no?

Yes, we will update the paper immediately, thank you for your correction!

Eric Wieser (May 24 2024 at 09:36):

Huajian Xin said:

Yes, errors still exist in the FIMO benchmark (https://github.com/liuchengwucn/FIMO) and require further inspection.

I'm somewhat sad to see that this dataset was done in lean 3, especially if the GitHub claim that it was built only 5 months ago is to be believed. Are there plans to update it to lean 4?

Kevin Buzzard (May 24 2024 at 09:36):

My understanding is that formalised datasets with low-quality data are just generally problematic in this field, because you see an issue where a dataset says "this is a formalisation of question X from exam Y" and you look at it and think "well actually it's not, because you tell the computer the answer/have failed to formalise the question accurately/have only formalised part of the question" and then your instinct is to fix the dataset but then you've changed the dataset so you're invalidating all the claims in the literature of the form "we got Z% on this dataset". Furthermore you are probably genuinely invalidating them in the sense that they got the original question right because the AI spotted e.g. that 1/0=0 in Lean making the question trivial, and once you've fixed this the AI would never have solved the problem.

This is particularly problematic when the problem is labelled as "an IMO problem" and then the splash is "we solved an IMO problem". But in general this is something that researchers like @Huajian Xin can do little about, the problem is with the datasets.

My instinct is that, despite the niggles above (and I'll post one or two more below), this is a pretty impressive piece of work. As my earlier comment above indicated, two days ago I was skeptical about getting computers to generate datasets which would be of any use at all in domains like number theory, even though I'm aware of the alphageometry work which seemed to work great for Euclidean geometry; I'm also aware of the fact that Euclidean geometry is a decidable domain and elementary number theory is not, which is some kind of indication that elementary number theory is in some sense much harder. I know this is a very unfashionable viewpoint amongst the AI crowd, but I personally would be very interested in seeing a few random samples from the "dataset of 869,659 high-quality natural language math problems" and their autoformalisations!

My nit is section 5.2, example (b). I'm not entirely sure what this section is supposed to be demonstrating, but I cannot make any sense of the question. Right now (translated a little) this question says: "given a real number D and the condition that D=-a-25b-11c for some real numbers a,b,c, prove that D^2=154". This question is obviously complete nonsense. In the paper it says "the computer formalised this statement in one way, and obviously this is complete nonsense; here is the correct way". But every way to formalise this nonsense question is nonsense.

Eric Wieser (May 24 2024 at 09:49):

The formalization employs the Matrix.det function to compute the determinant,
utilizing the ![...] notation for lists of lists in Lean to represent the matrix rows.

A nit and a complaint: ![x, y, z] is notation for tuples not lists, and matrices should be written !![a, b; c, d] not ![![a, b], ![c, d]], the latter has the wrong type and gets the wrong multiplication (though does end up meaning the right thing anyway in your example)

Mario Carneiro (May 24 2024 at 09:50):

Right now (translated a little) this question says: "given a real number D and the condition that D=-a-25b-11c for some real numbers a,b,c, prove that D^2=154".

The statement actually says "for real numbers a,b,c, ..." which I guess is meant to be interpreted as "for all real numbers a,b,c" rather than "there exists real numbers a,b,c" as you have done. And given that the header is "proving under inconsistent assumptions" I guess the point is that this assumption is supposed to be inconsistent and hence D^2=154 is provable. But I guess it is "complete nonsense" if you reason that no one would ever pose a question with inconsistent hypotheses which doesn't ask to prove this unless it was a mistake (i.e. the D^2=154 part is a red herring and would not appear in real mathematics).

Mario Carneiro (May 24 2024 at 09:51):

But the original informal statement is really ambiguous about this quantifier

Joseph Myers (May 24 2024 at 13:05):

This is not the first paper where an AI system found a proof for a claimed IMO or IMO shortlist problem in a benchmark that then proved to be a completely incorrect statement for obvious reasons to do with getting types wrong (a previous one was a miniF2F statement wrongly using natural number subtraction when negative numbers could occur in the original problem). It's hard to get formal statements 100% right without also formalizing human solutions (mistakes in stating a lemma formally often get discovered when you try to formalize a proof), which would be a lot of work, but I'm pretty sure it would be possible to do a lot better than miniF2F (or, from the evidence of this paper, FIMO) simply by having a checklist of rules for how to choose types in formal statements and common pitfalls such as natural number subtraction and division.

Eric Wieser (May 24 2024 at 14:01):

Are there examples of human-formalized statements which were found to be inaccurate due to small details, yet somehow could still be proven with the expected human proof strategy due to making the same mistakes in the proof?

Kevin Buzzard (May 24 2024 at 14:04):

Does #13134 count?

Max Cui (May 24 2024 at 14:10):

Huajian Xin said:

We submitted a new paper on arxiv and successfully achieved large-scale synthesis of formal proof data for Lean4. The trained model successfully proved 52% of theorems using the Lean4 language on the miniF2F benchmark. http://arxiv.org/abs/2405.14333

Thanks for the excellent work! I am very interested in this work, please allow me to ask a noob question, i think the proof verification should use a lean4 server or something similar, so how to use lean4 in python?

Eric Wieser (May 24 2024 at 14:15):

Regarding the paper, there is seemingly an error in A.2.b, which says:

the logical structure with existential quantifiers (∃) and implications (→) aligns well with the
problem’s requirements.

I can't make out what the "problems requirements" are here, but the statement of

example :
let F := fun k => Nat.choose (k + 2) 2;
let f := fun k => Nat.choose (k + 1) 1;
 n : N, 2  n   k : N, f k = (n - 1) * n / 2  ( m : N, F m = (n - 1) * n
/ 2  m = k

is surely not what was intended, because I can prove it by choosing m = k = 0 and ignoring all the statements about f. I think this is falling into the ∃ x, 0 < x ∧ P x (there exists a positive x satisfying P) vs ∃ x, 0 < x → P x (meaningless) trap

Patrick Massot (May 24 2024 at 14:23):

Why don’t those papers use https://dwrensha.github.io/compfiles/ instead of miniF2F? Are there so few problems in David’s database that it cannot be used?

Eric Wieser (May 24 2024 at 14:28):

One fear I've seen (which I think is why the hoskinson miniF2F lean4 dataset is not on GitHub) is that anything on GitHub with solutions quickly becomes ineligible for using as a benchmark due to LLMs training on GitHub data

Huajian Xin (May 24 2024 at 14:29):

Patrick Massot said:

Why don’t those papers use https://dwrensha.github.io/compfiles/ instead of miniF2F? Are there so few problems in David’s database that it cannot be used?

Thank you so much for your suggestion! We use miniF2F just because most of the existing research in the field of AI uses this data set, and our work is convenient for comparison with previous work. We will try David’s database later.

Huajian Xin (May 24 2024 at 14:35):

Max Cui said:

Huajian Xin said:

We submitted a new paper on arxiv and successfully achieved large-scale synthesis of formal proof data for Lean4. The trained model successfully proved 52% of theorems using the Lean4 language on the miniF2F benchmark. http://arxiv.org/abs/2405.14333

Thanks to Kim Morrison for the great work REPL, we found it to be very efficient when doing large-scale verification.
Thanks for the excellent work! I am very interested in this work, please allow me to ask a noob question, i think the proof verification should use a lean4 server or something similar, so how to use lean4 in python?

Jason Rute (May 24 2024 at 15:02):

Exciting work. I look forward to understanding it better. I do worry that the MiniF2F theorems are compromised. I'm not sure the LLMs you use take much effort to remove them from training and I know for example that @David Renshaw has been writing proofs for most of them in Lean and posting them publically. Also since you take a DSP approach, one also has to grapple with the fact that most of these problems have public natural language solutions that have been on the internet for years. I'm not saying there is a good answer, but it is something worth thinking about coming up with a better solution towards.

Jason Rute (May 24 2024 at 15:10):

Also, do you think it is confusing we now have a "DeepSeek-Prover" and a DS-Prover in Lean? At a quick glance, I can't figure out which one "DS-Prover" in figure 1 is referring to. :/

Jason Rute (May 24 2024 at 15:41):

Another thing that I worry about is the vast amount of compute going into inference to get these scores. I really appreciate the details in your column "generation times", but I think there are still a lot of questions to be addressed by the community. It is an important observation that "cumulative" approaches work. Unlike usual ensemble approaches in machine learning, with theorem proving one knows the answer is correct and doesn't have to average solutions, but can take the best score of all models/parameters/runs. This can be taken to the extreme. Josef Urban likes to claim that automated methods can solve say 80% of Mizar theorems. (If someone could remind me where to find this reference that would be great.) But the catch is that he is taking the union of ALL approaches ever tried. One could do the same in MiniF2F, taking the union of all approaches ever tried over all theorem provers. This isn't a silly idea exactly, but it also isn't clear how to make it practical. Is one spending many GPU-days solving a problem? Or, is one just quickly running a solver with many configurations? (In my experience it is better to run 5 solvers for 1 minute each than 1 solver for 5 minutes.) I think we need better metrics for inference compute used and inference time used, or benchmarks (like in my recent paper) where the compute and time are capped at something reasonable.

Eric Wieser (May 24 2024 at 15:49):

Should we move the messages in this thread to https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/DeepSeek-Prover which seems to be the announcement for this paper?

Huajian Xin (May 24 2024 at 16:07):

Jason Rute said:

Also, do you think it is confusing we now have a "DeepSeek-Prover" and a DS-Prover in Lean? At a quick glance, I can't figure out which one "DS-Prover" in figure 1 is referring to. :/

Thanks for your comment! Sorry for the confusion, DS-Prover is our abbreviation for DeepSeek-Prover, and we will correct these problems in an updated version.

Huajian Xin (May 24 2024 at 16:09):

Jason Rute said:

Another thing that I worry about is the vast amount of compute going into inference to get these scores. I really appreciate the details in your column "generation times", but I think there are still a lot of questions to be addressed by the community. It is an important observation that "cumulative" approaches work. Unlike usual ensemble approaches in machine learning, with theorem proving one knows the answer is correct and doesn't have to average solutions, but can take the best score of all models/parameters/runs. This can be taken to the extreme. Josef Urban likes to claim that automated methods can solve say 80% of Mizar theorems. (If someone could remind me where to find this reference that would be great.) But the catch is that he is taking the union of ALL approaches ever tried. One could do the same in MiniF2F, taking the union of all approaches ever tried over all theorem provers. This isn't a silly idea exactly, but it also isn't clear how to make it practical. Is one spending many GPU-days solving a problem? Or, is one just quickly running a solver with many configurations? (In my experience it is better to run 5 solvers for 1 minute each than 1 solver for 5 minutes.) I think we need better metrics for inference compute used and inference time used, or benchmarks (like in my recent paper) where the compute and time are capped at something reasonable.

As for the efficiency of theorem proving, since we use naive sampling to generate the entire proof instead of performing tree search step-by-step, all model generation is independent of each other and can be parallelized without any time dependence. Therefore, if there is strong enough parallel computing The proof can be performed very quickly on a capable GPU server.
In our experiments, we can complete the evaluation of miniF2F with 32 samples in 34 minutes using two servers each with 8 40G Nvidia A100 GPUs and a 350-core CPU, and the average time per question was less than 5 seconds. Even without such large-scale parallel experiments, the time required for proving a single theorem test is considerable.

Jason Rute (May 24 2024 at 16:11):

@Huajian Xin How much inference compute is used for the claimed 52% number?

Jason Rute (May 24 2024 at 16:12):

That is more than 32 samples, right?

Jason Rute (May 24 2024 at 16:14):

But as for your 32-samples numbers, 40 GPU-seconds per theorem is quite reasonable compared to a lot of other approaches.

Huajian Xin (May 24 2024 at 16:23):

Jason Rute said:

Huajian Xin How much inference compute is used for the claimed 52% number?

Yes, this result is a combination of all the evaluation experimental results we have conducted in this work. However, the pass rate at 128 samples is already good, and more samples do not increase much compared to the inference cost. We believe that in general it is still economical to use the 7B model for whole-proof generation.

Joseph Myers (May 24 2024 at 16:57):

Kevin Buzzard said:

My understanding is that formalised datasets with low-quality data are just generally problematic in this field, because you see an issue where a dataset says "this is a formalisation of question X from exam Y" and you look at it and think "well actually it's not, because you tell the computer the answer/have failed to formalise the question accurately/have only formalised part of the question" and then your instinct is to fix the dataset but then you've changed the dataset so you're invalidating all the claims in the literature of the form "we got Z% on this dataset". Furthermore you are probably genuinely invalidating them in the sense that they got the original question right because the AI spotted e.g. that 1/0=0 in Lean making the question trivial, and once you've fixed this the AI would never have solved the problem.

This is particularly problematic when the problem is labelled as "an IMO problem" and then the splash is "we solved an IMO problem". But in general this is something that researchers like Huajian Xin can do little about, the problem is with the datasets.

I think there's a tension here between different uses of such benchmarks.

  • If you want to compare performance of one AI with another, a fixed benchmark may be appropriate.
  • But if you want to make externally meaningful assertions about the achievements of your AI, you want a simple, correct, externally meaningful description of the contents of the benchmark, such as "all IMO shortlist algebra and number theory problems from 2006 to 2021", and so want to fix any mistakes found in the benchmark. Note that FIMO is not all algebra and number theory problems from that period, only a subset (in easy-mode, i.e. with answers to "determine" problems provided) whose statements were successfully autoformalized. It's not clear how the choice of this subset might bias statistics of how many problems get solved by an AI; for externally meaningful statistics, a complete set of all shortlist problems in given categories from a given period would be better (and even better would be a complete set without excluding combinatorics or geometry). This incompleteness also affects use of Compfiles as a benchmark as well.
  • From the mathlib community point of view, these formalizations are most valuable if solutions are formalized as well and all relevant lemmas are added to mathlib (if formalizing is done with an eye to factoring out as much as possible in general form for mathlib) - formalizing such problems serves to test whether relevant parts of mathlib have all the lemmas needed to make it convenient to apply the theory in specific cases; completeness is less important, though personally I'd like to see formal solutions to all 386 past IMO problems in the mathlib archive. Benchmarks might not want solutions online, though in practice all these problems will have informal solutions online in material that LLMs have likely trained on.

Thomas Zhu (May 24 2024 at 18:20):

Very impressive results! Sorry to bring some bad news, but I carefully read the examples of FIMO solved and I don't think either of them are formulated in Lean correctly (not the authors' fault):

In section A.3.2:

fimo_2009_algebra_p3 is stated in natural language for with a property for positive integers x y, but the Lean statement assumes x y : Nat. This allows the model to find a very short proof (note the induction is not needed), because you can just take y = 0 and the answer is immediate by linarith.

fimo_2016_algebra_p5_1 is stated in natural language for square root of n, but translated to n.sqrt instead of Real.sqrt n, resulting in a super short proof that takes a = n.sqrt and b = 1 and concludes by simp.

So it turns out both examples of IMO problems solved are false positives. It would have been very great if a model could actually solve a P3 problem; it is still very impressive the model can find these bugs and abuse them.

Thomas Zhu (May 24 2024 at 18:29):

Ah sorry, I just saw this was already posted somewhere else. So the above is just reposting something from another conversation

Notification Bot (May 24 2024 at 18:52):

26 messages were moved here from #Machine Learning for Theorem Proving > Synthetic dataset generation by Eric Wieser.

Eric Wieser (May 24 2024 at 18:53):

Thomas Zhu said:

Ah sorry, I just saw this was already posted somewhere else. So the above is just reposting something from another conversation

I've combined the two threads

Qian Hong (May 27 2024 at 11:58):

Jason Rute said:

Josef Urban likes to claim that automated methods can solve say 80% of Mizar theorems. (If someone could remind me where to find this reference that would be great.)

https://intelligence.org/2013/12/21/josef-urban-on-machine-learning-and-automated-reasoning/

Mario Carneiro (May 27 2024 at 19:18):

This accords with my recollection: Josef did not say 80% now, he said 60% now and 80% in 10 years (20 from the date of that interview)

Jason Rute (May 28 2024 at 12:11):

Thanks @Mario Carneiro . I misremembered. But I’m actually referring to the 75% number in that paper which is also the number on this website. In that number he is combining many different solvers. If we did the same for MiniF2F, Mathlib, or Isabelle/HOL AFP (through the PISA benchmark) we would also probably see really high pass rates. (Heck, even in our Graph2Tac paper if we combine all 25 solvers we benchmarked at some point on the test set, we get a combined 50% pass rate, almost twice the pass rate of our best single method. Of course this naive combining requires x25 more compute than the single best method.)

Jason Rute (May 30 2024 at 04:14):

@Huajian Xin Is it accurate or a typo that 46.3% appears twice in your table 1. (I could believe it is correct but worth checking. :smile: )

Jason Rute (May 30 2024 at 06:27):

Here is my opinionated summary of this paper:

Jason Rute (May 30 2024 at 06:27):

Right now there are two main machine learning approaches to automated theorem proving: (1) A neural-guided tree search like PACT, Reprover/Lean-Copilot, Graph2Tac, etc. and (2) use an LLM to predict the proof in one go. The latter approach is ambitious and seems to be gaining steam (and it is easy because you can just plug the theorem into an LLM like GPT-4). But currently, this LLM approach hasn’t worked too well (at least not without also being augmented with another solver like in Draft-Sketch-Prove).

Jason Rute (May 30 2024 at 06:28):

This paper makes me think maybe the LLM-gives-a-proof-in-one-go approach is not only reasonable, but it could rival the neural-guided search approach. (Especially if the search uses an LLM at each search step as that is prohibitively slow.). This is a reasonably small model and they said here on Zulip that they can get pretty good results in 5 seconds on two (good) GPUs. This is starting to become practical.

Jason Rute (May 30 2024 at 06:28):

To improve the results of the base DeepSeek-Model they use two very well-established approaches to improve any machine-learned theorem prover:

Jason Rute (May 30 2024 at 06:28):

(1) Diversity:
If there is any diversity in your provers—whether it is the same prover with randomly sampled solutions, similar provers trained with slightly different (hyper-)parameters, or entirely different provers—they will prove more theorems. (My experience is that the number of theorems proved is approximately logarithmic in the number of diverse proof attempts.)

Jason Rute (May 30 2024 at 06:28):

The way they accomplish that here is that they don’t just generate one proof candidate (as typical in ChatGPT or Github Copilot), but hundreds (or hundreds of thousands) of proof candidates. (This is called pass@k in the literature.) (Note to developers of tools, this would mean that an LLM integration with an ITP would have to be able to have the ITP check hundreds of generated proofs to know if it needs to attempt another generation. This is unlike Github Copilot which just gives one result.)

Jason Rute (May 30 2024 at 06:28):

(2) Reinforcement learning (RL) (sometimes called expert iteration (EI)).
The idea here is that to improve a theorem prover, you can keep trying to prove test theorems and learn from your successful (and unsuccessful) proof attempts to improve your model. This has been repeatedly shown to work quite well. And it has two especially helpful advantages for Lean. First, Lean changes repeatedly so the base LLM model probably confuses Lean 3 with Lean 4 (and even Lean 4 keeps changing in drastic ways every few months). Second, the test set is not actually Mathlib theorems, but MiniF2F theorems which are high-school challenge problems. So having more data similar to MiniF2F helps.

Jason Rute (May 30 2024 at 06:28):

The challenge with RL/EI is to find the best list of theorems to learn from. The available options have been the following up until now:

Jason Rute (May 30 2024 at 06:28):

This paper follows the last approach--but with some new ideas.

Jason Rute (May 30 2024 at 06:29):

  • They collect a very large database of high school math competition problems. (I don’t see many details about the dataset or if they will release it.)

Jason Rute (May 30 2024 at 06:29):

  • They seem to use a dataset of informalized Lean theorems to jump-start the auto-formalizer.

Jason Rute (May 30 2024 at 06:29):

  • Autoformalization often doesn't work. (In the Autoformalization with LLM models approach, only about 15% of formalizations were correct.) They do a lot of things to weed out incorrect auto-formalizations early:
    • They have an LLM score the quality of the translation. They use a fairly detailed prompt (in Appendix A1 of the paper), where they ask the LLM to evaluate if this is a useful translation.
    • They check if the hypotheses of the formalization are inconsistent (by trying to prove the negation from the previous hypotheses). This avoids having a vacuously true translation.
    • Only provable translations are used, but to check that the translation is provable, they simultaneously try to prove and disprove the theorem. (I want to remark that I proposed a similar idea a few years ago, although a bit more involved, which I’ve never had a chance to explore.) This speeds up proof search. (I don’t think they use slim_check, but that could possibly speed it up even faster.). I assume (but am not sure) that they learn from these negative proofs (including the proofs of that hypotheses are inconsistent).

Jason Rute (May 30 2024 at 06:29):

  • I assume (but am not sure) that they further finetune not only the theorem prover, but also the auto-formalizer with the newly found autoformalizations. So they are using RL to train both a theorem prover and an auto-formalizer. This is the holy grail aim of this field (mentioned in A Promising Path Towards Autoformalization and General Artificial Intelligence). This might be the furthest we have come on this goal (but it certainly isn’t yet the full virtuous runaway loop the field is shooting for).

Jason Rute (May 30 2024 at 06:30):

Overall I think the results are pretty good (this is one of the best results on MiniF2F) and it looks like they plan to release the 712,073 auto-formalizations (hopefully with the natural language statement and the generated proof, but we will see :smiley:).

Jason Rute (May 30 2024 at 06:30):

Of course, the reliance on MiniF2F leaves the usual questions:

Jason Rute (May 30 2024 at 06:30):

  • Does this approach also help with generating Mathlib-like theorems? (Their data is geared to competition problems and auto-formalization of real math would be much harder.)

Jason Rute (May 30 2024 at 06:30):

  • Is MiniF2F still a good benchmark, or is it completely compromised? (It is even harder since if we make a new benchmark, there are several abandoned projects, like PACT and HTPS, that it would be impossible to test on this new benchmark.)

Jason Rute (May 30 2024 at 06:30):

  • How does this compare per-compute to previous approaches? In particular, they are doing pass@<a huge number> (possibly just so they can claim state of the art :frown: ), but that might be similar compute-wise to a pass@100 run for a different model that uses tree search or a larger LLM.

Jason Rute (May 30 2024 at 06:30):

  • Any benchmark containing old IMO problems is liable to overclaiming. I would recommend talking with someone associated with the IMO (or @Joseph Myers) before ever making a “my AI solved an IMO problem” claim.

Jason Rute (May 30 2024 at 06:30):

  • I’m curious how well both the auto-formalizer and theorem prover work on the 10 public AIMO progress prize problems (if they came out after the data scrapping for this project). (Yes, one needs to also find candidate solutions to the problem before formalizing them, but one could use the DeepSeek-Math model for that, which I think is common for that challenge.)

Jason Rute (May 30 2024 at 14:59):

@Huajian Xin Another question I realize I have. What do you do with the formalizations that you can't prove true but also can't prove false? Those seem like they would also be useful data for some purpose. (They could be useful challenge problems for example.)

Jason Rute (May 30 2024 at 22:56):

@Huajian Xin I have another question. I'm still not sure what you mean by "cumulative" in Table 1 and the paper doesn't say anything about it that I can find. Here are a few possible alternatives. Which if any is correct?

  • You are taking the union over all pass@k runs in Table 1.
  • You are taking the union over all iterations in Table 4 where each iteration includes 128 attempts for each problem. (This is my guess.)
  • You are including the MiniF2F problems in the problems you learn from and unioning over all attempts at those problems in some reinforcement learning run with multiple training iterations. (This is what the HTPS paper means by "cumulative", but I suspect this is not what you are doing, and it would be a bit suspect if you did it for MiniF2F-Test.)
  • You are taking the union over all attempts by any version of DeepSeek-Prover in your paper (Table 1, Table 4, Table 5, etc.).

Jason Rute (May 30 2024 at 22:58):

This is somewhat important since this is the number you claim in the paper (and it is higher than any other claimed numbers for MiniF2F that I'm aware of).

Huajian Xin (May 31 2024 at 03:06):

@Jason Rute Thank you very much for your insightful and detailed analysis! I apologize for the typos and any confusion. Here are my clarifications:

  1. The reported 46.3% pass@64 is indeed a typo; the correct result is 45.5%. We regret this error in our abstract and introduction claims. However, our model is rapidly improving, and our best version now outperforms the results presented in the current paper. We will provide a final performance report upon the official release of the model and data.

  2. We chose MiniF2F-like theorems instead of Mathlib-like theorems because Mathlib is a large-scale, complex scientific project. Generating identically distributed data requires careful consideration of dependencies, definitions, and lemmas needed for theorem declaration and proof. This complexity necessitates advanced generation technology like RAG, which we are currently exploring. We plan to publish a detailed paper and open-source the model and data when ready.

  3. I share your concerns about miniF2F and believe that a significant challenge in automatic theorem proving for proof assistants is the lack of comprehensive benchmarks across various fields. Regarding the use of unprovable or unfalsifiable formalized data, we found that they can indeed be used as a benchmark: for any problems our model cannot decide, if other models can prove or falsify them, it indicates their superior theorem-proving capabilities. We will mark these data in the final open-source dataset and hope this benchmark will help measure model performance.

  4. Comparing our results with existing methods is nuanced, as most use tree search rather than direct generation, and model reasoning costs vary by parameter count. The purpose of providing pass@k results from small to large k is to offer multiple options for performance comparison:
    a. Directly comparing different definitions of pass@k, though this may be unfair for whole-proof generation methods that only interact with the verifier k times.
    b. Comparing performance after a fixed number of model generations (e.g., HTPS performing 32k generations) may also be unfair to tree search methods, which require multiple steps to synthesize complete proofs.
    c. Comparing cumulative results as an approximation of pass@infinity. This is why we provide the cumulative result for miniF2F-test, although we did not perform expert iteration on these problems. However, this requires excessive computing costs and imposes too much burden on researchers.
    We apologize for not explaining "cumulative" clearly in the paper. It refers to your third guess: merging all attempts of any DeepSeek-Prover version (Tables 1, 4, 5, etc.). If it is argued that merging all pass@k runs in Table 1 is more reasonable, we will clarify both performance reports in the revised paper.

  5. We only performed expert iterations on miniF2F-valid, resulting in much higher cumulative results than miniF2F-test. Therefore, we did not publish corresponding pass@k results, as they are nearly identical to the cumulative results and offer little comparative value.

Finally, thank you again for your attention and detailed analysis. Community feedback is crucial to us. Our goal is to advance the entire field, which is also DeepSeek's mission for all fields of artificial intelligence.

Lasse Blaauwbroek (May 31 2024 at 15:40):

Jason Rute said:

I misremembered. But I’m actually referring to the 75% number in that paper which is also the number on this website.

Note that this number is obtained by using human-supplied premises. As such, these proofs are not found fully autonomously. A bit further down on the website, it reports 58% proof rate while using trained premise selectors.

Jason Rute (May 31 2024 at 21:40):

One more comment that I want to make about this work. It should be made clear that the techniques employed are not at all specific to using LLMs as all-in-one-go theorem provers. Indeed, the same approach (figure 1) would work with the DS-Prover being replaced by any machine-learned theorem prover (LLM-based or tree-search-based or other). After the authors release their new dataset of generated formal theorems and proofs, fine-tuning on that dataset should be enough to boost any existing ML theorem prover by a decent margin on MiniF2F (maybe even surpassing the DS-Prover score :smile:). This of course comes with the caveat that using a pretrained LLM was crucial to the autoformalization step, and there might be some synergy if the same LLM is used for both autoformalization and proving. (I'm not sure if this was the case.)

Jason Rute (May 31 2024 at 21:41):

Also, I hope the authors release not only the positive data (successful proofs), but also the negative data (failed proof attempts). This negative data is a big value-add that can only come from experience-based methods like reinforcement learning, expert iteration, etc. For tree-search-based methods, one can use negative information to construct an alpha-go-style value function. For whole-proof LLM-based methods, one can use it for RLHF (but replace human feedback with Lean feedback). It also can be used for proof repair techniques (like in Baldur). @Huajian Xin Did you use any negative information when fine-tuning either the prover or the auto-formalizer? (Also, did you fine-tune the auto-formalizer at all?)

Jason Rute (May 31 2024 at 22:44):

Huajian Xin said:

c. Comparing cumulative results as an approximation of pass@infinity.

I'm not sure what you mean by "pass@infinity". Shouldn't the pass rate in the limit be close to 100%? You are just flipping a (biased) coin until it succeeds. The only way a theorem would never pass is if it has a probability exactly equal to zero by your model (e.g. the shortest proof is beyond the context length or every proof has a token with a floating point log probability of -infty). But this seems unlikely.

Jason Rute (May 31 2024 at 22:46):

Huajian Xin said:

We apologize for not explaining "cumulative" clearly in the paper. It refers to your third guess: merging all attempts of any DeepSeek-Prover version (Tables 1, 4, 5, etc.). If it is argued that merging all pass@k runs in Table 1 is more reasonable, we will clarify both performance reports in the revised paper.

To the extent that you are claiming victory with your score of 52.0, this vast amount of computing does seem a bit excessive. This "arms race" of getting a SoTA score on MiniF2F could devolve into just throwing excess computing at the problem. (By your report above of 5 seconds per 32 iterations, we are talking something like 3 hours per theorem for your 65536-generation score. That seems quite large to me.)

Nonetheless, it is reasonable to expect that diversity will help. I would expect that if you just take all 15 models in tables 2,3,4,5 and union their 128 generation runs, that union would do better than your 8192 generation run (and certainly better than a 2048 generation run). I wouldn't necessarily consider this cheating. (The HTPS paper also uses a lot of hyper-parameter diversity between successive proof attempts to improve their score.)

Adam Kurkiewicz (Jun 05 2024 at 12:10):

@Jason Rute, would there be scope in the field for incremental crowdsourcing of datasets useful for learning?

I'm thinking something like Amazon Mechanical Turk, but more citizen-science oriented and of course targeted at people with knowledge of Lean. I think pyBOSSA is an open source package that has been widely used for these kind of projects (although, of course, support for running Lean in the browser window with the task would have to be added):

https://github.com/Scifabric/pybossa

https://blog.okfn.org/2012/06/08/introducing-pybossa-the-open-source-micro-tasking-platform/

I think the ideal workflow would be something like this:

an AI method (like DeepSeek Prover) could be used to generate most-valuable data for training (e.g. theorems that are just outside of its proving capability, and cannot be proven by the AI prover at the moment), and add the statements of those theorems into a job queue.

Humans (likely undergraduate mathematicians) pick up the theorems and prove them in the pyBOSSA citizen-science environment. This actually could be a useful way for mathematicians to learn lean.

The AI method gets re-trained, improves, and proposes better training data.

This could be useful until we hit that runaway loop you've mentioned in your earlier posts.

Fabian Glöckle (Jun 10 2024 at 11:51):

Huajian Xin said:

In fact, many benchmarks currently widely used in the field of AI have errors. For example, we found that a theorem of miniF2F also has errors: the premise of https://github.com/facebookresearch/miniF2F/blob/main/lean/src/test.lean#L946 should be (3 * n) % 11 = 2.

please PR :)

Notification Bot (Jun 11 2024 at 12:04):

A message was moved from this topic to #Machine Learning for Theorem Proving > Aristotle by Jason Rute.


Last updated: May 02 2025 at 03:31 UTC