Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: PutnamBench

George Tsoukalas (Jul 17 2024 at 21:39):

Happy to share PutnamBench, an evaluation benchmark for theorem-proving consisting of problems from the William-Lowell Putnam Mathematical Competition (a popular collegiate math exam in the US & Canada) that our team at UT Austin has been working on for the past several months. We've gone through the competition exams from 1962-present and manually produced formalizations for 640 problems in Lean 4, Isabelle, and (partially) Coq. We've also included the informal problem statements with thanks to the MAA.

We're sharing PutnamBench to allow the community to test their methods and give feedback as we continue to iterate on the benchmark. For the foreseeable future we will be actively maintaining the benchmark (including adding problems from future year's exams), and hope that it can serve as a useful measure of performance for systems aimed at competition-math problems.

A majority of Putnam problems do not just require proving a theorem, but also finding a "solution" or "answer" and then proving its correctness. For such problems, our formalizations also factor out the solution from the theorem statement to enable two tasks: finding the solution (where applicable) and proving it correct, or proving the theorem directly.

George Tsoukalas (Jul 17 2024 at 21:40):

We've done cross-checking of the formalizations to help minimize the number of errors in the formalizations. However, we acknowledge that the benchmark is likely imperfect and hope that the community can help with patching any mistakes. We have not included any proofs and plan to avoid including any proofs as a way to minimize potential test-set leakage from LLM-based approaches (but we are open to your thoughts!).

An evaluation of PutnamBench on several recent approaches finds that the benchmark is quite hard - in Lean only 1 problem of 640 was solved by any of the methods. In the other languages the case is similar, with no more than a handful of problems being currently solvable. We see PutnamBench as a hard open challenge for the research community, also aligned with the IMO Grand Challenge/AIMO Prize (in recent years, top performers on Putnam are IMO medalists).

George Tsoukalas (Jul 17 2024 at 21:42):

Noting that many recent approaches have used the valid split of miniF2F outside of pure evaluation, we think it makes sense to similarly split PutnamBench. We've been thinking that we can do a (train or valid)/test split year-wise, setting the validation/potential training split to be problems from 1962 to a certain year, and the remainder to serve as a true testing split. We're interested in making a choice for the splits that the community can agree upon so usage of the benchmark can remain consistent across new approaches. We would love to hear your thoughts on this! Ultimately the chosen split will form the basis for a public leaderboard which we are hosting, and we are likely to only accept new evaluation results which conform to the community agreed-upon split.

We'd also love to discuss any questions or thoughts you have about the benchmark and (specifically) the best way to ensure its longevity as a useful benchmark. If you have feedback about particular formalizations in the benchmark, feel free to raise an issue at the github repository and we will be sure to address it!

George Tsoukalas (Jul 17 2024 at 21:43):

We've also recently released a paper on arxiv which presents PutnamBench in some more detail. This work would not have been possible without the full effort of my very talented collaborators: Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and my advisor Swarat Chaudhuri.

For folks who happen to be attending at ICML, some of us will be there and we would love to chat about PutnamBench and all things AI4Math!

Eric Wieser (Jul 17 2024 at 21:51):

The very first question I clicked on points to an obvious hole in mathlib: we still don't know what the volume of a triangle is!

Jason Rute (Jul 17 2024 at 22:01):

Maybe future models need to not only solve the problem, but PR all the needed background material to Mathlib. :smile:

David Renshaw (Jul 17 2024 at 22:02):

putnam_2008_b5, which shows up in Figure 2 of the paper, uses optParams for fqsat and fsat. I suspect the theorem does not mean what it is intended to mean.

Eric Wieser (Jul 17 2024 at 22:04):

~~Would it be possible to put a LICENSE file in the repo? The norm for lean code is usually the Apache 2 license, but obviously the Coq and Isabelle communities may have different conventions.~~ Nevermind, I can't read

Jason Rute (Jul 17 2024 at 22:06):

I look forward to seeing real measurable progress on this benchmark, hopefully in a generalizable way which equally applies to solving real world problems and lowering the difficulty of ITP use!

Adam Topaz (Jul 17 2024 at 22:12):

David Renshaw said:

putnam_2008_b5, which shows up in Figure 2 of the paper, uses a optParams for fqsat and fsat. I suspect the theorem does not mean what it is intended to mean.

That's not the only one... looking through just a few more problems, I found another optParam here: https://github.com/trishullab/PutnamBench/blob/main/lean4/src/putnam_1988_a2.lean

Adam Topaz (Jul 17 2024 at 22:13):

Here's the problem itself (1988 A-2):
Screenshot-2024-07-17-at-16-13-22-1988.pdf.png

David Renshaw (Jul 17 2024 at 22:17):

I'll open an issue on the repo.

Adam Topaz (Jul 17 2024 at 22:18):

It's also suspicious to formulate problems of the form "determine whether this is true" by making a sorry'ed prop, and asking for an iff assertion.

Adam Topaz (Jul 17 2024 at 22:19):

you could just formulate the same exact prop, and the iff is then trivial.

Jason Rute (Jul 17 2024 at 22:19):

How are you currently testing current models? Are you plugging the answer in directly?

Jason Rute (Jul 17 2024 at 22:22):

In particular, are you leaving solution in the proposition? (Or are you replacing solution with the solution?)

David Renshaw (Jul 17 2024 at 22:23):

https://github.com/trishullab/PutnamBench/issues/166

George Tsoukalas (Jul 17 2024 at 22:39):

Eric Wieser said:

The very first question I clicked on points to an obvious hole in mathlib: we still don't know what the volume of a triangle is!

Yeah, although we have done our best to state the problems correctly, there may not be support in Mathlib to easily prove them. It may also be that some problems are stated correctly but may be difficult to prove because of the nature of the formalization, not the difficulty of the problem. In the case of this problem, we are stating area of a triangle as the volume of convex hull of three points (in that one we assert noncollinearity but in general we might not include nondegeneracy conditions if the truth value of the statement remains unchanged). We haven't written proofs for the theorems so we can't be sure if the formalizations represent the most natural way to state the problems.

George Tsoukalas (Jul 17 2024 at 22:47):

Adam Topaz said:

It's also suspicious to formulate problems of the form "determine whether this is true" by making a sorry'ed prop, and asking for an iff assertion.

In the case of solution type Prop, one can easily enforce the solution value to be True or False. In general, it is unclear how to specify the boundary between values which we deem closed-form and those not closed-form as the solution to a problem. This can vary quite a lot depending on the problem and output type. Of course, if a method writes in as the solution the same statement, the theorem is true via reflexivity.

At this point, we've not tried to specify what a closed-form solution is in general. I think this can be a nice topic to discuss as a community. We have not tested (and afaik no open-source ones exist) any models which try to do the solving and the proving in tandem.

George Tsoukalas (Jul 17 2024 at 22:50):

Jason Rute said:

In particular, are you leaving solution in the proposition? (Or are you replacing solution with the solution?)

We replace putnam_{year}_{num}_solution with whatever the ground truth solution is, and then run the baselines on those formalizations. We haven't tested any methods which try to generate a solution and then prove its correctness (I'm not sure if any currently exist). A simple method one could try is to use the formalization to get a program/equation to solve and use the output of the executed problem/computer algebra solver to fill in the theorem statement, and then prove it via some established method. Since the baselines we try aren't able to prove that many problems with the correct solution written in, we haven't done this and probably future approaches should be much more intertwined.

Eric Wieser (Jul 17 2024 at 22:59):

George Tsoukalas said:

Eric Wieser said:

The very first question I clicked on points to an obvious hole in mathlib: we still don't know what the volume of a triangle is!

Yeah, although we have done our best to state the problems correctly, there may not be support in Mathlib to easily prove them.

To be clear, this was a criticism of mathlib not your dataset!

Eric Wieser (Jul 17 2024 at 23:02):

Adam Topaz said:

It's also suspicious to formulate problems of the form "determine whether this is true" by making a sorry'ed prop, and asking for an iff assertion.

I don't think providing boolean answers in this way is meaningfully different from the issues that "determine the set of all things that satisfy P"-style questions have; and it's reasonable to declare solving this out of scope for the benchmark

Adam Topaz (Jul 17 2024 at 23:05):

I mostly agree, but I think questions of the form "determine if this is true or false" should not be out of scope! E.g. you could formulate it in the form def foo : Bool := sorry ; def problem : P \iff foo or something like this (exploiting the fact that we can interpret bools as props)

Adam Topaz (Jul 17 2024 at 23:07):

Oh, Nevermind. Of course docs#decPropToBool will make the two equivalent.

Adam Topaz (Jul 17 2024 at 23:10):

And even if you somehow enforce the prop to be either True or False, that still won't fix the issue since you can use iff_true, iff_false and em

Mario Carneiro (Jul 18 2024 at 00:03):

no, "prove or disprove this statement" is strictly stronger than "prove this statement is true or false"

Mario Carneiro (Jul 18 2024 at 00:04):

assuming the normal form is forced to be True or False, proving RiemannHypothesis <-> solution will indeed net you a million dollars

Mario Carneiro (Jul 18 2024 at 00:06):

the key point is that the solution expression must externally have a normal form, it can't simply be reducible to a normal form and it certainly can't be an arbitrary well typed proposition

Adam Topaz (Jul 18 2024 at 00:08):

I guess you can enforce the normal form to be True or False using metaprogramming (so at the Expr level). Is that what you mean?

Adam Topaz (Jul 18 2024 at 00:10):

Right now the setup is def solution : Prop := sorry and if I understand correctly the AI can just fill in anything in place of the sorry

Adam Topaz (Jul 18 2024 at 00:34):

All I'm saying is that (even if solution : Bool) you can do things like this:

import Mathlib

noncomputable section
open Classical

def solution : Bool := RiemannHypothesis

lemma million_bucks_please : RiemannHypothesis ↔ solution := by
  simp [solution]

Mario Carneiro (Jul 18 2024 at 01:11):

Adam Topaz said:

Right now the setup is def solution : Prop := sorry and if I understand correctly the AI can just fill in anything in place of the sorry

My understanding is that this is incorrect. The format of the benchmark includes two sorries, but the way in which this is turned into a solving task is not specified, and in particular the approach you are assuming - which is that the solver is supposed to replace both sorrys with expressions/proofs making the file typecheck - is apparently not under consideration, and instead the task is split into two parts, one for guessing the first sorry and one for solving the problem given a solution for the first sorry.

Mario Carneiro (Jul 18 2024 at 01:13):

It seems the "correct" overall task is to supply a value for the first sorry matching some ill-defined normalization condition (which could be programmatically verified by reading Exprs as you suggest), and then supplying any value for the second sorry making the resulting file typecheck.

Jason Rute (Jul 18 2024 at 14:24):

@George Tsoukalas You may want to share this benchmark directly with the Coq and Isabelle folks. The Coq Zulip has a similar machine learning channel for example. That way, you will get full breadth and also you will have experts looking over your translations. (I think Coq is especially tricky with the many possible ways to express things. I think this issue really held up the translation of MiniF2F to Coq.)

George Tsoukalas (Jul 18 2024 at 15:10):

Jason Rute said:

George Tsoukalas You may want to share this benchmark directly with the Coq and Isabelle folks. The Coq Zulip has a similar machine learning channel for example. That way, you will get full breadth and also you will have experts looking over your translations. (I think Coq is especially tricky with the many possible ways to express things. I think this issue really held up the translation of MiniF2F to Coq.)

Some of us are travelling this week and won't have the capacity to review feedback for all languages in a timely manner, so we're going to hold off on sharing for the time being. We are also still doing some reviews of newly added problems. But, feel free to share it if you please.

Joseph Myers (Jul 18 2024 at 20:47):

George Tsoukalas said:

Eric Wieser said:

The very first question I clicked on points to an obvious hole in mathlib: we still don't know what the volume of a triangle is!

Yeah, although we have done our best to state the problems correctly, there may not be support in Mathlib to easily prove them. It may also be that some problems are stated correctly but may be difficult to prove because of the nature of the formalization, not the difficulty of the problem. In the case of this problem, we are stating area of a triangle as the volume of convex hull of three points (in that one we assert noncollinearity but in general we might not include nondegeneracy conditions if the truth value of the statement remains unchanged). We haven't written proofs for the theorems so we can't be sure if the formalizations represent the most natural way to state the problems.

Volume of convex hull is fine for the volume of a simplex, but to talk about it in proper generality (i.e. Euclidean affine spaces not just Euclidean vector spaces) we need the convexity refactor that has been talked about for a long time.

Joseph Myers (Jul 18 2024 at 20:48):

(I don't know what the proper generality for this particular problem is - some competition geometry problems do talk about an origin / coordinates, in which case a more concrete Euclidean vector space is appropriate - but for problems that don't do that, a Euclidean affine space is more appropriate.)

Eric Wieser (Jul 18 2024 at 23:40):

I mean, if you really want to make things hard for yourself you can talk about the docs#MeasureTheory.Measure.hausdorffMeasure of order 2 of the convex hull, μH[2] (convexHull ℝ {A, B, C}), so that you can work with triangles in 3D too

Eric Wieser (Jul 18 2024 at 23:44):

In fact, due to not using this more complex spelling, this problem incorrectly defines the areas as zero.

Jason Rute (Jul 27 2024 at 06:45):

@Eric Wieser I think we would all love to see the AlphaProof paper test on PutnamBench (and MiniF2F). But of course the worry is that if you spend 3 days per problem that is a lot of time and doesn’t capture if the solution if coming just from massive scale in time or from a good algorithm. This topic came up at the AI for Math workshop yesterday.

Jason Rute (Jul 27 2024 at 06:50):

I’m being a bit polemic when I say 3 days per problem. I assume you would not do that here, but I think you get the point. Our own experiments in Graph2Tac show that pass rate is something like logarithmic in the time used, so it is hard to compare approaches using vastly different time scales (or other resources) unless you show the scaling plot.

Lasse Blaauwbroek (Jul 27 2024 at 09:09):

@Jason Rute Not only is it three days per problem but also ran on a massive amount on TPU's as I understand it...

Andy Jiang (Jul 27 2024 at 12:04):

On the other hand I am of the opinion that it's a good thing they have an algorithm that can do more given 3 days and huge amounts of compute. I think it indicates it is more likely to improve and scale with the addition of even more compute--which will inevitably come in the future.

Jason Rute (Jul 27 2024 at 14:02):

Yeah, in Graph2Tac we saw a log-linear scaling so with even more compute it would do better and here it looks like through RL and good engineering they (hopefully) maximized the log linear scaling rate even more. I think over long time horizons, RL is really efficient as a search algorithm and it looks like they actually put that to the test here. What would be the best is if, even in lower compute budgets, one could see this log linear trend and know it will extrapolate to the 3-day regime. It would be great if we could compare algorithms by their log linear scaling rate instead of raw theorems proved.

Will Sawin (Jul 27 2024 at 18:27):

The scaling certainly can't be log linear forever, since that implies with enough compute you would prove more than 100% of the theorems. So it might be problematic to say one algorithm is better since its scaling rate is faster, because the log linear curve of this algorithm might cross over the log linear curve of another algorithm after the point where the scaling stops following a log linear curve. As long as you test two algorithms at the same compute budget you can compare their performance with equal amounts of compute.

Geoffrey Irving (Jul 27 2024 at 18:33):

The replacement for log linear with a cap is inverse-sigmoid linear, incidentally.

Jason Rute (Jul 27 2024 at 18:36):

Yes. I mostly agree. It isn’t perfectly log linear and the math wouldn’t allow it (although I think ideally one should also take into account that the problems themselves are sampled according to some distribution, so the cap in that regard is not fixed, Edit: I’m being silly. It is still 100%.). So one can’t be certain that an algorithm will scale without checking, but showing those numbers (which are trivial to make a graph of) really help to tell a story. I don’t understand why more papers don’t plot this sort of thing. You don’t even have to run any new experiments!

Jason Rute (Jul 27 2024 at 18:50):

Actually here is a mathematical question. I think this “log linear” behavior has also been observed for the setting of n independent samples. Say sampling independent proofs from DeekSeek-Prover shows that the percent of test proofs solved is proportional to log(n) . But since it is independent, we actually should know the math. Do we know what that linear scaling law is? Is it inverse sigmoid linear? (The issue is that it isn’t one test problem but a set (or distribution) of them, so I think this is sampling from a binary exchangeable sequence instead of an i.i.d sequence. First you select the problem randomly from the test set and then sample independently until success.)

Will Sawin (Jul 28 2024 at 01:03):

I think you can get a bunch of inequalities that the function has to satisfy but no identities. The success probability as a function of time will be a convex combination of functions of the form 1 - c^(-x) which is equivalent to being <1 with positive first derivative, negative second derivative, positive third derivative, ...

Joseph Myers (Aug 09 2024 at 12:57):

Eric Wieser said:

I mean, if you really want to make things hard for yourself you can talk about the docs#MeasureTheory.Measure.hausdorffMeasure of order 2 of the convex hull, μH[2] (convexHull ℝ {A, B, C}), so that you can work with triangles in 3D too

I think you'll probably need to scale the Hausdorff measure to get the right notion of area in a Euclidean space. docs#MeasureTheory.hausdorffMeasure_pi_real is specifically about n-dimensional measures in an n-dimensional space with sup metric; changing to the Euclidean metric will scale the Hausdorff measure. The PutnamBench problem you found trying to use areas in three dimensions is using Fin 3 → ℝ (i.e. sup metric), and 2-dimensional Hausdorff measure with 3-dimensional sup metric has its own problems; you probably get the sup of the areas of the projections of the triangle onto the spaces spanned by any two of the coordinate axes, which is also not the standard notion of area of a triangle in three dimensions.

I guess this all goes to illustrate how it's hard to be confident in the correctness of a formal problem statement without formalizing a solution, and hard to be confident in the correctness of a formal definition (say, of the area of a triangle) if you don't add enough API lemmas to show it behaves as expected. And we could certainly do with an appropriate API for volumes of arbitrary sets in Euclidean space (using scaled Hausdorff measure, say) - but it does need API, not just a definition we hope gets the scaling right.

Eric Wieser (Aug 14 2024 at 11:43):

David Renshaw said:

I'll open an issue on the repo.

I've already opened a draft PR about it (trishullab/PutnamBench#202), but you've run into a very common Lean footgun here: set_option autoImplicit false is not the default, and as a result a handful of the formalizations have nonsensical auto-inserted quantifiers.

George Tsoukalas (Mar 11 2025 at 21:30):

Howdy folks! Sharing some updates for PutnamBench:

Since release we have had a lot of community support in adding new problems and patching formalization errors - big thanks to @Oliver Nash , @Eric Wieser , @Edward Lockhart , @Moritz Firsching, @Simon Sorg , @Paul Lezeau , @Salvatore Mercuri, and @Fabian Glöckle. In particular, the problems from 2024's Putnam Competition have been added.
We have also received some leaderboard entries - the best performing model on PutnamBench presently is Self-Play Theorem Prover (https://github.com/kfdong/STP), which solves 8 of the 657 problems in pass@4096. Afterwards we have Goedel-Prover SFT (pass@512) & ABEL (pass@596) which prove 7 of the 657 problems.
We have now run evaluations with several leading models in generating one proof attempt per problem (pass@1): o3-mini, Claude 3.7, Gemini 2.0 Flash Thinking Experimental, Deepseek R1. Here are the results, as well as the costs of these experiments:

OpenAI o3-mini: 0/657. Cost: ~$20.
Claude Sonnet 3.7: 0/657. Cost: ~$21.
Gemini 2.0 Flash Thinking Experimental (1/21): 0/657. Cost: ~Free through AIStudio API.
DeepSeek R1: 0/657. Cost: ~$3 (ran through API)

While the final numbers are poor - the models do produce proofs which look correct, but usually for syntactic reasons the proofs fail. My general feeling is that the models try to produce very human readable proofs, which are often extremely verbose and produce many steps as opposed to a simple linarith. Some of the models confuse Lean 3 and Lean 4 syntax, which is a well known issue. Most of the problems which have been solved before by other methods rely heavily on such automated tactics, and the proofs themselves usually aren't very long. Given the syntax adherence issues I think it's critical that LLM + theorem proving approaches use Lean error information in an agentic loop. For instance, I expect that an agent using Sonnet 3.7 that can revise its proof over N iterations could take the top spot on the leaderboard. Another important angle is effective retrieval from Mathlib.

You can also run these experiments yourself! See the code at https://github.com/GeorgeTsoukalas/FTPEvals/ - feel free to use it for your own evaluations! The Lean interaction relies on @Amitayush Thakur 's itp-interface (https://github.com/trishullab/itp-interface/), which has an extremely easy setup (assuming Lean is installed, the whole project can be setup with a simple setup script).

George Tsoukalas (Mar 11 2025 at 21:35):

Here is an example of the generated proofs of the "easiest" problem in PutnamBench:

o3-mini
-- Error: need to do use x,y,z not use x, use y, use z (and split will fail)
theorem putnam_1988_b1
: ∀ a ≥ 2, ∀ b ≥ 2, ∃ x y z : ℤ, x > 0 ∧ y > 0 ∧ z > 0 ∧ a * b = x * y + x * z + y * z + 1 := by
intros a ha b hb
let x := a - 1
let y := b - 1
let z := (1 : ℤ)
use x, use y, use z
split
  linarith
split
  linarith
split
  norm_num
ring

Gemini Thinking:
-- Error: split not applicable
theorem putnam_1988_b1
: ∀ a ≥ 2, ∀ b ≥ 2, ∃ x y z : ℤ, x > 0 ∧ y > 0 ∧ z > 0 ∧ a * b = x * y + x * z + y * z + 1 := by
intro a ha b hb
use a - 1, b - 1, 1
split
· exact Nat.sub_pos_of_lt ha
split
· exact Nat.sub_pos_of_lt hb
split
· exact zero_lt_one
· ring

Sonnet 3.7:
-- Error: Calc block fails
theorem putnam_1988_b1
: ∀ a ≥ 2, ∀ b ≥ 2, ∃ x y z : ℤ, x > 0 ∧ y > 0 ∧ z > 0 ∧ a * b = x * y + x * z + y * z + 1 := by
intros a ha b hb
use 1, a-1, b-1
have ha' : a - 1 > 0 := by
  have : a ≥ 2 := ha
  linarith
have hb' : b - 1 > 0 := by
  have : b ≥ 2 := hb
  linarith
constructor
· simp
constructor
· exact ha'
constructor
· exact hb'
· calc
    1 * (a - 1) + 1 * (b - 1) + (a - 1) * (b - 1) + 1
    = (a - 1) + (b - 1) + (a - 1) * (b - 1) + 1 := by simp
    = a - 1 + b - 1 + (a * b - a - b + 1) + 1 := by
      congr
      ring
    = a * b := by ring

Deepseek R1:
-- Error: did not intro the hypotheses a ≥ 2, b ≥ 2
theorem putnam_1988_b1
: ∀ a ≥ 2, ∀ b ≥ 2, ∃ x y z : ℤ, x > 0 ∧ y > 0 ∧ z > 0 ∧ a * b = x * y + x * z + y * z + 1 := by
intro a
intro b
use 1, a - 1, b - 1
constructor <;> norm_num <;>
  nlinarith [Nat.mul_le_mul_left a (by linarith : b ≥ 2)] ```

GasStationManager (Mar 12 2025 at 16:40):

In my experience these frontier models generally do better when given the opportunity to fix its code in a feedback loop with the lean executable. (Saw that you mentioned this direction in your post as well.) E.g. here's a correct solution from Sonnet 3.7 with LeanTool:

--theorem putnam_1988_b1
--: ∀ a ≥ 2, ∀ b ≥ 2, ∃ x y z : ℤ, x > 0 ∧ y > 0 ∧ z > 0 ∧ a * b = x * y + x * z + y * z + 1


import Mathlib

theorem problem_statement : ∀ a ≥ 2, ∀ b ≥ 2, ∃ x y z : ℤ,
  x > 0 ∧ y > 0 ∧ z > 0 ∧ a * b = x * y + x * z + y * z + 1 := by
  -- We'll prove this by constructing specific values for x, y, and z
  intro a ha
  intro b hb

  -- Let's try x = a - 1, y = 1, z = b - 1
  use a - 1, 1, b - 1

  -- Now we need to prove the four conditions
  constructor
  -- Prove x > 0
  have hx : a - 1 > 0 := by
    have : a ≥ 2 := ha
    linarith
  exact hx

  constructor
  -- Prove y > 0
  have hy : (1 : ℤ) > 0 := by
    norm_num
  exact hy

  constructor
  -- Prove z > 0
  have hz : b - 1 > 0 := by
    have : b ≥ 2 := hb
    linarith
  exact hz

  -- Prove a * b = x * y + x * z + y * z + 1
  -- Substitute x = a - 1, y = 1, z = b - 1
  calc
    a * b = a * b := by rfl
    _ = (a - 1) * 1 + (a - 1) * (b - 1) + 1 * (b - 1) + 1 := by
      ring

GasStationManager (Mar 12 2025 at 19:40):

Interestingly the non-reasoning versions of the models seems to be a bit better at diligently checking the code with Lean, and responding to the error messages, compared to their reasoning counterparts. Here's GPT 4o with LeanTool:

import Mathlib

theorem putnam_1988_b1: ∀ a ≥ 2, ∀ b ≥ 2, ∃ x y z : ℤ, x > 0 ∧ y > 0 ∧ z > 0 ∧ a * b = x * y + x * z + y * z + 1 := by
  intros a ha b hb
  use a - 1, b - 1, 1
  constructor
  { linarith }
  constructor
  { linarith }
  constructor
  { linarith }
  calc
    a * b = (a - 1) * (b - 1) + (a - 1) * 1 + (b - 1) * 1 + 1 := by ring

GasStationManager (Mar 12 2025 at 19:42):

For harder problems it may be interesting to let a reasoning model to come up with a proof sketch, then finish up with a non-reasoning model.

Andy Jiang (Mar 13 2025 at 01:05):

question for those on the ML side: how much work/cost would it be to RL post train deepseek R1 against lean 4 environment to mitigate these syntactic issues (and hopefully also to improve math skills with more training)? Let's say you have already gathered 100k lean4 problem statements roughly in distribution of this test set for argument sake. Is it something we could hope to achieve outside of industry level funding? Is there hope to get this sort of funding from academia sources and to do it in a more open/collaborative way?

Zheng Yuan (Mar 13 2025 at 03:07):

Andy Jiang said:

question for those on the ML side: how much work/cost would it be to RL post train deepseek R1 against lean 4 environment to mitigate these syntactic issues (and hopefully also to improve math skills with more training)? Let's say you have already gathered 100k lean4 problem statements roughly in distribution of this test set for argument sake. Is it something we could hope to achieve outside of industry level funding? Is there hope to get this sort of funding from academia sources and to do it in a more open/collaborative way?

I think augmenting 100k lean4 problems will lead to many of them nonprovable and make RL very hard to work.

Andy Jiang (Mar 13 2025 at 03:52):

Do you mean like they will be false or too high in difficulty? I'm more asking about how doable the rest of process is after getting a curated 100k dataset. That being said I don't think human-curating 100k problem statements is an astronomical task with a small team

Zheng Yuan (Mar 13 2025 at 07:21):

They will be false and hard at the same time. And we have no way to verify if this problem is too hard or incorrectly formalized. I think this part could be harder than actual RL. Also if only focus on RL, there exists some datasets including lean-workbook, deepseek-prover-v1, and stp which can be used.

Andy Jiang (Mar 13 2025 at 10:24):

So are you saying that you don't trust a group of human curators to do a good enough job of producing/choosing a huge dataset of formal problems unless we also ask them to provide formal proofs? Let me accept your premise for a second--how bad is it for RL if say 50% of statements are false? It will just increase compute right? It's not like they will be rewarded for wrong behaviors

Lenny Taelman (Mar 13 2025 at 11:41):

Such questions could also be approached experimentally: generate (synthetically) just 100 problems, and have a team of humans inspect them to count how many are: trivially true, trivially false, true and doable to prove, false and doable to disprove, too hard to decide either way...

Zheng Yuan (Mar 13 2025 at 12:54):

Andy Jiang said:

So are you saying that you don't trust a group of human curators to do a good enough job of producing/choosing a huge dataset of formal problems unless we also ask them to provide formal proofs? Let me accept your premise for a second--how bad is it for RL if say 50% of statements are false? It will just increase compute right? It's not like they will be rewarded for wrong behaviors

I am considering 100k is obtained by LLM. If you say expert curated 100k, i will believe RL just works.

Andy Jiang (Mar 13 2025 at 23:47):

But do you have an estimate of how much the RL part would cost--as in do you expect it could be done by the lean community if such an open dataset appeared? Or would it have to be done in a big AI lab? Do you think if such a dataset appeared one of the medium/large labs would post train R1 on it and release the weights after training?

Joseph Myers (Mar 13 2025 at 23:53):

I think current PutnamBench still has the 2013 A5 problem discussed above (2-dimensional Hausdorff measure in Fin 3 → ℝ does not correspond to any standard notion of area). If you used EuclideanSpace ℝ (Fin 3) you still wouldn't have area (we still need a definition / API in mathlib for a version of Hausdorff measure that's appropriately scaled for the Euclidean metric), but you would at least have area scaled by some positive constant (so be clearly-to-humans equivalent to the original problem in this case, where scaling the area doesn't actually matter to the statement).

Joseph Myers (Mar 13 2025 at 23:57):

The difficulty in getting the right formal statement without writing a formal proof is evident from experience in many contexts (not just benchmarks or competition problems, but also a range of formalization projects). But we do know from AlphaProof that RL (for learning how to prove things in Lean) works given a collection of formal problem statements, some of which are wrong (and trying both proving and disproving the statements), since proving anything can be useful training for an AI even if the formal statement proved doesn't correspond to the intended informal statement.

Eric Wieser (Mar 14 2025 at 00:04):

Of course, even if a benchmark full of misformalizations is still usable for training an AI, it's not great as an evaluation intended to imply comparisons to humans, as the humans almost always get the correct questions which are usually harder.

Eric Wieser (Mar 14 2025 at 00:08):

I think correcting such evaluation benchmarks as mistakes are found is the right thing to do though; while this can invalidate old evaluations, the same could be said of bumping the Lean/Mathlib version which can certainly affect problem difficulty. The alternative of everyone evaluating their models only on frozen benchmark on Lean/Mathlib 4.7.0 (say) forevermore would result in models that excel at the benchmark but are far less useful to the real users on the bleeding edge.

Joseph Myers (Mar 14 2025 at 00:08):

Some types of mistakes in writing formal statements could be addressed by a checklist - both standard Lean translations of informal concepts (like in my proposed conventions for formal IMO problem statements), and common pitfalls (which include such things as the metric on Fin 3 → ℝ, and natural number subtraction, for example) - as well as more general things such as "check each condition in the informal problem is mentioned in the formal statement (and not just in an auxiliary definition you forgot to use in the actual statement)" and "check the types in the formal statement are consistent with the informal one". But I don't know what level of accuracy is likely to be achieved even with a careful checklist. (The next step from such a checklist is maybe "multiple people independently write formal statements for the same problem, verify their own and each other's against the checklist, and then discuss the statements with each other and decide which single statement is the best one to use for the benchmark / competition for AI entrants / ....)

Joseph Myers (Mar 14 2025 at 00:16):

Although humans do almost always get the correct questions, to some extent that depends on the same sort of checks for matching types etc. between human languages as are important in checking formal problem statements. When draft translations of the problems are displayed at an international olympiad for leaders to review each others' translations, getting a type wrong (e.g. saying "integer" instead of "real") or missing a condition (e.g. forgetting "positive"), or having a typo resulting in the wrong variables being used in an equation, are fairly common mistakes found in the drafts and fixed before they go to contestants. (Those are also things you can often spot without knowing the language in question.) On the rarer occasions where such a mistake gets through to contestants, it helps that most contestants will have requested the (correct) English version in addition to their native language.

Joseph Myers (Mar 14 2025 at 00:23):

For comparison to humans, as well as correcting mistakes in the statements being a good idea, updating Lean and mathlib is a good idea, not just to encourage practical usefulness in an AI, but also because newer mathlib means more definitions so more problems can be cleanly stated at all. (Does PutnamBench have any list somewhere of the missing mathlib definitions that were the reasons for each omitted Putnam problem in the given date range not being included - an analogue of my checklist at the bottom of https://github.com/jsm28/IMOLean of geometrical definitions that should be added to mathlib in support of being able to cleanly state more past and probably more future IMO problems?)

George Tsoukalas (Mar 30 2025 at 17:14):

There's no list like this right now, but I think it would be a good idea to add one. There are some contributors adding a fairly large set of new problems, once that's all in I can go through and annotate the remaining ones. I also think having a checklist of common formalization bugs would be good to have and can be compiled from PRs for PutnamBench. I have been meaning to do this because I'm interested in having one of the new LLMs do an automated sweep through PutnamBench for more bugs, but haven't had the bandwidth to carry it out. If anyone wants to be involved in a small project to do that I would be very happy to support it :)

George Tsoukalas (Mar 30 2025 at 17:22):

Also, there was a slight bug in our evaluation code (written ontop of itp-interface, not in our Lean tool itself) that caused a small number of correct proofs to be reported incorrect. This has been fixed, the numbers for Claude 3.7 & o3-mini are still both 0 solved. However, R1 did solve 1 problem, as did Gemini 2.0 Flash Thinking.

We also ran the recently released Gemini 2.5 Pro Experimental with pass@1, same prompt as the other models, and it solved 3/657 problems. Like before, I think wrapping 3.7/R1/Gemini models in an agent loop, with an optimized prompt, is going to solve many more. Still no eval for Grok 3 because I have not gotten API access yet.

David Renshaw (Mar 30 2025 at 22:57):

What do the hearts (":green_heart:", ":blue_heart:") mean on the leaderboard?

Bolton Bailey (Mar 30 2025 at 23:35):

David Renshaw said:

What do the hearts (":green_heart:", ":blue_heart:") mean on the leaderboard?

According to this, it seems it means that the model is fully (:green_heart:) or partially (:blue_heart:) open sourced (it would indeed be good if the webpage itself explained this) (EDIT: Now it does, kudos!)

Mario Carneiro (Mar 30 2025 at 23:45):

why is lean green and the others red?

Alex Meiburg (Mar 31 2025 at 14:56):

Why is "Self-play Theorem Prover" not considered open sourced?

George Tsoukalas (Mar 31 2025 at 15:47):

Ah, I forgot to update that when they open sourced the model. Will add info about the hearts to the page, and make the border colors uniform, not sure how I never noticed that before. Also, I've run an eval with DeepSeek-V3-0324 and it solves no problems.

Ralph Furman (May 01 2025 at 18:40):

--theorem putnam_1988_b1
--: ∀ a ≥ 2, ∀ b ≥ 2, ∃ x y z : ℤ, x > 0 ∧ y > 0 ∧ z > 0 ∧ a * b = x * y + x * z + y * z + 1

I think this formalization is slightly easier than the original (which just states that the LHS is composite) since it sets it up for the algebraic substitution, which in this problem works for any non-trivial factorization.
Then the key trick is making the substitution. The substitution is non-obvious (requires some thought and trials) so I would suspect any non-reasoning model that has pass@1 has seen this problem in training.
The rest is wrapping up the algebra, which is where the models seem to be struggling.

Ralph Furman (May 01 2025 at 22:17):

Actually, testing it out the performance is pretty similar. With the original statement all LLMs get the trick quickly, though only Qwen3 explicitly reasons its way there (see here).
Rephrasing in terms of n not being prime (and removing putnam_1988_b1 from the name) they mostly all reason through it and get it right (Llama 4 fails)

Last updated: May 02 2025 at 03:31 UTC