Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: FrontierMath, a benchmark for evaluating advanced mathema...


Ivan Eric (Nov 09 2024 at 20:15):

https://epochai.org/frontiermath

"We collaborated with 60+ leading mathematicians to create hundreds of original, exceptionally challenging math problems, of which current AI systems solve less than 2%."

Ivan Eric (Nov 09 2024 at 20:16):

Looking at the example problems, I can't do any of them. Even 2% is very impressive here, if those problems are representative. It seems less a benchmark and more a challenge: an AI that can get a very high score on this is like an AI that can beat top humans at Go.

Jason Rute (Nov 09 2024 at 20:28):

Awesome! Just to clarify, these are natural language problems right (as opposed to say formal Lean problems)? Also, how confident are you that there are no mistakes in the answers? (I still need to read the paper of course.)

Oisin McGuinness (Nov 09 2024 at 20:37):

Direct link to ArXiv: FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
The authors have had some interesting reviewers/contributors: "We thank Terence Tao, Timothy Gowers, and Richard Borcherds for their insightful interviews and thank Terence Tao for contributing several problems to this benchmark." (Does @Terence Tao ever sleep? :-) )

The sample problems in the paper (appendix, 5 given) are quite challenging! Formalizing the problems and solutions might be an interesting challenge in itself!

Jason Rute (Nov 09 2024 at 20:59):

@Ivan Eric Silly questions that I can't seem to figure out from a quick reading of the report:

  • How many problems are there?
  • Are they public or private? If public, what do you plan to do to avoid them becoming part of LLM training data? If private, how can others evaluate on this benchmark?

Jason Rute (Nov 09 2024 at 21:00):

Also, I'd love for you to do an evaluation of the top AIMO Progress Prize 1 winning solutions (and the solutions for progress prize 2, unless the problems are publically released and there is a serious worry that the Kaggle contestants trained on these problems). I think the problem format should be roughly similar enough to directly compare the models.

Jason Rute (Nov 09 2024 at 21:00):

Finally, what is EpochAI and what is their motivation for creating this benchmark? Also, what steps have they put in place to maintain this benchmark in the future? (I still think about MiniF2F and how the creators left OpenAI and the benchmark became splintered and poorly maintained since then.)

Ivan Eric (Nov 09 2024 at 21:04):

Jason Rute

Responses from some of the authors to these questions:
"No, we have not formalized the problems. We have thought about it. Some problems are difficult to formalize because e.g. Lean does not support the relevant math."
https://x.com/tamaybes/status/1855102789858517163

"Note that the dataset is not publicly shared except for some representative examples! We take data contamination very seriously."

https://x.com/Jsevillamol/status/1855010482882482655

Oisin McGuinness (Nov 09 2024 at 21:05):

Jason Rute said:

Also, how confident are you that there are no mistakes in the answers?

Section 2.3 (page 5-6) of the paper discusses their verification and review process for the problems.

Jason Rute (Nov 09 2024 at 21:06):

Oh, sorry @Ivan Eric, I thought you were one of the authors. I guess I'll have to reach out on X. :scared: Thanks for the pointers @Ivan Eric and @Oisin McGuinness!

Heather Macbeth (Nov 09 2024 at 21:16):

I skimmed the article earlier today. From our community's point of view, the key quote about format is probably:

we often structured problems to have integer solutions .... We also included solutions that could be represented as arbitrary SymPy objects — including symbolic expressions, matrices, and more.

So (as Jason already noted) it's a similar format to the current AIMO challenges on Kaggle: cleverly-designed problems which are hard but still admit integer answers (or answers from some other countable type).

The benchmark's authors correctly point out in their article that this format restriction prevents them from covering certain areas of mathematics (analysis, geometry, etc):

Number theory and combinatorics are most prominently represented, collectively accounting for approximately 34% of all MSC2020 tags. This prominence reflects ... these domains’ natural amenability to problems with numerical solutions ....

Joseph Myers (Nov 09 2024 at 22:23):

Ivan Eric said:

Jason Rute

Responses from some of the authors to these questions:
"No, we have not formalized the problems. We have thought about it. Some problems are difficult to formalize because e.g. Lean does not support the relevant math."
https://x.com/tamaybes/status/1855102789858517163

I've previously argued that the idea of the IMO Grand Challenge should be considered to include a mathlib side of things (teach mathlib the entire (unwritten) IMO syllabus so that all IMO problems can be stated and proved in a reasonably idiomatic way in Lean) as well as an AI side of things (develop an AI that can generate those solutions). Empirically, however, the work on those two things has been largely disjoint (and the AI side has developed faster than mathlib coverage; more people are interested in building AIs that work on such problems than in increasing mathlib coverage of relevant mathematics).

You could consider the same for such a benchmark as this: a challenge to expand mathlib to cover the mathematics needed to (a) state and (b) solve the problems. But there are also obvious major difficulties that don't apply so much to expanding coverage for IMO problems: to add the mathematics for a problem in this benchmark you'd need someone who has access to the problem (in confidence), and has (or can gain) specialized expertise in the relevant mathematics, and has Lean expertise, so they can work out the best way to formalize the relevant definitions and theory (and this is all assuming that the choice of what they contribute to mathlib doesn't leak too much about the contents of the problems).

Jason Rute (Nov 10 2024 at 05:01):

If this is what I think it is, the definitions should be relatively common, or they are given in the problem. If so, then mathlib’s formalization could still proceed as normal, formalizing standard math concepts.

Jason Rute (Nov 10 2024 at 05:04):

Also, there is an easy solution to the problem of AI for Lean verse building Mathlib: AI for building Mathlib, i.e. auto-formalization at scale! :joking: (I jest a bit, but I’m also somewhat serious. I don’t exactly know what is holding us back from much stronger auto-formalization on a large and systematic scale.)

Auguste Poiroux (Nov 10 2024 at 16:22):

Very interesting work! In my opinion the reported results represent a relatively "cheap" inference setup: 0-shot, with only 1 attempt per problem. I think this might not accurately represent what is achievable with current SOTA methods. They even state in the paper that "When re-evaluating problems that were solved at least once by any model, o1-preview demonstrated the strongest performance across repeated trials (see Section B.2).", so self-consistency would likely improve the results.

Auguste Poiroux (Nov 10 2024 at 16:22):

However, my concern is more about the token limit. The token limit is set to 10'000 tokens. But o1 models are known to generate CoT longer than that, especially on hard problems. So I guess part of the bad performance of o1 models is caused by this limit. Official OpenAI recommendation: "OpenAI recommends reserving at least 25,000 tokens for reasoning and outputs when you start experimenting with these models."
Additionally, page 9 of the FrontierMath paper, it is stated that Claude, Grok and GPT-4o reaches 10'000 tokens in more than 45% of their answers. Interestingly, they don't report this statistic for o1 models. Since even "non-thinking" models already use more than 10'000 tokens, increasing this token limit would likely improve performance for all models, and I expect even more for o1 models.

Jason Rute (Nov 10 2024 at 16:49):

Good catch. (Note I don’t think the authors are aware of this discussion, so it might be interesting to point this out on Twitter.)

Jason Rute (Nov 10 2024 at 16:49):

I think if they are expecting LLMs with strict token limits to solve research level math problems, they are a bit crazy. Yes, this field will improve, but token limits is an arbitrary and harmful limit. And if they are setting token limits, are they also ignoring tool use (Python, Lean, etc.)? That is what helped in the AIMO competition.

Jason Rute (Nov 10 2024 at 16:49):

Overall, I think evaluation on stuff like this is really hard. The benchmark authors don’t have the money to test SoTA models, but they also don’t want to release the problems. The same goes for AIMO problems. And with AlphaProof, DeepMind used a lot of compute to solve the IMO problems. Can that be improved over time? Sure, but it might be that test time compute is especially important for solving problems like this and it is harmful to set such tight bounds.

Auguste Poiroux (Nov 10 2024 at 17:35):

I agree this is arbitrary and harmful, although I understand it might be expensive to let models generate more tokens. In FrontierMath, they let models interact with a Python interpreter for their experiments :+1:

Kim Morrison (Nov 10 2024 at 23:37):

I think it would be nice to write Lean formalisations of the FrontierMath sample problems (just the statements for now).

Kim Morrison (Nov 10 2024 at 23:39):

Here's an easy one:

import Mathlib.FieldTheory.Finite.GaloisField
import Mathlib.LinearAlgebra.Projectivization.Basic
import Mathlib.Tactic.NormNum.Prime

open FiniteField Polynomial LinearAlgebra.Projectivization

def fact_five_prime : Fact (Nat.Prime 5) := by constructor; norm_num

attribute [local instance] fact_five_prime

abbrev F5_18 : Type := GaloisField 5 18

def curve_equation (x y z : F5_18) : Prop :=
  x ^ 3 * y + y ^ 3 * z + z ^ 3 * x = 0

def projective_points_on_curve : Set ( F5_18 (Fin 3  F5_18)) :=
  { P |  (x y z : F5_18) (h : ![x, y, z]  0), P = .mk _ ![x, y, z] h  curve_equation x y z }

noncomputable def count_nonzero_points_on_curve :  :=
  Nat.card projective_points_on_curve

Although this makes me sad in several ways:

  • attribute [local instance] fact_five_prime is clunky!
  • where is my term elaborator MVPolynomial.of% fun x y z => x ^ 3 * y + y ^ 3 * z + z ^ 3 * x?
  • How do I take the zeros of a homogeneous MVPolynomial as subset of projective space?

Kim Morrison (Nov 10 2024 at 23:41):

We could

  • collect these somewhere
  • if someone want to try asking for access to write Lean formalisations of the whole set, having the sample problems done is a good start
  • formalising the sample problems will already reveal holes in Mathlib

Heather Macbeth (Nov 10 2024 at 23:57):

Kim Morrison said:

I think it would be nice to write Lean formalisations of the FrontierMath sample problems (just the statements for now).

I think this is a worthwhile task, but another place to direct the formalisation community's energy is in collecting challenge problems of a similar level which don't admit integer/polynomial/integer-matrix/... answers. That is, problems where "the proof is the point", and there is no "shortcut" (like an otherwise-unguessable integer) for quickly convincing an evaluator that the problem has been solved.

Quinn (Nov 12 2024 at 05:03):

Epoch has shipped some great helper tools for AI forecasting, e.g. https://epochai.org/data/machine-learning-hardware -- I don't actually know why this benchmark is within their mandate, haha. I run into the first author here in berkeley from time to time so i'll ask him

Siddhartha Gadgil (Nov 13 2024 at 04:03):

Jason Rute said:

Also, there is an easy solution to the problem of AI for Lean verse building Mathlib: AI for building Mathlib, i.e. auto-formalization at scale! :joking: (I jest a bit, but I’m also somewhat serious. I don’t exactly know what is holding us back from much stronger auto-formalization on a large and systematic scale.)

As someone working on this, I feel there are (at least) three auxiliary tools we need:

  1. Proof Automation: to complete the "obvious" steps in a proof, which will be omitted in a human proof.
  2. Search: when an informal proof invokes a theorem, the corresponding theorem in Mathlib should be found and passed on to the proof automation.
  3. Disproving: Some errors are inevitable in translation. So we need to filter, feedback and try again.

We do have aesop, duper etc for proof automation and both leansearch and moogle (with APIs :smile: ) for search. Hopefully such tools will get better over time.

For disproving we have plausible (formerly slim_check) but I feel this is the weakest link at present.

There is another potential weak link: more advanced mathematics skips more details. But in my experience LLMs can fill in details pretty well.

Jason Rute (Nov 13 2024 at 04:06):

I think both DeepSeek Prover and AlphaProof had disproving capabilities they used for RL. It is a pity this sort of thing isn’t more upstreamed.

Jason Rute (Nov 13 2024 at 04:08):

Actually all those capabilities you mentioned are much stonger and more fleshed out in papers. Its now time to build them in practice.

Matthew Ballard (Dec 20 2024 at 20:09):

It looks like o3 moved the ball forward here. https://x.com/OpenAI/status/1870186518230511844

Matthew Ballard (Dec 20 2024 at 20:57):

Graphic on the results

Matthew Ballard (Dec 20 2024 at 21:46):

https://openai.com/12-days/ up now

Jason Rute (Dec 20 2024 at 22:05):

Yeah, I'm impressed! Scaling laws are real. If you are willing to spend o3/alphaproof level of inference compute, you can do impressive things (with good algorithms and sufficient training of course).

Jason Rute (Dec 20 2024 at 22:09):

Just to be clear, by "impressive things" I don't necessarily mean solving the RH. But solving easier unsolved math problems? The kind a math graduate student solves for their first paper? Maybe soon? Any thoughts from those more familiar with FrontierMath?

Jason Rute (Dec 20 2024 at 22:15):

And also to be clear, I think o3 (run in this configuration) is currently costing thousands of dollars per benchmark problem. (Or at least that was the case for the best o3 scores on ARC-AGI. Does anyone have a reference either way?)

Mario Carneiro (Dec 21 2024 at 12:25):

Jason Rute said:

And also to be clear, I think o3 (run in this configuration) is currently costing thousands of dollars per benchmark problem. (Or at least that was the case for the best o3 scores on ARC-AGI. Does anyone have a reference either way?)

Scaling laws are real :face_with_diagonal_mouth:

Jason Rute (Dec 21 2024 at 12:28):

It’s just become more clear that the FrontierMath math problems are not quite as difficult as the quotes from Tao and Gowers made it seem. There are three levels of difficulty and the lowest level is about IMO level (but still not requiring proofs like the IMO, just numerical solutions). This is about 25% of problems in FrontierMath. The Tao/Gowers quotes are for the hardest tier of problems in FrontierMath. https://x.com/elliotglazer/status/1870235655714025817?s=12

Albert Jiang (Dec 21 2024 at 12:52):

It'll get cheaper. I'm very hopeful an AIME or IMO question can be solved for less than $1 in one year.

Chris Bailey (Dec 21 2024 at 13:04):

Somewhat tangential but not wholly irrelevant; is there an industry standard procedure for controlling who gets to see what when running a nonpublic model on a nonpublic benchmark like frontier math, or is it negotiated on a case by case basis?

Auguste Poiroux (Dec 21 2024 at 13:25):

Going from 2% to 25% on FrontierMath sounds impressive. However, if you have a look at how this 2% has been obtained (10,000 tokens limit per problem, 0 examples in the prompt, 1 attempt per problem if I remember well), it is not fair to compare o3 25% with that. This is not the same compute budget class. To me, the 2% result was never representative of sota performance in LLM reasoning given this cheap inference setting. My controversial opinion is that this sounds like a marketing setup by OpenAI to create a "wow" effect on their o3 model. Let's not forget the amount of money that is at stake here :wink:
However don't get me wrong, I think o3 is really impressive and that this is real progress. I'm just a bit critical about the marketing around it which makes it look like bigger than it really is

Jason Rute (Dec 21 2024 at 13:27):

Albert Jiang said:

It'll get cheaper. I'm very hopeful an AIME or IMO question can be solved for less than $1 in one year.

Especially now that I realize they are using majority voting with a lot of samples (one of the most inefficient search algorithms) I have no doubt the price will go way down. The better these models get, the less they need to search (either in their “thinking time” or in majority voting), and that is not even factoring the ability to distill to smaller models and efficiencies in hardware.

Jason Rute (Dec 21 2024 at 13:30):

Auguste Poiroux said:

Going from 2% to 25% on FrontierMath sounds impressive. However, if you have a look at how this 2% has been obtained (10,000 tokens limit per problem, 0 examples in the prompt, 1 attempt per problem if I remember well), it is not fair to compare o3 25% with that. This is not the same compute budget class. To me, the 2% result was never representative of sota performance in LLM reasoning given this cheap inference setting. My controversial opinion is that this sounds like a marketing setup by OpenAI to create a "wow" effect on their o3 model. Let's not forget the amount of money that is at stake here :wink:
However don't get me wrong, I think o3 is really impressive and that this is real progress. I'm just a bit critical about the marketing around it which makes it look like bigger than it really is

We really need to stop talking about benchmark problem being solved/not solved. It really depends on the amount of compute going in. We see that in formal math too with MiniF2F and IMO. Log-linear scaling is interesting and cool, but we should look at the plot, not a single number.

Jason Rute (Dec 21 2024 at 13:38):

It will be interesting when we have an algorithm which we believe scales up to solving RH. How much would we pay? $1 million, $1 billion, $1 trillion? (I don’t think AlphaProof or o3 would scale to RH. I think for problems of that type the algorithms would plateaue. But this is the difficulty with scaling. It’s hard to measure the upper bound.)

Mario Carneiro (Dec 21 2024 at 13:39):

heh, 1 million is apropos for that problem

Mario Carneiro (Dec 21 2024 at 13:40):

it is somehow ironic that one could end up spending the entire prize money on the compute budget

Mario Carneiro (Dec 21 2024 at 13:42):

I was already thinking the budget was a bit iffy at 1 cent per problem, back when this was something that is supposed to be competing with sledgehammer

Mario Carneiro (Dec 21 2024 at 13:42):

this is now way outside anything I consider a reasonable budget for a formalization assistant

Xiyu Zhai (Dec 21 2024 at 23:56):

Auguste Poiroux said:

Going from 2% to 25% on FrontierMath sounds impressive. However, if you have a look at how this 2% has been obtained (10,000 tokens limit per problem, 0 examples in the prompt, 1 attempt per problem if I remember well), it is not fair to compare o3 25% with that. This is not the same compute budget class. To me, the 2% result was never representative of sota performance in LLM reasoning given this cheap inference setting. My controversial opinion is that this sounds like a marketing setup by OpenAI to create a "wow" effect on their o3 model. Let's not forget the amount of money that is at stake here :wink:
However don't get me wrong, I think o3 is really impressive and that this is real progress. I'm just a bit critical about the marketing around it which makes it look like bigger than it really is

I agree with you. Sometimes I think OpenAI is a necessary evil to get more money on AI research.

Alex Meiburg (Dec 23 2024 at 19:11):

I think $1 million for the RH would be a better-than-just reasonable amount to pay for a computer. In terms of the mathematical progress that would surely be spurred by studying the proof - and the ideas and fields it uncovers - I think it would be totally on par with what, say, $1 million buys the NSF in terms of mathematical output.

That is to say: governments (and to a degree, private institutions) already pay in some units of ($ / mathematical ingenuity), and I think $1 million for a proof of the RH beats market rate. :)

Alex Meiburg (Dec 23 2024 at 19:12):

Of course we can't know exactly how much math will follow, and that is partially because we don't know what the proof will look like. But all grant money / exploratory research comes with that caveat, that you don't know exactly how hard it will be or how big the benefits will be in the end.

Alex Meiburg (Dec 23 2024 at 19:18):

On the other hand ... $1 trillion seems too high. To put it in perspective, that's (very roughly) equal to the combined salaries of all professors and graduate students in the US, across all disciplines, for four years. So spending $1 trillion could be financially equivalent to halting all research at US universities for four years.

(Of course this is not quite true, because you can't just spend 2x the salaries to summon 2x the talent; and there's cost of support staff, and equipment, etc... But, on the margin, and to within an order of magnitude, something like this is true.)

Jason Rute (Dec 23 2024 at 19:20):

And I think we really need to consider AI models that don't only solve problems but make partial progress. If we could have AI which generates solutions and partial progress on a range of problems in the same area, then we could have a much better sense of the value of the output per dollar spent on a model. Then we could make reasonable decisions about if it is worth continuing to spend money on this. (Also, see my tweet here: https://x.com/JasonRute/status/1870831707844124766.)

Jason Rute (Dec 23 2024 at 19:22):

And I hope none of us are serious about rushing to RH, without first trying it on much simpler unsolved problems like those left open in recent papers or Mathoverflow posts.

Alex Meiburg (Dec 23 2024 at 19:25):

For sure. On both points. If I spent $1M on an AI trying to prove RH, and I didn't get a proof out but it made me a cool new theory relating arithmetic functions and their essential singularities over C and their homology classes, I would be pretty satisfied with my purchase :P


Last updated: May 02 2025 at 03:31 UTC