Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Open open benchmarks

Lars Ericson (Jan 26 2025 at 12:36):

I just learned about the FrontierMath issue, where OpenAI funded and got early access to training data for what was supposed to be an open benchmark. Michael Harris discusses this here: https://siliconreckoner.substack.com/p/the-frontier-math-scandal

People in this forum might want to devise open math AI benchmarks which constructively avoid NDAs, non-competes and similar. There's no money in it, but that would be the point.

Jason Rute (Jan 26 2025 at 13:31):

Kevin Buzzard also has a similar post here https://xenaproject.wordpress.com/2025/01/20/think-of-a-number/
where he mentions the creation of his own benchmark.
See the discussion on Zulip starting here #general > XenaProject blog posts @ 💬 about Kevin's post including the discussion on whether having mathematicians make a benchmark for free is good or exploitive.
As for Epoch AI's full explanation see https://x.com/tamaybes/status/1882566849197703284 (if you don't have X, see https://threadreaderapp.com/thread/1882566849197703284.html).
As for OpenAI's explanation see https://x.com/__nmca__/status/1882563755806281986 (if not X, see https://threadreaderapp.com/thread/1882563755806281986.html)
Also, note that the AIMO project funded by XTX has produced similar benchmarks paying mathematicians.
XTX also has a 9 million dollar grant fund for benchmarks like this.

Jason Rute (Jan 26 2025 at 13:31):

Ok, now that I got the background, let me give my thoughts. I think folks are being way too hard on EpochAI and the FrontierMath team. Yes, they made a huge mistake. But until quite recently, benchmarks weren't that well scoped. They were public meaning every one could train on them if they wanted to (and LLM companies have employees which work to make sure the standard benchmarks are removed from the training data, but even then it isn't clear that the data isn't getting through). The MiniF2F benchmark in interactive theorem proving was by OpenAI. Also, many common mathematics benchmarks by AI researchers are not the best benchmarks. Some have a lot of mistakes. Others have copyright concerns. Also, Google AI for example has two in-house hidden benchmarks that they don't show to anyone and they use for evaluating math models. Payment is also a tricky matter. Some think it brings weird incentives. Some think it is exploitive not to pay for it. (I can't get anyone to translate MiniF2F to Coq and the last discussion I had on the Coq Zulip was someone saying that a team should get paid to do it right and support it. They may be right since the Lean4 translation of MiniF2F is full of mistakes. But I also doubt anyone will pay for it, and MiniF2F will never be translated to Coq.) I think the XTX benchmarks are pretty good, even though they are paying for it, and I think a lot of open benchmarks will have more issues because they are not as well thought out and well-designed. Nonetheless, more benchmarks are better and I support anyone trying to make them (open, closed, secret, etc). (You probably don't want to ask me for too much advice since I am also an AI researcher with conflicts of interest. :smile:)

Jason Rute (Jan 26 2025 at 13:31):

As for FrontierMath, I don't know if the team will ever fully recover from this. But I think people are assuming it is much worse than it is. We have loosely trusted OpenAI and Deepmind/Google for years now when they say they attempt to not directly train on other benchmarks like MATH, 2024 AIME, HiddenMath, and the 2024 IMO. Why all of a sudden do we not trust OpenAI to have not trained on FrontierMath just because they had early access? (I personally think the results will stand more-or-less under scrutiny.) Moreover, EpochAI realized their mistake and I think they are sincerely trying to make it right going forward. (For example, they still have a holdout set of problems OpenAI has never seen, and they also committed to be open about funding going forward.)

Jason Rute (Jan 26 2025 at 13:31):

I'd love to hear what others are most offended about with the FrontierMath scandal. Is it because of the secrecy? That OpenAI has too much power? That they think OpenAI cheated and the o3 score on FrontierMath is inflated? Or was it also about FrontierMath's misleading advertisement about how difficult (or not) this benchmark was? (They made it sound like it was all research level or above, and I think Kevin rightly feels tricked into giving them a quote about the difficulty, when it turns out about a quarter of the problems are IMO undergraduate level.) Or was it that they didn't reveal their funding and how the dataset would be used by OpenAI to the mathematicians paid to make the problems?

Kevin Buzzard (Jan 26 2025 at 13:52):

My personal opinion: what sickens me is that I was tricked into saying "if AI can get anywhere on this dataset then it's game over for mathematicians" in Science and then OpenAI announce they can get 25% on it and now it turns out that the entire affair was just a publicity stunt.

Jason Rute (Jan 26 2025 at 13:58):

@Kevin Buzzard I assume they have never (publically or privately) apologized to you for this?

Kevin Buzzard (Jan 26 2025 at 16:54):

Oh I'm not looking for an apology, I just think it's kind of funny. It is all part of a learning curve for me. What I have realised is that right now there is this insane war between big tech companies and their language models, and all these outbursts by people like Altman and Musk are just calculated to drive up share prices and continue to inflate the hype train, so in particular are not really based on facts (such as "language models can't actually do maths beyond undergraduate level with any reasonable degree of accuracy", something which they would rather keep quiet). Note also that the entire premise that "guessing a number" is the same as "doing maths" is completely fallacious. I'm just riled at the general area; the problem is that honest testing of current models is extremely difficult because we don't want "guess a number", we want "prove a theorem" and the problem with proving a theorem informally is that then a human has to read it carefully and root out the crap. I am also slightly terrified of the near future where crap "proofs" of important problems will start to flood journals and it will be the job of humans to refute the proofs. This is why in my mind the only sensible approach going forward is to wire up theorem provers to language models and only take any notice of autonomously-generated proofs if they're formal proofs. My experience with FLT is that the literature is full of errors (most but not all minor) and I have no reason to believe that a model trained on the literature won't also be making errors. I am just a bit exasperated with what is actually happening and how far it is from what I want to be happening (which is that these systems should be using Lean or something like it). I lost faith in human mathematicians some time ago and I don't see any reason to believe that language models not backed up by formal proofs will be any better.

Joseph Myers (Jan 26 2025 at 18:22):

There are many reasons people might be predisposed to be suspicious of claims coming from AI companies.

What we're doing in building mathlib is some kind of hybrid of open source software development, mathematics research and mathematics teaching. This leads to community norms strongly in tension with those of the AI industry.

All those fields are much more open and collaborative than the AI industry.
There's also the more general open / reproducible science movement in recent years, seeking to ensure that code and data required to reproduce the results of a paper get released with that paper. But AI is if anything moving in the opposite direction - releasing papers that do not come with the code and data required to reproduce the results, or even sufficient details of how a model was trained to enable someone else to write their own code and do their own training to produce a similar model. (And OpenAI in particular is known for originally promising openness and then becoming less open over time - whether or not you agree with the safety concerns stated as a reason for this.)
A further development in that direction is the habit of announcing AI results only by blog post, video, social media or similar with no proper paper about the work at all. (I think we're still awaiting an AlphaProof paper six months on from the blog post. And there are plenty of other cases where much less information was released and there was no suggestion that a proper paper would ever follow at all.) To be fair, there are various cases in mathematics of announced results where no paper or preprint has appeared, but in mathematics such cases are considered somehow suspect in a way that they don't seem to be for claimed achievements in AI.
Indeed good practice in mathematics is that you should be wary of asserting something to be proved based purely on a preprint without caveats about "awaiting peer review" or similar. How something becomes socially accepted as a theorem is complex (and peer review hardly foolproof), but the standards are certainly higher than for AI claims.
As well as being open and collaborative in building up mathematics and libraries of formal mathematics, it's normal for research grant funding to be public - both in the sense that you can look up the grants in public databases (e.g. Kevin's FLT grant) and in the sense that papers acknowledge the sources of research funding (not because of conflicts of interest, but the funders want to be acknowledged).
There is a long history of AI claims of results on benchmarks that seem overblown on further investigation (e.g. formal problem statements that are incorrect, simplified, not as described or where informal statements may have appeared in the training material).
AI companies training large models are notorious for legally, ethically and socially questionable practices in collecting training material. This includes training on collections of material placed online without permission from copyright holders. It includes training on users' private data with at most an opt-out that doesn't help remove data from already-trained models. It includes ignoring robots.txt when crawling websites from thousands of IP addresses to try to make it hard for those sites to block the crawlers causing excess load (a very current problem experienced by lots of sites at present). Whether or not any particular company actually engages in a particular one of those things at present for training a particular model, there have been enough instances of those in the industry in the past few years that it's natural to suspect that if an AI company has access to some data, legally or illegally, it might be used in training.

Given all those issues with AI industry norms being different from those of open and collaborative science, that naturally leads to benchmarking approaches such as AIMO - running models without internet access, on problems that are only used once and don't get given to the people building the AIs. With second best being that the problems are new and only used once (e.g. testing on IMO problems immediately after humans sit them) but it's the AI builders who run their own AIs on the problems. And a third thing that would also be of interest (but not really a benchmark) is an AI solving (with formal proof) actual unsolved problems - even if there is a significant amount of human assistance along the way, and even if the details aren't disclosed of exactly what was done by humans and what was done by AI, any substantial AI involvement in coming up with a solution to a known unsolved mathematical problem would be an important development.

Joseph Myers (Jan 26 2025 at 18:31):

It's also doubtful whether the AI companies actually get much benefit from the secrecy (see "We Have No Moat, And Neither Does OpenAI", famously). That applies even more to things more tangentially related to the AI work (e.g. modifications to Lean to facilitate AIs exploring proof attempts interacting with Lean) - open collaborative community norms would encourage doing as much as possible in public, in collaboration with others working in the field, for such things.

Last updated: May 02 2025 at 03:31 UTC