Zulip Chat Archive

Stream: IMO-grand-challenge

Topic: AIME 2025 and data contamination

Jason Rute (Feb 09 2025 at 16:16):

AIME 2025 just came out and folks are using it as an AI benchmark. (It is a good benchmark because it is hard and because it can be graded automatically). But there is also an interesting X post by Dimitris Papail about possible levels of data contamination in AIME 2025 even though it just came out. Similar problems have been on the internet for years. It brings up interesting questions about data contamination vs. generalization. Is it even possible to come up with "completely new" competition math problems (for AIME, IMO, Putnam, etc), or is there truly "nothing new under the sun"? What does it mean for other AI benchmarks like AIMO, FrontierMath, or @Kevin Buzzard's benchmark?

Joseph Myers (Feb 09 2025 at 16:26):

Even if an AI had no original ideas at all and was incapable of constructing proofs, simply having an encyclopedic knowledge of the entire mathematics literature and being able to point out an old, obscure paper with a similar result could make it a useful research tool!

Joseph Myers (Feb 09 2025 at 16:28):

(Not to be confused, of course, with the kind of LLM that happily hallucinates a reference to a paper that doesn't exist with a result that doesn't exist in a journal that doesn't exist but that sounds like the sort of thing the hallucinated author might have written about.)

Joseph Myers (Feb 09 2025 at 16:34):

At least the IMO PSC tries to eliminate problems that are known (whether from the research literature or from previous competitions), or too close to something known, or trivialized by something known (where what counts as trivialized may depend on how hard the problem is in the first place without the known result). On the 2019 PSC we eliminated some problems for having appeared in journals in the 1930s and 1940s. And then the Jury may eliminate more problems as known.

Jason Rute (Feb 09 2025 at 16:35):

This post gives a method to find similar problems on the internet. Of course it itself involves entering the problem into an AI model.

Joseph Myers (Feb 09 2025 at 16:43):

Some level of similarity to known problems is inevitable - especially at the easier level (such as AIME). And a lot of solving problems is spotting similarities to ideas seen before and figuring out how to put them together. (Even if AlphaProof, when solving IMO 2024 P2, couldn't have literally spotted the relevance of an idea from IMO 2005 P4 the way a human might have - or if it did, had come up with that idea itself previously - since it didn't train on any data including past IMO solutions.)

Joseph Myers (Feb 09 2025 at 16:44):

Entering problems you don't want to leak into an AI model is probably not a good idea, at least unless it's a self-hosted model and you have control over any queries it sends out.

Joseph Myers (Feb 09 2025 at 16:47):

IMO 1994 P2 was an "if and only if" problem, where one direction was submitted by Armenia, the other by Australia, and the PSC saw that the submitters had independently come up with the two directions of the same problem and combined them into a single problem.

Joseph Myers (Feb 09 2025 at 16:54):

Sometimes a problem gets through that turns out later to be known (e.g. IMO 2004 P3, IMO 2007 P6, IMO 2018 P3). If a human contestant already knows a solution, they can write it out and get a quick 7 for the problem. In practice, if the PSC and Jury failed to identify the problem as known, few if any contestants are likely to do so either.

Joseph Myers (Feb 09 2025 at 16:56):

So an AI solving a problem through having seen it or something very similar (that was public at the time of the competition) before isn't at any unfair advantage compared to a human contestant who might do so.

Joseph Myers (Feb 09 2025 at 17:02):

As for benchmarks, this probably illustrates that larger benchmarks are better than smaller ones because they help average out the variations in different abilities (including spotting similarities to known problems and applying them to the benchmark problems) that mean there's a lot of chance involved in benchmark results on a small set of problems. (Indeed the AIMO discussions on Kaggle suggest significant variation in results for the same 50 problems just from repeatedly running the same AI on them.) Larger benchmarks are of course a lot more work to write, and once the benchmark problems become public they may rapidly enter training data.

Joseph Myers (Feb 09 2025 at 17:05):

Or you could pick a set of the many national and regional mathematics competitions that take place each year (with problems that are intended to be new) and plan in advance to use the problems from those as a benchmark, rather than just using one competition or writing lots of new problems, but then it takes most of a year to accumulate your results across those many competitions, and the results are confounded by both the rapid development of AI over the course of the year, and the widely varying difficulty level of such competitions.

Andy Jiang (Feb 10 2025 at 06:54):

Dimitris's post is pretty funny actually--it seems that he randomly checked three of the questions and they were all existing (did he say question there is a variation?). I guess it would be useful to know for the rest of the problems how many are repeats haha

Alex Meiburg (Feb 25 2025 at 18:40):

As another example of a "repeat", IMO 1983 Problem 5 just asks "is r_3(10^5) at least 1983"?, where the "r" function is the one in Roth's theorem / Szemerédi's theorem. A "solution" to this question (at least, enough to make the IMO problem trivial) was already in published papers by 1936, if my understanding of the literature is right...

Certainly this is a case where an AI with knowledge of all research math could solve it easily without any new ideas -- if we'd had LLMs + powerful search engines 40 years ago.

Last updated: May 02 2025 at 03:31 UTC