Zulip Chat Archive

Stream: IMO-grand-challenge

Topic: IMO 2025 problem statements


Joseph Myers (Jul 16 2025 at 03:40):

I've added my unofficial Lean versions of the IMO 2025 problem statements to IMOLean: https://github.com/jsm28/IMOLean (note that all five non-geometry problems are "determine" problems).

If more than one AI reports formal-to-formal results in Lean for IMO 2025, and they fail to use a common version shared by all such AIs - either this version or some other version coordinated between all such AIs and following conventions agreed in public in the Lean community in advance to be appropriate - then they've earned criticism for evading comparability of their results.

(That doesn't apply to any AIs that only use Lean internally as part of an informal-to-informal solver, though I'd say there that doing good science means also evaluating how the Lean component of such a solver does on an independent neutral Lean statement of the problems and so if a paper is submitted without such an evaluation, the referees ought to request such an evaluation be added.)

My formal statements work both with the Lean and mathlib versions in IMOLean and the slightly older v4.21.0 tagged version. Any AI requiring something older than that is failing at practical usability, where staying up to date with mathlib is clearly going to be important.

Ralph Furman (Jul 16 2025 at 03:49):

That's a great resource to have. I'd be curious to see how much the formalization matters for Problem 1 in particular, and whether an inherently finite or natural number based solution is easier to solve than one in R^2.

Jakob von Raumer (Jul 17 2025 at 07:40):

Is there any news from any of the contesting AIs?

Jason Rute (Jul 17 2025 at 11:09):

@Jakob von Raumer only Numina (@Jia Li) has publically indicated that they will “compete” in any sense of the word compete. So I look forward to what they have to announce. I doubt they will use @Joseph Myers statements for many obvious reasons. There are other good formal models out there like Goedel-Prover v2, but at least @Kaiyu Yang told me that he isn’t aware of any plans for them to “compete”.

Jason Rute (Jul 17 2025 at 11:10):

Nonetheless, one could run these open source models themselves and report the results. It can get expensive and complicated so I don’t think anyone will for the formal math models. (And if they do, I hope they run the models correctly with the right system prompt, the right looping strategy, the right Lean version, proper testing, etc.)

Jason Rute (Jul 17 2025 at 11:10):

As for the informal models there is some commentary and some experiments with the off-the-shelf LLM models on @Ralph Furman’s https://sugaku.net/content/imo-2025-problems/ . I find the commentary really interesting even if not very systematic. For example I don’t see the better AI models (the ones which did better on USAMO represented) like the newest Gemini model, Grok 4, or o4 (and in their heavy compute modes). Also I am suspicious of prompts which don’t mention that these are IMO problems (or otherwise indicate that they are hard). (Some IMO problems look easy, but because they are on the IMO you know they are very challenging. For example I knew I made a mistake on P6 when I tried it, since my solution was too easy for an IMO problem.)

Jason Rute (Jul 17 2025 at 11:10):

I like the analysis on @Mislav Balunović‘s MathArena.ai. It seems very systematic and thought through. I hope they keep up the tradition by benchmarking the IMO.

Jakob von Raumer (Jul 17 2025 at 11:18):

Thanks for the summary!

Jason Rute (Jul 17 2025 at 12:12):

Correction to something I said above. I was mistaken, and it does seem Sugaku’s analysis does include o4-mini, Gemini-2.5 Pro, Grok 4 (not sure what settings or how they were run). Sometimes the answers are blank. Is that because they returned no response?

Andy Jiang (Jul 17 2025 at 15:14):

Jason Rute said:

Nonetheless, one could run these open source models themselves and report the results. It can get expensive and complicated so I don’t think anyone will for the formal math models. (And if they do, I hope they run the models correctly with the right system prompt, the right looping strategy, the right Lean version, proper testing, etc.)

how expensive are we talking about? Goedel prover v2 seems pretty small (nowadays) of a model, so I guess the expense comes from running it N times for some large N? Do you think it costs like N$ to do pass@N roughly? More? Just asking bc my intuition was that it shouldn't be super expensive (like <<N$ for pass@N), especially if you have access to some gpus (32B seems like really friendly size actually)

although the first thought that came to me was it's unclear that you can get goedel prover to "guess" the answer and having it prove given the "guess" makes the problem easier--but still would be interesting to see that whether it can do it!

Andy Jiang (Jul 17 2025 at 15:15):

like a funny thing someone could try would be to get the top frontier models to crowdsource guesses to feed into goedel and see what happens? I don't think it will break the bank to try this even for an individual tbh

Andy Jiang (Jul 17 2025 at 15:56):

Ok I just saw that MathArena evaluated the top informal models-- https://x.com/j_dekoninck/status/1945848711466160349 [in line with USAMO evaluations (and lower than I expected)]

Andy Jiang (Jul 17 2025 at 15:59):

Gemini's P3 performance seems quite good--maybe it suggests/hints there was indeed some synthetic data coming from old alphaproof into Gemini 2.5 pro. Funny because I thought also that would be kind of a nice test of goedel prover bc the formalization seems quite short--caveat I haven't looked at any of the problems in detail myself

Andy Jiang (Jul 17 2025 at 16:18):

On the flip side--if one were to be overly generous--it does seem that at least in 5/6 (except for P6) the informal LLMs are getting it correct with some probability--which means that if the formal provers are strong enough (a big if ofc) it could have solved 5/6 in a combined effort. So I think it's really important we also test how strong the formal prover is (say given the correct guess).

Ralph Furman (Jul 17 2025 at 16:21):

The MathArena analysis looks great!

One reason I rushed to get the models tested the moment the problems leaked is that LLMs are really good right now at web search and I don't know a good way to make sure they are solving the problem without cheating.

The MathArena article mentions about Grok-4:

Many of its initial responses were extremely short, often consisting only of a final answer without explanation

If you ask on their app and you look at its reasoning traces it explicitly searches for the problem statement and solution online
Gv7tS6bWcAAHfmN.jpeg

Oktai Tatanov (Jul 18 2025 at 10:14):

Hello!

I’m new to Lean, but I’m very interested in the IMO Grand Challenge and everything surrounding it. I have a question about the exact format in which you expect solutions for the competition—and what kind of solution would count as passing the challenge.

Below I spell out where my uncertainty lies, framed as a few specific questions.

What I understand about Lean
In Lean code a sorry marks a spot that an prover is expected to replace with a correct term, while leaving the original code untouched. The Lean compiler must then confirm that every expression is correct; if it succeeds, the prover has solved the problem.

In the formal‑to‑formal IMO problems published in the IMOLean repository, each file contains two sorrys: one in the theorem statement and one in the answer term. Thus a “canonical” Lean solution looks like this:

  1. Specify the answer (for problems that require an explicit answer);
  2. Prove—or perhaps better, verify—within Lean that this answer satisfies the problem’s conditions.

If the problem is of the form “prove that …,” step 1 is absent.

Question 1:
Is it true that a winning AI system must discover both terms—the answer and the proof—on its own?

From the structure of the IMOLean problems and what is stated in the README, I assume the answer is yes.

Current state of AI provers
Recent AI provers (DeepSeek‑Prover‑V2, Kimina‑Prover‑72B, Gödel‑Prover‑V2, etc.) are usually benchmarked on miniF2F, PutnamBench, and the new MathOlympiadBench (largely past IMO and IMO‑shortlist problems). In these benchmarks the theorem already includes the answer term, so the prover has to replace only one sorry. In effect, it performs only step 2, with the answer supplied as part of the input—crucial for problems that are not “prove that …” statements.

PutnamBench even separates results into two tabs: “with answer” and “no answer.” The first tab lists many models; the second is empty: https://trishullab.github.io/PutnamBench/leaderboard.html

Question 2:
If the answer to Question 1 is indeed “yes,” does that mean there are no known open‑source AI systems currently capable of even attempting the IMO Grand Challenge, because all public benchmarks assume the answer is pre‑substituted?

Question 3:
Am I correct that at least the following provers—DeepSeek‑Prover‑V2, Kimina‑Prover‑72B, and Gödel‑Prover‑V2—cannot (or have not been trained to) generate the answer term and instead assume the theorem already contains it?

I am aware of the closed 2024 system AlphaProof, whose authors state that their agent generates both the answer and the proof entirely on its own (i.e. without any human substitution):

“while the problem statements were formalized into Lean by hand, the answers within the problem statements were generated and formalized by the agent.” (https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/imo-2024-solutions/index.html)

To my knowledge, no other system can currently meet the IMO Grand Challenge requirements.

Thank you for creating this challenge. I look forward to any feedback that will help me better understand the state of the art.

Jason Rute (Jul 18 2025 at 11:29):

@Oktai Tatanov That is a good question! First, there are actually benchmarks like this with two sorries, one for the solution and one for the proof. (Sometimes we call this "hard mode", although others like to point out it is the "standard mode" for a competition problem.) CombiBench comes to mind, and unlike PutnamBench, there is a CombiBench leaderboard with attempts at this, including off-the-shelf LLMs like Gemini, but also Lean-specific models like Kimina-Prover-Preview.

Jason Rute (Jul 18 2025 at 11:29):

I have to remind myself how Kimina-Prover Preview filled in the solutions, but for AlphaProof, I think they said they asked Gemini for solutions. Then they autoformalized those solutions to Lean. Then they had the model try to prove or disprove all the candidate solutions. The disproof came in pretty fast, so it was easy to reject bad candidates and focus only on the good ones. This approach is pretty flexible. Any of the standard Lean AIs (which are good at both proving and disproving, which many do during RL since the automatically generated data can be false) could follow this method using an external reasoning model. (Again, I need to check if Kimina-Prover-Preview and Kimina-Prover use an external model for this or their own model.)

Jason Rute (Jul 18 2025 at 11:33):

Also, "winning AI system" is a strange phrase. There aren't exactly rules here (although there is the AIMO prize). But AlphaProof did set a standard of the AI coming up with the answer, and I think people would be quick to criticize any system that didn't do this automatically.

Jason Rute (Jul 18 2025 at 11:50):

I think for Kimina-Prover-Preview it seems they did use the model directly (which is an LLM, so it can be capable of fill-in-the blank stuff like this), but also point out in the combibench paper that:

It is important to note that Kimina-Prover Preview has not been specifically trained on datasets pertaining to combinatorics or fill-in-the-blank question types.

Jason Rute (Jul 18 2025 at 11:50):

For Kimina-Prover, they specifically train the model to do this. From their report:

Non-proof Problem Solving. Many problems are presented in the form of requiring a final answer, which cannot be naturally framed as a traditional proof task. Inspired by the CombiBench[10] evaluation methodology, we introduced a capability for non-proof problem solving to unify answer generation with proof construction. In this setting, the model is given a problem with the final answer field left blank. It performs a two-stage process within a single inference: it first deduces the correct answer and then generates a complete formal proof to justify that answer. This approach ensures the model’s reasoning is explicitly conditioned on reaching the correct solution.

Notification Bot (Jul 24 2025 at 12:14):

6 messages were moved from this topic to #IMO-grand-challenge > IMO 2025 P1 by Jason Rute.


Last updated: Dec 20 2025 at 21:32 UTC