Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Blind Speculation about IMO 2025


Andy Jiang (Jun 28 2025 at 06:36):

Sorry in advance for this self-indulgent post with blatant speculations about performance of various AI systems in this year's IMO (happening in about two weeks).

1-7 points:
I think there are probably upwards of 30 models which can score non-zero number of points on IMO. Well-known ones I expect to be in this range include Qwen3, Deepseek R1 (original version) and Claude Opus 4. My prediction here is more optimistic than MathArena scores on USAMO suggest. I am assuming for the widely available LLMs that there is minimal scaffolding (basically grade them like humans).

8-21 points:
I think currently the best general LLMs will score in this range (this is again a little more optimistic than MathArena suggests--though on the whole USAMO's easiest questions are a little harder than the easiest on IMO usually I think [not sure]). I believe the best current general models are Gemini 2.5 pro (deepthink?) and o3-pro (maybe +distributed/parallel inference?) and maybe deepseek R1 0528 deserves to be here too but I'm not sure.

30+ points (probably closer to 42):
Here probably is where we expect Google's internal model to be in this year's IMO given their close-to-gold performance last year.

Having said the above, there's quite a few unknowns. It's not clear how strong the best formal models (apart of google's alphaproof) can perform on the IMO under reasonable resource constraints. Formal models have the advantage that you can employ search and throwing massive compute at generation because of perfect verification. For example it's not clear to me where Deepseek Prover v2 falls or for example the internal models at Harmonic. Finally, there's a high chance that various organizations are building IMO specific AI models which are not yet announced for this year's IMO. For example I personally think it's likely to be happening at OpenAI, Deepseek, and XAI--though it's unclear whether they will be publicly announced if they do not achieve a high score.

Happy to hear other people's thoughts on my rambling above, and if people agree or disagree with my take.

Kelly Davis (Jun 28 2025 at 12:54):

DeepMind's AlphaProof team has publicly stated that they will publish an article on AlphaProof in July.

So from a purely PR point of view it would make sense for them to (1) compete in the IMO, (2) achieve a solid (gold) performance, and (3) publish directly thereafter, say in Nature.

Jason Rute (Jun 28 2025 at 13:18):

Where have they publically stated this?

Patrick Massot (Jun 28 2025 at 13:51):

They say they are not interested in competing this year because either they will get a gold medal and nobody will be surprised or they won’t and it will be seen as a failure.

Patrick Massot (Jun 28 2025 at 13:51):

But it could be bluff.

Jason Rute (Jun 28 2025 at 13:53):

Is this from private discussions? Or from public talks that the AlphaProof team has been giving? (I haven't watched their bigproof talks yet.)

Patrick Massot (Jun 28 2025 at 14:25):

I don't remember whether it was mentioned in the talk.

Kelly Davis (Jun 28 2025 at 14:26):

Jason Rute said:

Where have they publically stated this?

AFAIK it was in the Harvard CMSA AlpahProof lecture by Thomas Hubert and hosted by Michael R Douglas. I heard about it from a personal message from Michael.

Ralph Furman (Jun 28 2025 at 14:30):

Morally I’d consider last year’s performance to be gold, since any student who solved 1, 2, 4, and 6 would definitely get at least 1 partial mark on the other 2 (as would LLM slop)

Kelly Davis (Jun 28 2025 at 14:33):

Just looked back at the personal message from Michael and he said the paper is supposed to come out "before the next IMO".

Jason Rute (Jun 28 2025 at 14:48):

@Kelly Davis wasn’t that CMSA talk a number of months ago? Couldn’t the schedule have changed since then (due to company politics, referee feedback, other things)? Or is your intel more recent?

Kelly Davis (Jun 28 2025 at 14:50):

Jason Rute said:

Kelly Davis wasn’t that CMSA talk a number of months ago? Couldn’t the schedule have changed since then (due to company politics, referee feedback, other things)? Or is your intel more recent?

Yes, things could have changed. The conversation I quote is from June 4, 2025.

Jason Rute (Jun 28 2025 at 14:58):

I hope someone knowledgeable in IMO scoring (whether MathArena.ai or someone else) measures the current LLMs on IMO right after.

Joseph Myers (Jun 28 2025 at 15:03):

Ralph Furman said:

Morally I’d consider last year’s performance to be gold, since any student who solved 1, 2, 4, and 6 would definitely get at least 1 partial mark on the other 2 (as would LLM slop)

Leaders are good at asking for points for things in a contestant's script that could sort of be interpreted as vaguely similar to something worth points in the mark scheme if you reinterpret it and look at it sideways. Coordinators are good at ensuring that such dubious claims don't actually get marks and that marking is consistent across all scripts about what is considered good enough to get a particular mark (including pointing out things to leaders that they missed but are worth marks, in some cases).

The same applies to LLM slop; AI researchers might try to argue for interpreting it as being worth something, but that doesn't mean it's actually worth anything. Partial credit on problems 3 and 6 is generally quite hard to get, the most obvious ideas may not get any marks.

A proper protocol for marking informal AI attempts at IMO problems involves them being assessed by coordinators who assessed human work on those problems at that IMO, not just the AI developers marking their own work. If the AI makes multiple attempts at a problem, it also involves the AI deciding for itself (with human-like length limits on the amount of work for the coordinators to consider) which attempts are most worthy of consideration (rather than an AI producing 100 attempts at a problem and humans selecting the best).

Joseph Myers (Jun 28 2025 at 15:09):

As for F2F attempts, as long as you're not looking for partial credit but only scores of 0 or 7, marking is easier (modulo the "determine" issue) provided the formal problem statements are correct (though we've seen lately how verifying that a Lean file really does add a valid proof of the correct type to the environment is more complicated than one might hope, and that AIs are capable of discovering bugs in whatever process they use for verifying proofs while training).

Jason Rute (Jun 28 2025 at 15:13):

@Joseph Myers do you have any thoughts on the marking of USAMO by MathArena.ai ? (https://arxiv.org/abs/2505.23281). Would you consider it reasonably fair?

Joseph Myers (Jun 28 2025 at 15:14):

I am not aware of any plan to produce a single standard Lean version of the problem statements, although I intend to write my own for IMOLean and think that having a single standard version is appropriate to get proper comparability between different AIs attempting the problems in Lean. Of course stating some geometrical problems in Lean might well still require working around missing mathlib definitions, although more such definitions are present than were present a year ago.

Joseph Myers (Jun 28 2025 at 15:20):

I'm not familiar with USAMO mark schemes and don't know how similar or not the ones used by MathArena.ai are to those used for human scripts. My assumption is that proper comparability with human scores means using the same mark schemes as for human scripts (which aren't public, hence AI developers wanting a proper independent assessment of informal IMO 2025 attempts should have been talking to the IMO organizers some time ago to seek arrangements for some coordinators to mark their scripts after marking human ones on the same problems).

Kevin Buzzard (Jun 28 2025 at 16:11):

I did email the IMO 2025 organisers quite some time ago suggesting that an "official" Lean translation of the problems would be helpful, but they did not get back to me about this.

Kevin Buzzard (Jun 28 2025 at 16:15):

Right now the entire "how well does a computer do on the IMO" question is muddied by several issues, which have been documented by Joseph above, but just to recap: some issues with formal submissions are: formalisation of "what is this number" questions, the fact that the AI teams can choose how to formalise the question (and may even try several variants internally until they find one where the machine can make progress), and the fact that the entire process is going on behind closed doors (science is supposed to be reproducible). And some issues with informal submissions are: the AI should submit one solution not "100 and then the AI team chooses the best one" and the submission needs to be marked by coordinators rather than by the team itself.

Kevin Buzzard (Jun 28 2025 at 16:17):

All this on top of the fact that we have no idea who will submit and who will not announce their results if they don't do very well, makes me a bit nervous about possible chaos which might unfold in July. Not least because people might have lots of money riding on my opinion about whether the IMO grand challenge was achieved in 2025.

Joseph Myers (Jun 28 2025 at 16:28):

And also any AI teams in a hurry to announce results could well be putting out announcements while a significant proportion of the people who might comment are either busy at the IMO itself or out of contact while travelling home from the IMO in Australia.

Justin Asher (Jun 28 2025 at 17:14):

Kevin Buzzard said:

I did email the IMO 2025 organisers quite some time ago suggesting that an "official" Lean translation of the problems would be helpful, but they did not get back to me about this.

This seems the most reasonable solution to most of these issues. I know for one that formalization choice can help model performance, which does not seem like a particularly good practice.

Justin Asher (Jun 28 2025 at 17:15):

Ideally, we want the human to be completely out of the loop. I do not see why we cannot have a standard for how to achieve this.

Andy Jiang (Jun 29 2025 at 01:04):

I guess that some teams (including google) are using an older version of mathlib or a fork of mathlib where they added more to the library so the official translation may need to be slightly modified for them but I do think it's a good idea to have an official translation. But maybe someone from AIMO prize could make a semi-official translation for the open-source competition.

Kevin Buzzard (Jun 29 2025 at 07:38):

AIMO prize has nothing to do with IMO 2025

Andy Jiang (Jun 29 2025 at 09:12):

True--maybe I should've said unofficial. I only suggested them because they are the only ones right now who seem committed to validating whether open source efforts do in fact pass IMO gold medal abilities (given that they still pledge 5M USD on their website) but it does seem they are also quite quiet on this and it's not clear whether they will accept formal to formal solutions.

Maybe the lean community here could agree on a translation after the IMO to test the publicly available models?

Andy Jiang (Jul 19 2025 at 09:03):

Resurrecting this thread to say that out of the general purpose AIs we saw performance on the lower end of my predictions above and in line with USAMO results (as reported by matharena). We're starting to see announcements of other efforts https://x.com/alexwei_/status/1946477742855532918 now made public. Still hoping for Google and maybe some more teams to announce their results (no inside info this is just speculation)

Kelly Davis (Jul 19 2025 at 09:56):

Andy Jiang said:

We're starting to see announcements of other efforts https://x.com/alexwei_/status/1946477742855532918 now made public. Still hoping for Google and maybe some more teams to announce their results (no inside info this is just speculation)

Alexander Wei also made their proofs public on GitHub

Junyan Xu (Jul 19 2025 at 10:29):

https://imo2025.au/wp-content/uploads/2025/07/IMO-2025_ClosingDayStatement-19072025.pdf

As well as upholding traditions, this year’s IMO has also made space for cutting-edge developments. We are delighted to welcome the AI Mathematical Olympiad Prize, a $10mn competition to encourage the creation of open-source AI models. The AIMO Prize will give an update on its progress later this morning, including a video presentation from this year’s winner, NemoSkills.

Additionally, for the first time, a selection of AI companies were invited to join a fringe event at the IMO, in which their representatives presented their latest developments to students. These companies also privately tested closed-source AI models on this year’s problems and we are sure their results will be of great interest to mathematicians, technologists and the wider public.

We are proud that the IMO is highly regarded as a benchmark for mathematical prowess, and that this year’s event has engaged with both open- and closed-source AI models. However, as Gregor Dolinar, President of the IMO, stated: “It is very exciting to see progress in the mathematical capabilities of AI models, but we would like to be clear that the IMO cannot validate the methods, including the amount of compute used or whether there was any human involvement, or whether the results can be reproduced. What we can say is that correct mathematical proofs, whether produced by the brightest students or AI models, are valid.”

Junyan Xu (Jul 19 2025 at 10:52):

https://x.com/zjasper666/status/1946335438228361372
Translation of the screenshot:

Congratulations to the Chinese team :+1: :+1: :+1: :tada: :tada: :tada:
Congratulations to the Chinese team :+1: :+1: :+1: :tada: :tada: :tada:
So DeepMind is also gold?
Yes
There's a big pile of 35s this year

(The last comment is probably about human contestants, as the gold threshold is 35 this year reached by 67 contesntants, compared to 29 last year reached by 58.)

Andy Jiang (Jul 19 2025 at 11:18):

Kelly Davis said:

Andy Jiang said:

We're starting to see announcements of other efforts https://x.com/alexwei_/status/1946477742855532918 now made public. Still hoping for Google and maybe some more teams to announce their results (no inside info this is just speculation)

Alexander Wei also made their proofs public on GitHub

I skimmed some of them. It's written in a really nice style (like very efficient almost similar to someone verbally explaining the proof--and a lot of unnecessary grammar is omitted etc) But still seems quite clear in a math perspective (I didn't check the proofs)

Joseph Myers (Jul 19 2025 at 11:24):

I talked to IMO Secretary General Ria van Huffel at the IMO 2025 closing party about the OpenAI announcement. While I can't speak for the Board or the IMO (and didn't get a chance to talk about this with IMO President Gregor Dolinar, and I doubt the Board are readily in a position to meet for the next few days while traveling home), Ria was happy for me to say that it was the general sense of the Jury and Coordinators at IMO 2025 that it's rude and inappropriate for AI developers to make announcements about their IMO performances too close to the IMO (such as before the closing party, in this case; the general coordinator view is that such announcements should wait at least a week after the closing ceremony), when the focus should be on the achievements of the actual human IMO contestants and reports from AIs serve to distract from that.

I don't think OpenAI was one of the AI companies that agreed to cooperate with the IMO on testing their models and don't think any of the 91 coordinators on the Sunshine Coast were involved in assessing their scripts.

Junyan Xu (Jul 19 2025 at 11:49):

Andy Jiang said:

I skimmed some of them. It's written in a really nice style (like very efficient almost similar to someone verbally explaining the proof--and a lot of unnecessary grammar is omitted etc) But still seems quite clear in a math perspective (I didn't check the proofs)

https://x.com/algobaker/status/1946517196097687590

It is more likely it speaks that way because they did some long RL runs with verifiable reward, maybe downweighing longer responses, and it just gargled the ability to write coherently. This tends to happen with long RL training.

I think also RL will encourage the model to not waste too many tokens on thinking and to drop tokens not useful for thinking (assuming they still use natural langauge to think rather than special thinking tokens).

Justin Asher (Jul 19 2025 at 13:17):

Joseph Myers said:

when the focus should be on the achievements of the actual human IMO contestants and reports from AIs serve to distract from that.

Seems reflective of what is to come next. Albeit, I presume by next IMO the AI will be so good that this will not be a problem, as solving all the problems will be trivial to them. No one will care.

Joseph Myers (Jul 19 2025 at 13:30):

It's also illustrative of the common attitude among AI companies of being entitled to take whatever they want that they can get their hands on, without any concern for courtesy or impacts on the rest of the world. (See also e.g. the various AI scrapers ignoring robots.txt and overloading websites with an effective DDOS using borderline malware to make the scrapers hard to block - such scrapers were apparently responsible for major load problems on the system used to enter scores during coordination at the IMO and forced the public live scoreboard to be taken down because of the load they caused when it was public. Or the assumptions by AI companies that it's OK for them to train on any material they can download or scan, regardless of the views of the rights holders, the ethics of the intended use of the material or indeed whether the source from which it was acquired was actually doing large-scale illegal copying.)

(deleted) (Jul 19 2025 at 13:58):

My main concern is the OpenAI proofs are very unreadable. I can't understand them. So the score that the AI gets depends on how charitable the grader is.

Andy Jiang (Jul 19 2025 at 14:16):

I also wonder, as these models are thinking in parallel for hours, how long will it be until we can have access to this sort of capability as consumers (1-2 years?) i do think that general access to this would already be helpful for mathematicians doing research (maybe Kevin buzzard won't agree though)

Kevin Buzzard (Jul 19 2025 at 14:36):

I certainly don't agree that machines which can solve IMO problems will be useful for mathematicians doing research, in the same way that when I arrived in Cambridge UK as an undergraduate clutching my IMO gold medal I was in no position to help any of the research mathematicians there. I think that currently it is still entirely unclear whether things will scale from machines being able to do mathematics which can be solved using high school techniques to machines being able to help with mathematics which can only be solved by having a deep understanding of modern research ideas (and I have no opinions on whether things will scale, I feel like this is a big open question right now).

Andy Jiang (Jul 19 2025 at 15:01):

I feel like I disagree somewhat with this, depending on how convenient such a system is to use. Maybe it's different from your experience, but I do feel like sometimes for me there could be a toy example about whether some idea could work or some general confusion I have about some questions could be reduced to a standalone question which one could ask at an IMO level--and in these cases I would benefit from such a system. I do agree that in general IMO gold medalists cannot immediately do research math, but I still think in the hypothetical situation that a gold-medalist were 24/7 on call to some mathematician who didn't care about wasting their time/helping them develop as a mathematician, there could be some situations which they could come in handy? But I have much less experience in research math than you so I definitely think your opinion carries more weight here.

Justin Asher (Jul 19 2025 at 15:10):

Kevin Buzzard said:

to do mathematics which can be solved using high school techniques to machines being able to help with mathematics which can only be solved by having a deep understanding of modern research ideas

This seems similar to the OpenAI announcement, though: A general reasoning system that can score well on IMO can presumably start to be able to write research papers. They just want a good test for raw reasoning ability, which the IMO gives, and then research should follow from more general abilities.

Chris Birkbeck (Jul 19 2025 at 15:29):

I'd like to see them spend this compute on the formal conjectures repo and see how well they do there.

Jason Rute (Jul 19 2025 at 15:35):

Chris Birkbeck said:

I'd like to see them spend this compute on the formal conjectures repo and see how well they do there.

What if it returned a claimed solution, but in English and LaTeX instead of Lean ? Would you read it? I think that is the next test for natural language math AI. Either it needs to be right so often that it is worth reading it, or it needs to self-filter any wrong answers.

Chris Birkbeck (Jul 19 2025 at 15:38):

I'd want it to be in Lean really. I don't think I trust natural language AI answers enough to spend more than, say, 10 mins reading a proof (of something I really care about).

Kelly Davis (Jul 19 2025 at 15:47):

Jason Rute said:

What if it returned a claimed solution, but in English and LaTeX instead of Lean ? Would you read it? I think that is the next test for natural language math AI. Either it needs to be right so often that it is worth reading it, or it needs to self-filter any wrong answers.

Ideally it would return both. No?

Lean would be the "canonical" answer and English + LaTeX would be the "non-canonical" answer for consumption of non-Lean experts.

Andy Jiang (Jul 19 2025 at 15:57):

I'd be down to read at least the first solution it submits. If it is wrong idk if I'll read the second...

Justin Asher (Jul 19 2025 at 15:59):

I would be surprised if a model can solve a hard conjecture and simultaneously be incapable of writing Lean code and formalizing that proof.

Andy Jiang (Jul 19 2025 at 16:09):

why? there are plenty of humans who can solve hard conjectures and can't formalize them in Lean

Shreyas Srinivas (Jul 19 2025 at 16:09):

Because an AI is a tool

Shreyas Srinivas (Jul 19 2025 at 16:09):

And tools should produce well defined and verifiable outputs

Shreyas Srinivas (Jul 19 2025 at 16:09):

Not force humans to do the work for them

Justin Asher (Jul 19 2025 at 16:14):

Andy Jiang said:

why? there are plenty of humans who can solve hard conjectures and can't formalize them in Lean

Most of the time, saying a human cannot do "X" is because the human has not taken the time to study "X". One possible future of AI is having a model which trains on a lot of tasks / jobs via a large RL pipeline. These major companies want to train on Lean because they want their code to be verifiable.

Also, a lot of the strong researchers at these companies have papers on doing math via Lean. It is only natural that it will work its way into the training pipeline, and it will be very obvious when this happens.

Shreyas Srinivas (Jul 19 2025 at 16:18):

That (well-defined verifiably output), ethical sourcing of data, and transparency ought to be cornerstones of good scientific research. But apparently all that doesn’t seem to apply to AI companies and their precious profits as @Joseph Myers point out.

Martin Dvořák (Jul 19 2025 at 20:07):

Jason Rute said:

What if it returned a claimed solution, but in English and LaTeX instead of Lean ? Would you read it?

No.

Junyan Xu (Jul 19 2025 at 20:18):

Gernally I have high trust in proofs produced by Gemini (2.5 Pro); haven't used other models to produce proofs much. If it makes a mistake it's probably an honest mistake and can be corrected. You can see how it thinks after all.

Kevin Buzzard (Jul 19 2025 at 20:39):

Justin Asher said:

Most of the time, saying a human cannot do "X" is because the human has not taken the time to study "X".

That statement might be "generically true" for humans doing things, but we are not talking generally here, we are specifically talking about machines doing research mathematics. And most of the time an expert mathematician saying they cannot do X (a problem in their area) is because they are an expert in the area and are well aware that the techniques known to humanity are not strong enough to make progress on X. In this setting, a machine which has been trained on all human knowledge might not be in a better position than an expert human (who may also know all relevant human knowledge) to make progress and in fact might currently be in a worse position. I'm not saying it's impossible for a machine to make progress in this situation, but I am saying it's an important open problem, and we've seen absolutely no evidence of it yet.

Chris Birkbeck (Jul 19 2025 at 20:57):

Junyan Xu said:

Gernally I have high trust in proofs produced by Gemini (2.5 Pro); haven't used other models to produce proofs much. If it makes a mistake it's probably an honest mistake and can be corrected. You can see how it thinks after all.

I generally trust Gemini, but not at research level maths . I've asked it about open questions and it makes up references to plausible sounding theorems which are ultimately wrong.

Joseph Myers (Jul 19 2025 at 21:48):

Andy Jiang said:

I feel like I disagree somewhat with this, depending on how convenient such a system is to use. Maybe it's different from your experience, but I do feel like sometimes for me there could be a toy example about whether some idea could work or some general confusion I have about some questions could be reduced to a standalone question which one could ask at an IMO level--and in these cases I would benefit from such a system.

For the case of "IMO-type lemma arising in research mathematics", there's the additional complication that you don't necessarily want a proof of a precisely defined lemma (if true), but rather the statement of what you want proved may be a bit fuzzy; something a bit different is OK if it can be used in the larger argument, and that larger argument and candidate lemma statements may evolve together until you have a lemma that you can both prove and use in the larger argument.

Thomas Zhu (Jul 19 2025 at 22:27):

(By the way, maybe move the discussions about OpenAI to a separate thread?)

TyDHC (Jul 20 2025 at 07:51):

I think the original purpose of Lean4 was for mathematical proof rather than integration with AI. The combination of mathematics and AI is very promising, but Lean4 may not be the best implementation, and there might be a better one that hasn't been discovered yet.

Jason Rute (Jul 20 2025 at 12:41):

GasStationManager said:

To go towards slightly more on topic:

  • any entries to IMO this year that use Lean? I heard vague rumors of DeepMind participation.
  • how difficult would it be to autoformalize one of these OpenAI solutions into Lean?

I think it has come out that there is a sort of embargo period where the IMO has requested that AI companies wait until the children have had a time to shine (and OpenAI didn’t follow that). I think we will hear more in the next week or so.

Kelly Davis (Nov 12 2025 at 16:55):

Kelly Davis said:

Jason Rute said:

Kelly Davis wasn’t that CMSA talk a number of months ago? Couldn’t the schedule have changed since then (due to company politics, referee feedback, other things)? Or is your intel more recent?

Yes, things could have changed. The conversation I quote is from June 4, 2025.

Looks like the paper is finally here #Machine Learning for Theorem Proving > AlphaProof Paper (discussion) @ 💬


Last updated: Dec 20 2025 at 21:32 UTC