Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: OpenAI IMO Gold Medal


Justin Asher (Jul 20 2025 at 01:48):

Recently, OpenAI announced that their LLM, which does not use Lean, achieved a gold medal at the IMO. This thread is to discuss this topic and declutter the #Machine Learning for Theorem Proving > Blind Speculation about IMO 2025 thread.

Some notable details about this model are...

  • It was not focused on the IMO, meaning the model was trained on a large variety of tasks and not largely trained / fine-tuned for the IMO.
  • The model appears to be able to check its own work / thoughts much more precisely than pervious models, indicating a better algorithm / less sparse reward signal.
  • It did not use any tools, coding, internet, and, in particular, Lean.
  • The model was allowed to think for a similar amount of time as students, albeit it is not clear how many tokens-per-second / how much compute is being used.

Super impressive achievement, and indicative of the turning tide around AI being applied to mathematics. Importantly, their method being completely general means that both folks learning high school mathematics and those working on the far reaches of the Langlands program will be able to use these tools to learn and prove new statements.

Justin Asher (Jul 20 2025 at 01:51):

You can see Alexander Wei's tweet here: https://x.com/alexwei_/status/1946477742855532918

Here is a GitHub repository containing the solutions: https://github.com/aw31/openai-imo-2025-proofs/

MathArena contains information about how commercially available LLMs performed (not great): https://matharena.ai/

Andy Jiang (Jul 20 2025 at 03:07):

I find it difficult to believe (without clear evidence) that this model was not tuned to IMO. AFAIK the twitter thread says it was training with general RL + scaling test time compute techniques. But training with general techniques doesn't mean the model itself is not specialized. For example (in this example the issue not very severe but just to illustrate), alphago-zero was trained with general techniques but is specialized to go. In fact I guess all modern deep learning training uses somewhat general techniques. Reading their tweets generously suggests that the model is general-purpose (not precluding tuned to IMO) and the scaffolding on IMO in particular is not overwhelming (though certainly there's some scaffolding/prompting techniques which can be indirectly seen from proofs). My personal reading of that statement was more in the spirit that they believe the techniques should also work in other verifiable but difficult to verify domains. But it's unclear to me how easy that would be to do for cutting-edge research math at the moment--if it were truly easy I would expect results [proving math conjectures] to come in coming months from OpenAI and I don't see how else we can verify that claim

Justin Asher (Jul 20 2025 at 03:38):

For reference, here is the tweet about little IMO work from a VP of Research at OpenAI: https://x.com/MillionInt/status/1946551400365994077

And yeah, if the data reported is accurate this could be a big breakthrough.

Andy Jiang (Jul 20 2025 at 03:41):

I mean Alex Wei/Noam's tweets were a bit more conservative. If OpenAI is reading: I would happily believe them that it's not IMO-specific if they put out a paper on methods ;)

David Michael Roberts (Jul 20 2025 at 04:34):

What I'm curious about is how much energy was used to get the solutions. I mean you can say only 9 hours were spent computing, but if you are burning 1.21GW while doing it, then ... well.

Patrick Massot (Jul 20 2025 at 06:29):

Could we please stay on topic? It seems this thread has nothing to do with Lean or even formal methods.

Justin Asher (Jul 20 2025 at 06:35):

I think the interplay between automated theorem proving informally and formally is highly relevant to machine learning for theorem proving in Lean. It changes the way researchers see applying AI to Lean and formal methods.

Patrick Massot (Jul 20 2025 at 07:11):

I don't see any interplay in this thread.

David Michael Roberts (Jul 20 2025 at 09:13):

I will not be commenting on any self-reported AI competition performance results for which the methodology was not disclosed in advance of the competition.
—Terry Tao

GasStationManager (Jul 20 2025 at 12:21):

To go towards slightly more on topic:

  • any entries to IMO this year that use Lean? I heard vague rumors of DeepMind participation.
  • how difficult would it be to autoformalize one of these OpenAI solutions into Lean?

Eric Wieser (Jul 20 2025 at 12:25):

I think that's more on topic for #Machine Learning for Theorem Proving > Blind Speculation about IMO 2025

Jason Rute (Jul 20 2025 at 12:35):

As for if this thread is on topic, I think for at least the imo grand challenge stream (sorry on mobile so can’t use # links) , this sort of thing is common and there is a lot of precident for it. And even for this stream it is not uncommon. In the context of AI for the IMO we are likely to see a number of approaches leveraging formal and informal math, so to talk about each of them and their differences is important. (A huge selling point of Lean last year was that it was the first and at the time only way to do well on the IMO. That is starting to change this year.) Finally the recent papers for Lean, such as DeepSeek-Prover v2, Kimina-Prover, and Goedel-Prover v2 have shown it one can take general LLM methods like these (especially if they do well on reasoning tasks like math and code) and almost directly apply them to Lean.

Shreyas Srinivas (Jul 20 2025 at 13:02):

Jason Rute said:

(A huge selling point of Lean last year was that it was the first and at the time only way to do well on the IMO. That is starting to change this year.)

How so?

Justin Asher (Jul 20 2025 at 13:04):

Jason Rute said:

Finally the recent papers for Lean, such as DeepSeek-Prover v2, Kimina-Prover, and Goedel-Prover v2 have shown it one can take general LLM methods like these (especially if they do well on reasoning tasks like math and code) and almost directly apply them to Lean.

This especially is why their result is so significant. Going informal to Lean is not hard if you incorporate Lean / Mathlib into the training regimen.

I do feel like we are standing at the edge of something great, and a lot of people are nitpicking details even though the results are seemingly ahead of what most people would have anticipated. It is so hard to see where things might be in two years. From my perspective, it is not unlikely that the current algorithmic progress continues, and AI capable far beyond that of the best human mathematicians arise in the next year or two.

Justin Asher (Jul 20 2025 at 13:10):

Shreyas Srinivas said:

How so?

Previously, companies / research groups would synthetically generate a bunch of Lean data. Since you can type check whether a Lean statement is correct, you can autoformalize a lot of individual problems and then use techniques like MCTS to train on your autoformalized problems. This is more in line with how Go was tackled by AlphaZero.

However, now research groups are taking LLMs and simply teaching them how to write Lean code, much more similar to how we do and think about mathematics. The LLM gets the added benefit of checking each of its steps individually while, at the same time, being free to think and work as it pleases (at least if you give it access to the terminal, Lean LSP, etc. like Claude Code does).

OpenAI's success was that not only did they do well on the IMO without using any sort of formal environment, they also further used generalized training algorithms without verifiable rewards, from what I can tell. Once you take an extremely intelligent model like that and hook it up to a Lean environment, you essentially can let the AI start doing math research on its own. This is all going to happen in the near future.

Jason Rute (Jul 20 2025 at 13:10):

(Also you can see a lot more context for this thread in the thread @Eric Wieser linked to, including more discussion about Lean, and also @Thomas Zhu asking for a new thread to be created just for this topic.)

Jason Rute (Jul 20 2025 at 13:13):

And to get back to Lean, the major selling point of this work is that it works with hard to verify domains like natural language proofs. While Lean proofs are the opposite (they are easy to verify automatically), other related areas of AI for theorem proving are notoriously hard to verify automatically:

  • autoformalization
  • correcting formalization mistakes (especially when formalizing open problems or specifications which don’t yet have proofs)
  • decomposing proofs into lemmas

Alex Meiburg (Jul 20 2025 at 17:12):

It looks like OpenAI RL'ed itself deep into a very peculiar way of writing. I've seen some people saying it's great, efficient communication and unambiguous. I've also seen people saying they hate it and they would want to take points off a student proof written this way because it's so confusing.

I certainly prefer it over LLMs BS'ing though, where they cite "by a standard result" or "a more careful analysis" or "a careful inspection shows that" when they want to gloss over an actually invalid step.

I'm sure that style issues will quickly be comparatively easy to improve, but there might be a frustrating period of time where LLM proofs are a dilemma between "elegant but trying to pull wool over your eyes" vs "detail-oriented and explicit, but necessarily terse and awkward in order to make (natural language) verification easier". Ew!

Alex Meiburg (Jul 20 2025 at 17:13):

When a human writes a proof, especially in a primarily instructive setting like a textbook, they can be more or less handwavy over some details; but I trust them that they have a fully detailed proof in their heads and that (assuming my brain's working okay) I could fill in these gaps if I wanted to. I don't trust the LLMs, so I don't want them to ever do that.

Alex Meiburg (Jul 20 2025 at 17:16):

I wonder if there would be room for LLMs to generate natural language proofs in an "explorable tree" like https://kmill.github.io/informalization/rudin.html -- but without the Lean backing. So if I'm skeptical I could always recurse and get more info.

Alex Meiburg (Jul 20 2025 at 17:17):

I mean, I'm sure that would be an easy thing to make, with current LLMs providing some on-demand fill-in-the-blank ability. I've seen things like that. But I mean, OpenAI probably had some kind of recursive self-debate to generate these proofs, and I wonder if it had some internal data structures that could be more readably exposed in an interface like that. Mmm. Just talking to myself at this point.

Adam Kurkiewicz (Jul 20 2025 at 18:40):

Patrick Massot any particular reason why you are voicing objections to this being discussed? Is it because you think staying “on topic” is important? Or is it because you are very unhappy with openAI for violating the news embargo other AI companies are so far all gracefully complying with? I’m genuinely interested in understanding your thought process here.

Kevin Buzzard (Jul 20 2025 at 20:25):

This Zulip is extremely good at remaining on topic. Essentially every post on this Zulip is about Lean. This thread is a glaring exception and usually the administrators do not tolerate such threads. I would also far prefer it if we stick to Lean on the Lean Zulip. There are plenty of other places to talk about informal attempts at IMO questions.

Jason Rute (Jul 20 2025 at 20:54):

@Patrick Massot I think you might have missed some of that context in #Machine Learning for Theorem Proving > Blind Speculation about IMO 2025 that led to the creation of this separate topic. I personally feel it arose naturally. (I maybe would have preferred moving all relevant messages over here as we often do, but it is not a big deal to me either way.)

Jason Rute (Jul 20 2025 at 20:54):

Also, as the one who created #Machine Learning for Theorem Proving, I humbly and respectfully want to express that I feel this is in line with what we have historically discussed, but, of course, I will agree to abide by the decisions of the Lean Zulip administrators. (If you like, I can express my thoughts in private to the administrators or make another topic to discuss what is allowed on this stream.)

Adam Kurkiewicz (Jul 20 2025 at 22:07):

I’m not a heavy user of this forum, but I have been around for many years, I do try to stay on top of things and will occasionally discuss, and I have learnt a lot from the community over the years. Some of the most valuable discussions for me were often on topics only tangentially related to lean (e.g. I once had an expert explain to me how SNARKs work. I would have struggled to understand it as quickly myself, and most likely I wouldn’t have even thought of reading up on those independently). Given these experiences, a vision of the forum as a group of people brought together by Lean but free to discuss intellectually stimulating topics of their choosing would be appealing to me.

Regardless, and more importantly, the specific reason I decided to speak out is that Justin is an early career researcher doing valuable work on AI tooling for Lean. I don’t think it’s very supportive of the community to make him feel like he’s doing something wrong by discussing a topic that’s very relevant to his research, especially in a situation where the community is already starting to benefit from his research (with more benefit yet to come).

I may be missing some nuance here, given that both Kevin and Patrick seem to have a strong preference to keep this Zulip chat more focused.

Jason Rute (Jul 20 2025 at 22:18):

Oh, I somehow responded to the wrong person. I read @Kevin Buzzard‘s message as coming from Patrick and responded to him. Sorry for that confusion!

Kevin Buzzard (Jul 20 2025 at 22:29):

I guess the reason this particular thread is so grim is that (a) it really has nothing to do with Lean at all and (b) it's quite distasteful that OpenAI have been quick to announce when everyone else is waiting until the IMO is over, and this thread is just amplifying the hype.

Jason Rute (Jul 20 2025 at 22:36):

Ok, for you it has more to do with the surrounding ethics of all this. For example if Google comes out with an IMO model this year which doesn’t use Lean (say “Gemini super plus”) but for which they respect the one week wait before announcing, you would feel better about a thread on that on this Zulip, even if you would feel more comfortable about an AlphaProof v2 thread.

Eric Wieser (Jul 20 2025 at 23:32):

Note that the discussions in the other thread have been picked up by Twitter as an expression of this distastefulness

Justin Asher (Jul 20 2025 at 23:45):

Kevin Buzzard said:

(a) it really has nothing to do with Lean at all

It does have a lot to do with Lean. They managed to train a model without formal methods, indicating that current research in Lean theorem proving is perhaps not properly extracting enough of a learning signal. I think knowing that the OpenAI approach works will give a lot of researchers the confidence to push ahead into new algorithmic advances outside the top labs. Open source is important to all of us.

I still believe formal methods are going to play a significant role, but this result also, to some extent, shifts the paradigm to me between training Lean specific models and giving general purpose models tools to interact with Lean. As my research has been more or less in the latter camp, I found this result extremely exciting.

I do want to point out that Terry said that the current models and methodologies were not sufficient to achieve such a result. Terry is a brilliant researcher, and I largely agreed with him from the publically available data. However, quite clearly a lot of people were wrong (including myself), and I think it would be good if people in mathematics / Lean started having deeper discussions about these rapidly advancing models are going to impact their work.

Jason Rute (Jul 21 2025 at 00:17):

@Justin Asher and I talked about this privately and we think it is best to hold off discussing this OpenAI model and related matters until the other IMO AI results are announced (in a week or so) out of respect for the kids at the IMO. Would everyone here be okay with that?

Shreyas Srinivas (Jul 21 2025 at 00:20):

There are some big problems with your claims and request @Justin :

  1. OpenAI hasn’t actually released any formally verified proofs of their solutions. They have released natural language proofs that none of the IMO organisers have vetted and made any official announcements about.

  2. OpenAI has acted with breathtaking arrogance by claiming these results at this point. So far you have only addressed this at one point, wherein you said this is reflective of things to come. No progress, self-proclaimed or otherwise justifies openAI acting in this manner.

  3. As Terence Tao pointed out in his mathstodon thread, the comparison with human competitors and AI models is utterly meaningless. He points out the methodological flaws in his thread on mathstodon. Given the lack of transparency in the exact experimental protocol followed, the comparison with student competitors is even more meaningless. So on what basis should we presume that these AIs are capable of performing or assisting in mathematical research? Any sound discussion on this topic must be preceded by a full disclosure and publication of open AIs exact experiment. No hidden details. How much computation did they use? What exactly did they “teach”? How much back and forth was involved before the final proofs were reached? How much energy was consumed? Currently we have tweets and proof outputs on GitHub and no paper explaining what happened. Just endless speculation and extraordinarily optimistic claims.

  4. As has been pointed out several times before, the connection between competence at IMO level math and research math is tenuous. Given this and point 3, I don’t see on what basis mathematicians could be expected to have any discussion about using these AIs in research math. Do we know how this AI acts? Is it simply going to call sage math or Mathematica on a tough integral or is it doing something more? Is it producing genuinely new mathematical insights? Why can’t we achieve the same result with a combination of human insight and existing symbolic computational tools? Again, how is it being “taught” whatever mathematics it is claimed to do? It is crucial to understand the strengths and limitations of tools to have a sensible discussion on using it.

A great research advance in AI techniques doesn’t necessarily mean much for end users without clear evidence.

Jason Rute (Jul 21 2025 at 00:21):

Jason Rute said:

Justin Asher and I talked about this privately and we think it is best to hold off discussing this OpenAI model and related matters until the other IMO AI results are announced (in a week or so) out of respect for the kids at the IMO. Would everyone here be okay with that?

Shreyas Srinivas (Jul 21 2025 at 00:22):

Jason Rute said:

Jason Rute said:

Justin Asher and I talked about this privately and we think it is best to hold off discussing this OpenAI model and related matters until the other IMO AI results are announced (in a week or so) out of respect for the kids at the IMO. Would everyone here be okay with that?

This message just appeared on my Zulip client. Thanks.

Jared green (Jul 21 2025 at 02:18):

just how much of that performance improvement over previous attempts remains valid when the lean backing is brought back to it. im with you all about the early announcement, it means nothing until we know how they accomplished it. so in the meantime there is nothing to talk about.

Andy Jiang (Jul 21 2025 at 16:51):

I can delete this if it's violating the embargo (or it's also deemed off-topic), but Deepmind just announced their efforts which are also end-to-end natural language and 5/6 problems (I assume the same ones)

Thomas Zhu (Jul 21 2025 at 16:53):

(deleted)

Andy Jiang (Jul 21 2025 at 16:55):

well I meant the embargo in this thread from Jason and Justin (and I also saw some comments saying this whole topic is off-topic)

Ralph Furman (Jul 21 2025 at 16:58):

DeepMind's post: https://deepmind.google/discover/blog/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad/

Thomas Zhu (Jul 21 2025 at 16:59):

Lean-related part of the announcement:

While our approach this year was based purely on natural language with Gemini, we also continue making progress on our formal systems, AlphaGeometry and AlphaProof. We believe agents that combine natural language fluency with rigorous reasoning - including verified reasoning in formal languages - will become invaluable tools for mathematicians, scientists, engineers, and researchers, helping us advance human knowledge on the path to AGI.

(deleted) (Jul 21 2025 at 17:03):

Is it safe to declare the embargo over?

Ping J (Jul 21 2025 at 17:12):

truly impressed... clear win to google... any (beginnings of) thoughts on their method?

(deleted) (Jul 21 2025 at 17:20):

Unfortunately they revealed little about their method

Opt (Jul 21 2025 at 17:32):

AlphaProof folks are still in the acknowledgements. Maybe they used Lean to train the NL verifier?

Justin Asher (Jul 21 2025 at 17:44):

Andy Jiang said:

well I meant the embargo in this thread from Jason and Justin (and I also saw some comments saying this whole topic is off-topic)

I also wanted to stop having technical discussions until next week when all the results are out…Harmonic and some other groups still have not announced.

Justin Asher (Jul 21 2025 at 17:44):

https://x.com/HarmonicMath/status/1947023450578763991


Last updated: Dec 20 2025 at 21:32 UTC