Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Seed Prover Achieves Silver-Level Score at IMO 2025


Junyan Xu (Jul 23 2025 at 01:16):

Official score: 2/7/7/7/7/0
P2 is solved using subsystem Seed Geometry
Announcement: https://mp.weixin.qq.com/s/MWo8dIg4bqQXzB_W9UZ7Ow (Chinese)
Solutions: https://github.com/ByteDance-Seed/Seed-Prover/tree/main/SeedProver/imo2025

Junyan Xu (Jul 23 2025 at 01:20):

Congrats to @Huajian Xin and the team

Thomas Zhu (Jul 23 2025 at 01:33):

Link to announcement: #announce > Seed Prover Achieves Silver-Level Score at IMO 2025

Heather Macbeth (Jul 23 2025 at 02:09):

331 solved on PutnamBench, that's impressive! They have posted their solutions publicly though -- I had the impression that the PutnamBench prefer people not do this? cc @George Tsoukalas

Heather Macbeth (Jul 23 2025 at 02:11):

There also seem to be apply?s in a lot of the solutions, I wonder whether any of them exploit the bug DeepSeek found.

Thomas Zhu (Jul 23 2025 at 02:12):

Heather Macbeth said:

331 solved on PutnamBench, that's impressive! They have posted their solutions publicly though -- I had the impression that the PutnamBench prefer people not do this? cc George Tsoukalas

Hi @Heather Macbeth , I will relay this to the repo manager. Thank you!

Justin Asher (Jul 23 2025 at 02:13):

Retracted.

Thomas Zhu (Jul 23 2025 at 02:13):

Heather Macbeth said:

There also seem to be apply?s in a lot of the solutions, I wonder whether any of them exploit the bug DeepSeek found.

We are aware of the bug, and I can confirm it is not exploited.

Jason Rute (Jul 23 2025 at 02:15):

@Justin Asher Can you point to the soundness issues of Lean REPL thread (if there is one) or make a new thread about it? (Edit: I see you redacted your statement. Should I delete this?)

Justin Asher (Jul 23 2025 at 02:16):

@Jason Rute (Leni said this, but Auguste said they were patched. I do not know too much. I figured this discussion is off-topic.)

Huajian Xin (Jul 23 2025 at 02:17):

Jason Rute said:

Justin Asher Can you point to the soundness issues of Lean REPL thread (if there is one) or make a new thread about it?

We may release our Lean verification infra later :)

Justin Asher (Jul 23 2025 at 02:17):

@Huajian Xin Would be super useful! Seems like everyone has their own…

Jason Rute (Jul 23 2025 at 02:18):

As for the repo, at least zip them. I appreciate the desire to have others check that they are correct, but too many people train on github projects without thought.

Huajian Xin (Jul 23 2025 at 02:20):

Jason Rute said:

As for the repo, at least zip them. I appreciate the desire to have others check that they are correct, but too many people train on github projects without thought.

Sure, we are fixing this issue

Thomas Zhu (Jul 23 2025 at 02:23):

We do not use REPL for checking solutions, and we use a #print axioms check at the end. We are using Lean v4.14.0, and AFAIK DeepSeek's bug does not work at this version. (FYI: I discovered the mechanism behind DeepSeek's apply? bug with the community here; I now work on the verification infrastructure and inference-time strategies at ByteDance Seed and also checked a subset of the solutions manually).

Justin Asher (Jul 23 2025 at 02:27):

Here is an English translation of the announcement that was originally in Chinese generated by Gemini 2.5 Pro. My apologies if anything is mistranslated.

Seed_Prover_Announcement.pdf

Thomas Zhu (Jul 23 2025 at 02:29):

@Junyan Xu I have taken the liberty to change the title of the topic to "Seed Prover Achieves Silver-Level Score at IMO 2025", because we haven't actually achieved the silver medal! The latter requires being a human :slight_smile:

Huajian Xin (Jul 23 2025 at 02:30):

Justin Asher said:

Here is an English translation of the announcement that was originally in Chinese generated by Gemini 2.5 Pro. My apologies if anything is mistranslated.

Seed_Prover_Announcement.pdf

Thank you! As for the text in the table, I can also show you the English version for reference :slight_smile: :

image.png

Justin Asher (Jul 23 2025 at 02:36):

@Huajian Xin Thanks! Fixed it.

Justin Asher (Jul 23 2025 at 02:49):

The PutnamBench result is super exciting. Putnam-AXIOM already demonstrated that LLMs are capable of solving a relatively similar percentage of Putnam problems. Now we see that, with a sufficiently strong Lean programming system, we can achieve this score in practice. Awesome!

It also seems like these LLMs love using "have" expressions. I think this is due to the way they are pretrained / how they interact with the Lean environment. You can see a somewhat similar patten in Morph Labs' autoformalization. I do not think this is optimal (Mathlib is certainly not written this way).

Zheng Yuan (Jul 23 2025 at 03:45):

Justin Asher said:

The PutnamBench result is super exciting. Putnam-AXIOM already demonstrated that LLMs are capable of solving a relatively similar percentage of Putnam problems. Now we see that, with a sufficiently strong Lean programming system, we can achieve this score in practice. Awesome!

It also seems like these LLMs love using "have" expressions. I think this is due to the way they are pretrained / how they interact with the Lean environment. You can see a somewhat similar patten in Morph Labs' autoformalization. I do not think this is optimal (Mathlib is certainly not written this way).

I think LLMs prefer using have is mainly because it can show what we need to prove now as a hint.

Justin Asher (Jul 23 2025 at 03:48):

Do you think the RL system is producing all the "have" expressions, since you are using compiler feedback? In particular, from the translation:

Our team discovered that by using feedback from the Lean compiler combined with the model’s summarization capabilities to iteratively refine proofs, we could overcome the token budget limitations of a single inference pass and significantly enhance proving capabilities.

It seems like this would incentivize the model to break the proof into as small of steps as possible, so that the LLM can get feedback on each little step. I think this fairly typical and expected as we go

entire proof -> proof with retries -> diff requests

Setting up the training process to provide real-time LSP feedback, and letting the model "type" while looking at the current LSP just like we do, would presumably resolve this. Albeit, I do not know how much this would slow down the underlying token-generation speed.

Justin Asher (Jul 23 2025 at 03:49):

Zheng Yuan said:

I think LLMs prefer using have is mainly because it can show what we need to prove now as a hint.

This seems right, when considering what information the diagnostics would provide.

Huajian Xin (Jul 23 2025 at 03:55):

Justin Asher said:

Do you think the RL system is producing all the "have" expressions, since you are using compiler feedback? In particular, from the translation:

Our team discovered that by using feedback from the Lean compiler combined with the model’s summarization capabilities to iteratively refine proofs, we could overcome the token budget limitations of a single inference pass and significantly enhance proving capabilities.

It seems like this would incentivize the model to break the proof into as small of steps as possible, so that the LLM can get feedback on each little step. I think this fairly typical and expected as we go

entire proof -> proof with retries -> diff requests

Setting up the training process to provide real-time LSP feedback, and letting the model "type" while looking at the current LSP just like we do, would presumably resolve this. Albeit, I do not know how much this would slow down the underlying token-generation speed.

An example is Isa-Proof-Shell, in which there exists only statement declarations and all supporting proofs are generated automatically.

Justin Asher (Jul 23 2025 at 04:10):

Justin Asher said:

Albeit, I do not know how much this would slow down the underlying token-generation speed.

This should not be a problem with efficient LLM cache management, I believe? Correct me if I am wrong, because I have not written much code to manage the cache myself before.

Opt (Jul 23 2025 at 04:20):

@Huajian Xin @Thomas Zhu Are you planning to release the tech report and the model?

(deleted) (Jul 23 2025 at 04:27):

I feel it's not RL causing the have statements, but rather the SFT stage before RL

Justin Asher (Jul 23 2025 at 04:28):

Huỳnh Trần Khanh said:

but rather the SFT stage before RL

Would be happy to hear more on why you say this?

Zheng Yuan (Jul 23 2025 at 04:31):

Opt said:

Huajian Xin Thomas Zhu Are you planning to release the tech report and the model?

Tech report is on the way. No plan to release model now.

(deleted) (Jul 23 2025 at 04:36):

Justin Asher said:

Huỳnh Trần Khanh said:

but rather the SFT stage before RL

Would be happy to hear more on why you say this?

For Kimina Prover, the have statements are already in the chain of thought, not just in the final proof.

https://github.com/MoonshotAI/Kimina-Prover-Preview/tree/master

Distinct Reasoning Style: We carefully design a reasoning style that we call Formal Reasoning Pattern that bridges the gap between formal verification and informal mathematical intuition.

Quite obvious that this is intentional.

(deleted) (Jul 23 2025 at 04:39):

17532455506305169576064469475789.png

(deleted) (Jul 23 2025 at 04:39):

Each reasoning block is a have statement.

Justin Asher (Jul 23 2025 at 04:48):

@Huỳnh Trần Khanh Thanks! Fantastic reference w.r.t. the Kimina paper. Still think direct LSP access and more integrated systems would be helpful.

Opt (Jul 23 2025 at 04:58):

Huỳnh Trần Khanh said:

Justin Asher said:

Huỳnh Trần Khanh said:

but rather the SFT stage before RL

Would be happy to hear more on why you say this?

For Kimina Prover, the have statements are already in the chain of thought, not just in the final proof.

https://github.com/MoonshotAI/Kimina-Prover-Preview/tree/master

Distinct Reasoning Style: We carefully design a reasoning style that we call Formal Reasoning Pattern that bridges the gap between formal verification and informal mathematical intuition.

Quite obvious that this is intentional.

Interesting. Well it does seem to have the side effect of perhaps allowing the model to decompose the proofs in a nice manner. The source of the SFT data is Claude which the Kimi team instructed to align the informal and formal proofs into their format. Maybe then the formal proofs they used had a lot of have statements?

(deleted) (Jul 23 2025 at 05:00):

It must be some sort of synthetic data, the data we provided isn't that neatly decomposed into have statements

Andy Jiang (Jul 23 2025 at 05:04):

really impressive! will wait for tech report but is it following https://arxiv.org/pdf/2502.03438 and calling some policy LLM to generate each tactic one at a time? and doing some search over tactics? if so what does the pass@ number mean? like parallel instances of search? (this doesn't quite make sense to me bc wouldn't it just overlap in its work) or is it calling LLM in a autoregressive entire proof generation and then you do pass@?

would be interesting to see like model size and some numbers on compute needed :) I know I'll never get them from GDM and OAI so here's hoping for Bytedance!

Zheng Yuan (Jul 23 2025 at 05:06):

Andy Jiang said:

really impressive! will wait for tech report but is it following https://arxiv.org/pdf/2502.03438 and calling some policy LLM to generate each tactic one at a time? and doing some search over tactics? if so what does the pass@ number mean? like parallel instances of search? (this doesn't quite make sense to me bc wouldn't it just overlap in its work) or is it calling LLM in a autoregressive entire proof generation and then you do pass@?

would be interesting to see like model size and some numbers on compute needed :) I know I'll never get them from GDM and OAI so here's hoping for Bytedance!

Our team is different from BFS-prover. We will release a tech report with some details we can say.

Opt (Jul 23 2025 at 05:07):

Huỳnh Trần Khanh said:

It must be some sort of synthetic data, the data we provided isn't that neatly decomposed into have statements

Not sure I follow. If it's the SFT stage then it must be the data you gave right? Or do you mean you're building off a model that might have this in their SFT?

(deleted) (Jul 23 2025 at 05:14):

OK so the thing is it is pure speculation based on what we were told... and what the paper says. The Kimina prover team made us prove a lot of theorems. But our proofs are very messy. They are not clean. And they don't have that many have statements. Initially they wanted our proofs to be clean and well commented, but then as the project progresses they practically just let us do anything as long as the proof compiles. So maybe our proofs are fed into the model to teach it reasoning patterns, but this alone doesn't explain the abundance of have statements.

At the same time, the abundance of have statements is not emergent behavior. There are many of these statements in the chain of thought, and the chain of thought is explicitly engineered to have these have statements. They might use something else to train the model to create this chain of thought. The paper says Claude is used, but the paper might not tell the full story.

(deleted) (Jul 23 2025 at 05:14):

So I'm just a lowly data annotator and I have no say over how the model is trained. Well paid, but still.

Opt (Jul 23 2025 at 05:16):

Huỳnh Trần Khanh said:

OK so the thing is it is pure speculation based on what we were told... and what the paper says. The Kimina prover team made us prove a lot of theorems. But our proofs are very messy. They are not clean. And they don't have that many have statements. Initially they wanted our proofs to be clean and well commented, but then as the project progresses they practically just let us do anything as long as the proof compiles. So maybe our proofs are fed into the model to teach it reasoning patterns, but this alone doesn't explain the abundance of have statements.

At the same time, the abundance of have statements is not emergent behavior. There are many of these statements in the chain of thought, and the chain of thought is explicitly engineered to have these have statements. They might use something else to train the model to create this chain of thought. The paper says Claude is used, but the paper might not tell the full story.

Thanks a lot for the context!

Thomas Zhu (Jul 23 2025 at 06:01):

Jason Rute said:

As for the repo, at least zip them. I appreciate the desire to have others check that they are correct, but too many people train on github projects without thought.

We have zipped the results and deleted the commit history on GitHub, and removed the PutnamBench results. Thank you for the scrutiny, we appreciate it very much!

George Tsoukalas (Jul 23 2025 at 07:37):

Thanks Thomas! If you would like to have these results added to the leaderboard, please send the zipped proofs to me at george.tsoukalas@utexas.edu so I can confirm the result.

Auguste Poiroux (Jul 23 2025 at 08:53):

Very impressive! I wonder if you plan to try your model on DeepMind's formal-conjectures repo. Maybe your model will manage to solve an open problem there :eyes:

Zheng Yuan (Jul 23 2025 at 08:57):

Auguste Poiroux said:

Very impressive! I wonder if you plan to try your model on DeepMind's formal-conjectures repo. Maybe your model will manage to solve an open problem there :eyes:

This is our next step.

Floris van Doorn (Jul 23 2025 at 09:05):

Thomas Zhu said:

Junyan Xu I have taken the liberty to change the title of the topic to "Seed Prover Achieves Silver-Level Score at IMO 2025", because we haven't actually achieved the silver medal! The latter requires being a human :slight_smile:

This is a very impressive achievement!
One nitpick: am I correct that within the official time limit (2 * 4.5 hours), Seed Prover only scores 7 points, since all problems other than problem 2 took a day or more to complete?

Auguste Poiroux (Jul 23 2025 at 09:17):

Also, are there any plans to open-source the model/code? Maybe I missed the info

Zheng Yuan (Jul 23 2025 at 09:24):

Floris van Doorn said:

Thomas Zhu said:

Junyan Xu I have taken the liberty to change the title of the topic to "Seed Prover Achieves Silver-Level Score at IMO 2025", because we haven't actually achieved the silver medal! The latter requires being a human :slight_smile:

This is a very impressive achievement!
One nitpick: am I correct that within the official time limit (2 * 4.5 hours), Seed Prover only scores 7 points, since all problems other than problem 2 took a day or more to complete?

Within 4.5 hours, many useful lemmas have been proved. It is hard to say how many score it can obtain.

Joseph Myers (Jul 23 2025 at 10:24):

It's stated (in the English-translated announcement) that the problem statements were first formalized by humans. (a) Can we see the statements (including P6 if you prepared one of those) in the form in which they were presented to the AI (including in particular how "determine" problems were represented)? (b) Did you coordinate with any other AIs starting with human-prepared Lean statements to ensure that a single common version of the problems was used? (c) Was there a particular reason for preparing your own Lean statements rather than having the IMO provide them?

Joseph Myers (Jul 23 2025 at 10:31):

As of the start of the IMO, Ivan (Chief Coordinator) didn't know whether they might want me to provide Lean versions for use by AIs working with Lean, but it was concluded within a few days that all the AI developers working with the IMO would just take the informal statements from the IMO and start from there. (I was going to prepare Lean statements anyway regardless of whether they had been requested. The possibility of a request for officially provided statements was why I made sure the statements worked with Lean and mathlib v4.21.0 not just a more recent version. General principles that useful AIs for mathematics in Lean need to keep up with mathlib meant I wasn't concerned with whether statements would work with anything older than the most recent tagged release.)

Zheng Yuan (Jul 23 2025 at 11:04):

Joseph Myers said:

It's stated (in the English-translated announcement) that the problem statements were first formalized by humans. (a) Can we see the statements (including P6 if you prepared one of those) in the form in which they were presented to the AI (including in particular how "determine" problems were represented)? (b) Did you coordinate with any other AIs starting with human-prepared Lean statements to ensure that a single common version of the problems was used? (c) Was there a particular reason for preparing your own Lean statements rather than having the IMO provide them?

(a) We first prompt Seed-1.6 (i.e. Bytedance's flagship reasoning models to obtain the answer) and use this answer to formalize the problem by human. (b) No, we are not coordinate with any other AI companies. Actually, we have no idea who is joining the competition and who is going to use LEAN. (c) IMO doesn't provide formalize version to us. They only provide a English version pdf with a latex file each day.

Joseph Myers (Jul 23 2025 at 12:38):

It seems increasingly clear to me that secrecy has impaired the process of evaluating different AIs on IMO 2025 problems. We have one company offered a Lean version that they declined because of not using Lean, and another using Lean that could have been provided with a Lean version by the IMO but wasn't. Even at the start of the IMO, the Chief Coordinator had very limited information about what the evaluation process for AIs might look like, and detailed arrangements evolved during the IMO. Although I'd expressed my willingness to prepare Lean versions, there was no actual suggestion from the IMO to me that such a thing would be of use until after the IMO started (meaning mathlib doesn't have all the geometrical definitions that could potentially have been required - a lot more notice would have been needed to ensure all that API was filled out in advance).

A better process would have involved all concerned parties discussing things in a public forum starting at least several months ahead of the IMO, with all interested AI developers giving enough information about the form of input and output used by their AIs that we could tell whether an IMO-provided Lean version would make sense or not (and if an IMO-provided Lean version made sense on that basis, it would then have been possible to arrange for more than one person to be present at the IMO to prepare and check that version).

Jason Rute (Jul 23 2025 at 12:54):

Organization of AI for the IMO or even the Putnam (since that is next) is an interesting and important topic, but should probably be its own thread.

(deleted) (Jul 23 2025 at 13:00):

And I'm treating all claims of AI solving the IMO with skepticism.

Martin Dvořák (Jul 23 2025 at 15:38):

Did you also find verbatim copies of some code you wrote yourself in their repo?

Martin Dvořák (Jul 23 2025 at 15:45):

But back to the topic; how can I compile p1.lean for example? I cannot find any lakefile.

Thomas Zhu (Jul 23 2025 at 17:18):

Martin Dvořák said:

But back to the topic; how can I compile p1.lean for example? I cannot find any lakefile.

We use Lean v4.14.0, and Mathlib v4.14.0 as a dependency.

Martin Dvořák (Jul 23 2025 at 17:19):

Should I create a new project on Lean v4.14.0 and add p1.lean as the only file? Is this the intended way your solutions should be browsed?

Thomas Zhu (Jul 23 2025 at 17:24):

@Martin Dvořák currently that is the intended way they should be browsed. We are working on a web-based interactive view of our solutions.

Kevin Buzzard (Jul 23 2025 at 17:28):

Jason Rute said:

Organization of AI for the IMO or even the Putnam (since that is next) is an interesting and important topic, but should probably be its own thread.

I wonder whether the upshot of the events of this year is that the tech companies now will lose interest in future IMO participation completely because "IMO gold was achieved in 2025" (and all the subtleties around this claim will be forgotten). Perhaps potential investors in tech companies will find it hard to understand why "IMO gold achieved in 2026" is interesting.

Justin Asher (Jul 23 2025 at 17:37):

@Kevin Buzzard Yep, was thinking about this. Now, solving an open conjecture is the new benchmark.

Justin Asher (Jul 23 2025 at 17:40):

Probably will happen before the end of the year...

Kevin Buzzard (Jul 23 2025 at 17:58):

It's easy to solve an "open conjecture" -- just make up a boring problem which is easy but has never been asked before. AI solved the Robbins conjecture (https://xenaproject.wordpress.com/2019/07/06/a-computer-generated-proof-that-nobody-understands/) which was open for decades but it's not clear that anyone was working on it. What's hard is to solve an open conjecture which mathematicians would generally agree was interesting and difficult, and that smart humans had tried and failed to solve. I am not sure that this will happen within the next 5 years, let alone before the end of this year.

Adam Kurkiewicz (Jul 23 2025 at 18:00):

@Kevin Buzzard I doubt that big AI labs are as focused on “what investors think” as you think they are. Especially under the US model shareholders are quite weak because of how fragmented they are. This in practice leaves quite a lot of power in the hands of the CEO (the board can only really hire and fire the CEO, in practice they can do little else). And if the gossips about Demis Hassabis being groomed for the next CEO of Google are true, this should give even more wind into the sails of Google Deep Mind. So they should be able to get a lot done. I hope this will include maths, although it does seem like the Lean effort in particular is no longer the core focus.

But your/Justin’s conclusion seems correct. Shockingly it does seem like the IMO got nearly saturated as a benchmark. It will have to now be open conjectures (as formidable a challenge as this sounds).

Justin Asher (Jul 23 2025 at 18:11):

Kevin Buzzard said:

just make up a boring problem which is easy but has never been asked before.

I mean a sufficiently difficult conjecture that would impress mathematicians. Of course, an REU problem is not going to impress. The DeepThink solutions show that the models can prove relatively difficult individual statements. Putting a model like this into an agentic framework where they can work on a paper over the course of many days would likely produce good results. I can write an open source framework for this if people are interested.

Thomas Zhu (Jul 23 2025 at 18:16):

I agree with Kevin's message: It is unclear whether AIs that excel at IMO can transfer to important open problems where hard-to-find approaches, vast knowledge of previous work, and ability beyond proving a single, isolated, fixed theorem are called for. At ByteDance, we also tried to evaluate non-high-school-math performance by evaluating on PutnamBench and miniCTX. The 80% score on miniCTX means that a random theorem taken from an open source Lean repo has a 80% chance of being solved in our lightweight setting. However, this ability is still minuscule compared to true research math ability; I think important open questions will be the ultimate benchmark!

Justin Asher (Jul 23 2025 at 18:24):

@Thomas Zhu The problem with evaluating on existing statements is that the models are often only trained on high school mathematics, or they lack strong agentic capabilities. (I do not know how this applies to your work.) This is why I am interested in search (cf. LeanExplore) combined with autoformalization. I had an agentic loop with o4-mini which should be able to prove arbitrary statements in existing repositories, albeit it was super slow, so I started working on other things.

Justin Asher (Jul 23 2025 at 18:26):

The 80% score is promising, especially since it is your light-weight setting!

Jason Rute (Jul 23 2025 at 18:36):

To combine both topics, since you have the best Lean AI (save maybe AlphaProof), have you tried the problems in https://github.com/google-deepmind/formal-conjectures? :slight_smile:

Thomas Zhu (Jul 23 2025 at 19:01):

We have not tried formal-conjectures yet.

Joseph Myers (Jul 23 2025 at 19:14):

Every so often someone solves an open conjecture with a fairly simple solution based on known techniques that no-one had previously spotted could be applied to solve that problem. I fully expect that sort of thing is within reach of current AIs. It might however take millions of dollars of compute applied to a thousand problems to find one such solution, so the question then is whether the AI company would disclose the details of the compute used and all the problems the AI didn't solve, or just show off the one it did solve.

(Did I mention I dislike the general secretiveness of the AI industry?)

Phillip Harris (Jul 23 2025 at 19:24):

Have they released any more detail about the architecture? Like the models involved and their sizes.

Joseph Myers (Jul 23 2025 at 19:27):

It was suggested at IMO 2025 that IMO 2026 wouldn't be of interest to AI companies because it would be too easy for them to solve all six problems then. Apart from whether that's actually a good extrapolation from current data (going from "gold boundary minus one point" to "exactly on the gold boundary" is comparatively small progress between 2024 and 2025), there are plenty of things that could reasonably be assessed based on AI attempts at IMO 2026 if you're concerned with getting scientifically meaningful data rather than just something that makes a good press release for the AI companies (and happy to work with those AIs interested in contributing to such data even if some of the big AI companies aren't interested).

  • As above, can the AIs solve all six problems, which is surely the remaining AI goal for the IMO beyond achieving the gold threshold?
  • What can be achieved with different AI techniques? Results with a technique that didn't previously do so well would arguably be more interesting than repeating success with a technique that's already known to do well. Just because "AI" has been demonstrated to do something doesn't mean we're anywhere close to knowing the potential of different ideas in the general space of AI, or what approaches or combination of approaches will end up being the best.
  • What can be achieved with given resource limits? The less resources AIs need to solve problems, the more potentially useful they are.
  • What can be done with open-weight AI, or, better, with fully open-source AI (everything about the training process open, including all training data and all code used to produce synthetic training data)?
  • For AIs working with Lean or other formal languages: can they work usefully with the latest released version and a single standard version of the problems, even if training occurred with older versions? Keeping up to date with mathlib is important for practical use of such AIs. (The related issue, "can they work usefully with definitions newly added to mathlib?", isn't so readily testable through competition problems, though very relevant for working with research mathematics.)
  • Suppose you run the same AI many times on the same problems (if resource consumption permits). What's the distribution of different solution approaches found? Do different AIs prefer different solution approaches? Can AIs be asked to look for genuinely different solutions and so help explore the solution space that way? Learning about the space of possible solutions to a problem is valuable for writing mark schemes for that problem; the PSC tries to include different approaches in the shortlist booklet and then more may be found during the IMO.

Justin Asher (Jul 23 2025 at 19:40):

Why not just train an RL system which goes through papers on the arXiv and, each time it sees an open problem, tries to solve it? I think DeepThink could already start solving some of these, and since these models are now able to perform self-verification, it should be able to become increasingly better at mathematics.

Justin Asher (Jul 23 2025 at 19:43):

Joseph Myers said:

  • For AIs working with Lean or other formal languages: can they work usefully with the latest released version and a single standard version of the problems, even if training occurred with older versions? Keeping up to date with mathlib is important for practical use of such AIs.

I was thinking we should create a Lean ML library for this purpose, which would include tools for data indexing, search, verification, and so on. This would build on top of existing repositories like the REPL, Pantograph, SafeVerify, LeanInteract, and so on. I think there are a lot of competing standards right now (at least companies internally seem to have their own tools), and that having one central standard which is up-to-date and used by everyone would be good—at least everyone working on open-source. A lot of existing libraries seem to lag behind the current Lean version, so making a bot which submits PRs to Mathlib is not even really possible.

Joseph Myers (Jul 23 2025 at 20:04):

A lot of open problems on the arXiv are probably of the form "the authors thought about it for five minutes and didn't immediately spot how to solve it" (and probably don't have many people other than the authors interested in a solution either). So solving one of them is a fairly weak signal of AI abilities (and attempting the maybe millions of such problems there could be in total on the arXiv in bulk, regardless of their value, seems very much like something that would make more sense to do later, once the cost of such AIs has gone down a lot - and any solution found needs to be of enough interest to someone for them to check it, or at least to check the definitions and statement in the case of a formal solution), compared to problems that have stood the test of time and lots of people have shown an interest in (which there are still plenty of available).

Justin Asher (Jul 23 2025 at 20:05):

That's fair. I was more so just thinking about
(a) providing some immediate value to the community and
(b) getting a dataset which is (largely) uncontaminated.

Joseph Myers (Jul 23 2025 at 20:06):

Solving a conjecture from formal-conjectures is more likely to be of wide interest than a conjecture from a random arXiv paper.

Junyan Xu (Jul 23 2025 at 21:21):

Don't forget about MathOverflow; AI can earn reputation by solving questions posted there. Also if the AI find a solution to a casual question in an arXiv paper, it can write an email to the author(s), and maybe it will become a collaborator.

Junyan Xu (Jul 23 2025 at 21:28):

It impresses me that the heavyweight Seed Prover is building a lemma pool with problems it tries to prove or refute. To me this is the sign that Seed Prover has become a developing mathematician, maybe not a good one yet, but still.

David Michael Roberts (Jul 23 2025 at 21:33):

https://www.reddit.com/r/math/comments/1m6sooc/a_brief_perspective_from_an_imo_coordinator/

David Michael Roberts (Jul 23 2025 at 21:52):

Junyan Xu said:

Don't forget about MathOverflow; AI can earn reputation by solving questions posted there. Also if the AI find a solution to a casual question in an arXiv paper, it can write an email to the author(s), and maybe it will become a collaborator.

I am a moderator on MathOverflow, and the answer is a solid NO. This is NOT welcome there.

David Michael Roberts (Jul 23 2025 at 21:53):

Making the analogy to IMO, AIs and their owners/operators would not be allowed at the venue, they would get booted, and get no official help at all.

Bo Qin (Jul 23 2025 at 22:37):

Seems like it's just doing clustering

Bo Qin (Jul 23 2025 at 22:39):

I doubt LLM can solve novel problems where the format hasn't been seen before

Justin Asher (Jul 23 2025 at 22:39):

Bo Qin said:

Seems like it's just doing clustering

I am curious what you mean by this?

Bo Qin (Jul 23 2025 at 22:40):

https://en.wikipedia.org/wiki/Cluster_analysis

Bo Qin (Jul 23 2025 at 22:40):

Chain of thought is just breaking the bigger problem into smaller ones where the learner can easily cluster those smaller ones with small proof formats it has already seen

Bo Qin (Jul 23 2025 at 22:48):

Coming up with novel proof format has more to do with combinatorics

Bo Qin (Jul 23 2025 at 22:49):

aka brute forcing all the methods you know

Opt (Jul 24 2025 at 00:48):

Justin Asher said:

Why not just train an RL system which goes through papers on the arXiv and, each time it sees an open problem, tries to solve it? I think DeepThink could already start solving some of these, and since these models are now able to perform self-verification, it should be able to become increasingly better at mathematics.

We don't know how out of distribution the verification ability extends though I'm sure OAI and DM are already trying to do so, not just on open problems but even on solved theorems on arxiv as part of their RL.

Andy Jiang (Jul 24 2025 at 02:30):

Joseph Myers said:

Solving a conjecture from formal-conjectures is more likely to be of wide interest than a conjecture from a random arXiv paper.

There's some relative merits of solving arxiv conjectures no? Like it shows the underlying system has some grasp of a large breadth of knowledge if it's capable in almost all areas. I don't know what's in the formal conjectures repo but somehow I doubt it's as representative/comprehensive in breadth and probably the difficulty will also not allow us to judge the capability spread of the systems.

Andy Jiang (Jul 24 2025 at 02:33):

David Michael Roberts said:

Junyan Xu said:

Don't forget about MathOverflow; AI can earn reputation by solving questions posted there. Also if the AI find a solution to a casual question in an arXiv paper, it can write an email to the author(s), and maybe it will become a collaborator.

I am a moderator on MathOverflow, and the answer is a solid NO. This is NOT welcome there.

I don't understand this animosity (presumably it's targeted against current systems and also if one wants to ask AI I guess one often has the option to do it directly). But don't you think that being able to answer a large portion of MO questions also shows some ability of the AIs (they could show this without active participation)

Justin Asher (Jul 24 2025 at 02:34):

Just to clarify, the point about arXiv was not that random conjectures are useful, but that the model practicing mathematics on questions people are already asking and then self-improving via this process is useful.

Andy Jiang (Jul 24 2025 at 02:43):

By the way, even if you didn't intend it, I do think solving random conjectures on arxiv is useful (at the very least to the people who wrote them in the paper). But also in aggregate to our collective understanding of the field.

David Michael Roberts (Jul 24 2025 at 02:44):

@Andy Jiang It's about site culture. If someone uses an AI system to answer a question offline, and understands the answer and can vouch for it, and the answer is written in a way indistinguishable from how a professional mathematician can write, then clearly we can't tell. But the amount of AI-generated junk we have to delete posted by people that have not enough mathematical understanding to check if its even reasonable, let alone right (and, for that matter, check the formatting syntax is sensible and consistent, and not just TeX output pasted on MO) - you can understand that this drives away high-value users who don't want to waste their time on what might turn into an API for an LLM that they could already consult themselves.

David Michael Roberts (Jul 24 2025 at 02:46):

The AI doesn't need MO rep, by the way, to somehow justify its ability. That is only used to slowly accrue moderation-like powers by osmosis in the site culture, as to what fits, how things work and so on. On MO reputation points are deliberately hidden to users by default.

There's nothing wrong taking an MO question, running an AI over it, and storing the output on one's own blog/website/GitHub repo. Just don't paste it in the answer box, please. It's against site policy.

Andy Jiang (Jul 24 2025 at 02:47):

A question to the seed team: of course this is not critique--since all the AI teams announced so far have failed on P6 not just you guys--but do you view the limitations of the system as mostly coming from limitation of the Seed LLM that you're calling to guess the solution? Especially if finding the solution is the bulk of the work--it does seem like it's not able to recognize that and devote most of compute to that (like it's not clear if you call an LLM a million times whether it guesses a hard answer). Do you have some plans to integrate the prover part with the answering part (maybe share some resources in both directions and to have a RL system train the LLM to guess the solution given partial progress on the prover side etc)

Thomas Zhu (Jul 24 2025 at 02:52):

but do you view the limitations of the system as mostly coming from limitation of the Seed LLM that you're calling to guess the solution?

Yes, I have yet to see any LLM to even guess the correct answer or come close to the construction.

Thomas Zhu (Jul 24 2025 at 02:59):

Do you have some plans to integrate the prover part with the answering part

I think this question is more general than fill-in-the-blank competition questions: for humans in general, one constantly modifies the statement that they wanted to prove originally. For example, if you realize mid-proof that a condition is needed, you may strengthen the conditions; or you may start from an undetermined bound and only derive its precise value after a proof. This is certainly an interesting ability to explore.

LiChenggang (Jul 24 2025 at 03:42):

Andy Jiang said:

A question to the seed team: of course this is not critique--since all the AI teams announced so far have failed on P6 not just you guys--but do you view the limitations of the system as mostly coming from limitation of the Seed LLM that you're calling to guess the solution? Especially if finding the solution is the bulk of the work--it does seem like it's not able to recognize that and devote most of compute to that (like it's not clear if you call an LLM a million times whether it guesses a hard answer). Do you have some plans to integrate the prover part with the answering part (maybe share some resources in both directions and to have a RL system train the LLM to guess the solution given partial progress on the prover side etc)

That's a very good question, and will be one of our key research focuses in the future.

Justin Asher (Jul 24 2025 at 05:17):

@Thomas Zhu How well have you seen LLMs do on P6? I am getting some pretty good bounds on the answer which seem reasonable, but need to set up a more complete agentic system with branching to explore ideas more quickly.

Thomas Zhu (Jul 24 2025 at 05:50):

@Justin Asher I have not seen them make any partial progress at all (except for trivial bounds like ≤ 4048). Do you have an example answer?

Yan Yablonovskiy 🇺🇦 (Jul 24 2025 at 05:58):

Junyan Xu said:

Don't forget about MathOverflow; AI can earn reputation by solving questions posted there. Also if the AI find a solution to a casual question in an arXiv paper, it can write an email to the author(s), and maybe it will become a collaborator.

It's a cool concept, but it does remove a lot of attainable work for researchers.

In my view, those are often the source of consistent research output, I would be hesitant to let AI just take all of those and having to hope there is funding for known open problems that have been 'demonstrably hard.'

I can't imagine it not having a negative impact on post-doc short term positions in my opinion.

Martin Dvořák (Jul 24 2025 at 07:39):

Which theorem in p4.lean provides the final answer to the question (the set of all possible values of a1a_1 is ...)?

Zheng Yuan (Jul 24 2025 at 08:20):

Martin Dvořák said:

Which theorem in p4.lean provides the final answer to the question (the set of all possible values of a1a_1 is ...)?

imo2025_p4_left, imo2025_p4_right

Martin Dvořák (Jul 24 2025 at 08:39):

I don't understand the solution. Is Odd (padicValNat 2 (a 0)) ∧ padicValNat 2 (a 0) < 2 * padicValNat 3 (a 0) ∧ padicValNat 5 (a 0) = 0 supposed to be the answer to the question that the IMO problem was asking? Where is the formalization of the statement of the problem?

Zheng Yuan (Jul 24 2025 at 09:01):

Martin Dvořák said:

I don't understand the solution. Is Odd (padicValNat 2 (a 0)) ∧ padicValNat 2 (a 0) < 2 * padicValNat 3 (a 0) ∧ padicValNat 5 (a 0) = 0 supposed to be the answer to the question that the IMO problem was asking? Where is the formalization of the statement of the problem?

The answer is any number n where its 2-adic is odd, 3-adic is larger than half of the 2-adic, and 5-adic is zero (i.e. 2^k||n, k is odd, 3^m||n, 2m>k, n is not dividable by 5), which is same as 6 * 12^m * p (where p is coprime to 10).

Martin Dvořák (Jul 24 2025 at 09:06):

All right, but this is something nontrivial that the reader must do in their head. Why is there no iff theorem, or even better, a theorem about equality of sets (the set from the statement and the set that represents the solution)?

Martin Dvořák (Jul 24 2025 at 09:07):

The file looks as if you cut the solution right before the final answer.

Thomas Zhu (Jul 24 2025 at 09:08):

@Martin Dvořák We have a fixed strategy (fixed before the IMO) to decompose any conjunction problem (including iff/equality of sets/IsLeast) into left and right parts and prove them separately, simply to facilitate our conjecturing pipeline.

Martin Dvořák (Jul 24 2025 at 09:12):

Yes, that would be reasonable if the statements of the "->" implication and the "<-" implications were definitionally equivalent up to the direction (or even better, syntactically equivalent) but not if they are (nontrivially) propositionally equivalent.

Joseph Myers (Jul 24 2025 at 11:55):

P4 does provide a very good example of a "determine" problem where there are many different reasonable ways to express the answer (and so automated verification of whether a particular version of the answer is legitimate would be hard).

Martin Dvořák (Jul 24 2025 at 13:26):

You are right that there are several ways to express the answer in Lean. However, if your solution of "characterize the set SS described in the problem statement" consists of a proof STS \subseteq T and a proof RSR \subseteq S, you should also prove that TRT \subseteq R. Without the last step, the solution is incomplete. Assuming that the AI knows that T=RT = R and hence considers the solution to be finished, I don't know why the AI keeps the proof of T=RT = R to itself. It should be a part of the solution.

Bo Qin (Jul 24 2025 at 14:42):

David Michael Roberts said:

the amount of AI-generated junk we have to delete posted by people that have not enough mathematical understanding to check if its even reasonable, let alone right

That just means those people aren't qualified to use AI and post the AI answers. You always need a qualified human to verify the results at the end of the aggregate function.

Justin Asher (Jul 24 2025 at 16:55):

Thomas Zhu said:

Justin Asher I have not seen them make any partial progress at all (except for trivial bounds like ≤ 4048). Do you have an example answer?

No, I just wrote a basic script and told it to start deriving general theory. I had a lot of success previously on Putnam problems using basic models (e.g., Gemini 2.0 Flash) and having them write out a bunch of ideas in a paper like format. What this means is each time the model would write a section to add to the paper, reviewers would check that it is correct, and then this data would be added. This allows the model to build up theory recursively while knowing that it is correct.

The big issue with this particular problem is that the model, even after extensive prompting, seems unable to identify that it needs to use the Erdős–Szekeres Theorem. I think this is a combination of (a) the problem is indeed very difficult and (b) the intuition for the problem is visual. Seeing how much models struggle on tasks like the ARC-AGI challenge, this is likely more due to how the model is trained.

I think reformulating the problem into something which is less visual would probably help the model significantly.

Andy Jiang (Jul 24 2025 at 17:34):

sorry actually I have a stupid question. In this case, all the Seed (nonprover) LLM model needs to do is guess the minimum number right? So actually in this case the main difficulty is actually on the prover side not the guesser part? because if the prover were good, then you could just like binary search to get the optimal bound if you know what I mean. like say you know that <4048 or whatever, just try to prove both > and <2024 etc until you narrow it down. so basically the main question for this particular question is how much compute you need to prove it (that there exist construction for n) for the optimal bound n

Andy Jiang (Jul 24 2025 at 17:42):

I actually wonder if for the prover system, this erdos theorem is that hard(?)--I assume it's used to establish the lower bound--to figure out that it needs to be applied? [sorry I didn't try to solve/read the solution of this problem so maybe i'm saying complete bs] Or it's actually generating this example at that n. For the latter, it kind of reminded me of the equational theories thing that Terry did where some SAT solvers gave easy arguments for where examples can't exist, and then for the rest you have this hard problem of actually finding an example where theory can't rule it out easily. Rambling a little, but maybe it's the case that generating such problems (for training data) isn't that hard, especially if you have a robust system (like in the SAT solver case) where ruling out examples is easy. Then you just like generate a lot of (nice?) examples of something, and then you try to use your solver to find some characteristics of your example which is extremal. And then that's a training sample for your model. Obviously very speculative but...

Andy Jiang (Jul 24 2025 at 18:05):

Come to think of it, now that the goal of competiting in IMO like a human and getting gold is achieved, it would be nice to see how well a system with also code execution at its disposal does? presumably with access to python, the models are finding the construction for P6?

Joseph Myers (Jul 24 2025 at 23:51):

Certainly I think a "determine" problem should have a single answer provided in any claimed formal solution, rather than separate answer for each direction.

Joseph Myers (Jul 24 2025 at 23:55):

On P6, there is a motivated way to find the construction: considering the edges that the Xs (1x1 squares not in any tile) share with the tiles, by double counting you see you want almost all the edges of tiles to be shared with Xs, which leads to the well-known tiling by squares of two different sizes. (It's also been remarked that this tiling can be seen in the floor at Sunshine Coast Airport, so contestants not already aware of it had a chance to learn about it on their way to the IMO.)

Joseph Myers (Jul 25 2025 at 00:04):

If you want a "less visual" statement, my formal statement expresses P6 in terms of intervals in the order on a product type Fin 2025 × Fin 2025. But a "less visual" statement may not help when the intuition behind solutions (there are many different solutions) is generally visual.

Opt (Jul 25 2025 at 00:09):

I asked O4-mini-high for n=9. It comes up with the wrong solution. Then I tell it the right one and ask it to identify the theorem needed to prove the bound. It's able to correctly identify the Erdős–Szekeres Theorem

Justin Asher (Jul 25 2025 at 00:11):

Huh, neat. Gemini 2.5 Pro could not identify Erdős–Szekeres even after significant prodding and asking for many potential theorems.

Joseph Myers (Jul 25 2025 at 00:15):

Erdős–Szekeres is only one of many different approaches to proving the bound for this problem.

Justin Asher (Jul 25 2025 at 00:19):

That's fair. I did not read through the solutions thoroughly. Just surprised me the model never mentioned that theorem.

Opt (Jul 25 2025 at 00:24):

Justin Asher said:

Huh, neat. Gemini 2.5 Pro could not identify Erdős–Szekeres even after significant prodding and asking for many potential theorems.

Gemini with a similar hint identifies the Dilworth theorem which ChatGPT tells me is the same as the Erdos theorem except in poset language

Notification Bot (Jul 25 2025 at 12:07):

6 messages were moved from this topic to #Machine Learning for Theorem Proving > Delta Prover by Jason Rute.

Moritz Firsching (Jul 30 2025 at 13:49):

Joseph Myers said:

Solving a conjecture from formal-conjectures is more likely to be of wide interest than a conjecture from a random arXiv paper.

Currently there might be a bias in formal-conjectures towards harder and famous problem, and also problems from https://www.erdosproblems.com/. However conjectures from random arxiv or published papers are very welcome and definitely in scope! So if you see a suitable conjecture in a random arxiv paper (or in your paper)
fell invited to open an issue.
Currently we only have 9 conjectures from the arxiv: https://github.com/google-deepmind/formal-conjectures/tree/main/FormalConjectures/Arxiv
and arguably some of than are somewhat famous and not random (but not famous enough to have a wikipedia article)
If you feel like formalizing a random arxiv paper, look for https://github.com/google-deepmind/formal-conjectures/issues?q=is%3Aissue%20state%3Aopen%20label%3Aarxiv (currently there is only one issue there: https://github.com/google-deepmind/formal-conjectures/issues/227 ).

Chris Hughes (Jul 30 2025 at 15:41):

Is it fair to say that this is a relatively modest claim to only claim a silver medal? It seems like 5 problems were solved, but problem 1 took too long to solve so the Seed prover team didn't count it? source Were the other systems playing by those rules, where taking too long discounted the solution?

Thomas Zhu (Jul 30 2025 at 18:54):

Chris Hughes said:

Is it fair to say that this is a relatively modest claim to only claim a silver medal? It seems like 5 problems were solved, but problem 1 took too long to solve so the Seed prover team didn't count it? source Were the other systems playing by those rules, where taking too long discounted the solution?

In order for IMO to grade our results, they required submission by a specific time on July 18 (about 3 days after we got the problems). Seed-Prover solved 4 problems by the deadline and 1 other problem after the deadline. Perhaps one might interpret it as "getting gold-level score in 7 days" but this score would only be verified by Lean and not the IMO (hence the "IMO-certified score of 30 points" in our news release). We are working on significantly reducing our computational time.

Thomas Zhu (Jul 30 2025 at 19:00):

By the way, Huawei claimed 34/42 (backed by a quote from IMO). I could only find relevant news in Chinese. They claimed using "formal proofs" in the news release but the final PDFs were in natural language, and I assume they won't have a tech report, so I don't know the technical details.

Thomas Zhu (Aug 01 2025 at 00:57):

Our tech report is now available at: https://arxiv.org/pdf/2507.23726

Joseph Myers (Aug 01 2025 at 01:23):

Could you release the dataset mentioned of 155 past IMO problems (under a suitable license for reuse) and the details of which ones you solved? Any statements not currently in Compfiles might well be of interest for that repository; I don't know what the Compfiles attitude is to taking AI-generated proofs, but there's plenty of scope for starting from such a proof and then cleaning it up heavily.

That the dataset mainly taken from existing repositories has many fewer combinatorics problems than algebra or number theory certainly indicates a bias in what statements people have chosen to formalize. To be fair, combinatorics statements take a lot longer to formalize than algebra or number theory statements (but they don't tend to run into missing definitions the way geometry statements still do, except for some problems expressed geometrically).

Thomas Zhu (Aug 01 2025 at 01:27):

@Joseph Myers We have released our solutions the past IMO problems here: https://github.com/ByteDance-Seed/Seed-Prover/blob/main/SeedProver/imo.zip. We will release all problem statements, including ones not solved shortly.

Joseph Myers (Aug 01 2025 at 01:34):

Thanks!

I'm interested myself in high-quality statements formalized following uniform conventions and covering all problems from given years without selection bias on which problems someone chose to formalize the statement of (I hope to extend IMOLean to earlier years at some point, after filling in more of the gaps in geometrical definitions in mathlib), but less uniform statements with selection bias are still certainly useful for evaluating solvers (mathematicians writing formal statements of research problems to ask AIs for help on those problems won't use uniform conventions) and also useful for comparison when writing one's own formal statements (ideas from someone else's formal statement can help write a better statement following one's own preferred conventions).

Zheng Yuan (Aug 01 2025 at 06:12):

Joseph Myers said:

Thanks!

I'm interested myself in high-quality statements formalized following uniform conventions and covering all problems from given years without selection bias on which problems someone chose to formalize the statement of (I hope to extend IMOLean to earlier years at some point, after filling in more of the gaps in geometrical definitions in mathlib), but less uniform statements with selection bias are still certainly useful for evaluating solvers (mathematicians writing formal statements of research problems to ask AIs for help on those problems won't use uniform conventions) and also useful for comparison when writing one's own formal statements (ideas from someone else's formal statement can help write a better statement following one's own preferred conventions).

https://github.com/ByteDance-Seed/Seed-Prover/blob/main/SeedProver/imo_all.lean

We have multiple versions of IMO2023P1, please just ignore them.

Eric Wieser (Aug 01 2025 at 09:06):

Is imo_all.lean exhaustive coverage of all non-geometry past IMO problems?

Zheng Yuan (Aug 01 2025 at 12:16):

Eric Wieser said:

Is imo_all.lean exhaustive coverage of all non-geometry past IMO problems?

No, mainly from compfiles. We only add like 30+ formal statements by ourselves.

Justin Asher (Aug 01 2025 at 13:56):

Justin Asher said:

Probably will happen before the end of the year...

I know this is not the Riemann hypothesis or anything, but seems like we are heading in the right direction to start solving some good conjectures by the end of the year. The demo of DeepThink is it solving a small conjecture:
https://www.youtube.com/watch?v=QoXRfTb7ves

I think if you hooked this up to a larger agentic system you could start writing some decent research papers, but that would be terribly expensive.

I am super excited to see how systems like Seed Prover and DeepThink contribute to this.

Junyan Xu (Aug 01 2025 at 17:57):

gemini-2-5-deep-think-solves-previously-unproven-v0-fbr5jhuneegf1.webp

Junyan Xu (Aug 01 2025 at 17:57):

The conjecture solved, from https://www.reddit.com/r/singularity/comments/1metslk/comment/n6c3dhm/

Junyan Xu (Aug 01 2025 at 18:08):

image.png
Comments from YouTube; looks like the conjecture has been solved in 2023.

Jason Rute (Aug 01 2025 at 18:10):

I wonder if AI ever solves RH, if it will have just taken the proof from some supposed crank who was right all along. :slight_smile:

Anh Nguyễn (Aug 01 2025 at 18:41):

Junyan Xu said:

gemini-2-5-deep-think-solves-previously-unproven-v0-fbr5jhuneegf1.webp

I know that comparing it to FrontierMath is not a sensible thing to do (because FM is about finding a numerical answer). But which is more difficult

Bhavik Mehta (Aug 01 2025 at 19:46):

Junyan Xu said:

Comments from YouTube; looks like the conjecture has been solved in 2023.

Note that the first speaker (Michel) of the video is also an author of the paper which solved the conjecture. It seems to me that DeepThink came up with a different solution and that the proof was along different lines to the existing proof, but the problem wasn't previously unsolved.

Jason Rute (Aug 01 2025 at 20:10):

That was confusing. Thanks for clearing it up!

Bhavik Mehta (Aug 01 2025 at 20:27):

The video certainly didn't make this clear, I had an advantage in that I know the speaker so recognised both his face and his surname from the youtube comment reference :)

Joseph Myers (Aug 01 2025 at 22:44):

I think finding alternative solutions and exploring the solution space of a problem is an interesting thing to do; indeed, I mentioned it above as one way in which evaluating AIs on future IMOs seems worthwhile even if "can the AI solve the problems?" isn't an interesting question at that point. (Problems do need a certain minimum difficulty level to have an interesting solution space, so I'm not sure you'd get much out of such an evaluation on the easier parts of miniF2F, for example. But national olympiad and IMO-easy problems can certainly have interesting solution spaces.)

Finding alternative solutions to a solved problem (with or without knowledge of solutions you're looking for an alternative to) is, however, very different from solving an unsolved problem.

To keep this more on topic as far as Lean is concerned: formalizing alternative proofs of something is also of value if they result in contributing different lemmas to mathlib, or in insights into how to formalize the different concepts involved in the different proofs.

Opt (Aug 03 2025 at 04:47):

Thomas Zhu said:

Our tech report is now available at: https://arxiv.org/pdf/2507.23726

Thanks @Thomas Zhu . If it's something you can talk about - how do you generate the natural language hints and proofs? Do you have a separate LLM which you solicit for that or does your Lean LLM generate that automatically as part of its chain of thought?

David Michael Roberts (Aug 03 2025 at 11:54):

@Bhavik Mehta do you mean the video was edited in a way that when he says it came up with a different proof, but apparently never said it was proved already, it was intentional? I'm shocked,... shocked!

Zheng Yuan (Aug 03 2025 at 13:00):

Opt said:

Thomas Zhu said:

Our tech report is now available at: https://arxiv.org/pdf/2507.23726

Thanks Thomas Zhu . If it's something you can talk about - how do you generate the natural language hints and proofs? Do you have a separate LLM which you solicit for that or does your Lean LLM generate that automatically as part of its chain of thought?

same model here

Ayush Agrawal (Aug 03 2025 at 22:22):

@Thomas Zhu The paper talks about a cool interface called LooKeng that would help in a lot of theorem proving tasks. Is this interface open-sourced? Also, can we know what is the size of seed-prover?

Zheng Yuan (Aug 04 2025 at 02:25):

Ayush Agrawal said:

Thomas Zhu The paper talks about a cool interface called LooKeng that would help in a lot of theorem proving tasks. Is this interface open-sourced? Also, can we know what is the size of seed-prover?

Lookeng will be open-sourced.

Ayush Agrawal (Aug 04 2025 at 20:21):

Thanks @Thomas Zhu! One more question: Is the lean4 compiler feedback (incorporation of errors etc. into the prompt) happening online during RL training? It was not totally clear from the paper. Feel free to point to me specific parts of the paper if I missed something.

Jared green (Aug 04 2025 at 21:13):

when can we use seedprover?

Simon Sorg (Aug 05 2025 at 08:53):

Ayush Agrawal said:

One more question: Is the lean4 compiler feedback (incorporation of errors etc. into the prompt) happening online during RL training? It was not totally clear from the paper. Feel free to point to me specific parts of the paper if I missed something.

Similar problem. I'm still not entirely sure what was done during training and what is part of the (very impressive) test-time inference. So you propose easier conjectures during training, and have some multi-task learning objective that trains to generate lemmas and use them at the same time but also to prove starting with already existing lemmas, right?
But library learning is inference only? And self-summarization, is that part of training?

Zheng Yuan (Aug 06 2025 at 01:33):

Jared green said:

when can we use seedprover?

We are not ready for other people to try this due to our complex test-time strategy.

Zheng Yuan (Aug 06 2025 at 01:34):

Simon Sorg said:

Ayush Agrawal said:

One more question: Is the lean4 compiler feedback (incorporation of errors etc. into the prompt) happening online during RL training? It was not totally clear from the paper. Feel free to point to me specific parts of the paper if I missed something.

Similar problem. I'm still not entirely sure what was done during training and what is part of the (very impressive) test-time inference. So you propose easier conjectures during training, and have some multi-task learning objective that trains to generate lemmas and use them at the same time but also to prove starting with already existing lemmas, right?
But library learning is inference only? And self-summarization, is that part of training?

Library learning and self summary is test-time only.

Zheng Yuan (Aug 21 2025 at 01:03):

The performance of Seed-Prover on PutanBench has been verified with 329/657. 2 problems are not counted due to using wrong version of formalizations.

Kim Morrison (Sep 17 2025 at 02:37):

Has anyone actually run SeedProver's IMO2025 proofs? Their repository doesn't have a lean-toolchain or lakefile.toml, and the proofs don't work on the latest Mathlib master.

Zheng Yuan (Sep 17 2025 at 02:39):

Kim Morrison said:

Has anyone actually run SeedProver's IMO2025 proofs? Their repository doesn't have a lean-toolchain or lakefile.toml, and the proofs don't work on the latest Mathlib master.

Hi, SeedProver's proof should be compiled under v4.14.0

Zheng Yuan (Sep 17 2025 at 02:41):

The proof break in the lastest lean version mainly due to ∑ i in -> ∑ i ∈ issue

Kim Morrison (Sep 17 2025 at 02:42):

Is it possible to set up the repository so that lake build will work? This seems like step 0 of verification.

Zheng Yuan (Sep 17 2025 at 02:47):

Kim Morrison said:

Is it possible to set up the repository so that lake build will work? This seems like step 0 of verification.

Since https://github.com/ByteDance-Seed/Seed-Prover is used for store multiple research projects and multiple benchmark results (minif2f, imo, putnam, combibench...). It is a little hard for us to maintain a mono-repo with default lean setting since they relies on different Lean version. The most easiest way to verify our file is using lean-web, and simply copy paste one proof. Any repo with v4.14.0 Lean+Mathlib should also work to verify IMO2025.

Lawrence Wu (llllvvuu) (Sep 17 2025 at 02:53):

It can probably be done one lake project per directory (such as under imo2025/)

Zheng Yuan (Sep 17 2025 at 02:54):

Lawrence Wu (llllvvuu) said:

It can probably be done one lake project per directory (such as under imo2025/)

We will update our github for imo2025 with a lake project :working_on_it:

Kim Morrison (Sep 17 2025 at 04:17):

Thank you!

Zheng Yuan (Sep 19 2025 at 07:13):

Kim Morrison said:

Thank you!

Updated


Last updated: Dec 20 2025 at 21:32 UTC