Zulip Chat Archive
Stream: Machine Learning for Theorem Proving
Topic: Seed Prover 1.5 proves 11 / 12 Putnam 2025 in 9 hours
Zeyu Zheng (Dec 19 2025 at 06:41):
Link to announcement: #announce > Seed Prover 1.5 proves 11 / 12 Putnam 2025 in 9 hours
Kevin Buzzard (Dec 19 2025 at 09:41):
Screenshot from 2025-12-19 09-37-16.png
The last sentence in this quote (from p9) is a reminder to the mathematicians in this community here that there is still plenty of work to do. I think that we are just about at the point where I can state the main result of my PhD thesis using only things in mathlib but we're still a long way from being able to state the main intermediate results involved in the proof, and this thesis was written 30 years ago; things will only have gotten worse since then.
Bas Spitters (Dec 19 2025 at 14:12):
@Zeyu Zheng I did not find the model on Hugging Face. Do you provide it somewhere else ?
Zheng Yuan (Dec 19 2025 at 15:23):
Bas Spitters said:
Zeyu Zheng I did not find the model on Hugging Face. Do you provide it somewhere else ?
This is not an open sourced model
Joseph Myers (Dec 19 2025 at 15:44):
Some general observations / questions prompted by this report:
Joseph Myers (Dec 19 2025 at 15:46):
At the current level of AI results on PutnamBench, I wonder whether AIs failing to solve a particular problem is a signal that there might well be a misformalization of that problem making it harder or impossible to solve. I don't know if there's public information on which PutnamBench problems have not been solved by any AI (or more generally, on which problems were solved by which AIs), but have the PutnamBench maintainers used information on the problems not solved by AIs to search for misformalized statements there?
Joseph Myers (Dec 19 2025 at 15:55):
Taking 16.5 hours to solve IMO 2025 P1 certainly suggests something about the difficulty of some kinds of combinatorics problems in Lean, whether in general or specifically for AIs. I'm not sure exactly what the lessons are to take there (and in particular, whether there are things to do better in Lean and mathlib to make such formalizations easier), but it does seem that combinatorics is a key area of difficulty for AIs on competition-type problems at present. (Though I note that three of the five IMO 2025 problems for which a solution is reported in this report have a time to solve greater than the 4.5 hours available each day to a human contestant for all three of that paper's problems.)
Similarly, CombiBench results are given with a note "However, we discovered significant formalization issues within this dataset. We list the performance here for reference.". It's true that early CombiBench versions had major issues, as discussed here on Zulip, and the report doesn't seem to say which CombiBench commit was used. Were the significant issues found reported to the CombiBench maintainers and have they now been fixed, or are they in addition to the many issues that were fixed fairly early? Given current strengths and weaknesses of AIs in solving competition problems, we could do with good ways to assess their abilities in combinatorics, whether or not CombiBench itself is useful as such a benchmark.
Kelly Davis (Dec 19 2025 at 16:04):
@Zeyu Zheng Can you provide more info on how the system utilized "Python as a tool"?
- What type of problems did the system use it for?
- What subproblems did the system use it for?
- How was the Python tool presented/described to the system?
- ...
The article didn't provide much detail on this topic.
Joseph Myers (Dec 19 2025 at 16:07):
The stated IMO 2025 results say "(P2 is solved by Seed-Geometry)", which I understand as meaning it's not actually a Lean solution at all and not directly comparable with anything solving in Lean.
Now, it's perfectly reasonable in principle for an AI to call out to tools where helpful, whether those tools are Lean, Python, a SAT solver, a computer algebra system or a geometry solver. But to be relevant to Lean solving, a geometry solver should only be used as part of generating a Lean solution to the whole problem, stated in Lean using mathlib definitions and with the AI responsible for doing all translations to and from the format needed by the geometry solver.
And in modern AI, general (scalable) methods are expected to work better in the long run than special-casing things based on detailed human knowledge of a given kind of problem (the "bitter lesson"). So given an AI goal of contributing usefully to mathematics in general, I think the natural challenge for elementary Euclidean geometry is to get general methods to the point where they can tackle such problems without needing a custom geometry solver (unless the AI notices for itself the use of such a solver and also writes the solver itself without human intervention) - not because such a solver is bad, but because there are very many niche areas of mathematics that might sometimes benefit from custom solvers but where AI can't reasonably always expect solvers to be available (unless it writes them itself as needed) or to have huge amounts of training material on how to use them if they are available.
I.e., rather than thinking of Euclidean geometry as "this is outside the modern mathematical mainstream, deal with the competition problems as a special case", think of it as one of many examples of a niche mathematical area with little formal material available (but much more informal material available for training), where the ability to tackle such problems by general methods should be equally applicable to other niche areas of mathematics. Of course this means not just the AIs attempting to solve the problems in Lean by general methods - it means the AIs looking for more generally applicable lemmas and doing their own PRs of those to mathlib in mathlib-suitable form, just as is needed in any area of mathematics where you can expect that a nontrivial formalization will show up lots of missing lemmas in mathlib.
Kevin Buzzard (Dec 19 2025 at 16:14):
Oh yeah -- writing something that can translate these geometry tool proofs into Lean would be awesome! There is always a mismatch in my talks when AlphaProof says things like "we solved 4/6 problems on the 2024 IMO" and I say "AlphaProof solved 3/6 problems on the 2024 IMO in Lean" and it's because of the AlphaGeometry proof, but thanks to Joseph's sterling efforts we are now pretty much at the point where Lean can understand statements and proofs of many IMO geometry problems, so why not make such a system?
Wang Jingting (Dec 19 2025 at 16:27):
Kevin Buzzard said:
Oh yeah -- writing something that can translate these geometry tool proofs into Lean would be awesome! There is always a mismatch in my talks when AlphaProof says things like "we solved 4/6 problems on the 2024 IMO" and I say "AlphaProof solved 3/6 problems on the 2024 IMO in Lean" and it's because of the AlphaGeometry proof, but thanks to Joseph's sterling efforts we are now pretty much at the point where Lean can understand statements and proofs of many IMO geometry problems, so why not make such a system?
I think the problem is, although the proof given by AlphaGeometry is rigorous in some sense, it's certainly not a very concrete proof that you can formalize in Mathlib. To be more precise, the proof of AlphaGeometry actually gives the result "if the geometric relationship of the points (e.g. A, B are on the same side of line CD) is the same as the internally generated graph (which is numerically computed), then the result holds". But proofs generated by these formal systems don't really address the problem of geometric relationship, despite the fact that their deductions (e.g. angle chasing) might be highly reliant on the specific geometric relationships.
Joseph Myers (Dec 19 2025 at 17:50):
I don't think proving "same side" and "same sign" properties should be any harder for an AI than any of the other gaps you have to fill in when going from an informal proof to a formal one. (You could no doubt write a custom solver that attempts to do so - or a custom Lean tactic. But a reasonable challenge for AIs is to do as much as possible without hardcoding custom solvers for particular parts of mathematics, just using general methods that could apply elsewhere in mathematics, indeed without needing a custom geometry solver at all.)
Zheng Yuan (Dec 19 2025 at 22:50):
Joseph Myers said:
At the current level of AI results on PutnamBench, I wonder whether AIs failing to solve a particular problem is a signal that there might well be a misformalization of that problem making it harder or impossible to solve. I don't know if there's public information on which PutnamBench problems have not been solved by any AI (or more generally, on which problems were solved by which AIs), but have the PutnamBench maintainers used information on the problems not solved by AIs to search for misformalized statements there?
We have an autoformalizer judger using LLM which states some misformalized in Putnam and combibench (both latest commit). But this is a model judger, we don’t use many human efforts to double check it. For combi problem, I think the model are not familiar with them enough since creating many high-quality combi training data is really hard, it’s easy to misformalize the problem. Also I think there is some api issues for combi question in mathlib.
Zheng Yuan (Dec 19 2025 at 22:53):
Kelly Davis said:
Zeyu Zheng Can you provide more info on how the system utilized "Python as a tool"?
- What type of problems did the system use it for?
- What subproblems did the system use it for?
- How was the Python tool presented/described to the system?
- ...
The article didn't provide much detail on this topic.
I think the system using Python not too much, it mainly use it to ensure some correctness of calculations. Python tool is described as a simply function which you input a python code and the tool response is the final print result.
Kelly Davis (Dec 20 2025 at 06:51):
Zheng Yuan said:
I think the system using Python not too much, it mainly use it to ensure some correctness of calculations. Python tool is described as a simply function which you input a python code and the tool response is the final print result.
Thanks for the response. However, it still leaves me with a few questions:
- Lean 4 ensures correctness of calculations. So I'm not sure exactly what you mean when you say it is mainly used "to ensure some correctness of calculations".
- Generally, in what type of calculations is it used? (For example, is it used for geometry questions, combinatoric questions....)
- Can you give a bit more detail beyond the tool being described as "simply function which you input a python code and the tool response is the final print result". Traditionally such tools are described to the system by a prompt indicating to the system how the tool is to be called, the parameters it takes, what it returns... something like
## Tools Available
### movie_showtime_db(movie_name, date, time_preference, theater_location)
Search for available movie showtimes at theaters.
**How to call it:**
``python
# Basic - just movie name (required)
movie_showtime_db("Interstellar")
# With date filter
movie_showtime_db("Interstellar", date="2025-10-15")
# With time preference
movie_showtime_db("Interstellar", time_preference="evening")
# With theater location
movie_showtime_db("Interstellar", theater_location="downtown")
# With multiple filters
movie_showtime_db("Interstellar", date="2025-10-15", time_preference="evening")
``
**Parameters:**
- `movie_name` (required): Name of the movie (case-insensitive)
- `date` (optional): Specific date in YYYY-MM-DD format
- `time_preference` (optional): "morning", "afternoon", "evening", or "late"
- `theater_location` (optional): Theater area like "downtown", "mall", "northside"
**Returns:** List of showtimes with theater info, times, dates, available seats, and pricing.
...
As you can see, the description in the prompt of a tool can give strong priors to the system as to how the tool can/should be used.
So in the case of Seed Prover 1.5 such a prompt could be used to steer the system to use Python for, say, combinatoric questions, geometric questions... Thus, my curiosity on how the Python tool was presented/described to the system.
Zheng Yuan (Dec 20 2025 at 08:18):
Kelly Davis said:
Zheng Yuan said:
I think the system using Python not too much, it mainly use it to ensure some correctness of calculations. Python tool is described as a simply function which you input a python code and the tool response is the final print result.
Thanks for the response. However, it still leaves me with a few questions:
- Lean 4 ensures correctness of calculations. So I'm not sure exactly what you mean when you say it is mainly used "to ensure some correctness of calculations".
- Generally, in what type of calculations is it used? (For example, is it used for geometry questions, combinatoric questions....)
- Can you give a bit more detail beyond the tool being described as "simply function which you input a python code and the tool response is the final print result". Traditionally such tools are described to the system by a prompt indicating to the system how the tool is to be called, the parameters it takes, what it returns... something like
## Tools Available ### movie_showtime_db(movie_name, date, time_preference, theater_location) Search for available movie showtimes at theaters. **How to call it:** ``python # Basic - just movie name (required) movie_showtime_db("Interstellar") # With date filter movie_showtime_db("Interstellar", date="2025-10-15") # With time preference movie_showtime_db("Interstellar", time_preference="evening") # With theater location movie_showtime_db("Interstellar", theater_location="downtown") # With multiple filters movie_showtime_db("Interstellar", date="2025-10-15", time_preference="evening") `` **Parameters:** - `movie_name` (required): Name of the movie (case-insensitive) - `date` (optional): Specific date in YYYY-MM-DD format - `time_preference` (optional): "morning", "afternoon", "evening", or "late" - `theater_location` (optional): Theater area like "downtown", "mall", "northside" **Returns:** List of showtimes with theater info, times, dates, available seats, and pricing. ...As you can see, the description in the prompt of a tool can give strong priors to the system as to how the tool can/should be used.
So in the case of Seed Prover 1.5 such a prompt could be used to steer the system to use Python for, say, combinatoric questions, geometric questions... Thus, my curiosity on how the Python tool was presented/described to the system.
The system is prompted to use python like sympy, numpy, … and some other python calculations tools
Last updated: Dec 20 2025 at 21:32 UTC