Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Anthropic in IMO/Putnam/PutnamBench/Erdos ?

TongKe Xue (Jan 28 2026 at 18:25):

Are there any results from Anthropic on IMO/Putnam/PutnamBench/Erdos ?

I know about the Deepmind/OpenAI/xAI, Harmonic.fun/AxiomMath/LogicalIntelligence results.

I'm surprised I'm not finding Anthropic results on IMO/Putnam/PutnamBench/Erdos.

Do these results exist (and I'm just missing them) or is this an area Anthropic has not chosen to invest time into ?

Thanks!

Jason Rute (Jan 28 2026 at 18:42):

Wait, what are the xAI and OpenAI results? (Unless you mean pre-chat GPT OpenAI results or results where others call the model with their own scaffolding.)

Jason Rute (Jan 28 2026 at 18:57):

Oh, you mean the non-Lean results?

Justin Asher (Jan 28 2026 at 18:59):

My read on this was that they did not view the Putnam as being a good benchmark. This is reflected by how a lot of organizations saturated it who did release results (Axiom, Project Numina, etc.).

FrontierMath is perhaps more relevant.

Joseph Myers (Jan 28 2026 at 19:37):

I've previously noted that saturating on some kind of problem doesn't prevent doing interesting experimental science on AI capabilities for such problems. You can give the problems many times to the AIs (appropriately set up so they don't learn from one attempt on a given problem to the next) and look at the distribution of results in various areas. That includes how the time and other resources used to solve a problem vary across the attempts. It also includes the different approaches taken to solve a given problem, on different attempts by the same AI or attempts by different AIs.

Looking at different approaches for solving the same problem (which I think is interesting for both human and AI attempts) does depend on working with proof-based problems not final-answer ones. It also helps a lot to work with formal solutions (to avoid humans needing to check large numbers of informal claimed solutions), and would probably also help to have a pre-existing taxonomy of known solutions approaches to the problems found by humans attempting them. And AI might help in initially classifying many AI solutions according to such a taxonomy and identifying those that look like odd ones out needing further study.

(Doing such experiments would need the problems to be hard enough to have interestingly different solution approaches available; most of miniF2F is probably too easy.)

TongKe Xue (Jan 28 2026 at 19:55):

Jason Rute said:

Wait, what are the xAI and OpenAI results? (Unless you mean pre-chat GPT OpenAI results or results where others call the model with their own scaffolding.)

You're right, I grouped all the IMO / Putnam / PutnamBench / Erdos results, even if they were not Lean4; sorry for confusion.

TongKe Xue (Jan 28 2026 at 19:58):

Somewhat related to this: On PutnamBench, https://trishullab.github.io/PutnamBench/leaderboard.html , what does pass@N mean given these problems have Lean4/Isabelle formalization?

More precisely: why would any 'prover' submit anything that did not pass Lean4/Isabelle verifier ?

Jason Rute (Jan 29 2026 at 10:53):

Pass@n was originally for the situation where you have an LLM and you have n separate (totally independent) runs writing code. You want to know how many succeed. It came from AI for code, but unlike that situation, where you are trying to pass hidden tests (and hence it is only an approximation of the usefulness of the agent), here it is a perfectly valid theorem proving method (as you have also concluded).

But quickly the term got co-opted for agents, like tree search agents. At first pass@n meant run the tree search n times more or less independently. But each tree search could have 1000s of small LLM model calls. Then as we started using LLMs directly to write proofs as in DeepSeek Prover v1, it reverted back to its original meaning. However as we started adopting agentic frameworks on top of LLMs, researches have split in their conversions. Some write things like pass@64*1000 to mean something like 64 independent calls of an agent with 1000 (average?) model calls per agent. And as agents get more hierarchical, more terms of that product get added. Others just follow the original convention and write pass@n where n is the number of independent runs, even if each run is an agent which calls 1000s of models. That is why you get Hilbert and AlephProver using similar approaches (a decomposition agent calling frontier LLM APIs) and similar computation budgets (at least both are expensive) giving vastly different pass@n numbers on the putnambench leaderboard.

Jason Rute (Jan 29 2026 at 11:00):

Also it should be noted that as LLM outputs get longer (with thinking models), comparing pass@n values (where n is the number of total model calls), gets tricky since some models use many more tokens (and hence more compute/time/money) than other models. The newer results are just starting to report cost (although there are many factors, like economy of scale, that go into that too).

Thomas Zhu (Jan 29 2026 at 18:09):

I think pass@64*1000 means maximum 64 runs each with maximum 1000 calls, not average.

TongKe Xue (Jan 29 2026 at 22:53):

Interesting, so pass@X*Y means:

program "solver" was run X times, with different seeds
each execution of "solver" was allowed Y queries to some preferred LLM

and out of the X runs, atleast one generated an acceptable lean4 proof ?

Thomas Zhu (Jan 29 2026 at 22:58):

Yes. I should note that historically, "pass@k" was long before "agentic" capabilities of LLMs, so there was no notion of a "prover" "submitting" something. Back then, usually a LLM takes the problem statement as prompt, and outputs some string, and we manually extract ```lean ... ``` blocks from the output string and give it to Lean. That was when pass@k made sense. Today LLMs have general tool-calling capabilities and can treat Lean compiler as a tool, where it can try to compile, receive error feedback, and refine or decide to submit, and so on, so the "k" of pass@k is not really a well-defined number anymore in this framework.

TongKe Xue (Jan 29 2026 at 23:14):

A bit off-topic from my original question:

Is seed-prover basically the currently strongest open source (of code) + open weight (of any LLM models used) Lean4 prover, as measured on IMO / PutnamBench ? I.e. an "open source" option to Harmonic/Axiom ?

Thomas Zhu (Jan 29 2026 at 23:17):

Seed-Prover is currently not open source/weight. There is no centralized benchmark for IMO but for PutnamBench scores you can look at their official leaderboard: https://trishullab.github.io/PutnamBench/leaderboard.html. I believe the strongest open code+weight to be Hilbert, currently.

TongKe Xue (Jan 29 2026 at 23:18):

Ah, is https://github.com/ByteDance-Seed/Seed-Prover only the output and not the actual code?

Thomas Zhu (Jan 29 2026 at 23:18):

Yes

Last updated: Feb 28 2026 at 14:05 UTC