Zulip Chat Archive
Stream: Machine Learning for Theorem Proving
Topic: benchmarking Gemini 2.5 pro
Andy Jiang (Mar 29 2025 at 15:20):
Has anyone benchmarked the new Gemini model on generating formal proofs with lean yet? I find its math reasoning abilities really phenomenal and wonder how it transfers to this task though I assume it still suffers from similar problems of not understanding lean 4/mathlib syntax or details of mathlib library code (especially since it is also in flux). In general it would be nice to see a few lean benchmarks evaluated on these newer models (grok3, claude 3.7 and Gemini 2.5 pro) as they are supposed to be a step up in reasoning capabilities. [I guess minif2f is evaluated by these companies?] sorry I'm mostly out of loop so hopefully someone can fill me in
Kevin Buzzard (Mar 29 2025 at 16:01):
I think that currently the distance between what LLMs can do informally (e.g. tackling questions from the FrontierMath dataset, which are often at beginning graduate or hard undergraduate level although the answers are numbers rather than proofs) and what they can do formally is absolutely vast. If the lemma you ask it to prove is not already in mathlib and the proof is more than 5 lines then I would be very pessimistic that any LLM would be able to solve it autonomously. However I am generally an LLM-pessimist and would love to be proved wrong.
Kevin Buzzard (Mar 29 2025 at 16:08):
My suspicions as to why this is true is that problems of FrontierMath type (where the answer is a number but the level is hard) are solved by machines because the machines "guess" the answer and have no real idea about how to back things up. Many times now I have seen a machine get a "what is this number" question correct but fail spectacularly at the follow-up question "why?". There was an example in Wil Sawin's comment in a recent blog post of mine where I ask a question about modular forms for which the answer is 0, and the LLMS were consistently getting the answer right but writing plausible-looking rubbish for their justification (it looks like the right idea but on closer inspection doesn't even typecheck).
Kevin Buzzard (Mar 29 2025 at 16:10):
In short, these informal systems are not as smart as people think because we are not asking them the right questions; we are asking them questions for which the answer is easy for a computer to check (i.e. a number rather than a proof), and getting the right solution turns out to be far far easier than getting the right solution together with a correct justification. You're asking the systems to write proofs and this seems to me to be much much more difficult right now.
Ralph Furman (Mar 29 2025 at 16:12):
LLMs struggle not just on formal math, but also on informal math that differs from the pattern of questions on benchmarks. You can see it in this question where most models gave the wrong answer and a faulty construction. I just tried it in Gemini 2.5 and it gives the correct answer and a really long proof, although the meat of the proof is a line in the middle where it essentially claims the whole result is a theorem of Erdos (I don't know if it actually is, or if it's stochastic). I wrote a bit more about it in this post
Justin Asher (Mar 30 2025 at 16:28):
Hi, I just wanted to chime in and say that I think it is completely possible to use the new Gemini models, especially 2.5 Pro, to autoformalize large amounts of mathematics. I think one of the keys here is to write a program that can formalize mathematics from scratch without using Mathlib, as that allows better file storage structuring and a clear bijective map between human mathematics (informal) and Lean (formal). I have already started working on this, and I would be happy for any help.
Regarding Kevin's comment and blog post, I agree that LLMs are not inherently great at creating new research at the moment. My overwhelming suspicion is that a model like Gemini 2.5 Pro could potentially resolve this issue through a clever prompting algorithm: I was already getting 2.0 Flash close to solving A6 on last year's Putnam via writing down the literature, forming strategies, attempting proofs, and reviewing its work. However, it simply was not smart enough to implement proof strategies, even when it found the right ideas.
I think one of the big advantages here is that LLMs can take the Grothendieck approach to solving problems. Meaning that they are capable of building large bodies of theory, so that instead of trying to solve a problem in one go, they can build up a knowledge base of mathematics in very small steps that can be referenced.
I also want to point out that LLMs currently exceed at recalling known information and programming. Hence, from my observations, writing an algorithm which has an LLM write down a known proof and then formalize it is much, much easier than an algorithm which does entirely novel mathematics. My gut tells me that an algorithm that can formalize a decent amount mathematics (say, FLT) should be achievable within a year or two as these models become cheaper, faster, and smarter.
Again, if anyone wants to help me on this, I would be more than appreciative. There are a lot of little things that I need help optimizing. My email is justinchadwickasher@gmail.com
George Tsoukalas (Mar 30 2025 at 17:24):
pass@1 eval of Gemini 2.5 pro solves 3 of 657 problems on PutnamBench. Though this eval doesn't really reflect how a Lean user might interact with an LLM, I think it is likely much stronger if you put it in an agentic loop.
Jason Rute (Mar 30 2025 at 17:26):
What is the SoTA now? (I think you leaderboard is broken at least on my phone.)
George Tsoukalas (Mar 30 2025 at 17:46):
SoTA is Self-Play Theorem Prover (https://arxiv.org/abs/2502.00212) which solves 8 problems pass@3200, trained using an iterative conjecturing/proving training loop.
Joseph Myers (Mar 30 2025 at 20:20):
Some previous discussions have noted that a key part of formalizing existing mathematics in a useful way (meaning useful as a basis for future work, rather than the quickest path to simply having a proof of something in Lean at all) is not producing a one-to-one conversion from LaTeX to Lean, but working out the proper generality of the results in formalized form and the overall interface design for integrating them into a coherent library (mathlib), possibly trying out various different approaches to see what works best.
Formalizing without mathlib doesn't sound like a very good idea for integrated library building; what works well for formal mathematics isn't in a clear bijection with informal mathematics. I suspect that uses of LLMs for autoformalization ought to involve a dialogue so you (or the reviewer of the mathlib PR) can tell the LLM what you'd like it to do differently and it might go through multiple attempts at formalization of the same material based on such feedback.
Justin Asher (Apr 01 2025 at 01:31):
@Joseph Myers Thanks for the information. I understand where you are coming from, and I will keep that in mind.
Siddhartha Gadgil (Apr 01 2025 at 02:56):
My own informal benchmark suggest GPT-2.5 is rather good. Here are 3+1 questions (the last most significant). It may be good to catch a nearby student and test them (in the last case a student in topology/geometric group theory). These are not so easy for mathematicians either.
Basic Analysis
The following question seems to be hard for LLMs - for instance GPT-4o gave rubbish answers. At least that suggests it is not easy to find online (Deepseek-R1 got this only after mistakes were pointed out to fix in 3-4 rounds).
Problem: Let $f: R \to R$ be a smooth function, where $R$ is the set of real numbers. Suppose $f''(x)>0$ for all $x\in R$. Show that $f$ is not bounded above.
Gemini-2.5 answered perfectly. Among students I take this as the kind of problem that students who genuinely understand basic analysis can do quite easily (and conversely).
Group theory
Problem: Show that a group $G$ is finite if and only if it has finitely many subgroups.
Again, perfect answer. This was a problem I read about someone giving o1 (from the final exam of their graduate algebra course) and being impressed by the answer.
Probability Theory
Problem: Suppose X, Y and Z are independent random variables such that X∼N(1,1), Y∼Beta(2,2) and Z∼Uniform[0,3]. Find P(X+Y>Z)
Answered correctly, though not quite perfect (for instance claimed X+Y-Z had a density without justification, but then did not use it).
This was from a colleague (as part of an internal debate on LLMs like Deepseek-R1 versus our best students).
Geometric Group Theory + Probability ("Research level")
As part of an approach to a research problem, I made the following naively optimistic conjecture. The reason for my hope was that this was true in the "Gromov-hyperbolic" and Euclidean cases for different reasons, so perhaps it was always true for a combination of these reasons.
Conjecture: Given any bounded harmonic function $f$ on the Cayley graph of a finitely generated group and a quasigeodesic $c$ in the Cayley graph, the limit $f(c(t))$ at $t$ tends to infinity always exists (or almost surely exists).
The answers of Gemini-2.5 had mistakes, but after back and forth for 15 minutes the model gave a correct counterexample. Better still, this was an illuminating counterexample, not only showing that a statement was false but suggesting how to try to modify the approach.
To clarify, the answer from Gemini-2.5 was putting together a bunch of known results and constructions (and if I had spent enough time with the key relevant book I ought to have figured this). But a lot of mathematics is putting together known results and conjectures.
Siddhartha Gadgil (Apr 01 2025 at 03:08):
Maybe to be more precise, I feel this model has reached the "research student" level (the better research students at my institute, the typical ones at a top institute).
Kevin Buzzard (Apr 01 2025 at 05:13):
I would be interested in seeing the "perfect answer" by Gemini 2.5 to the first question. Did it give you a proof? Isn't hyperbolic tangent a counterexample or am I missing something?
Siddhartha Gadgil (Apr 01 2025 at 09:06):
Below is the answer @Kevin Buzzard . There is no counterexample: this is the 1-dimentional case of the result that bounded subharmonic functions are constants, but the proof is easier in dimension 1.
I was thinking of asking for more details, but saw that its "thinking" had references to the Mean Value Theorem (the detail I expected)
Proof from Gemini-2.5 (worth trying a bit)
Kevin Buzzard (Apr 01 2025 at 12:39):
Ha ha I didn't have my glasses on, I thought it was f' > 0 :-) Sorry for the noise!
Siddhartha Gadgil (Apr 01 2025 at 12:40):
(deleted)
Last updated: May 02 2025 at 03:31 UTC