Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Proof or Bluff

Floris van Doorn (Apr 01 2025 at 19:40):

I just found the paper Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad.

Floris van Doorn (Apr 01 2025 at 19:42):

It is written by a group of researchers that use the latest public LLMs (reasoning models) to try to solve the 2025 USAMO, and the LLMs did terribly (<2 points out of 42 on average).

Floris van Doorn (Apr 01 2025 at 19:47):

This supports the argument (that I've made) that to do complicated mathematical reasoning, you need a trusted verifier (e.g. Lean).
Not sure how strong this argument is. A counterargument is that it's easy to make a LLM output a bad answer, but maybe with some new techniques/insights you can get significantly better results.

Kevin Buzzard (Apr 01 2025 at 22:52):

The LLM answers I saw to Putnam problems in December were also abysmal. In contrast the IMO achievement by DeepMind getting a respectable score on the IMO was with Lean. So I am also a proponent of the argument that a trusted verifier is a necessary tool to move these things beyond where they are now.

I think the video at https://epoch.ai/frontiermath/tier-4 is quite illuminating too. Daniel Litt points out that the reason that LLMs can sometimes get hard "what is this number" questions right is that it can sometimes be much easier to guess the answer than to prove that your guess is correct. The "FrontierMath philosophy" of asking for questions for which you necessarily have to do some hard mathematics correctly and yet the answer is a number, is like asking mathematicians to come up with some kind of amazing compression algorithm, where a very small amount of data (the final answer, a number) is supposed to represent a proof that the model must have done a large amount of deep reasoning accurately. This is a hard problem to solve (which is why Epoch AI are offering $7500 to anyone who can give an example I guess).

Another trick I saw recently coming from the "guess the right number" proponents is a test which was run at "pass@3200" which apparently is code for "let the model guess the number 3200 times and it is deemed to get the answer right if it gets it right at least once". (I retract this claim: I suspect I was thinking about https://arxiv.org/pdf/2502.00212 which is not about "guess a number")

However there are problems with getting models to write Lean code, for example there is not enough data to train on. It would be nice if we had more details about what DeepMind did but there is still no paper on AlphaProof and instead we have articles like https://on.ft.com/4i6C8Gc , which does not bode well from the point of view of an academic. Again in the EpochAI video, Sergei Gukov stresses the importance of autoformalization; if we can make great progress here then perhaps the idea of getting LLMs to write Lean code will become closer to a reality, as there will be far more data to train on.

Jason Rute (Apr 01 2025 at 22:56):

Where did you see “pass@3200”?

Jason Rute (Apr 01 2025 at 22:56):

Also what is this $7500 reward?

Kevin Buzzard (Apr 01 2025 at 23:11):

The $7500 fee per question is mentioned on the Epoch AI Tier 4 page https://epoch.ai/frontiermath/tier-4 . It might be hard for me to find the pass@3200 reference, the tab is long closed. It made me laugh out loud.

Another paper where LLMs with no formal abilities perform abysmally, this time on a relatively straightforward mathematical task, is documented here https://arxiv.org/abs/2502.07087 .

Kevin Buzzard (Apr 01 2025 at 23:13):

3200: Oh maybe it was this? https://arxiv.org/abs/2502.00212 although now looking at it, this is asking for formal proofs not numbers, so pass@3200 is far more sensible. I have retracted my claim above.

Jason Rute (Apr 01 2025 at 23:16):

Yeah, it is unfortunate that pass@n means very different things (at least morally) in formal proofs vs. almost anything else (python code, natural language math, etc.)

Joseph Myers (Apr 02 2025 at 00:14):

At the recent marking weekend for intermediate (ages 14-16) olympiads in the UK, one of the markers tried a ChatGPT reasoning model (whatever you can get without paying for a subscription) on problem 1 from this paper (so the easiest problem on a paper for 14-year-olds). It insisted that only digits 0 and 1 are valid in two-digit positive integers (despite understanding that the numbers were in decimal), and doubled down when told that it was wrong.

Justin Asher (Apr 02 2025 at 01:06):

@Kevin Buzzard While we do not have a paper on AlphaProof, there is one on AlphaGeometry2. I am not sure whether the approach of building specialized symbolic engines by hand is particularly effective when general LLMs are becoming increasingly intelligent. However, having LLMs automate building symbolic engines would be neat.

Justin Asher (Apr 02 2025 at 01:09):

@Joseph Myers While I understand the concern, I cannot imagine newer free reasoning models like Gemini 2.5 Pro having this issue. I remember playing around with ChatGPT a year or two ago, having a lot of problems like this, but now they have mostly subsided. I expect the error rate to go down close to zero, but never completely to zero, which is why we need formal methods.

Jireh Loreaux (Apr 02 2025 at 03:58):

I always ask LLMs to prove that a sequentially compact metric space is compact. They have always failed me, which I find hilarious because this has to be present in the training data.

Kim Morrison (Apr 02 2025 at 05:29):

Ouch, yeah, I'm surprised they are still so bad at that one. Claude 3.7 just told me "since each point is contained in only finitely many open sets in a metric space"...

Kim Morrison (Apr 02 2025 at 05:38):

ChatGPT-o1 is similarly bad. It doesn't make an outright false statement, it is just wrong: at some point it says "In particular, if $U_{i^*}$ was among the previously chosen sets, eventually you should not find infinitely many $x_{n_k}$ in $U_{i^*}$ ." but there is no reason for U_{i^*} to have been used in the construction it gave. I suspect you could salvage a proof that sequentially compact implies countably compact from the argument it gives.

Kim Morrison (Apr 02 2025 at 05:46):

Gemini 2.5 at first appears to produce a correct proof. It appeals to the (true!) fact that a metric space is compact if and only if it is complete and totally bounded. However, when asked to prove that, it then relies on the claim that a sequentially compact metric space is compact, and doesn't notice the circularity.

Oliver Nash (Apr 02 2025 at 09:05):

I guess it won't be surprising to hear that I too believe our best chances for automatic theorem proving include using a proof checker, such as Lean.

However I don't think we should entirely dismiss the alternatives such as LLM (or LLM + tool use). I am very pleased to see this paper, which reminds us of reality, but I think we should bear in mind the USAMO problems are sufficiently hard that there is not much room for a signal. This is also true for humans: if you take a bunch of mathematicians who haven't trained in Olympiad math, many will perform dismally on USAMO problems, even though they are quite capable mathematicians.

It's clear that the performance on easier benchmarks like AIME are all ludicrously inflated (partially by unintentional / unavoidable leakage) but these models are clearly able to do some sort of reasoning. The constant hype is annoying but there is also progress being made.

Andy Jiang (Apr 02 2025 at 10:05):

Glad to see more evaluations on proofs (vs numbers) in harder math questions for LLMs. Personally feel like the most representative of true ability would be to create maybe ~1000 proof type questions from all areas of math which are not incredibly standard but fairly natural (which is very difficult for these numerical answer question imo) and hand grade them for each model [as they have done here afaict]. I feel the knowledge of lean/the outdatedness of mathlib knowledge/generally coding ability translated into lean syntax confounds the math evaluations too much in the regard of math ability evaluations of pure formal math.

Andy Jiang (Apr 02 2025 at 14:51):

By the way, it seems that Gemini 2.5 pro has gotten 25% score on the usamo benchmark of this thread.

Kevin Buzzard (Apr 02 2025 at 15:21):

It will be quite difficult (and very expensive) to consistently mark the results of several LLMs on 1000 questions. Humans really need to do the marking here.

Andy Jiang (Apr 02 2025 at 15:37):

I guess it'll be expensive but it depends on how important we think this data is. One way could be to form an informal group of a few hundred mathematicians who each come up with a private problem or two on their own and their entire (voluntary) commitment would be to benchmark any new SOTA model on that one question. I think the value of this would be to generate data to measure how helpful newer models for professional mathematicians is across different disciplines (maybe we can find some areas where it is much more helpful than others)

Siddhartha Gadgil (Apr 02 2025 at 16:23):

Looks to me as if Gemini-2.5 got it correct (did not check very carefully though): https://g.co/gemini/share/15e2ef3107b4

Siddhartha Gadgil (Apr 02 2025 at 16:29):

Andy Jiang said:

I guess it'll be expensive but it depends on how important we think this data is. One way could be to form an informal group of a few hundred mathematicians who each come up with a private problem or two on their own and their entire (voluntary) commitment would be to benchmark any new SOTA model on that one question. I think the value of this would be to generate data to measure how helpful newer models for professional mathematicians is across different disciplines (maybe we can find some areas where it is much more helpful than others)

Many of us have our own benchmarks, and with mine Gemini-2.5-pro is pretty good (#Machine Learning for Theorem Proving > benchmarking Gemini 2.5 pro @ 💬 ) though can be better.

Honestly I don't think any benchmark will convince all people. Keeping an updated of collections that any good mathematician should be able to solve but the current best LLMs cannot lets one keep checking for progress. As was pointed out, olympiad problems are not such problems.

Jireh Loreaux (Apr 02 2025 at 16:43):

Siddhartha Gadgil said:

Looks to me as if Gemini-2.5 got it correct (did not check very carefully though): https://g.co/gemini/share/15e2ef3107b4

That's not bad, but it appealed to the Lebesgue number lemma, which also has a nontrivial proof from sequential compactness. (Perhaps it could prove that though, if asked; the proof is on Wikipedia after all :laughing:)

Jireh Loreaux (Apr 02 2025 at 16:44):

In any case, that's the closest I've seen an LLM come to achieving this.

Andy Jiang (Apr 02 2025 at 18:11):

Siddhartha Gadgil said:

Andy Jiang said:

I guess it'll be expensive but it depends on how important we think this data is. One way could be to form an informal group of a few hundred mathematicians who each come up with a private problem or two on their own and their entire (voluntary) commitment would be to benchmark any new SOTA model on that one question. I think the value of this would be to generate data to measure how helpful newer models for professional mathematicians is across different disciplines (maybe we can find some areas where it is much more helpful than others)

Many of us have our own benchmarks, and with mine Gemini-2.5-pro is pretty good (#Machine Learning for Theorem Proving > benchmarking Gemini 2.5 pro @ 💬 ) though can be better.

Honestly I don't think any benchmark will convince all people. Keeping an updated of collections that any good mathematician should be able to solve but the current best LLMs cannot lets one keep checking for progress. As was pointed out, olympiad problems are not such problems.

In my opinion the purpose would not really be to convince people. I don't think it's controversial that the models are improving at math. I think it would be good to provide a more robust measure of the progress. So at least we can agree on what level of problems it can solve on average.

Thomas Zhu (Apr 02 2025 at 18:13):

Jireh Loreaux said:

In any case, that's the closest I've seen and LLM come to achieving this.

I don't think an LLM reciting a proof of sequential compactness => compactness (which probably occurred in its training data for hundreds/thousands of times) is really indicative of its reasoning ability. If anything it could just be the model pre-training giving a higher weight to Wikipedia data; I would also suspect it can be easily solved with the "Search" function turned on for ChatGPT.

Jireh Loreaux (Apr 02 2025 at 19:10):

As I indicated before, it's certainly present in the training data many times over.
Jireh Loreaux said:

I always ask LLMs to prove that a sequentially compact metric space is compact. They have always failed me, which I find hilarious because this has to be present in the training data.

But the point is this: if it can't successfully recite a (valid!) proof that it's seen thousands of times, then it's certainly not reasoning successfully because it would just be able to look at that proof and realize it's correct. So, this isn't a sufficient condition for reasoning, but a necessary one.

And I've never seen an LLM even manage that (although the Gemini 2.5 link above is the closest I've seen).

Thomas Zhu (Apr 02 2025 at 19:47):

Sorry, I didn't want to contradict what you said! I agree not being able to prove it certainly indicates a problem. I just think Gemini 2.5 being closer to a proof doesn't necessarily mean it's better at reasoning.

Jireh Loreaux (Apr 02 2025 at 20:51):

Agreed.

Joseph Myers (Apr 03 2025 at 00:00):

Kevin Buzzard said:

It will be quite difficult (and very expensive) to consistently mark the results of several LLMs on 1000 questions. Humans really need to do the marking here.

I think this is pretty much the same difficulty as checking the result of a potential future AI's long claimed (informal) solution to some unsolved problem: a large amount of human checking to know whether what the AI has produced is correct or not. (An AI might eventually get a reputation for producing reliable proofs, but a lot of human checking would be needed to get there. And a good reputation for one version of an AI wouldn't necessarily carry over to the next version of that AI.)

Getting the AIs to produce formal proofs avoids that difficulty. But producing 1000 correct formal statements from all areas of mathematics, to be able to better benchmark AI prover performance without hand grading, is also hard (especially if you validate the statements by producing formal proofs of them all)! It would, however, have the value of adding a lot of material to mathlib (if genuinely covering "all areas") that's of value independent of any utility of benchmarks.

Justin Asher (Apr 03 2025 at 00:49):

@Kevin Buzzard

And a good reputation for one version of an AI wouldn't necessarily carry over to the next version of that AI.

Once we have one AI that can do math well, future models will likely be trained with that first AI checking their work, particularly in mathematical areas where we know it understands reliably. Because of this, I have a hard time imagining a scenario where models regress. It seems more likely that the boundary capabilities of what these models are capable of will quickly expand until we reach a point where we can no longer easily check their work. (I personally do not enjoy reading thousand-page proofs, much less hundred thousand-page proofs.) This is where formalization comes in as a necessary step.

Siddhartha Gadgil (Apr 03 2025 at 01:29):

My experience with Gemini-2.5 and other models (and a colleagues with DeepSeek-R1) is that if we go a couple of levels beyond their capabilities they still have useful output but with errors creeping in. So one needs to check and have back and forth.

This is why I feel (auto)formalization will always be worthwhile. Whatever the level the best model reaches at any stage, with feedback from formalization it can go some levels higher.

Yury G. Kudryashov (Apr 03 2025 at 01:44):

I also like asking questions about circle dynamics. E.g., "Is it true that a circle homeomorphism is always conjugate to a rotation?" or "Is it true that a circle homeomorphism with an irrational rotation number is always conjugate to a rotation?"

Yury G. Kudryashov (Apr 03 2025 at 01:44):

(in fact, we also need it to be at least $C^{1+VB}$ )

Andy Jiang (Apr 03 2025 at 07:25):

Joseph Myers said:

Getting the AIs to produce formal proofs avoids that difficulty. But producing 1000 correct formal statements from all areas of mathematics, to be able to better benchmark AI prover performance without hand grading, is also hard (especially if you validate the statements by producing formal proofs of them all)! It would, however, have the value of adding a lot of material to mathlib (if genuinely covering "all areas") that's of value independent of any utility of benchmarks.

I don't think it's so hard to produce such problems. You don't even need to know the answer. Just produce some statements which seem difficult after a few hours of thought and ask the AI to either prove or disprove the statement. You don't need to provide correct statements bc finding counter examples is also a good thing to test.

Andy Jiang (Apr 03 2025 at 07:27):

In contrast to the frontiermath type questions (which tbf is probably by far the best benchmark for research math for AIs right now) which are very difficult to generate

Andy Jiang (Apr 03 2025 at 07:28):

I guess the main difficulty would be adding enough to mathlib to state those problems in lean

Kevin Buzzard (Apr 03 2025 at 08:57):

The frontiermath type questions are not a very good benchmark IMO because they have revealed that coming up with an educated guess for a number is far easier than working out a number rigorously. This is why we need to check proofs. The frontiermath approach overinflates the ability of LLMs and the Proof or Bluff paper gives a far more accurate idea of where LLMs actually are when it comes to mathematics, if by mathematics you mean "what mathematicians actually do".

Kevin Buzzard (Apr 03 2025 at 08:58):

Even the proof or bluff paper isn't perfect because it's testing "Olympiad math" rather than real math but I think we really need to move away from "what is this number" questions which are even less representative.

Siddhartha Gadgil (Apr 03 2025 at 09:05):

Many of us make up questions to ask our students, or in selection interviews for research students. I personally just ask these to the LLMs, or ask similar ones my colleagues suggest.

I had a collection giving the limits over the years, but Gemini-2.5-pro saturated these. I then moved to questions arising out of research questions, and it hit its limit (as I mentioned, often still saying useful things but often failing).

It would be good if those on the thread not convinced made up similar questions. Essentially one which if a research students (saying into second year or beyond) failed to answer, we would be disappointed in the student. These can be tried on the LLMs.

Andy Jiang (Apr 03 2025 at 11:03):

By the way I think a better benchmark methodology on USAMO would be to report also what happens if you do naive compute scaling--say let Gemini 2.5 pro do ~500 generations and self-report the confidence score and grade the highest confidence answer. It is also worth measuring in case of failure to produce a proof how much it recognizes it cannot prove it vs believing in an incorrect proof. If the model cannot easily fool itself then probably it will benefit from compute scaling more.

Luigi Massacci (Apr 03 2025 at 16:14):

Jireh Loreaux said:

I always ask LLMs to prove that a sequentially compact metric space is compact. They have always failed me, which I find hilarious because this has to be present in the training data.

I tried it with perplexity (with the "pro" mode too) and it gave me a transparently circular proof as well. I would have bet money they would all be capable of regurgitating such a classic exercise, so much for AGI by 2027...

Andy Jiang (Apr 03 2025 at 17:30):

Jireh Loreaux said:

But the point is this: if it can't successfully recite a (valid!) proof that it's seen thousands of times, then it's certainly not reasoning successfully because it would just be able to look at that proof and realize it's correct. So, this isn't a sufficient condition for reasoning, but a necessary one.

I definitely don't think that only quizzing well-known proofs are a good benchmark for research ability (though it probably should form part of such a benchmark). But would like to gently disagree with this statement here. I think LLMs responding to a question like this behaves more similarly to a person trying to "remember" the proof than a machine which is able to search within its training data for this proof and verify its correctness on the spot. This failure is probably more of a memory/recall failure than a reasoning one. A reasoning test on its ability to see if a proof is correct would be to find similar level problems and give various arguments, some of which are faulty, and see if it can identify the faulty ones.

Luigi Massacci (Apr 04 2025 at 15:15):

Andy Jiang said:

I think LLMs responding to a question like this behaves more similarly to a person trying to "remember" the proof than a machine which is able to search within its training data for this proof and verify its correctness on the spot.

More like remembering the text of a proof, which is not the same thing.

Deming Xu (Apr 17 2025 at 20:03):

Jireh Loreaux said:

I always ask LLMs to prove that a sequentially compact metric space is compact. They have always failed me, which I find hilarious because this has to be present in the training data.

It seems like the new ChatGPT o4-mini got it right without using any additional complicated theorems.

https://chatgpt.com/share/68015d21-f030-800e-8b9c-c6647e21cad6

Jireh Loreaux (Apr 17 2025 at 22:10):

Indeed, I just had a friend try that for me last night.

Luigi Massacci (Apr 18 2025 at 10:30):

On the other hand:
Screenshot from 2025-04-18 12-29-56.png

Luigi Massacci (Apr 18 2025 at 10:31):

Which does not seem very good to me...

Last updated: May 02 2025 at 03:31 UTC