Zulip Chat Archive
Stream: IMO-grand-challenge
Topic: Putnam 2024
Jason Rute (Dec 08 2024 at 19:48):
It seems that OpenAI’s o1 model can do really well on the Putnam 2024 exam. https://x.com/genericname2134/status/1865534730139050023?s=46 I hope we can see some more systematic experiments soon. I know @Brando Miranda said he would do some (https://x.com/brandohablando/status/1864047245374734635?s=46). And this is informal reasoning. Imagine what we could do with an informal/formal mix!
Jason Rute (Dec 08 2024 at 20:12):
Of course nothing is official. It isn’t like we have the grading rubric (and the tweet author didn’t post the AI answers for others to check). But considering this isn’t the first announcement that o1 does well on the Putnam, it could be very real. And now I wonder how o1 would do on the IMO? (I can’t remember which exam is harder.)
David Renshaw (Dec 08 2024 at 22:05):
Dan Hendrycks posted some o1-pro-generated solutions here: https://x.com/DanHendrycks/status/1865855151564849220
David Renshaw (Dec 08 2024 at 22:06):
I started reading the A1 solution: https://chatgpt.com/share/6755f3ca-7ec0-800c-9fe2-c90492b1b8f0
David Renshaw (Dec 08 2024 at 22:06):
I don't think I buy this part:
Check the combinations to achieve :
* If both a and b are odd, their n-th powers are either 1 or 3 mod 4. Testing all combinations shows you cannot get a sum divisible by 4.
* If one is even and the other odd, the sum won't vanish mod 4 either.
* The only way to get a sum divisible by 4 for n≥2 is if both a and b are even, ensuring an≡bn≡0(mod4).
David Renshaw (Dec 08 2024 at 22:10):
wait, nevermind, I now think it's ok.
Joseph Myers (Dec 08 2024 at 22:43):
What about the descent argument going from to if and are even? That's not literally the same equation with smaller numbers, so it shouldn't be called a descent without further details. And indeed the new equation can be satisfied mod 4 with , you need to go to mod 8 to conclude that the variables are all even and get a genuine descent for the new equation.
Joseph Myers (Dec 08 2024 at 22:48):
(This is still a lot better than single-shot attempts at IMO problems I've seen from older (unspecified) ChatGPT versions. From memory, the ChatGPT attempt at IMO 2024 P3 posted on the coordinators' Discord during the IMO (so hopefully before any training data contamination with human solutions) ended with a claim along the lines of "the answer is 1" - that problem isn't a "determine" problem, it made no sense whatever for ChatGPT to assert a numerical answer.)
Joseph Myers (Dec 08 2024 at 22:51):
Asserting that case X is the same as or similar to case Y when in fact there is some difference that matters is of course a common mistake in informal human reasoning (that formalization helps eliminate).
Brando Miranda (Dec 09 2024 at 00:39):
We've generated plenty of solutions for multiple models (including grok,
Gemini, Claude, llama 405b and more), plan to make a rubric, human grade
it, also automatic evaluations we proposed in our Putam-AXIOM benchmark
which is nearly ready to be released too with all Putnam questions in
existence we believe and release everything and a technical report as soon
we can. Hopefully by the end of this week or worst case next week.
https://huggingface.co/Putnam-AXIOM
Jason Rute (Dec 09 2024 at 02:49):
I am curious about all these results and how the best models did on the Putnam. I may have gotten a bit too excited. :blushing: I'd love a sober and honest assessment of how o1-pro and the other models performed. Unfortunately, I feel these evaluations are going to be noisy. It will be stochastic if a model like o1-pro gets a problem right or not (unless one performs a lot of generations). (If one is willing to pay for the test-time compute, I think there are much better evaluation strategies that are less stochastic and would perform better overall, but I think we may have to wait for the future to get those. It probably depends on whether well-funded groups have been secretly preparing an AI for the Putnam exam which they will announce right before MATH-AI next week, like AlphaProof did with the IMO over the summer.)
Kevin Buzzard (Dec 09 2024 at 03:48):
I had a look at the solution to A6 and as far as I can see the AI got the answer totally wrong and their method was vague at best. The point where the wheels fell off was "this generating function has a square root in so probably the coefficients satisfy a second order linear recurrence relation so it's probably this" and one can easily check that it's not. They then made what could well have been correct deductions about the wrong sequence and got A(1), A(2) right and A(n) wrong for all n>=3. I've not seen any official solutions and I'll have egg on my face if I'm wrong here, but I think A(n)=10^(n(n-1)/2).
Kevin Buzzard (Dec 09 2024 at 03:55):
I also took a look at A5 and the argument is not at all rigorous. I don't know what the answer to this question is (it's "find all primes p>=7 with some property"), but the machine found that worked and just said that it seemed unlikely that any higher primes would work because the property was weird, so the answer was only. I would imagine that this answer was worth 1/10 max.
Jason Rute (Dec 09 2024 at 05:17):
Were there any it got right (or mostly right)? I'm feeling a bit silly here. :smile:
Jason Rute (Dec 09 2024 at 19:24):
Dan Hendrycks redid his Putnam solutions using a new prompt which mentions its a Putnam question and the importance of being rigorous. O1 thinks longer. I haven’t yet really checked if the solutions are any better: https://x.com/danhendrycks/status/1866191952531845547?s=46
Jason Rute (Dec 09 2024 at 19:30):
This thread is a scoring of the answers both before and after the new prompt. I think the new prompt helped a little but not much. https://x.com/isaac1124102676/status/1866096185049522614?s=46
Last updated: May 02 2025 at 03:31 UTC