Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: IMO results from Google DeepMind

Eric Wieser (Jul 25 2024 at 15:31):

For this year's IMO, we paired a new version of AlphaGeometry with "AlphaProof", a new system for finding answers to mathematical questions, and proving them rigorously correct in Lean.

Check out our blog post for the details and the proofs we found for the IMO!

Kevin Buzzard (Jul 25 2024 at 15:34):

Can you claim a silver medal if it took you three days to solve one of the problems? :-) No doubt each of the students would have also got far more problems solved if they'd been allowed three days :-)

Kevin Buzzard (Jul 25 2024 at 15:39):

Of course, on the plus side this hammers SOTA. Congratulations!

Eric Wieser (Jul 25 2024 at 15:40):

I suspect there are other requirements for actually claiming a silver medal, such as being human and under 20 years old (Google is 25)!

Eric Wieser (Jul 25 2024 at 15:42):

Check out the acknowledgements list for some other names you may recognize from this community (and a shoutout to Lean/Mathlib) :smile:

zera (Jul 25 2024 at 15:46):

Their lean solutions are here (from the blog post) https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/imo-2024-solutions/P6/index.html

Eric Wieser (Jul 25 2024 at 15:49):

zera said:

Their lean solutions are here (from the blog post) https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/imo-2024-solutions/P6/index.html

(Huge thanks to David Thrane Christiansen for making that site possible, both for building Verso and for giving me support with learning to use it from his vacation!)

Julian Schrittwieser (Jul 25 2024 at 15:51):

Huge thank you also to everyone for the great work building Lean and Mathlib!

Chris Birkbeck (Jul 25 2024 at 15:51):

It says some problems were solved in minutes and the others in 3 days, can you say which was the quick one and which was the slowest one?

Andy Jiang (Jul 25 2024 at 15:55):

First of all holy shit congrats!!

Kevin Buzzard (Jul 25 2024 at 15:58):

The geometry one was solved in 19 seconds using the non-Lean system.

Heather Macbeth (Jul 25 2024 at 16:00):

Wow, very cool!

I see three of the problems solved require data in their solutions:

Determine all real numbers α such that ...
Determine all pairs (a,b) of positive integers for which ...
... find the smallest possible value of c.

Does the system itself determine the answer (e.g., for the first one, "the set of even integers")? Or were you doing the "easy mode" version where you provide the answer? Looking at the blog post,

AlphaProof solved two algebra problems and one number theory problem by determining the answer and proving it was correct.

(emphasis mine) I might guess it was the "hard mode" version?

Kevin Buzzard (Jul 25 2024 at 16:02):

Yes, it was the hard mode version. Otherwise I would have been ripping the claim to shreds :-)

Kevin Lacker (Jul 25 2024 at 16:06):

are the comments in the proof generated by the AI, or are those separately added by a human?

Jireh Loreaux (Jul 25 2024 at 16:06):

Outstanding! thanks for the post!

llllvvuu (Jul 25 2024 at 16:06):

Kevin Lacker said:

are the comments in the proof generated by the AI, or are those separately added by a human?

We did those. The proofs can be a bit annoying to understand otherwise.

Andy Jiang (Jul 25 2024 at 16:07):

I see that the unsolved ones are in Combinatorics. sorry for commenting without reading the blog post in full (it's a very exciting development for me). But I would've thought that AI can take advantage of a lot of inherent machine advantages for combinatorics problems

1) for small-scale combinatorics problems it is obviously much better at brute-force counting

2) for games: it can run simulations

3) for some combinatorics problems it can run small-number examples by brute force and try to generalize the pattern

4) for strategy based game problems it can try out strategies

I just would've thought that there's a lot of rooms for "cheating" methods for AI in combinatorics. Eric (I gather you are in the team) can you comment how much these things were tried? Did you spend less time working on combinatorics-type problems? Sorry I will read the blog post in full and probably realize I've made a fool of myself here.

Eric Wieser (Jul 25 2024 at 16:07):

Kevin Lacker said:

are the comments in the proof generated by the AI, or are those separately added by a human?

These are added by the lean experts acknowledged in the blog post to aid in public understanding, which indeed includes @llllvvuu :smile:

Andy Jiang (Jul 25 2024 at 16:10):

The complexity of the proofs found by Alphaproof probably means at the least that it can be used for formalization efforts here in Lean for existing math. and also potentially (depending on generality of the techniques) can already be tried in a range of research math domains? is this correct?

Kevin Buzzard (Jul 25 2024 at 16:12):

Research math: No. IMO is child's play compared to research mathematics, where it can take months or years for a human to make an advance rather than hours.

Kevin Lacker (Jul 25 2024 at 16:14):

the blog post describes the "solver network" that searches for a proof, using MCTS over some directed-graph-like structure. what is one "step" of the graph for the solver network - is it a transformer generating a candidate for the tactic to use for the current goal?

Eric Wieser (Jul 25 2024 at 16:15):

is it a transformer generating a candidate for the tactic to use for the current goal?

I think that probably falls under the "will release more technical details soon" part of the blog post, I'm afraid

Jireh Loreaux (Jul 25 2024 at 16:19):

I'm impressed by the complexity of some of the tactic calls. For example, calls to nlinarith with lemmas supplied, some of which are proved by norm_cast. That's quite a complicated step.

I'm also just interested in proportions of various tactics used. I thought the repeated use of suffices was interesting. Presumably it helps significantly because it gives Lean more information and can make elaboration easier. My thought process being: the AI decides "I wonder if trying to prove X would be helpful", by supplying that to Lean in the form of suffices, it not only breaks the problem into two goals, but in the one case, Lean has more elaboration information which presumably makes the goal easier to solve.

Jireh Loreaux (Jul 25 2024 at 16:20):

It's also using syntax I didn't know was valid (e.g., (x:))

llllvvuu (Jul 25 2024 at 16:21):

Andy Jiang said:

I see that the unsolved ones are in Combinatorics. sorry for commenting without reading the blog post in full (it's a very exciting development for me). But I would've thought that AI can take advantage of a lot of inherent machine advantages for combinatorics problems

1) for small-scale combinatorics problems it is obviously much better at brute-force counting

2) for games: it can run simulations

3) for some combinatorics problems it can run small-number examples by brute force and try to generalize the pattern

4) for strategy based game problems it can try out strategies

I just would've thought that there's a lot of rooms for "cheating" methods for AI in combinatorics. Eric (I gather you are in the team) can you comment how much these things were tried? Did you spend less time working on combinatorics-type problems? Sorry I will read the blog post in full and probably realize I've made a fool of myself here.

Probably there's not much @Eric Wieser or anyone can say too much about here, but note that combo problems and solutions are also very hard for humans to formalize!

Andy Jiang (Jul 25 2024 at 16:21):

(deleted)

Floris van Doorn (Jul 25 2024 at 16:22):

This is really amazing! 3 formalized proofs of the IMO, including IMO-6. That is really impressive. I think this really cool!

Eric Wieser (Jul 25 2024 at 16:22):

Andy Jiang said:

I see that the unsolved ones are in Combinatorics [...]

One thing to note is that while simulating a game can give you guesses for strategies, it only works as a proof if you simulate every possible opponent strategy (which probably translates to the whole proof being by decide in the first place!)

Floris van Doorn (Jul 25 2024 at 16:23):

One question: what information was provided to AlphaProof? Something like

theorem imo_2024_p1 :
   {(α : ℝ) | ∀ (n : ℕ), 0 < n → (n : ℤ) ∣ (∑ i in Finset.Icc 1 n, ⌊i * α⌋)}
   = _ := by
  sorry

(plus maybe the English problem description)?

Andy Jiang (Jul 25 2024 at 16:23):

Eric Wieser said:

One thing to note is that while simulating a game can give you guesses for strategies, it only works as a proof if you simulate every possible opponent strategy (which probably translates to the whole proof being by decide in the first place!)

But shouldn't tool use (as a general principle) be leveraged as a huge help. Like if I was doing combinatorics problems with Python available, I would expect my performance to be massively boosted

Kevin Lacker (Jul 25 2024 at 16:24):

Jireh Loreaux said:

I'm impressed by the complexity of some of the tactic calls. For example, calls to nlinarith with lemmas supplied, some of which are proved by norm_cast. That's quite a complicated step.

I'm also just interested in proportions of various tactics used. I thought the repeated use of suffices was interesting. Presumably it helps significantly because it gives Lean more information and can make elaboration easier. My thought process being: the AI decides "I wonder if trying to prove X would be helpful", by supplying that to Lean in the form of suffices, it not only breaks the problem into two goals, but in the one case, Lean has more elaboration information which presumably makes the goal easier to solve.

for the suffices thing, my guess is that it works similarly to HTPS, where one "step" is picking a tactic. and then it creates the proof in the order of the tactics it used. so if a proof of A -> E works like A -> B -> C -> D -> E, you could start with any of "suffices B", "suffices C", etc. A human might then rearrange it so that it reads like A -> B -> C -> D -> E, but the AI might just leave the tactics in the order it found them.

Andy Jiang (Jul 25 2024 at 16:25):

sorry so is the proof search in formal language or in informal first then formalized. Because if this is raw formal language proof search it sounds literally insane

Eric Wieser (Jul 25 2024 at 16:26):

Floris van Doorn said:

One question: what information was provided to AlphaProof [...]

Yes, with some elaborator tricks to prevent AlphaProof filling the _ with something that would change the type of the equality.

Eric Wieser (Jul 25 2024 at 16:28):

I don't think "literally insane" is a way I am supposed to describe this work, but yes, the search is in formal language

Andy Jiang (Jul 25 2024 at 16:30):

I saw in the blog post
"We trained AlphaProof for the IMO by proving or disproving millions of problems, covering a wide range of difficulties and mathematical topic areas over a period of weeks leading up to the competition."
can you disclose if the problems are mostly synthetic or existing?

Andy Jiang (Jul 25 2024 at 16:31):

I guess if autoformalization is weakly solved (EDIT!!: for high-school competition problems) (which it sounds like) then you have access probably to millions in existing--but idk if that was the better option for your guys

Kevin Buzzard (Jul 25 2024 at 16:32):

Given that AlphaProof seems to want Lean inputs and was trained on millions of problems, I am assuming that they must have been synthetic? I'm not convinced that autoformalization is weakly solved yet.

Andy Jiang (Jul 25 2024 at 16:33):

sorry! I meant obviously for IMO-like questions!

Kevin Lacker (Jul 25 2024 at 16:33):

note this part - "The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found." not only is it training on synthetic data, it generates new synthetic data per problem to train on.

Andy Jiang (Jul 25 2024 at 16:36):

that's wildly exciting that it's able to produce these novel synthetic problems of good quality. Personally I think (I know there's disagreements) that it's already time to see if these synthetic generation of problems based techniques produce good results in math that's closer to research level. I guess we can agree that we can see how well it does in undergrad level math first.

Eric Wieser (Jul 25 2024 at 16:37):

Andy Jiang said:

I saw in the blog post
"We trained AlphaProof for the IMO by proving or disproving millions of problems" [...]

Does the diagram below answer this question? AlphaProof is trained on formal problem statements that were autoformalized from informal ones.

Andy Jiang (Jul 25 2024 at 16:41):

Eric Wieser said:

Andy Jiang said:

I saw in the blog post
"We trained AlphaProof for the IMO by proving or disproving millions of problems" [...]

Does the diagram below answer this question? AlphaProof is trained on formal problem statements that were autoformalized from informal ones.

yup! sorry I was reading too fast! so it's based on large-scale autoformalization of existing problems as backbone.

Floris van Doorn (Jul 25 2024 at 16:42):

Those formalized proofs are so incredibly ugly, it's amazing.
Of course it doesn't much of a sensible indentation, but then there are single proof steps where I have no idea what it's even doing. useλy . x=>y.rec λS p=>?_. What does a dot even mean inside a lambda-abstraction?
And then there are nonsense mathematical steps. The solution of problem 2 starts with induction, before introducing any variables. It applies induction to the number 12. And it write 12 as (10)+2. Then it proceeds to do the whole proof in the base case of the induction, and notices that the induction step is trivial, since the goal is the same as the induction hypothesis (but instead of the assumption tactic it uses congr 26).

I've only read the first few lines of these formalizations. I'm amazed that while doing such stupid steps, it still managed to reason in the direction of an actual solution and solved it. I'm curious how much compute Deepmind threw at this.

Eric Wieser (Jul 25 2024 at 16:42):

What does a dot even mean inside a lambda-abstraction?

If you look through my zulip posts recently, you will find that I asked precisely this question!

Floris van Doorn (Jul 25 2024 at 16:47):

Oh yes, I remember that now. I somehow always forget that . and \. often mean the same thing.

Kevin Lacker (Jul 25 2024 at 16:48):

the nonsense mathematical steps remind me of how alphago in a winning position would happily make completely random moves. because from the AI's point of view, it's equally winning if it wastes five moves and then wins by 1 point, as opposed to immediately winning by 20 points. similarly, a proof is equally valid if you throw in a few pointless steps

Adam Topaz (Jul 25 2024 at 16:49):

Shouldn't the RL model get additional rewards for shorter proofs?

Kevin Lacker (Jul 25 2024 at 16:51):

I dunno, the IMO doesn't give you additional points for shorter proofs. so maybe not? I guess with this and many other questions we must await the technical details.

Floris van Doorn (Jul 25 2024 at 16:55):

Now I'm wondering if a human contestant gets full points for a proof like this. They start their proof with "We start by doing induction on (10)+2.", then solves the problem in the base case and notices that the induction step is trivial.
Probably they would get full marks, since grading at the IMO is very forgiving about giving writing wrong stuff, as long as the right stuff is also there.

Bhavik Mehta (Jul 25 2024 at 17:03):

https://www.reddit.com/r/math/comments/1ea7m3o/how_coordination_went_for_imo_2024_problem_3/ Some unformalised attempts were just as nonsense!

Floris van Doorn (Jul 25 2024 at 17:05):

Here is another one: it writes (u<|@@↑(( (-a )))) instead of u (-a). Note that no coercion is inserted, a is a rational number, so the @ doesn't do anything, and the second @ definitely doesn't do anything.
(Side note: in Lean 2 @@ actually was a single token with a special meaning (making only higher-order arguments explicit IIRC), but was rarely used.)

Kevin Lacker (Jul 25 2024 at 17:09):

Floris van Doorn said:

Now I'm wondering if a human contestant gets full points for a proof like this.

It's not even wrong, it's just weird. The proof is completely valid, we just find it aesthetically displeasing to use needlessly complicated steps when there is an obvious simple one that does the same thing. so, I do think a human contestant would deserve full points. but, as a tool for practical mathematics it might make sense to optimize these models for something other than "correctness at any cost". or to have some other mode of simplifying the proofs.

Andy Jiang (Jul 25 2024 at 17:11):

Probably can train a separate net to improve proofs i don't think that's as hard as finding the proofs

Tyler Josephson ⚛️ (Jul 25 2024 at 17:21):

Congrats on the article in NYTimes!

https://www.nytimes.com/2024/07/25/science/ai-math-alphaproof-deepmind.html?unlocked_article_code=1.900.oof_.uF1zHmB82Y1s&smid=nytcore-ios-share&referringSource=articleShare&sgrp=c-cb

Ning DY (Jul 25 2024 at 17:28):

I don‘t know what Poincaré would say to this machine. I think even it wins a gold medal it would miss some important aspects of human mathematicians. Remember what he said? “ No theorem can be new unless a new axiom intervenes in its demonstration”. But we just set the axioms still. Some IMO problem has something to do with this ability, especially in Combination and number theory, for example, the windmill problem.

Kim Morrison (Jul 25 2024 at 19:53):

My take on the insane syntax is that this is showing us that AlphaProof is more "AlphaZero doing self play against Lean" and less "Gemini reading human proofs".

Super (!) exciting that this works, and of course a little worrying as to whether future AI and human mathematicians will find each other mutually comprehensible!

Kevin Buzzard (Jul 25 2024 at 20:05):

Andy Jiang said:

The complexity of the proofs found by Alphaproof probably means at the least that it can be used for formalization efforts here in Lean for existing math. and also potentially (depending on generality of the techniques) can already be tried in a range of research math domains? is this correct?

Whilst I quickly answered "no" to this in the above chat, it's worth noting that Gowers is much more optimistic in the NYT article: his quote is “It’s a fairly safe bet that if Google DeepMind can solve at least some hard I.M.O. problems, then a useful research tool can’t be all that far away”. This might reflect the fact that he and I work in very different areas -- he recently proved PFR, which is a statement which I (and Lean) can easily understand, but in my area even the statements of the interesting conjectures/theorems are well beyond Lean (indeed this is part of the motivation of the FLT project -- it will force us to formalise a bunch more high-powered definitions so we can state what's going on in the Langlands program). Or it might just reflect the fact that he's more optimistic than me :-)

llllvvuu (Jul 25 2024 at 20:15):

FWIW Gowers seems to be thinking more broadly about whether DeepMind / the AI community at large will figure out research mathematics, as opposed to whether AlphaProof specifically will figure it out (also what he considers a "useful research tool" could mean many things)

Kevin Buzzard (Jul 25 2024 at 20:16):

Right. And the reason I'm pessimistic is that IMO problems can all be solved using a small and well-defined list of techniques (for example you never need to use calculus or group theory), whereas research mathematics is a very different kettle of fish.

Andy Jiang (Jul 25 2024 at 20:57):

Random thoughts on the divide of synthetic data vs human data. One can imagine that in the true alphazero spirit you can give a system the raw lean codebase and tell it "go forth and prove theorems" and it creates some system which is good at proving theorems which are easily state able with the existing code base. This is not what current approach does--It takes as input the distribution of competition problems (which is reasonably well populated by existing problems). And in this distribution it trains the system. This second approach is probably easier but less robust to generalizations. Though idk if the former one is possible at all. Though alphageometry kind of does the former.

Andy Jiang (Jul 25 2024 at 21:07):

Another possibility is that (in my opinion) some areas of research despite relying on many layers of abstractions often there are formal proofs which don't need to break these abstractions. And even in these situations having these proofs be done automatically could be helpful. It wouldn't require it to be formalized down to existing axioms--but instead axioms internal to the theory in question. Not sure how practical this would be

Ralf Stephan (Jul 25 2024 at 21:18):

Kevin Buzzard said:

Right. And the reason I'm pessimistic is that IMO problems can all be solved using a small and well-defined list of techniques (for example you never need to use calculus or group theory), whereas research mathematics is a very different kettle of fish.

Doesn't this call for a new kind of competition with problems that need other techniques?

Kevin Buzzard (Jul 25 2024 at 21:21):

I feel like my beliefs have some kind of catch 22 internal inconsistency. I can give you a competition: we have a statement of the Riemann Hypothesis in mathlib and the competition is to prove it. But everyone knows this is ridiculous. On the other hand if I ask you to prove something which is already known then I can then complain that probably you saw the proof already, or, if it is an unseen problem like IMO (or Putnam) then probably you saw all the techniques already. So clearly I'll never be happy!

Kim Morrison (Jul 25 2024 at 21:42):

I think we are much closer to research relevance than Kevin makes out.

The initial application will be in the phase of research where one is thinking about a bigger problem, and realises that some smaller statement both looks plausibly true (but you really don't know), and would likely (but again, with no certainty) make progress.

Being able to say: "hey, inhuman weird self-trained incomprehensible AI, can you prove my lemma?", and sometimes getting back an answer "Yes", is incredibly useful. Even if it's not worth the time to understand its proof (or even if it is impossible to), it is still very useful validation of the intuition that the smaller statement may be helpful.

At least my experience of research problems this is a very common experience. I think if you look at these tools though this lens ("when would it be helpful to have a magic signal that a minor conjecture is true?") then applications look closer.

(This all said, I've been writing lots of boring lemmas about List recently, and I am frequently surprised at how good Copilot is already at these. Sometimes it is writing 5-10 lines proofs in one shot, and getting them right. It's certainly saving me time.)

Kevin Buzzard (Jul 25 2024 at 21:42):

I guess, reflecting on this, it's funny that you're asking for a competition. The CS papers I read about AI are full of benchmarks and SOTA etc. This stuff is completely absent in, say, number theory; people prove theorems which help explain how things fit together. Ok so if you want to interpret the IMO game as a problem which will be solved by this time next year, what should the new game look like? The next step in most successful IMO candidate's lives is an undergraduate degree, and this is very easy for a machine because most of the questions in the exams are "use techniques explained in lectures to do these problems, some of which are straightforward because we don't want everyone to fail". I would imagine that this is an easier target than IMO, which requires more out -of-the-box thinking. Many undergrad or Masters theses in maths are 100% unoriginal in terms of material, typically in my area at least the goal is to summarise a topic (eg "survey of how FLT was proved"). But then after that it's PhD and now you have to find an interesting research problem and this already is very difficult.

Kevin Buzzard (Jul 25 2024 at 21:44):

I totally agree with Kim that a machine which says "hey you want to prove A=>C; how about proving A=>B and B=>C for this statement B which I just pulled out of a hat" would really be a powerful tool.

Kim Morrison (Jul 25 2024 at 21:45):

@Kevin Buzzard, that was the exact opposite of what I was suggesting!

Kim Morrison (Jul 25 2024 at 21:46):

I was proposing that often you have an intuition about a B, such that at least one of A=>B and B=>C is "not too hard", and you would like the machine to go and prove it.

Kevin Buzzard (Jul 25 2024 at 21:46):

Ha ha I misread: you want the human to find B and then the computer to check it

Kevin Buzzard (Jul 25 2024 at 21:46):

But that code we're looking at is often arguing forwards

Kim Morrison (Jul 25 2024 at 21:48):

There's no code here.

My point is that when thinking about A -> C, and you can up with a candidate intermediate B, even if you know B->C will remain really hard, but hope A->B is not too hard, you next have to go prove A->B. Sometimes this works, and you're happy. Sometimes it doesn't, so you reject B and try to come up with B'.

But getting faster signal about whether A->B is in fact true would be really helpful.

Luigi Massacci (Jul 25 2024 at 21:49):

Kim Morrison said:

Being able to say: "hey, inhuman weird self-trained incomprehensible AI, can you prove my lemma?", and sometimes getting back an answer "Yes", is incredibly useful. Even if it's not worth the time to understand its proof (or even if it is impossible to), it is still very useful validation of the intuition that the smaller statement may be helpful.

I think one should also keep in mind that the amount of computations Google can do with N hours is not the same as the average mathematician with their laptop. If the "yes" comes after retirement it is not as useful (at least for the poor average mathematician).

Luigi Massacci (Jul 25 2024 at 21:50):

Then again it depends on how much one believes in Moore's law

Kim Morrison (Jul 25 2024 at 21:50):

Point taken, but also: Moore's law :woman_shrugging:

Luigi Massacci (Jul 25 2024 at 21:50):

For now:laughing:

Kim Morrison (Jul 25 2024 at 21:51):

Our AI overlords will start doing the chip design, don't worry. :-)

Chris Birkbeck (Jul 25 2024 at 21:54):

Kevin Buzzard said:

I feel like my beliefs have some kind of catch 22 internal inconsistency. I can give you a competition: we have a statement of the Riemann Hypothesis in mathlib and the competition is to prove it. But everyone knows this is ridiculous. On the other hand if I ask you to prove something which is already known then I can then complain that probably you saw the proof already, or, if it is an unseen problem like IMO (or Putnam) then probably you saw all the techniques already. So clearly I'll never be happy!

Maybe we should make a list of open problems, (below millennium prize level) to throw at these AIs.

Andy Jiang (Jul 25 2024 at 21:55):

Sorry how can you rank the difficulty of unsolved problems? Are you assuming some are like ones where someone basically knows how to solve but are too lazy to do it?

Chris Birkbeck (Jul 25 2024 at 21:56):

Well we do this often. For example when thinking of problems to give PhD students, where it's something where we think we know how it should go but the details need checking.

Chris Birkbeck (Jul 25 2024 at 21:59):

Thinks like the non Archimedean Scottish book for example, has problems that boil down to "find some ring with some property", which is maybe more reasonable for an AI look at ( that said I know very little about how these AIs look for solutions). Edit: that said, some of these you probably wouldn't give a PhD student, since we don't know what to do).

Kevin Buzzard (Jul 25 2024 at 22:15):

Kim Morrison said:

There's no code here.

I mean the Lean solutions to the IMO problems.

Joseph Myers (Jul 26 2024 at 00:26):

One journalist asked me when Millennium Prize Problems might be within reach of AI. I think it would be interesting to try it on unsolved problems (with elementary statements), not because it's likely to solve any of them at present (though if you list 1000 unsolved problems, there are probably a few that do in fact have a simple solution that an AI might have some chance of finding where humans have missed it), or even make useful new progress, but because seeing which known facts about such unsolved problems the AI found, and which known facts it didn't find, would give more information about its capabilities.

Jacob (Jul 26 2024 at 02:24):

If this work leads to actual tools to be used then this will probably be heavily improved upon. Leela (Open source chess AlphaZero, which is now much stronger than AlphaZero) had a similar issue early on and now it is effectively dealt with. In that case I believe they introduced a Moves Left head which was used to sharpen the policy among other methods.

Floris van Doorn said:

Here is another one: it writes (u<|@@↑(( (-a )))) instead of u (-a). Note that no coercion is inserted, a is a rational number, so the @ doesn't do anything, and the second @ definitely doesn't do anything.
(Side note: in Lean 2 @@ actually was a single token with a special meaning (making only higher-order arguments explicit IIRC), but was rarely used.)

Utensil Song (Jul 26 2024 at 02:50):

Kim Morrison said:

My take on the insane syntax is that this is showing us that AlphaProof is more "AlphaZero doing self play against Lean" and less "Gemini reading human proofs".

Super (!) exciting that this works, and of course a little worrying as to whether future AI and human mathematicians will find each other mutually comprehensible!

Indeed, I suspect DeepMind has used all legal Lean 4 syntax as the action space for the self play (EDIT: except for premise selection and type matching), which is massive, and the weird syntax combination in the code are the result (because some combinations are formed in almost token level and unlikely prepared by human experts).

It's reasonable to guess that the action space can be significantly reduced in future optimization, the algorithm would be more efficient and the code would become much more comprehensible.

Jason Rute (Jul 26 2024 at 05:55):

Re: Comprehensibility, I think cleaning up search-generate code would be in the realm of current technology. LLMs are really good at “style”, and refactoring a correct proof is likely much easier than making a new proof.

Jason Rute (Jul 26 2024 at 06:02):

I’m struck how the main pipeline is the standard pipeline using (1) auto-formalized problems, (2) search, and (3) reinforcement learning. If they really use pure AlphaZero, that is even more interesting. (Although even in AlphaZero one has lots of choices as to what is a state and an action, as well as choices of model architecture and data). It really seems then that massive compute and a massive team (which can make sure all the technical details are right) can go a long way. I’m excited to read the full details, but I also hope this inspires new ideas in this field and not just a desire for more massive amounts of compute. Congrats to the team @Eric Wieser!

Jason Rute (Jul 26 2024 at 06:32):

It is also not at all clear to me what one is doing over 3 days. Is it just MCTS over a giant tree (or many independent trees)? Is there RL in there (for a single problem)? Or maybe “AlphaZero” is just branding and there are lots of new ideas going on very different from the AlphaZero paper and the Hyper-tree proof search paper? (I hope the later is the case.)

Jason Rute (Jul 26 2024 at 07:55):

I said “a giant tree”, but to be honest depending on the models used, it might be that every step of this search is fairly slow and this search tree might not be too large in the end. One of my main comments of Copra is how slow and expensive each step is, since it calls GPT-4 with a long context at each proof state. It could be that AlphaProof does something similar, but has the budget to see such an approach through to the end. I even saw a code generation paper at ICML this week that generated whole functions at each step of the tree search, so that would even take longer. In that setting each action would be a whole code repair.

Siddhartha Gadgil (Jul 26 2024 at 08:54):

Kim Morrison said:

There's no code here.

My point is that when thinking about A -> C, and you can up with a candidate intermediate B, even if you know B->C will remain really hard, but hope A->B is not too hard, you next have to go prove A->B. Sometimes this works, and you're happy. Sometimes it doesn't, so you reject B and try to come up with B'.

But getting faster signal about whether A->B is in fact true would be really helpful.

And in many cases, automatic experimentation showing that A->B is actually false.

Simone Castellan (Jul 26 2024 at 09:45):

Eric Wieser ha scritto:

Andy Jiang said:

I saw in the blog post
"We trained AlphaProof for the IMO by proving or disproving millions of problems" [...]

Does the diagram below answer this question? AlphaProof is trained on formal problem statements that were autoformalized from informal ones.

How far do you feel the formalizer network is from being able to formalize some actual mathematics? (Say, on the level of mathlib)

Andy Jiang (Jul 26 2024 at 10:01):

Jason Rute said:

I said “a giant tree”, but to be honest depending on the models used, it might be that every step of this search is fairly slow and this search tree might not be too large in the end.

Sorry stupid question but the tree must be enormous right. Look at the length of the P6 solution. Unless you think their policy network is outputting like 3 lines at each step. I interpreted them as doing search at token level honestly (maybe that's wrong).

Also to the members of the team here: I (like Jason) am also curious if you can share if it's doing RL at test-time or just MCTS search.

Jason Rute (Jul 26 2024 at 10:23):

It is a good point about P6. I assume they also handle independent subgoals correctly (namely doing the searches for each subgoal separately). That would help a lot in keeping the tree size down.

Jason Rute (Jul 26 2024 at 10:28):

@Andy Jiang why do you assume they branch on tokens? Tokens are detached from the proof states and goals.

Andy Jiang (Jul 26 2024 at 10:35):

Jason Rute said:

Andy Jiang why do you assume they branch on tokens? Tokens are detached from the proof states and goals.

Oh good point. I guess proof steps are more reasonable. But then at each step you have basically an infinite action space in your MCTS. which i guess is technically ok...

Andy Jiang (Jul 26 2024 at 10:36):

But maybe you have to do this if your value net takes the proof state as input

Luca Sartori (Jul 26 2024 at 10:43):

About all the question on being comprehensible, could it be be possible to train already existing LLM at translating the code in natural language? (maybe even trying to comment on it afterwards, so to make it somehow "reason"), and then another AI to retranslate back to lean from natural language to verify and so auto-training (or even training to write the proof given what the other AI think is the main idea). My main thought is that with this new tecnology it seems to have produces lot of good quality mathematical proof, with the caviat of being only in code. This method could give lots of natural language example of new proofs, and that could train a new natural AI.
The writing in code I think is not that practical alone and restrains its "though" process, for example solving P3 and P5 could be much easier if some step are taken in natural language and then formalized (I think of a mixed approach with a AI that tries many thing to prove as with alphageometry, and another that just produces more heuristic approach, like proposing lemma to prove).

Jason Rute (Jul 26 2024 at 10:53):

Andy Jiang said:

Jason Rute said:

Andy Jiang why do you assume they branch on tokens? Tokens are detached from the proof states and goals.

Oh good point. I guess proof steps are more reasonable. But then at each step you have basically an infinite action space in your MCTS. which i guess is technically ok...

MCTS for formal theorem proving is not new. Many papers do it. The HTPS paper by Meta does it for example. There are two approaches to the infinite action space. The simplest is just to make the first (or top) N generations of your model the N actions. It isn’t complete, but that is fine. The other is that DeepMind had a new version of MCTS that IIRC allows infinite action spaces by letting you sample new actions at a node instead of following the path to a leaf.

Andy Jiang (Jul 26 2024 at 11:04):

@Luca Sartori my uninformed guess is that Google would definitely do this to help train the next Gemini model. Seems like an obvious source of good reasoning data.

Notification Bot (Jul 26 2024 at 12:49):

boboquack has marked this topic as resolved.

Notification Bot (Jul 26 2024 at 12:50):

boboquack has marked this topic as unresolved.

boboquack (Jul 26 2024 at 12:50):

(oops, not sure what I clicked there, sorry!)

David Renshaw (Jul 26 2024 at 13:17):

I tried running my proof animation tool on AlphaProof's solution to problem 6: https://x.com/dwrensha/status/1816824331629998328

Andy Jiang (Jul 26 2024 at 14:15):

Jason Rute said:

Andy Jiang said:

Jason Rute said:

Andy Jiang why do you assume they branch on tokens? Tokens are detached from the proof states and goals.

Oh good point. I guess proof steps are more reasonable. But then at each step you have basically an infinite action space in your MCTS. which i guess is technically ok...

MCTS for formal theorem proving is not new. Many papers do it. The HTPS paper by Meta does it for example. There are two approaches to the infinite action space. The simplest is just to make the first (or top) N generations of your model the N actions. It isn’t complete, but that is fine. The other is that DeepMind had a new version of MCTS that IIRC allows infinite action spaces by letting you sample new actions at a node instead of following the path to a leaf.

Can you explain to me what the value function is capturing here? (I'm a little thrown off [for only psychological reasons] by the fact that you never really get further away from proving a goal by proving useless things). Is it probability of finding correct proof? Estimate of shortest proof? Sorry I really ought to read the literature instead but maybe it's easier to ask

Jason Gross (Jul 26 2024 at 14:24):

Andy Jiang said:

Probably can train a separate net to improve proofs i don't think that's as hard as finding the proofs

Why wouldn't arbitrary proof improvement be as hard as finding the proof in the first place? e.g., consider the brute-force proof of Fermat's last theorem restricted to numbers <= 10^100. Surely improving this proof to the length of the actual proof of Fermat's last theorem is comparably hard to proving Fermat's last theorem in the first place.

Jason Rute (Jul 26 2024 at 14:29):

@Jason Gross if Andy is referring to what I was also referring to above, I meant cleaning up the proof to be more idiomatic and understandable. Not coming up with a new and completely different proof (but I’m having trouble following the link on my phone).

Jason Gross (Jul 26 2024 at 14:32):

Makes sense. It seems like targeting idiomatic and understandable requires basing it on human-proofs, while having the system target "shortest possible proof" (I guess a slightly better version is "cheapest proof term to typecheck") should enable it to learn to generate better proofs unsupervised.

Jason Rute (Jul 26 2024 at 14:33):

@Andy Jiang Obviously I don’t know what DM used, but there are a few good valuation options (and probably none are ideal). (1) Probably your agent can complete the proof. (2) Estimated remaining proof length (or the exponential of that). If no RL: (3) Arbitrate buckets estimating the current proof state quality. (4) The log probability of the proof so far. (5) Other.

Andy Jiang (Jul 26 2024 at 14:37):

I looked at the HTPS it seems there they do probability of success. Which makes some sense, but I'm not sure why when the agent is close to the goal it doesn't become a worthless signal saying that any action is ok. (But I guess this is a classical problem of alphazero when you put it in winning positions and has known mitigations)

Andy Jiang (Jul 26 2024 at 14:42):

btw wanted to share this quote from Stanislas Polu on Twitter on MCTS

"All that being said, I tried hard (and I'm not the only one) to get juice out of MCTS. Never convinced myself that it's any better than direct sampling of the proofs + retraining, aka simple Expert Iteration. Is the reference to AlphaZero just marketing or a core element of the solution? My bet would be that this all work nicely with simple Expert Iteration. Unless AlphaZero/MCTS really is the way to backtrack the signal of a trained VF + deterministic preferences (eg shortness or CPU complexity) into the model. With Expert Iteration the only thing you can do is to replace your best proof in your training set by your new best proof. It's less direct."

I have no personal opinions (i don't really know this stuff at all). but I would be really interested in seeing some behind-the-hood things on what the policy/value models output at different points of the proof search process.

Jason Rute (Jul 26 2024 at 14:43):

Re: Valuation near the end, You still have a policy model so you don’t have to just hill climb to the next higher value. Also if you make a mistake, like a tactic which makes the goal false, then you will (hopefully) get a signal from your value.

Jason Rute (Jul 26 2024 at 14:46):

When you compare MCTS vs Expert Iteration, you miss that MCTS is two things: a tree search algorithm and an RL algorithm. It isn’t clear to me it is the best tree search algorithm. As an RL algorithm, it is nice since it works well with tree search, and it learns much faster than expert iteration (see the appendix of HTPS), but again I don’t know there isn’t something better.

Will Sawin (Jul 26 2024 at 14:46):

Kevin Buzzard said:

I feel like my beliefs have some kind of catch 22 internal inconsistency. I can give you a competition: we have a statement of the Riemann Hypothesis in mathlib and the competition is to prove it. But everyone knows this is ridiculous. On the other hand if I ask you to prove something which is already known then I can then complain that probably you saw the proof already, or, if it is an unseen problem like IMO (or Putnam) then probably you saw all the techniques already. So clearly I'll never be happy!

I mean here is a list of open research-level problems which currently contains 573 members. The average difficulty of the problems is low enough and the size of the list is large enough that new ones are solved frequently, so one can't say that the challenge of solving one of them is too hard as a test for if AI can do research-level mathematics. While it's possible that known techniques will suffice for some of them, finding which techniques suffice for which problems would itself be an achievement. The statements of the problems are simple and elementary enough that I think for most of them it wouldn't be too hard to formalize the statement in Lean.

Ralf Stephan (Jul 26 2024 at 17:56):

The Erdős problem page above has links to the OEIS, which is full of conjectures. I came up with this paper by eyeballing the numbers from the OEIS in bulk after some transformations, and writing down conjectures, completely ignorant of what was known in the resp. field. Some of them are still unsolved. So, conjectures in discrete math are cheap.

Jason Rute (Jul 26 2024 at 18:13):

Stan Polu has some nice takes on this work: https://x.com/spolu/status/1816809478894886941

Kevin Lacker (Jul 26 2024 at 19:03):

Andy Jiang said:

Also to the members of the team here: I (like Jason) am also curious if you can share if it's doing RL at test-time or just MCTS search.

The blog post claims the "training loop" was applied at test-time, so it's not just MCTS search.

The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found.

Andy Jiang (Jul 26 2024 at 19:38):

But the only signal it gets at test time is the reward model right? It's interesting they can do RL on just that. I feel like I'm missing something

David Renshaw (Jul 26 2024 at 19:43):

David Renshaw said:

I tried running my proof animation tool on AlphaProof's solution to problem 6: https://x.com/dwrensha/status/1816824331629998328

I animated the solutions for Problem 1 and Problem 2 also.

Joseph Myers (Jul 26 2024 at 20:12):

I think an AI benchmark of 573 unsolved problems would be of most interest if at least one of the following conditions were met:

The benchmark is routinely used and results reported by people working on AI theorem proving, even when their AIs solve no problems, and including lots of raw data about what the AI tried and what it did or didn't discover along the way, that people could then mine for information about AI progress as well as for ideas for further human work on the problems. Unfortunately, this sort of open collaboration and publication of work-in-progress and negative results seems rare in AI.
The benchmark genuinely covers all problems from the source material, each expressed in a fully idiomatic way in Lean and with all extra definitions / theory needed for clean and idiomatic expressions of the problems being added to mathlib along the way. For benchmarks based on competition problems, it so far seems that the people writing benchmarks are largely disjoint from those building mathlib, so prefer to release a benchmark based on a partial set of problems that could readily be stated in Lean, and a paper noting the incompleteness of mathlib as a reason for less coverage of some kinds of problems, rather than going for maximal contributions to mathlib to make the benchmark completely cover the original problems.

Jason Rute (Jul 27 2024 at 06:22):

@Joseph Myers as for the fact the the benchmark writers aren’t heavy ITP users, I have been saying for a while now that the heavy ITP users like the mathlib’s maintainers should be making their own benchmarks. As we see, researchers just build systems targeted toward benchmarks, so why not make the benchmarks accurately reflect the concerns of the Lean (or Coq/Isabelle) community.

Jason Rute (Jul 27 2024 at 06:38):

Andy Jiang said:

But the only signal it gets at test time is the reward model right? It's interesting they can do RL on just that. I feel like I'm missing something

This is a good question. One paper which does RL on single proofs is https://arxiv.org/abs/2102.09756 by @Minchao Wu.

As for signal, the two most obvious signals are when a tactic fails (negative signal) and when one closes a subgoal (positive signal). Since subgoals are usually independent and different proof attempts get you to different subgoals, that could be a good way to learn more about the theorem. Also it is a fuzzy way of caching goal states you have already solved. It is possible in different proof attempts and different parts of the same proof that you will run into subgoals which you have already solved (or goals very similar to those you have already solved).

Another signal they might use is the updated value at different nodes of the search tree. This way between proof attempts they could memorize more promising and less promising paths. For less promising paths especially, this could break symmetries. For example an LLM will often suggest intro n, intro m, etc, but this is just duplication of effort. If such a tactic gets a low backpropogated value through MCTS, then fine tuning on that value will teach the model not to also try to do intro n’ on that proof state.

Finally, that line says they learn from “self generated variants of the contest problems”, so that provides even more diverse signals whatever “self-generated variants” means.

Sid (Jul 27 2024 at 08:18):

Jason Rute said:

Andy Jiang said:

Jason Rute said:

Andy Jiang why do you assume they branch on tokens? Tokens are detached from the proof states and goals.

Oh good point. I guess proof steps are more reasonable. But then at each step you have basically an infinite action space in your MCTS. which i guess is technically ok...

MCTS for formal theorem proving is not new. Many papers do it. The HTPS paper by Meta does it for example. There are two approaches to the infinite action space. The simplest is just to make the first (or top) N generations of your model the N actions. It isn’t complete, but that is fine. The other is that DeepMind had a new version of MCTS that IIRC allows infinite action spaces by letting you sample new actions at a node instead of following the path to a leaf.

Do you have a link to the DM paper that does MCTS with infinite action spaces?

Jason Rute (Jul 27 2024 at 09:25):

@Sid The paper is here: Learning and Planning in Complex Action Spaces

Jason Rute (Jul 27 2024 at 09:30):

In the AlphaTensor paper they call it “Sampled AlphaZero” and I assume they use some variation of that.

Joseph Myers (Jul 27 2024 at 11:25):

Floris van Doorn said:

Those formalized proofs are so incredibly ugly, it's amazing.
Of course it doesn't much of a sensible indentation, but then there are single proof steps where I have no idea what it's even doing. useλy . x=>y.rec λS p=>?_. What does a dot even mean inside a lambda-abstraction?
And then there are nonsense mathematical steps. The solution of problem 2 starts with induction, before introducing any variables. It applies induction to the number 12. And it write 12 as (10)+2. Then it proceeds to do the whole proof in the base case of the induction, and notices that the induction step is trivial, since the goal is the same as the induction hypothesis (but instead of the assumption tactic it uses congr 26).

I've only read the first few lines of these formalizations. I'm amazed that while doing such stupid steps, it still managed to reason in the direction of an actual solution and solved it. I'm curious how much compute Deepmind threw at this.

Here's another possibility that's occurred to me regarding the nonsense steps. It's a known (if incompletely understood) phenomenon that generative AI can produce better results by including extra tokens in its output to give itself more thinking time - not just when those tokens are an explicit chain of thought, but even if they are just "... thinking ... thinking" or similar. I don't know how that would interact with the kind of learning / search involved in finding these proofs, but maybe that's still what's happening here: the AI can get better results by including useless but harmless steps in its output while it "thinks" about what meaningful steps to do next, and so it's learned to insert such nonsense in proportion to the amount of thinking time it needs? If so, then rather than laughing at these steps as being nonsense or showing how stupid the AI is, maybe they are actually a source of information on which steps the AI expects to need more thinking time for and so are themselves worthy of study to understand the AI better?

Andy Jiang (Jul 27 2024 at 11:59):

Jason Rute said:

Finally, that line says they learn from “self generated variants of the contest problems”, so that provides even more diverse signals whatever “self-generated variants” means.

To me it seems like this must be a key reason they can do RL at test time--some sort of local synthetic problem generation (local meaning like in some neighborhood of the given problem). Idk if it also generates potential subgoals as part of the synthetic data but it could be reasonable. Stanislas Polu was wondering about why there's a 100x blowup from the 1M problems they have the 100M formal problems they formalize. He was suggesting it was mostly coming from intermediate goals in the formalizations of solutions but I wasn't sure if most of the problems they found has full solutions (and indeed if they did that it would also make the problems less hard and probably a much more overfit agent). Thanks to Kevin for pointing this out.

by the way in one of Tim gower's more recent twitter replies (https://x.com/wtgowers/status/1816919893742366883) he points out that the way the program chooses potential answers to prove (like the even integers in P1) is by generating a lot and eliminating a lot with easy counterexamples--that could also feed into this synthetic data somehow

Will Sawin (Jul 27 2024 at 18:28):

It seems hard to change the culture of people not posting partial results. The only solution I can think of is to come up with an easier benchmark, so that more people can get results on the benchmark that are good enough to be worth posting. Certainly the Putnam benchmark should be much, much easier than the Erdös benchmark, and well, the IMO benchmark isn't even solved yet. So all of this might be a little premature.

To me it seems like the best way to get people to write a benchmark and contribute to mathlib along the way to support it is to get people who are already contributors to mathlib interested in writing a benchmark, rather than convincing benchmark writers to contribute to mathlib. But I don't know what sort of benchmark mathlib contributors would be interested in creating.

That said, with regard to getting every problem from the list in the benchmark, some of the problems on the Erdös problem list I suggest are not fully formal prove-or-disprove problems and so some choice would have to be made in how to formulate as a formal problem, and some might even have to be dropped.

Bhavik Mehta (Jul 27 2024 at 18:31):

Joseph Myers said:

I think an AI benchmark of 573 unsolved problems would be of most interest if at least one of the following conditions were met:

The benchmark is routinely used and results reported by people working on AI theorem proving, even when their AIs solve no problems, and including lots of raw data about what the AI tried and what it did or didn't discover along the way, that people could then mine for information about AI progress as well as for ideas for further human work on the problems. Unfortunately, this sort of open collaboration and publication of work-in-progress and negative results seems rare in AI.

The benchmark genuinely covers all problems from the source material, each expressed in a fully idiomatic way in Lean and with all extra definitions / theory needed for clean and idiomatic expressions of the problems being added to mathlib along the way. For benchmarks based on competition problems, it so far seems that the people writing benchmarks are largely disjoint from those building mathlib, so prefer to release a benchmark based on a partial set of problems that could readily be stated in Lean, and a paper noting the incompleteness of mathlib as a reason for less coverage of some kinds of problems, rather than going for maximal contributions to mathlib to make the benchmark completely cover the original problems.

I have (scattered across repos) versions of all the definitions needed for the 808 erdos problems in an idiomatic Lean style, as well as formal proofs of solutions for at least four of them. Some, however, have similar difficulties to those in IMO problems of the question not being completely specified. For instance, https://www.erdosproblems.com/581 asks to "Determine f(m)", and the solution given there is not a complete description of the function (although it does still give really cool bounds!). https://www.erdosproblems.com/29 asks for an "explicit construction", which is made precise in that paper but is not particularly nice to define. Questions like https://www.erdosproblems.com/6 also don't specify their answer, (indeed in this specific case, a disproof was more valuable than a proof!). The problem https://www.erdosproblems.com/192 is a more extreme example of this, it's basically the same as an IMO "find all" style problem. Each of these cases are solved problems, but the same properties can be observed in unsolved ones, consider https://www.erdosproblems.com/117.

Will Sawin (Jul 27 2024 at 18:45):

It's really cool to know you have been thinking about this!

Problem 192 is actually not so bad. It's clear from the statement that if it's true for d it's also true for d-1. So one can formulate the problem as either give a positive integer n and prove it's true for d < n and false for d at least n, or prove it's true for all d. This is not much worse than a "Prove or disprove" problem.

Bhavik Mehta (Jul 27 2024 at 18:50):

Ah good point about 192, it is indeed not quite as bad as I thought on a first read

Joseph Myers (Jul 27 2024 at 22:32):

Bhavik Mehta said:

I have (scattered across repos) versions of all the definitions needed for the 808 erdos problems in an idiomatic Lean style, as well as formal proofs of solutions for at least four of them.

So if these were to get into mathlib, rather than only across different repos, then we'd arguably pass the benchmark of "able to state all Erdős problems".

(The benchmarks most closely related to mathlib seem to be such benchmarks for formal libraries, rather than benchmarks for AI. As another example, the 100-theorems list is such a benchmark. To make the two kinds of benchmarks overlap, imagine formalizing the contents of a graduate-level textbook. Getting all definitions and proofs into mathlib completes one kind of benchmark, and enables writing the other kind: write formal statements of all the exercises in the textbook, and ask an AI to solve them. Similarly, you could have one benchmark, "complete the IMO syllabus in mathlib", leading to another, formal versions of all 392 IMO problems for an AI to solve. This sort of thing suggests maybe good benchmarks for AI need to be a collaboration between someone expanding API support in mathlib and someone using that API support to pose challenges for the AI.)

Ralf Stephan (Jul 28 2024 at 07:27):

Will formal statements of IMO problems without proof be accepted in mathlib's Archive/?

Kim Morrison (Jul 28 2024 at 11:14):

Ralf Stephan said:

Will formal statements of IMO problems without proof be accepted in mathlib's Archive/?

I'd prefer not. How about someone creates a repository downstream of Mathlib for it? If it gets some critical mass we can move it to leanprover-community.

Eric Wieser (Jul 28 2024 at 11:18):

Compfiles seems like a good place for them

Eric Wieser (Jul 28 2024 at 11:22):

(and has many such statements already)

Joseph Myers (Jul 28 2024 at 13:54):

I think there are at least three different plausible approaches for formal statements and proofs of IMO (and other competition) problems:

The problems are examples of Lean theorems and proofs, with both statements and proofs aiming to be reasonably clean and idiomatic, reviewed according to mathlib standards and with all relevant API that might reasonably go in mathlib proper doing so. This is the approach used in the mathlib Archive/ (and since I think expanding mathlib API provides much of the benefit of such formalizations, I'd very much welcome more getting added there to get the mathlib review and API additions).
The problems are stated explicitly as problems, indicating exactly what the solver has to fill in, with or without proofs. This is what compfiles does. It also seems a reasonable place for a wider range of competitions (if the mathlib archive has the IMO and maybe a few other representative examples of competitions, e.g. something at undergraduate level, compfiles can be more like a formal version of the AoPS contests collections).
The problems follow well-defined and consistent formalization conventions for how informal statements are translated into formal ones (how to choose the types in a formal statement, what formal definitions are used to translate various informal terminology, when to use auxiliary definitions versus putting everything directly in the theorem statement, etc.). If you hold actual competitions for AIs to solve formal problems, such conventions are of value so that AI writers know how to set up their own test and training problems to be maximally consistent with what the AI will encounter in the competition. (And there is clearly plenty of scope for AI work on such problem solving beyond AlphaProof - improving results beyond IMO silver, getting results from open source AI, getting results with time and compute limits, covering geometry uniformly with everything else rather than needing a separate input language, etc.) Although when doing this you don't actually need proofs, they might help a lot in getting confidence in correctness of the formal statements. (Note that I do not suggest concealing any formal proofs generated; it's up to AI writers how they exclude existing formal and informal proofs from training data if they wish to attempt to use past IMO problems for a benchmark.)

To a large extent, formal proofs could be shared between all three approaches (even when different choices are made about types etc., adapting a formal proof or formalizing the equivalence of two statements shouldn't be hard). Though some approaches involve higher review standards than others, and it can also be of value to formalize more than one solution approach to the same problem.

Joseph Myers (Jul 28 2024 at 13:58):

For the last approach, one limiting factor in translating from English to Lean according to well-defined conventions is getting the original English versions of older problem statements - many sources have a habit of paraphrasing problems rather than using the original wording, and this includes sources from which imo-official took some of the pre-2006 statements (from 2006 onwards the papers on imo-official are as actually sat by contestants). Paraphrases are hopefully mathematically equivalent, but different English choices may correspond to different Lean choices.

Thomas Bloom (Jul 28 2024 at 14:03):

Just to chime in (as the creator of erdosproblems.com) to say that I of course would be interested in any use of the site as a source of benchmark problems, and am happy to hear any suggestions about how to change/expand the site to help that effort.

In particular, I'm always happy to hear proposed alternative statements of any of the problems. Often I tried to closely match what Erdos wrote, but this was often very vague, since his goal was to inspire interest in a topic rather than create precisely defined formal conjectures.

Also, the current tally of 808 problems is certainly not the final count, and there are many more Erdos problems still to be added...

Finally, there are some problems that have received recent solutions short enough to be described completely on the site itself (e.g. 645). These may be useful as test cases, comparing the AI solution to the known human solution. (Of course questions with such simple solutions are probably in practice the same as IMO problems.)

Martin Dvořák (Aug 07 2024 at 09:55):

Has anybody created a version with a better (human-oriented) formatting?
Things like spaces on usual places and Unicode instead of ASCII art...

Eric Wieser (Aug 07 2024 at 09:57):

@Joseph Myers has PR'd human-written versions of all the problems AlphaProof solved to Mathlib (though I believe he did so without any reference to our solutions). docs#Imo2024Q1.result docs#Imo2024Q2.result docs#imo2024q6

Martin Dvořák (Aug 07 2024 at 09:58):

Ah, sorry. I meant, I would like to read the authentic proofs that AlphaProof made, only reformatted without changes to the syntactic tree.

Eric Wieser (Aug 07 2024 at 09:59):

In that case I believe you could create a syntax object using the raw code we published, and then use ppCommand or whatever the function is called to pretty-print it

Joseph Myers (Aug 07 2024 at 22:07):

My formalizations are indeed directly following the official informal solutions. (In the case of P2 I was following a simpler solution than the ones in the shortlist, taken from Ben Green's mark scheme - the first published version of the solutions document just describes that in a Comment, in the revised version of the solutions document I've sent for publication tonight it's Solution 3.) I prepared formal statements of all nongeometry shortlist problems in advance of the IMO (plus one problem that we (the PSC) ended up removing quite late from the draft shortlist as we found it to be too close to something known), with formal solutions for a few that included the problems that ended up as P1 and P6 (and did the formal solution for P2 during the IMO). I didn't follow any particularly consistent conventions for how to translate informal statements to formal ones (indeed, I found that the formal statements for problems where I also did a formal solution ended up looking different to those for problems where I didn't do a formal solution, because the process of writing a formal solution often included factoring conditions from the problem statement out into separate definitions to make it more convenient to state and use lemmas about those conditions) - but the process of writing lots of formal statements certainly helped make obvious some of the ways in which choices need to be made when converting an informal statement to a formal one, and in which a more literal translation to Lean may sometimes be less idiomatic as a Lean statement.

Eric Wieser (Aug 07 2024 at 22:17):

Joseph Myers said:

I prepared formal statements of all nongeometry shortlist problems in advance of the IMO

I assume these are embargoed in some way until next year? I guess there's also the question of whether IMO shortlist is fair game for the archive, but it would be cool for it to contain a full set of formal IMO shortlist questions formalized by someone on the PSC!

Joseph Myers (Aug 07 2024 at 22:28):

Shortlist problems are embargoed until next year (with some slight fuzziness about whether the embargo is until "end of IMO 2025 closing ceremony" or "end of IMO 2025 second exam"), yes, and anyone wanting embargoed shortlist problems for AI use is to be directed to the IMO Board (officially stated at the third joint Board/Jury meeting of IMO 2024).

Joseph Myers (Aug 07 2024 at 22:40):

I might formalize more solutions for shortlist problems between now and IMO 2025, but my current formalization priority is AperiodicMonotilesLean, which is of relevance to exactly one, or maybe two, past IMO problem out of 392. (For the so-obvious-to-humans-that-there's-no-need-to-mention-it part of IMO 2004 P3, showing that all hooks are aligned to the obvious division of the $m\times n$ rectangle into $mn$ unit squares, the sorts of arguments and results in Appendix A of "An aperiodic monotile" are likely to be useful - the case of squares is easier, but the extra generality of Appendix A is more appropriate for mathlib. Note that in the AperiodicMonotilesLean README I mention that Appendix A could be as much work to formalize as everything else in that project put together. Having definitions for tilings in a geometrical context would also be relevant for stating IMO 1976 P3.)

I may release what I have when the embargo ends (if I formalize solutions to IMO P3 or P5, those would be released when done), but formal statements without solutions do have a significant chance of containing mistakes that would have shown up when formalizing a solution.

Joseph Myers (Aug 07 2024 at 22:45):

I'd like to see all 392 past IMO problems in the mathlib archive (including those that currently have solutions in compfiles only); maybe compfiles is more appropriate for the much larger set of past shortlist problems.

Joseph Myers (Aug 07 2024 at 22:49):

Getting many past IMO problems into the mathlib archive is definitely the sort of project better done by "lots of people do a few problems each" rather than "one person does 392 problems". Formalizing a few algebra / number theory problems (or for more of a challenge, combinatorics problems), and PRing the results to mathlib, should be a suitable exercise for a past olympiad student learning Lean and interested in getting to grips with the mathlib coding standards and becoming a mathlib contributor.

Joseph Myers (Aug 07 2024 at 22:53):

(Defining uniform formalization conventions for how to translate problems to be used in competitions for AIs might be more of a small-group thing, but isn't very relevant for the mathlib archive, where being idiomatic Lean is of more significance than uniform translation from English to Lean.)

Nikolai Bobenko (Aug 09 2024 at 11:10):

Hey, please excuse the somewhat naive question as I am new to this community so do not really have a great overview of what is happening where. I did not see this mentioned here.

Is there an open source effort happening trying to reproduce this IMO result similar to what leela-zero did with Alphazero?

Ralf Stephan (Aug 09 2024 at 11:37):

I think a lot of people are waiting for the paper right now.

Jason Rute (Aug 09 2024 at 14:16):

@Nikolai Bobenko Besides the fact that the paper isn’t out yet, another important question is what one would hope to get out of a reproduction of this project. Is the goal to make a solver which takes 3 days on a heavy amount of compute hardware to solve formalized IMO problems? Or is it to apply similar techniques (and compute budgets) to solve open math problems? Or is it to do ML research on easier benchmarks like MiniF2F or miniCodeProps? Or is it to make a practical tool for Lean users? I think while we will learn a lot from this project, it may not necessarily lead to practical tools right away. Indeed there already is a large amount of research in this space, and many open source implementations (but not enough). Many of those projects have similarities to AlphaProof, but likely don’t scale to solving IMO problems even with 3 days of compute. Also most current research projects, even the open source ones, aren’t polished enough to be practical tools for Lean (or Isabelle or Coq) users.

Andy Jiang (Aug 09 2024 at 15:35):

I also read somewhere that people in Wisconsin-Madison replicated Alphageometry? Not sure if this is true and if they are open-sourcing it. I could be completely off-base though

Albert Jiang (Aug 09 2024 at 21:31):

Jason Rute said:

Nikolai Bobenko Besides the fact that the paper isn’t out yet, another important question is what one would hope to get out of a reproduction of this project. Is the goal to make a solver which takes 3 days on a heavy amount of compute hardware to solve formalized IMO problems? Or is it to apply similar techniques (and compute budgets) to solve open math problems? Or is it to do ML research on easier benchmarks like MiniF2F or miniCodeProps? Or is it to make a practical tool for Lean users? I think while we will learn a lot from this project, it may not necessarily lead to practical tools right away. Indeed there already is a large amount of research in this space, and many open source implementations (but not enough). Many of those projects have similarities to AlphaProof, but likely don’t scale to solving IMO problems even with 3 days of compute. Also most current research projects, even the open source ones, aren’t polished enough to be practical tools for Lean (or Isabelle or Coq) users.

Totally agreed with Jason here. I'm thinking of writing an essay about AI for formal maths research in the post-AlphaProof time. The main thing I'd want to promote is that current AI capabilities are ripe enough to provide normal users of Lean a good productivity boost if the tools are built properly. There ought to be replications and efforts to make AlphaProof smaller, more efficient, and more usable. But at the same time, I'd really want to advocate for closing the gap between research and applications and make sure the advances actually benefit people.

Kim Morrison (Aug 10 2024 at 01:36):

I'd love to see work on tooling for Lean users. Even before the deepmind results there's a huge gap between "demonstrated in research setting (possibly with a demo tool)" and actual user tools (where, to be honest, there is very little ready).

The engineering work is hard to get done in academia, and my guess is that the big labs are going straight for general code synthesis, and aren't going to provide any intermediate tools for lean users today, even if in principle they have models that could power such tools.

The FRO model seems pretty good... :-) We like doing the "boring" engineering work that can't get credit in academia, and want to give people tools today to bootstrap adoption of lean. We just need twice as much money, an excellent technical program manager, and then twice as many people. :-)

Jason Gross (Aug 10 2024 at 02:09):

I'd also be super interested to find out how far we are from autoformalization / automating the part of math peer review that is checking results for mistakes. It seems like, with access to AlphaProof + enough compute, we could more or less straightforwardly build a system where the interaction mode is "feed in arXiv paper" => "get proposed formalization of theorem statements" => "either okay the formalization or manually tweak it" => "get formalized proof that is closest to having no open holes" (and possibly get a best effort explanation / summary of "why it can't solve what remains")

llllvvuu (Aug 10 2024 at 02:29):

That sounds like it would involve a number of other skills such as API design, Blueprint design, and managing large contexts / knowledge bases. I'm sure that is a north star for many AI4Math projects though. And certainly topics of interest even in the broader codegen space.

David Renshaw (Aug 19 2024 at 02:17):

I published a video where I walk through (and animate!) a heavily cleaned-up version of AlphaProof's solution to problem 2: https://youtu.be/5IARsdn78xE

Sid (Sep 25 2024 at 22:17):

Are there any formalizations of Problems 3 & 5 which AlphaProof was not able to solve?

Jason Rute (Sep 26 2024 at 12:19):

@Sid It looks like @David Renshaw formalized the theorem statement (no proof) for Problem 3 at Compfiles.

Joseph Myers (Sep 26 2024 at 20:46):

I also have independent formal statements (no proofs) for problems 3 and 5. Note that problem 5 is certainly the kind of problem for which it would be easy to make a mistake in the formal statement if not formalizing a solution at the same time, and there are many different approaches possible to formalizing the statement (I like including lots of auxiliary definitions in such statements, whereas the Google DeepMind approach involves putting everything in hypotheses to a single theorem statement rather than having auxiliary definitions).

Sid (Sep 26 2024 at 21:04):

Joseph Myers said:

I also have independent formal statements (no proofs) for problems 3 and 5. Note that problem 5 is certainly the kind of problem for which it would be easy to make a mistake in the formal statement if not formalizing a solution at the same time, and there are many different approaches possible to formalizing the statement (I like including lots of auxiliary definitions in such statements, whereas the Google DeepMind approach involves putting everything in hypotheses to a single theorem statement rather than having auxiliary definitions).

Do you have them available publicly anywhere or have plans to put them up?

Eric Wieser (Sep 26 2024 at 22:16):

Sid said:

Are there any formalizations of Problems 3 & 5 which AlphaProof was not able to solve?

Such formalizations exist, but we did not end up releasing them. I suspect they don't offer anything over Joseph and David's formalizations though, so there isn't much point in us going through the process of publishing them.

Joseph Myers (Sep 27 2024 at 06:54):

Here is my problem 5 statement (see above caveats about not being verified by formalizing a solution). Although there are various differences of detail between how I stated problem 3 and how it's stated in compfiles (e.g. I used Finset.card rather than Set.ncard), they aren't particularly interesting unless you're trying to establish uniform conventions for translating English problem statements to Lean (and I don't think my formalizations of IMO 2024 problem statements consistently follow any such conventions) and certainly problem 3 is easy enough to write a formal statement for with whatever conventions you wish.
Imo2024Q5.lean

Sid (Sep 27 2024 at 10:43):

Joseph Myers said:

Here is my problem 5 statement (see above caveats about not being verified by formalizing a solution). Although there are various differences of detail between how I stated problem 3 and how it's stated in compfiles (e.g. I used Finset.card rather than Set.ncard), they aren't particularly interesting unless you're trying to establish uniform conventions for translating English problem statements to Lean (and I don't think my formalizations of IMO 2024 problem statements consistently follow any such conventions) and certainly problem 3 is easy enough to write a formal statement for with whatever conventions you wish.
Imo2024Q5.lean

Thanks!

Andy Jiang (Oct 09 2024 at 14:16):

Jason Rute said:

Andy Jiang said:

Jason Rute said:

Andy Jiang why do you assume they branch on tokens? Tokens are detached from the proof states and goals.

Oh good point. I guess proof steps are more reasonable. But then at each step you have basically an infinite action space in your MCTS. which i guess is technically ok...

MCTS for formal theorem proving is not new. Many papers do it. The HTPS paper by Meta does it for example. There are two approaches to the infinite action space. The simplest is just to make the first (or top) N generations of your model the N actions. It isn’t complete, but that is fine. The other is that DeepMind had a new version of MCTS that IIRC allows infinite action spaces by letting you sample new actions at a node instead of following the path to a leaf.

I think from David silver's talk it's clear that it's tactic level search (which makes much more sense anyway)

llllvvuu (Oct 09 2024 at 14:50):

I think that's mostly orthogonal to the style of node expansion. Other MCTS papers like HTPS are also tactic level search. HTPS IIRC does node expansion with fixed branching factor of 2.

In a traditional MCTS setting you can use first-play urgency (FPU) to explore a large action space. Of course there is no public info on what DeepMind does for node expansion.

David Renshaw (Oct 09 2024 at 14:54):

I think from David silver's talk...

Was there a public presentation about this project? As far as I've heard, the blog post is all that's been released.

Sid (Oct 09 2024 at 15:13):

David Renshaw said:

I think from David silver's talk...

Was there a public presentation about this project? As far as I've heard, the blog post is all that's been released.

https://www.youtube.com/watch?v=pkpJMNjvgXw - AlphaProof is around the 35:30 mark.

Jason Rute (Oct 09 2024 at 17:00):

Even if it is tactic level prediction, it is possible that they used the full proof history in the prompt, and then so it isn’t that different from full proof prediction. (This is even more true if they used efficient prompt caching, since then it is almost the same as full proof prediction.) (I haven’t watched the above video yet.)

Jason Rute (Oct 09 2024 at 17:00):

I don’t think HTPS is a great example to draw upon since it quite old and at much smaller time scales. Other more recent papers have done MCTS filling in entire functions (for code generation) or entire proofs (for automated theorem proving).

Jason Rute (Oct 09 2024 at 17:01):

Also, I’m curious if the choice of what simpler versions of the problem to prove was also controlled by MCTS or another mechanism. (Again, haven’t looked to see if this is in the video.)

Sid (Oct 09 2024 at 17:41):

Jason Rute said:

Even if it is tactic level prediction, it is possible that they used the full proof history in the prompt, and then so it isn’t that different from full proof prediction. (This is even more true if they used efficient prompt caching, since then it is almost the same as full proof prediction.) (I haven’t watched the above video yet.)

FWIW I think InternLM does step prediction and DeepSeek (trained on a much larger dataset) does whole proof generation and I think the difference is 10% on minif2f (50% vs 60%). So it's not clear imo that it makes a big difference

Jason Rute (Oct 10 2024 at 12:59):

I watched that video. It is a good explaination of the AlphaProof system even if there isn’t much more new information not in the blog post (or above): New-ish information:

they gather a lot (1 million) of math statements to autoformalize to Lean (they formalize each in many ways, and are happy if the formalization is true or false because it a valid problem either way).
the training AlphaZero loop tries to prove/disprove a theorem
the state of the search is the goal state and the action is the tactic (although I don’t know if this is a bit of a simplification. Maybe there is more used in the state representation.)
they have a diagram of the inference time RL loop which provides a bit of clarity (see below)
during inference a variant-generator creates many variants of the problem and AlphaZero tries to prove/disprove each and does RL during that
the problems are in “hard mode” where the AI has to give an answer. This is done by Gemini which proposed solutions and the AI tries to prove/disprove them. (The relation between the variant generator and Gemini solution proposer isn’t clear to me. Which is run first?)
534308c05b077509b130de11f71803321673e58b.heic

Jason Rute (Oct 10 2024 at 12:59):

It seems that all the novel (or semi-novel) ideas are LLM-based:

autoformalization
variant generation
solution proposal

But the workhorse is still just Alpha-Zero-style MCTS and RL.

Joseph Myers (Nov 06 2024 at 05:02):

I've now PRed a full formalized solution to IMO 2024 P5 in #18685. This is longer than any of the existing IMO formalizations in the mathlib archive or compfiles, and depends on two other open PRs (a third PR developed in the course of this formalization, #18366, was merged last week). I didn't find any mistakes in the formal statement previous posted in this thread, though I made some changes to it for stylistic consistency or where slight changes to the formal statement proved nicer for formalization of the solution.

Joseph Myers (Dec 01 2024 at 23:18):

I've now PRed a complete formalized solution to IMO 2024 P3 as well, in #19671 (currently even longer than that for P5).

Harald Carlens (Feb 07 2025 at 10:00):

AlphaGeometry2 paper is out: https://arxiv.org/abs/2502.03544
There were a few paragraphs on this in the AlphaProof blog post back in July but I think this is the first time we're seeing more technical details.

Floris van Doorn (Feb 07 2025 at 15:06):

Cool! Now let's hope that we don't have to wait much longer on the AlphaProof paper :smile:

Joseph Myers (Feb 09 2025 at 14:33):

This paper notes (for possible future work):

First, our domain language does not allow talking about variable number of points, non-linear equations and problems involving inequalities, which must be addressed in order to fully “solve geometry”.

I think that some of the (more combinatorial) problems shown as not attempted (2002 P6, 2006 P6 and 2020 P6, for example) actually illustrate how geometry is connected to the rest of mathematics, and rather than trying to expand the domain-specific language to cover more and more specific cases that are arbitrarily classified as being on the "geometry" side of the line dividing such problems from combinatorics (or number theory, e.g. 2016 P3), which would surely fail to cover some future problem that links geometry to some other mathematical idea not previously linked to geometry in an IMO problem, it would be more natural (although Lean is mentioned nowhere in the paper) to make AlphaGeometry into an agent that could be used by AlphaProof on geometrical problems and subproblems.

This seems similar to how people often want to keep expanding the scope of a Lean tactic to cover more and more cases that informally feel like the cases it does cover, but are outside its more precisely defined scope - and where, rather than having the scope of tactics grow ever larger to cover anything vaguely related to a rough idea of what the tactic is meant to cover, we encourage people to combine multiple tactics where appropriate rather than expecting a single tactic to cover as much as possible.

Last updated: May 02 2025 at 03:31 UTC