Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Do AI Models Struggle with Combinatorics?


Alex Habeen Chang (Mar 05 2025 at 08:06):

To those following news about machine learning performance on math olympiad problems, it seems notable that the two problems Google's AlphaProof could not solve on IMO 2024 were both "combinatorial" (as described by the Google DeepMind blog post).

My first question is whether models underperform on combinatorial competition-style problems as a "known" trend, or even further whether models underperform on combinatorics in general as compared to theorems in other areas of math. I've heard in conversation that this is the case, but is this being too quick to generalize one model's performance? Do we even have enough model performance data to draw such a conclusion?

If this is indeed the case, then why? Is the amount of combinatorics theorems available in Mathlib less than the amount for other fields of math? Even if this is true, would this have much of an effect on model training or the application of the model to combinatorial olympiad problems, which might not use many "deep" theorems at all? It's also been mentioned that combinatorial competition-style problems are often stated in an informal fashion that might be difficult for a model to formalize; I agree with this, but in AlphaProof's instance the IMO 2024 problem statements were manually translated anyway.

I hope this generates some space for disussion, as opposed to seeming like I'm looking for a definite answer. This is a potential research direction for me, so I want to interrogate the starting questions themselves.

Kevin Buzzard (Mar 05 2025 at 08:19):

I don't know the answer to this question but the thought I had at the time was that for eg number theory questions there is usually a unique natural/idiomatic way to state the question in lean but for a question which begins "consider a frog trying to cross an N*N board with some hidden bombs in" you already have some design decisions to make which might be making certain proofs far more inconvenient to formalise.

Perhaps another reason is that for a number theory question you might imagine that there are theorems in the library which might already help (eg theorems about powers modulo n) but in the question about bombs on a board it seems to be that the prover might be expected to come up with every idea themselves in some sense, you could imagine formalising that question in core lean and mathlib being of no use to you at all, whereas mathlib is definitely of some use when you're trying to work out gcd's.

David Wärn (Mar 05 2025 at 10:07):

My non-expert view is that the search space for intermediate lemmas and definitions is much less constrained in combinatorics. Compare problems 6 (algebra) and 3 (combinatorics) from IMO 2024. P6 is difficult for people because the proofs are pretty long and non-intuitive. But the search space is highly constrained; all intermediate steps are something like "for all x, y, if A = B then C = D", where A, B, C, D are expressions involving x, y, f, +, and -. For P3 on the other hand, the idea is to define some notion of "state" such that the sequence moves from state to state in a deterministic way and only visits finitely many states. This more or less makes sense to people. From an autoformalisation POV, you have to at some point synthesize the type of states, as well as the broad outline of the proof. So your search space now includes something like "all types". And you have to do a lot of work to see if you synthesized the right type etc.

Joseph Myers (Mar 05 2025 at 12:44):

The design decision to include the whole problem statement in hypotheses of the single statement to be proved, without any auxiliary definitions, certainly makes stating combinatorics problems harder (consider how much use is made of auxiliary definitions in the version of IMO 2024 P5 in the mathlib archive, for example - or in various other combinatorics statements in my examples of formal statements following my suggested conventions). It also probably reduces the utility of the AI solver outside constrained domains, given how much in mathematics you want to create new definitions and then prove things involving those definitions.

I'm less sure how it impacts difficulty of solving the problems. (My theory is that test-time RL compensates to some extent for the effects of trying to fit the entire formal solution in a single by block rather than splitting out lots of lemmas as done in idiomatic human formalizations.)


Last updated: May 02 2025 at 03:31 UTC