Zulip Chat Archive

Stream: IMO-grand-challenge

Topic: AI Math Olympiad Prize

David Renshaw (Nov 27 2023 at 13:53):

$10 million prize fund! And I appreciate that models must be "publicly-shared"

Adam Topaz (Nov 27 2023 at 15:46):

My university's network is blocking that site as potentially malicious... How serious is this?

David Renshaw (Nov 27 2023 at 15:50):

Looks pretty legit to me. https://aimoprize.com/supporters has blurbs from Terence Tao, Leonardo de Moura, and Kevin Buzzard.

David Renshaw (Nov 27 2023 at 15:50):

A big difference from the IMO Grand Challenge is that this one is framed as informal-to-informal.

Adam Topaz (Nov 27 2023 at 15:51):

Gotcha. Ok, I'll spin up my VPN :)

David Renshaw (Nov 27 2023 at 15:51):

(But I still think the best way to get there is through formal-to-formal.)

Jason Rute (Nov 27 2023 at 16:35):

They are taking this really seriously. $10M, paid roles, and prizes for partial progress. That is great to hear! I think they are missing a lot of details, like what is the "AI-MO"public sharing protocol"? What are the partial progress goals? What is the entry criteria? How many resources can one use?

David Renshaw (Nov 27 2023 at 16:36):

presumably the folks in the paid roles will eventually fill in those details

Jason Rute (Nov 27 2023 at 16:38):

Overall I hope the criteria is not too onerous. Otherwise we could have a situation where some tech company produces a working solution which gets a gold medal in the IMO and publishes in Nature, but it doesn't count since it relies on proprietary components, so a bunch of teams rush over that next year to publicly reproduce the results to get the $10M. (Or maybe this is fine too. It is the progress that matters.)

Jason Rute (Nov 27 2023 at 16:38):

As for legitimacy, let's ask @Kevin Buzzard.

David Renshaw (Nov 27 2023 at 16:39):

but it doesn't count since it relies on proprietary components, so a bunch of teams rush over that next year to publicly reproduce the results to get the $10M.

that sounds like a good outcome to me

Jason Rute (Nov 27 2023 at 16:40):

Also, this is more money than the Millenium prizes combined. Are there any other comparably-sized prizes in AI? I do worry that this might be too much money. Hopefully it doesn't end up like the Netflix prize where the solutions are just a hodgepodge of models and techniques with no overall pattern and generalizability (or so I thought I remember hearing, as it was before I was in AI).

Miroslav Olšák (Nov 27 2023 at 16:42):

To me, it is not clear what they are going to do against lengthy calculations / proofs. Human beings are not capable of writing as long proofs as computer programs in 4.5 hours, and according to standard IMO rules, solution of any length should be accepted as long as it is correct.

David Renshaw (Nov 27 2023 at 16:42):

I think the Netflix prize was of the form "whoever decreases the loss the most wins". Here we have a concrete endpoint we're aiming for.

David Renshaw (Nov 27 2023 at 16:49):

AI models must consume problems in the same format as human contestants and must produce human readable solutions that can be graded by an expert panel, using standard Olympiad criteria.

A human grader would not accept an enormous proof.

Jason Rute (Nov 27 2023 at 16:58):

I think the Netflix prize was of the form "whoever decreases the loss the most wins". Here we have a concrete endpoint we're aiming for.

I'm not sure what you mean here. Maybe I don't know enough about the Netflix prize. How is this different? (Is it because the test set is hidden?) Couldn't we easily end up in the situation where there are 100 sub-models? Each model does better on a different type of IMO problem. Teams realizing they are not quite there, start to combine into super teams where they aggregate all their models. There are a lot of hard coded solutions for particular common types of problems. There is very complicated and ad hoc logic to decide what model to use in each situation and what solution to send to the reviewers. Of course complexity can be fine. Alpha-fold 2 is quite complex. But I think high prize money could lead to a kitchen-sink approach where everything is thrown in.

Jason Rute (Nov 27 2023 at 16:58):

Alternately, another possible issue with extreme prize money is that teams not wanting to get scooped on $10M stop sharing their partial results. (Are we going to stop getting models like LLEMMA now?)

Jason Rute (Nov 27 2023 at 16:58):

Of course I might just be too jaded. The attention will be good, and most AI labs will still be looking to the big picture even if they aren't first (and $10M isn't that much money).

Miroslav Olšák (Nov 27 2023 at 17:02):

A human grader would not accept an enormous proof.

On IMO, I am pretty confident they would (although truly enormous proof couldn't appear). If there is an exception in this challenge (which I find pretty reasonable), I would consider it fair to define "enormous".

David Renshaw (Nov 27 2023 at 17:02):

Netflix Prize = "win computer chess tournament"
AI-MO Prize = "beat Garry Kasparov"

David Renshaw (Nov 27 2023 at 17:03):

Like, there are lot of people saying that it's just never going to happen. Part of the point is to prove those people wrong.

Adam Topaz (Nov 27 2023 at 17:03):

Still can't connect from my university network :-/ Is there an expected timeline for this, or is it completely open-ended?

Jason Rute (Nov 27 2023 at 17:04):

David Renshaw said:

(But I still think the best way to get there is through formal-to-formal.)

I think formal-to-formal could still be an important part of the winning solution, especially if it is the best way to get there. Imagine something like draft-sketch-prove-sketch-draft.

Jason Rute (Nov 27 2023 at 17:05):

Adam Topaz said:

Still can't connect from my university network :-/ Is there an expected timeline for this, or is it completely open-ended?

From the website:

The first AI-MO approved competitions will open to participants in early 2024. There will be a presentation of progress at the 65th IMO, which will be held in Bath, England in July 2024

Jason Rute (Nov 27 2023 at 17:05):

Overall, they don't give many concrete details.

Kevin Buzzard (Nov 27 2023 at 17:12):

yeah this is definitely a thing, XTX Markets are behind it. My understanding from talking to them is that they (i.e. XTX) get to decide what "partial progress" means.

As for this v the netflix prize, here there is a very clear goal. XTX are working in collaboration with the IMO organisers so will probably manage to persuade them to mark a couple of computer entries, if there are any this year.

Re sharing: I was talking to a bunch of DeepMind and ex DeepMind people over the weekand and they say that this (suppression of results) is already happening a lot anyway.

Jason Rute (Nov 27 2023 at 17:35):

I do especially like the partial progress part. My big complaint on the original IMO Grand Challenge is that there is no mechanism for measuring, rewarding, or acknowledging partial progress.

Jason Rute (Nov 27 2023 at 17:42):

Ok, I now understand @David Renshaw's point. Since one doesn't have to beat other models, just humans, there isn't as much need for extreme optimization as seen in some Kaggle competitions and the Netflix challenge. So it might (hopefully) be that a model of reasonable complexity does fine on all types of problems, or that maybe there is a model for only a few select types of problems, but not the kitchen sink aggregate approach I am worried is needed to squeeze out that remaining 1% of improvement.

Kevin Buzzard (Nov 27 2023 at 17:43):

My big complaint on the original IMO grand challenge is that if you look at the 2023 IMO day 1, two of the questions aren't "prove this", they are "determine this set", and I have never figured out how to formalise this. This is why I think the natural language approach is a better idea -- it makes it much easier to define the question.

Miroslav Olšák (Nov 27 2023 at 18:55):

My big complaint on the original IMO grand challenge is that if you look at the 2023 IMO day 1, two of the questions aren't "prove this", they are "determine this set", and I have never figured out how to formalise this. This is why I think the natural language approach is a better idea -- it makes it much easier to define the question.

One option already mentioned here (I think) was to have human-checked solution to the "determine" part but computer-verified proof, which is going a bit towards AI-MO Prize just without people having to check the proofs (which is more annoying part).

Another (more formal) option is to restrict per problem which constructing operations are allowed. For example in case of IMO 2023-1, it would be a set defined by a free variable x, and besides this variable, allowed functions would be boolean connectives, equality, integer less than, arithmetic operations,, prime decomposition, and basic functions providing manipulation with the prime decomposition (length of a list, index into a list, ...).

Kevin Buzzard (Nov 27 2023 at 19:06):

yeah, like I said, we could either contort ourselves into a convoluted position or just ask for a natural language solution.

Miroslav Olšák (Nov 27 2023 at 19:18):

my problem with human-checking is that I wouldn't be much motivated to check 20 pages of geometry calculations that are likely correct (in case they came from a computer algebra system, and not ChatGPT :laughing: )

Joseph Myers (Nov 28 2023 at 03:35):

After (briefly) discussing this prize with XTX a few weeks ago I noted a few considerations that didn't come up in that discussion but would be relevant for establishing detailed rules, one of which was the matter of establishing a length limit on solutions to avoid coordinators needing to read 1000-page claimed proofs (which have a legitimate place in mathematics, but not in solving IMO problems, since that's far longer than an IMO contestant could reasonably write in 4.5 hours).

Joseph Myers (Nov 28 2023 at 03:39):

If a human IMO contestant produces a solution with 20 pages of algebra (which is within what might be written in reality by such a contestant, occasionally, albeit 20 handwritten pages for the human contestant), the coordinators need to read it, even if boring. I think motivation in the AI-MO Prize case might be provided by paying the coordinators to check the AI solutions. In both cases, coordinators are free to use computer algebra systems themselves to help them in checking.

Andy Jiang (Nov 28 2023 at 05:34):

How exactly is this a problem? If they came from a computer algebra system can't we just verify it with a computer algebra system?

By the way, can someone correct my understanding on this. I actually thought most plane geometry problems can be efficiently solved by computer algebra. But i gather this isn't true?

Siddhartha Gadgil (Nov 28 2023 at 06:03):

As I understand if you trust the algebra systems then you can do so but with proved algorithms it is not so easy. My experiments on this are at https://siddhartha-gadgil.github.io/automating-mathematics/posts/solving-pappus/ (following on https://siddhartha-gadgil.github.io/automating-mathematics/posts/proving-by-solving/).

If someone knows about checked algorithms that work I would be grateful for the information.

Joseph Myers (Nov 29 2023 at 03:02):

If the problem reduces to showing some polynomial equations in the coordinates imply another polynomial equation in the coordinates, then it can be solved efficiently (in practice) by Gröbner bases. Otherwise (if there are inequalities involved), you need quantifier elimination, which is much less efficient.

A large proportion of IMO geometry problems do in fact involve configuration or nondegeneracy information that prevents them reducing to Gröbner bases. For example, IMO 2023 problem 2 asks to prove that two points meet on an internal angle bisector (the polynomials in question won't distinguish internal from external bisectors, and indeed a similar issue applies to the definition of the point S). IMO 2023 problem 6 asks to prove that circles have two common points, which polynomials won't distinguish from having a single common point.

Sometimes the relevant information in the conclusion, beyond what can be deduced with Gröbner bases, may be geometrically obvious and not need remarking on in a human solution, but that's not always the case. In IMO 2023 problem 6 you could lose points for not justifying various nondegeneracy conditions needed in a solution. In IMO 2002 problem 2 you could lose a point for failing to explain why an angle condition meant you had the incentre rather than an excentre.

Even if you don't have such configuration information in the conclusion, you might need to use such information from the hypotheses. Maybe the polynomials generate an ideal that contains polynomial $PQ$ , but actually you have to prove something equivalent to $P=0$ , and excluding the possibility of $Q=0$ with $P\ne 0$ requires using nondegeneracy conditions from the problem statement, for example. Again, it depends on the problem whether this is something that needs explicit justification in a solution, but if you're doing an algebraic solution you probably need to least to interpret $Q=0$ geometrically before saying it can't occur.

Siddhartha Gadgil (Nov 29 2023 at 05:48):

Joseph Myers said:

If the problem reduces to showing some polynomial equations in the coordinates imply another polynomial equation in the coordinates, then it can be solved efficiently (in practice) by Gröbner bases. Otherwise (if there are inequalities involved), you need quantifier elimination, which is much less efficient.

A large proportion of IMO geometry problems do in fact involve configuration or nondegeneracy information that prevents them reducing to Gröbner bases. For example, IMO 2023 problem 2 asks to prove that two points meet on an internal angle bisector (the polynomials in question won't distinguish internal from external bisectors, and indeed a similar issue applies to the definition of the point S). IMO 2023 problem 6 asks to prove that circles have two common points, which polynomials won't distinguish from having a single common point.

Sometimes the relevant information in the conclusion, beyond what can be deduced with Gröbner bases, may be geometrically obvious and not need remarking on in a human solution, but that's not always the case. In IMO 2023 problem 6 you could lose points for not justifying various nondegeneracy conditions needed in a solution. In IMO 2002 problem 2 you could lose a point for failing to explain why an angle condition meant you had the incentre rather than an excentre.

Even if you don't have such configuration information in the conclusion, you might need to use such information from the hypotheses. Maybe the polynomials generate an ideal that contains polynomial $PQ$ , but actually you have to prove something equivalent to $P=0$ , and excluding the possibility of $Q=0$ with $P\ne 0$ requires using nondegeneracy conditions from the problem statement, for example. Again, it depends on the problem whether this is something that needs explicit justification in a solution, but if you're doing an algebraic solution you probably need to least to interpret $Q=0$ geometrically before saying it can't occur.

Are Grobner basis efficient in practice for Euclidean geometry problems? My experience was they are not (and I believe they are doubly exponential).

Joseph Myers (Nov 30 2023 at 01:00):

Both Gröbner bases and quantifier elimination are doubly exponential in worst cases, but I think Gröbner bases are much more likely to work with reasonable resources in practice. A previous discussion https://leanprover.zulipchat.com/#narrow/stream/208328-IMO-grand-challenge/topic/geometry.20problems says something about what problems could be proved with Gröbner bases plus handling nondegeneracy conditions manually.

Miroslav Olšák (Dec 01 2023 at 18:09):

Recently, I have noticed that the AIMO Prize website spins my CPU's. What is it doing? Are they using my hardware to solve IMO problems :-D? Am I the only one, or is it happening to someone else too? I am on Firefox 118.0.2 (Linux).

David Renshaw (Dec 01 2023 at 18:12):

Yeah me too, with Firefox and Chrome on Linux.

David Renshaw (Dec 01 2023 at 18:13):

For me on Firefox it looks like it spins up like 6 or 7 threads.

Eric Wieser (Dec 01 2023 at 18:24):

It's this SVG file: https://aimoprize.com/pattern-tile.svg

Kevin Buzzard (Dec 01 2023 at 20:19):

I've reported the issue to XTX. Thanks.

Jason Rute (Jan 17 2024 at 17:25):

Cross-posting this: #Machine Learning for Theorem Proving > AlphaGeometry Solves 25 of 30 recent olympiad geometry problems.

Nilesh (Feb 20 2024 at 07:52):

I'm looking to connect with people from India who want to put together a team to participate in the AIMO prize 2024 as well as the IMO-Grand-Challenge. I met @Siddhartha Gadgil this Saturday at IISc, Bangalore. :pray:

We have a whatsapp group. Please DM me if anyone discovers this thread and wants to join. :-)

David Afonso Valente (Feb 22 2024 at 11:08):

Im interested in collaborating with any open groups/teams aimed at tackling the AIMO prize 2024, if anyone is open please DM me, thank you :)

David Renshaw (Feb 23 2024 at 18:15):

preliminary AIMO "progress prize" rules were announced just now:
https://www.reddit.com/r/AIMOprize/comments/1ay4aiu/progress_prize_pre_announcement/

David Renshaw (Feb 23 2024 at 18:16):

We will have 100 completely new problems created. Problems will be in style of AIME: https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions.

In particular, the answer to each problem is a number from 0 to 999.

Complexity of problems will be a bit easier than AIME and targeted between AIME and AMC'12 level: https://artofproblemsolving.com/wiki/index.php/AMC_12_Problems_and_Solutions

Processing multi modal inputs (specifically diagrams) is not required, all problems are text only (although geometry problems are allowed).

Total prize fund is $1,048,576.

Terence Tao (Apr 02 2024 at 15:36):

The first progress prize competition is now live and accepting submissions: https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/overview

David Renshaw (Apr 02 2024 at 17:18):

submission deadline: 27June 2024
scored on 50 new problems, difficulty between AMC and AIME, stated in English/LaTeX, answers between 0 and 999
submissions get a total of 9 hours of compute time on a machine with 4 cpus, 29GB of memory, and either a P100 gpu or 2xT4 gpus (Kaggle also has a TPU option, but I think it's not allowed for submissions in this contest?)
max team size of 5
"Freely & publicly available external data is allowed, including pre-trained models"
no live internet access allowed during evaluation

Floris van Doorn (Apr 02 2024 at 17:52):

Also the way they deal with the problems is interesting. 110 problems were created for this contest.

10 problems are "training problems", which you can read if you enter the competition
50 problems are "public problems", which are not publicly available, but which are used over the next three months to assess the quality of submissions on the leaderboard: https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/leaderboard
50 problems are "private problems" and are actually used to determine the competition ranking.

Mario Carneiro (Apr 02 2024 at 18:17):

too bad zero of them are actually public for non-contestants

David Renshaw (Apr 02 2024 at 18:22):

To sign up, I needed to agree to the terms:
"You agree to use reasonable and suitable measures to prevent persons who have not formally agreed to these Rules from gaining access to the Competition Data."

Mirek Olšák (Apr 02 2024 at 18:28):

I haven't managed to get in -- "unable to phone verify", on the other hand, so far I am looking at it more out of curiosity, not sure if I would compete in this

Kevin Buzzard (Apr 02 2024 at 19:38):

Mario Carneiro said:

too bad zero of them are actually public for non-contestants

It is highly nontrivial to make 110 original questions of this nature.

Mirek Olšák (Apr 04 2024 at 11:22):

Actually, when I searched enough, I was able to find some examples of the train problems in publicly available on the Kaggle website...

Floris van Doorn (Apr 04 2024 at 11:44):

Yeah, the 10 training problems + (unofficial) solutions are now visible here: https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/490640
There was apparently a mistake/typo in one of the training problems.

Adam Topaz (Apr 04 2024 at 11:59):

Maybe I'm confused about the rules... looking at the models tab lists things like mixtral, which I can't imagine would run on the VMs provided for this competition, as well as Gemini pro which is certainly not open-source and only behind an API. What do the models listed here actually refer to?

David Renshaw (Apr 04 2024 at 12:00):

Mixtral with 4-bit quantization just barely fits into RAM.

David Renshaw (Apr 04 2024 at 12:00):

not sure why Gemini is listed there

Adam Topaz (Apr 04 2024 at 12:00):

oh wow. is that with the 2xT4 or the P100? P100 has something like ~18G ram right?

David Renshaw (Apr 04 2024 at 12:01):

I mean CPU RAM

Adam Topaz (Apr 04 2024 at 12:02):

aha I see. But in this case I think the 9 hour limit would be an issue if you're not using the GPU

David Renshaw (Apr 04 2024 at 12:02):

but also looks like each GPU in the T4x2 configuration has 15GB of memory, so I think it fits there too?

Adam Topaz (Apr 04 2024 at 12:07):

Trying to run models on more than one GPU is intimidating for me... does the transformers library do the hard work for you? Edit: it seems that a simple device_map = 'auto' might just do the trick

David Renshaw (Apr 04 2024 at 12:12):

The starter notebook I'm looking at uses llama_cpp (through a python wrapper), which seems to be pretty good at spreading things out:

llama_new_context_with_model: total VRAM used: 25329.59 MiB (model: 25145.55 MiB, context: 184.04 MiB)

Adam Topaz (Apr 04 2024 at 12:14):

Is this starter notebook available publicly? Do you mind sharing?

David Renshaw (Apr 04 2024 at 12:15):

https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/490734

Floris van Doorn (Apr 08 2024 at 17:16):

The submissions so far are able to solve 17/50 problems of the public problems, which is quite an impressive score!

Adam Topaz (Apr 08 2024 at 17:26):

FWIW, there's a public notebook that gets 13/50 (and supposedly just by setting a higher temp it's possible to get a few more)

Adam Topaz (Apr 08 2024 at 17:27):

All this public notebook does is essentially self-consistency chain of thought together with straightforward python code execution, along with a particular choice of open-source model.

Oliver Nash (Apr 08 2024 at 17:48):

Is there an easy link to this notebook for those of us too lazy to go hunting?

Adam Topaz (Apr 08 2024 at 17:50):

I think this is it: https://www.kaggle.com/code/olyatsimboy/aimo-zero-shot-sc-mmos-deepseekmath

Kevin Buzzard (Apr 08 2024 at 17:51):

Am I right in thinking that one can cheat in the following way: look at the training data, solve the problems manually, and then write some kind of obfuscated code which just prints out the correct solutions? And then you'd get 0 on the test data (so wouldn't win any $$) but you could still be top of the kaggle leader board? Or have I misunderstood how things work?

Adam Topaz (Apr 08 2024 at 17:52):

I think submitted notebooks don't have internet access, and we don't know the problems that they run on.

Adam Topaz (Apr 08 2024 at 17:54):

Adam Topaz said:

I think this is it: https://www.kaggle.com/code/olyatsimboy/aimo-zero-shot-sc-mmos-deepseekmath

Note that this gets 0/10 on the test set, but if you change the n_repetitions to 5 on the test set it does get more than 0.

Floris van Doorn (Apr 08 2024 at 17:55):

The problems are not publicly available. Even the so-called public problems are unknown, the only thing known is the leaderboard of scores that programs get on the public problems.

Adam Topaz (Apr 08 2024 at 17:55):

right, and since there's no internet access you can't make a submission that just uploads the problems somewhere

Floris van Doorn (Apr 08 2024 at 17:55):

the 10 test problems that we can read are disjoint from the public or private set.

Kevin Buzzard (Apr 08 2024 at 18:03):

So the 50 "public" problems are not known to the people writing the AIs? I had misunderstood. (I knew they weren't available to the general public)

Eric Wieser (Apr 08 2024 at 18:03):

The problem texts aren't public, but the success rate on them is

Kevin Buzzard (Apr 08 2024 at 18:41):

Then what is the point of splitting the 100 secret questions into two sets of 50? Sorry to ask such basic questions,

Adam Topaz (Apr 08 2024 at 18:44):

so that you can't optimize your submissions to the actual problems that will determine the prize

Adam Topaz (Apr 08 2024 at 18:44):

submissions close in june, and then the submissions run on the SUPER SECRET 50 questions that no submission has seen before.

Kevin Buzzard (Apr 08 2024 at 18:44):

But I thought we'd just established that we can't see the problems?

Adam Topaz (Apr 08 2024 at 18:45):

yeah but you can still tune your model by making submissions each day and seeing what score you get

Adam Topaz (Apr 10 2024 at 18:52):

This was posted two days ago: https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/492178

Also, there is currently a public notebook that supposedly gets 16/20 here: https://www.kaggle.com/code/quan0095/more-diversity-in-output-improve-score

But the highest score is still 17/20 :-/

Mauricio Collares (Apr 11 2024 at 07:55):

"16/50", if I understand correctly

fzyzcjy (Apr 11 2024 at 08:33):

Curious whether anyone has attempted https://arxiv.org/abs/2404.06405 as an accompany tool to LLMs?

solves 21 out of 30 methods by just using a CPU-only laptop with a time limit of 5 minutes per problem. Essentially, this classic method solves just 4 problems less than AlphaGeometry

Adam Topaz (Apr 11 2024 at 11:53):

Mauricio Collares said:

"16/50", if I understand correctly

Yeah, sorry both 20's were a typo

Adam Topaz (Apr 19 2024 at 11:50):

This notebook has now got 20/50: https://www.kaggle.com/code/piotrzamisnii/aimo-zero-shot-sc-mmos-deepseekmath-e3248b

Note that it is essentially identical to the ones mentioned above which got 13 and eventually 17 by adjusting the temperature, and number of repetitions. I find it a bit disappointing that getting to 20/50 was more-or-less a lottery.

Adam Topaz (Apr 19 2024 at 11:58):

version 2 of this notebook had the following suspicious code:

for a, b in zip(total_answers, total_results):
    a = np.array(a)
    b = np.array(b)
    a[a < 0] = b[a < 0]
    pred = Counter(a.tolist()).most_common(2)
    enumerates = df.enumerates.values[enumerate_i]
    if enumerates in [3, 4, 5]:
        ans = pred[1][0]
    elif enumerates in [30, 36]:
        ans = pred[1][0] if not pred[1][0] < 0 else pred[0][0]
    else:
        ans = pred[0][0] if not pred[0][0] < 0 else pred[1][0]
    enumerate_i+= 1
    final_answers.append(ans)
    print(ans)

Kevin Buzzard (Apr 19 2024 at 12:55):

What does that code do?

Kevin Buzzard (Apr 19 2024 at 12:56):

Disclosure: as of recently I am one of the people involved with the organisation of this prize

David Renshaw (Apr 19 2024 at 13:04):

total_answers and total_results hold solution values that were computed in different ways. (I think the former is from running generated python code and the latter is from parsing what's in a generated \box{}.) The above code implements voting, so that the most common solution gets chosen, with some priority given between the two styles of solution. The weird thing is the hard coding of the priority in problems 3,4,5,30, and 36.

Kevin Buzzard (Apr 19 2024 at 14:05):

Thanks for the explanation!

George Tsoukalas (Apr 19 2024 at 15:51):

Huh, I would have figured submissions are run on a random ordering of the 50 problems to prevent any probing.

Floris van Doorn (Apr 19 2024 at 17:15):

The final competition is on 50 private problems that will not be used at all while the submissions are open, so it is impossible to probe the private problems.

Kevin Buzzard (Apr 19 2024 at 20:33):

https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/495133

Kevin Buzzard (Apr 21 2024 at 08:36):

Another submission is claiming 20/50 without probing: https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/492178#2763188

Junyan Xu (Jul 02 2024 at 22:09):

The competition for Progress Prize 1 ended 5 days ago, and apparently the highest score is 29/50 on the private leaderboard, pending verification. (Update: the scores appear to be still on the public test set, see below.)

Jason Rute (Jul 02 2024 at 22:19):

Is that the final score? I’m trying to understand this post including the comment that the person currently in first place is still refreshing to see if the final results have come in. https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize/discussion/515516

Kevin Buzzard (Jul 02 2024 at 22:35):

My understanding is that the scores on the super-secret 50 questions are yet to be publically released.

Junyan Xu (Jul 02 2024 at 22:39):

This is what I see, from which I understand that the submissions have been run on the private set but the results have not been verified. The verification looks like it might not be automated? Otherwise it would be strange to have a long delay. I think it makes sense that each submission is run on the private set when it's submitted, just the result is not published. If these are not results from the private test set, why call it "preliminary" "private leaderboard" rather than "public leaderboard"? But maybe I'm wrong and not all submissions have been run on the private set yet.
image.png

(And thanks to @Jason Rute for posting a link to the thread, I was not previously aware of it.)

Kevin Buzzard (Jul 02 2024 at 22:40):

My understanding is that they were hoping that everything would be done by now but it's not done yet.

Kevin Buzzard (Jul 02 2024 at 22:57):

I've heard that you shouldn't be hitting refresh every 5 minutes and that the results from the private leaderboard might not appear for another day or two.

Floris van Doorn (Jul 03 2024 at 10:19):

Junyan Xu said:

The competition for Progress Prize 1 ended 5 days ago, and apparently the highest score is 29/50 on the private leaderboard, pending verification.

You call this "the private leaderboard", but isn't this just the scores on the "public" set of problems?

Jason Rute (Jul 03 2024 at 11:55):

It does seem that the scores are just the previous ones from the public test set. The confusion seems to be that Kaggle itself calls this the “private leaderboard”. And the note at the top of the leaderboard is confusing. It implies that the private leaderboard has been run, but just needs verification. I think that is just some generic kaggle text that doesn’t match the status of this competition. (Also, there have been a few superficial changes from the public scores seen at the end of the competition. Namely, the second-place Meta-math team was disqualified, and among the many teams with a score of 26/50, their order has changed. This makes it look superficially like the leaderboard is different from the old public one, which isn’t available anymore.)

David Renshaw (Jul 03 2024 at 12:00):

For what it's worth, here's a csv snapshot of the public leaderboard that I downloaded at the end of the submission period last week:
ai-mathematical-olympiad-prize-publicleaderboard-2024-06-27T104202.csv
At a quick glance, it looks to me like this lines up with what's currently being displayed.

Jason Rute (Jul 03 2024 at 12:01):

The reordering might be due to the fact that I think these “new” scores are only calculated on the versions of the notebook that were submitted for final scoring. In particular, I believe ties are broken by the notebook that was submitted first.

Kevin Buzzard (Jul 03 2024 at 20:12):

OK the scores now represent the scores on the super-secret 50. Note that the top team, Numina, scored 29/50 on both the "public" and the "private" questions.

Jason Rute (Jul 03 2024 at 20:12):

It seems that the private leaderboard is finally updated. Most of the private scores are much lower than the corresponding public scores, except the winning team which was the winning team of the public leaderboard and still got 29/50 correct. I look forward to learning about their method. I am suspicious of the 2nd, 4th and 5th results. Was that pure luck? They didn’t even score in the top 100 on the public board as far as I can see, right? (There were only 50 problems so randomness was very possibly a factor.) Also it seems that ~~a number~~ (edit: one) of the top public leaderboard submissions crashed when run on the private leaderboard. I feel sorry for them. It is also sad that the second place public leaderboard team was disqualified. I would have liked to see their method.

Jason Rute (Jul 03 2024 at 20:29):

Here is public to private for the top 5 public and top five private teams:

Numina        1st public (29/50) -> 1st private (29/50)
MetaMath-AIMO 2nd public (28/50) -> Disqualified?
after exams   3rd public (27/50) -> 3rd private (22/50)
[Deleted]     4th public (27/50) -> Disqualified?
Zoltan        5th public (28/50) -> 7th private (20/50)
...
Zhiqing SUN 176th public (22/50) -> 2nd private (22/50)
codeinter    90th public (23/50) -> 4th private (21/50)
Conor #2    771st public (19/50) -> 5th private (20/50)

Adam Topaz (Jul 03 2024 at 20:31):

The notebooks of the winner(s) will be made public, right?

Junyan Xu (Jul 05 2024 at 03:45):

According to this tweet the champion team consists of the following members:

Edward Beeching
Albert Jiang (on Zulip)
Jia Li
Roman Soletskyi (on Zulip, never posted)
Lewis Tunstall

and founding members:

Hélène Evain
Yann Fleureau
Guillaume Lample (on Zulip)
Stanislas Polu (on Zulip)

and sponsors:

Hugging Face
Mistral AI
Answer AI
General Catalyst

Congratulations to all!

Junyan Xu (Jul 05 2024 at 03:54):

A preview of the methodology:

Lewis Tunstall:

fine-tuning open LLMs ... Our Numina Math 7B model ... Stay tuned - we will release the model, dataset, and methodology very soon!

Edward Beeching:

We will be sharing the details of our method over the coming weeks. This will include open source models, training code and evaluation pipelines.
Tree of Thoughts algorithm that interleaved generation with code execution and correction. ... an optimized and elegant solution to scale up to majority voting with 48 candidate solutions per problem.

GRqpX4mbMAYSdC5.jpeg

Jason Rute (Jul 05 2024 at 22:26):

Interesting! I was telling someone recently that I doubted this Kaggle competition would attract serious research efforts because of all the onerous requirements (strict team limit of 5, can’t share code/data with anyone outside your team, limits on the models you can use, and strict open source requirements). I thought it would only be a good measure of current methods. So it is great to see that I am wrong and a serious research effort (with hopefully exciting new ideas) made it to the top place! Congratulations! I look forward to reading the solution!

Kevin Buzzard (Jul 12 2024 at 16:04):

https://huggingface.co/blog/winning-aimo-progress-prize

Johan Commelin (Jul 12 2024 at 19:08):

If the computer is a bicycle for the mind, artificial intelligence is its engine

I'm really old-fashioned... my bicycle doesn't have an engine :bicycle: :rofl:

Adam Kurkiewicz (Jul 14 2024 at 18:11):

One disadvantage of the scoring chosen by this competition is that it encourages a model to output a solution, even if it's wrong. In other words, the scoring system does not promote checking the correctness of the solution. To a human problem-solver it's usually easy to say: "this I can do, but this is too difficult for me". If negative scoring was introduced for wrong answers (but no negative score given to the "I don't know" answer), we could see competing teams adjust to this way of scoring, by e.g. including an additional machine learning model to check the correctness of reasoning (or maybe even translating the solution to lean!). If the model is not convinced it produced a correct solution, it could just output the "I don't know" answer (e.g. negative one, or something like this), to avoid losing points.

Adam Kurkiewicz (Jul 14 2024 at 18:12):

But of course, huge success by the winning team! 29/50 problems is really big!!! A few months ago my best guess would have been that top model will barely get a few answers right, if any at all!

Jason Rute (Jul 14 2024 at 18:37):

I think it is a bit difficult to know how hard the test problems were. They said (somewhere on Kaggle) that they were more diverse than the 10 public ones (which were on the harder end). But still 29/50 does seem good given the baselines and other scores. Also it is fun to directly play with their model here: https://huggingface.co/spaces/AI-MO/math-olympiad-solver It’s a nice way to prove how good/bad it is without having to interpret the scores.

Qian Hong (Jul 17 2024 at 09:29):

Adam Kurkiewicz said:

One disadvantage of the scoring chosen by this competition is that it encourages a model to output a solution, even if it's wrong. In other words, the scoring system does not promote checking the correctness of the solution. To a human problem-solver it's usually easy to say: "this I can do, but this is too difficult for me". If negative scoring was introduced for wrong answers (but no negative score given to the "I don't know" answer), we could see competing teams adjust to this way of scoring, by e.g. including an additional machine learning model to check the correctness of reasoning (or maybe even translating the solution to lean!). If the model is not convinced it produced a correct solution, it could just output the "I don't know" answer (e.g. negative one, or something like this), to avoid losing points.

Great point!

This reminds me of the book "How to Measure Anything." In chapter 5, "Calibrated Estimates: How Much Do You Know Now?", it introduces a subjective confidence level self-test. The author points out that most people are systematically overconfident, some are systematically underconfident, and very few people are naturally calibrated estimators. However, most people can be calibrated after mere hours of training. (I did the test myself and got a surprisingly poor result, then improved my estimation following the author's advice on calibration training.)

When we encourage a model to output a solution even if it's wrong, we are indeed encouraging overconfident behavior in the model.

If we can explicitly or implicitly extract the confidence level of the AI model, then we will be able to draw a 2D scatter chart with one dimension for the subjective confidence level of the model and the other dimension for the actual average score, grouped by subjective confidence level. If the model is perfectly calibrated, then we should see the 2D scatter chart resemble a perfect diagonal line. This means when the model believes it has 100% confidence, it should answer 100% of questions correctly, and when it believes it has 50% confidence, it should answer about 50% of questions correctly.

Whether the confidence level of a model can be calibrated/fine-tuned, similar to how humans can be calibrated as shown in prior psychology research, would be an interesting research topic in itself. If a model's confidence level could be accurately calibrated, then in theory, we could use that information as a hint for expert routing, using the confidence level to identify the right expert for the right problem domain.

confidence-level.png

Andy Jiang (Jul 18 2024 at 02:59):

Now that this year's IMO is taking place/finished--is there a list anywhere of how many AI models entered into the challenge this year? I presume that grading will be later but would be good to see if/how many people threw their hat in the ring

Jason Rute (Jul 18 2024 at 03:20):

I’m not sure aware of anything public. It is possible Google/DeepMind or OpenAI has a secret thing going. It also is interesting that Mathstral (by Minstral) was released right before the IMO.

Jason Rute (Jul 18 2024 at 04:13):

But to be clear, I’m not at IMO and not privy to any inside information. Maybe those there know something I don’t.

Albert Jiang (Jul 20 2024 at 13:26):

(not at the IMO) From what I hear, Deepmind is presenting AlphaGeometry at the moment. There'll be a presentation by Project Numina on the 1st progress-prize winning solution. It's not livestreamed afaik but hopefully recorded.

Sid (Jul 21 2024 at 00:58):

Andy Jiang said:

Now that this year's IMO is taking place/finished--is there a list anywhere of how many AI models entered into the challenge this year? I presume that grading will be later but would be good to see if/how many people threw their hat in the ring

The questions are already out so the various teams can run their models on the questions on their own (and I bet many are and not just Google/OAI...).

Andy Jiang (Jul 21 2024 at 15:47):

Sid said:

Andy Jiang said:

Now that this year's IMO is taking place/finished--is there a list anywhere of how many AI models entered into the challenge this year? I presume that grading will be later but would be good to see if/how many people threw their hat in the ring

The questions are already out so the various teams can run their models on the questions on their own (and I bet many are and not just Google/OAI...).

I agree with this--just wondering if there was anything official this year or was it only the progress prize that was available this year. Is it correct to say that the main prize is not open yet this year? (otherwise I would guess some people would submit models for the judges to determine the score)

Kevin Buzzard (Jul 21 2024 at 17:45):

There was talk of some of the tech companies having a go this year but I'm pretty sure there was nothing submitted to AI-MO yet (indeed, there is nothing to submit to right now).

Andy Jiang (Jul 21 2024 at 18:06):

Out of curiosity--what's the process for someone to submit to the actual AIMO? there doesn't seem to be an official method on the website, is it just like if you're a big company you can reach out otherwise you can't attempt it?

Kevin Buzzard (Jul 21 2024 at 18:43):

The AIMO announces competitions. For example they announced the progress prize which has just been won. The fund wants to encourage development of open source models which can win gold at IMO. This seems a bit optimistic right now, especially because big companies won't open source their models. There will no doubt be future progress prizes announced by the AIMO. [Disclaimer: I'm on the advisory committee]

Andy Jiang (Jul 21 2024 at 19:55):

Given that there are some good open source models being released recently/in the near future (deepseek/ llama3) and the recent success of the AIMO progress team winner (whose Numina-72B model is claiming 12/30 on 2024 AIME) I don't think it's impossible that by next year even the open source models (especially those based in China with large funding like deepseek) can make a serious push for IMO gold.

Kevin Buzzard (Jul 21 2024 at 21:15):

I can certainly relay this to the people in charge. They are certainly going to be running other progress prizes but there are no details yet decided about what future prizes will look like.

Albert Jiang (Jul 22 2024 at 06:48):

One big complaint I have for the setup of the progress prize is the severe limitation of the computational resource (GPU limit and time limit). This prevents people from using the larger, more capable models and essentially encourages hacking and overfitting the validation set. A clear evidence of this is that there were lots of solutions with > 25 solutions on the public validation set and drop off to <20 on the private test set. It would be great if the people in charge of setting up the future progress prizes can take this into account.

Siddhartha Gadgil (Jul 22 2024 at 09:03):

Albert Jiang said:

One big complaint I have for the setup of the progress prize is the severe limitation of the computational resource (GPU limit and time limit). This prevents people from using the larger, more capable models and essentially encourages hacking and overfitting the validation set. A clear evidence of this is that there were lots of solutions with > 25 solutions on the public validation set and drop off to <20 on the private test set. It would be great if the people in charge of setting up the future progress prizes can take this into account.

This is a very important point. The difference between being able to solve in an hour versus a minute is far less fundamental than between being able to solve in an hour (or day) versus not at all. If I understand correctly, competitive programming is all about fast solutions where a solution in principle is not hard, which is not the case at all here.

Andy Jiang (Jul 25 2024 at 15:39):

https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/

Eric Wieser (Jul 25 2024 at 15:42):

I've made a thread about that here

Last updated: May 02 2025 at 03:31 UTC