Zulip Chat Archive
Stream: IMO-grand-challenge
Topic: 2nd AIMO Progress Prize
Jason Rute (Sep 27 2024 at 10:47):
It looks like AIMO just announced the second progress prize.
This second progress prize will increase the level of challenge for participants to National Olympiad level and will include 100 new maths problems to solve. The new challenges will be ‘AI hard’, meaning they will require mathematical reasoning, rather than brute force calculations or guesswork.
No information on when it starts or any other changes to the format.
Jason Rute (Oct 17 2024 at 19:32):
It looks like the 2nd prize is up: https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-2/overview
RexWang (Oct 18 2024 at 08:06):
The published the ten questions for reference:
Question 1: Three airline companies operate flights from Dodola island. Each company has a different schedule of departures. The first company departs every 100 days, the second every 120 days and the third every 150 days. What is the greatest positive integer for which it is true that there will be consecutive days without a flight from Dodola island, regardless of the departure times of the various airlines?
Question 2: Fred and George take part in a tennis tournament with 4046 other players. In each round, the players are paired into 2024 matches. How many ways are there to arrange the first round such that Fred and George do not have to play each other? (Two arrangements for the first round are different if there is a player with a different opponent in the two arrangements.)
Question 3: Triangle has side length and circumradius . Let be the foot of the perpendicular from to the line . What is the greatest possible length of segment ?
Question 4: Find the three-digit number such that writing any other three-digit number times in a row and times in a row results in two numbers divisible by .
Question 5: We call a sequence of non-negative integers delightful if there exists a positive integer such that for all , , and for all , counts the number of multiples of in . How many delightful sequences of non-negative integers are there?
Question 6: Let be a triangle with , , and . Point lies on segment such that bisects . Let be the circumcircle of triangle . Let be a point on different from such that . Line meets at . The length of the segment can be written as , where and are coprime positive integers. Find .
Question 7: For a positive integer , let denote the sum of the digits of in base 10. Compute with .
Question 8: For positive integers define to be the sum of their pairwise greatest common divisors. We say that an integer is artificial if there exist different positive integers such that
Find the sum of all artificial integers in the range .
Question 9: The Fibonacci numbers are defined as follows: , , and for . There are positive integers strictly less than such that is a multiple of 5 but is not. How many prime factors does have, counted with multiplicity?
Question 10: Alice writes all positive integers from to on the board for some positive integer . Bob then erases ten of them. The mean of the remaining numbers is . The sum of the numbers Bob erased is . What is the remainder when is divided by 997?
Answers: 79, 250, 180, 143, 3, 751, 891, 810, 201, 902
RexWang (Oct 18 2024 at 08:08):
Like the first AIMO, the problems are similar in style to AIME (Not F2F, takes values in the range of 0–999, mod 1000).
As stated, the difficulty of the problems is expected to be roughly at the level of a national Olympiad.
RexWang (Oct 18 2024 at 08:47):
I'm curious about Numina's preformence on these ten problems, but the HuggingFace demo has been showing a 404 error for some time.
https://huggingface.co/spaces/AI-MO/math-olympiad-solver
Andy Jiang (Oct 18 2024 at 12:06):
Stupid question: what does it mean that the questions are at the level of national Olympiad? Does it mean someone who can compete in say USAMO should expect to get a similar percentage score in USAMO as they get doing these problems? Bc USAMO (and I assume other Olympiads) are proof based right?
Alex Meiburg (Oct 18 2024 at 15:41):
If their goal is to make these "AI hard", then questions 1 and 4 seem poorly designed. Simply iterating over all 100*120 possible offsets (relative to the 150 day plane) works for question 1, and for question 4 it's perfectly feasible to just try all three-digit numbers. (A computer can take the GCD of some 2024-digit numbers in under a millisecond.)
Alex Meiburg (Oct 18 2024 at 15:42):
The rest look solidly "calculator resistant" though, as in, you need some kind of mathematical insight before writing your for loop.
Terence Tao (Oct 18 2024 at 16:39):
I think the numbers in Question 4 have about 10^2024 digits, not 2024 digits. I agree though that the 100 * 120 * lcm(100,120,150) ~ 7 x 10^6 calculations needed to brute force Question 1 is feasible.
Alex Meiburg (Oct 18 2024 at 18:07):
Oops, yes, I misread that question. :) Somehow changed it to "... up to 10^2024" or something similar in my head, haha
RexWang (Oct 19 2024 at 11:53):
I think the real challenge in IMO is creating a solid proof, not just finding the right answer. Though the problems are labeled as National Olympiad level difficulty, it doesn't fully measure what LMs can do with them.
Since LMs are still not great at F2F, and it's hard to check the NL proofs due to hallucinations, AIME-style might be the best approach for now.
For example, in OpenAI's solution to IMO 2024 P1, it often skips key steps or includes nonsense.
The proof is almost correct but misses some tricks: considering min/max cases. Despite this, it gets the right answer, while most pure language models like DeepSeek and GPT-4o don't.
Kevin Buzzard (Oct 19 2024 at 16:46):
For proof questions, "getting the right answer" may not be worth much; I will happily give 0 points to someone who answers a question "figure out x, with proof" with "[value of x asserted with no proof]", especially if x is small (or an element of the set {true, false}!)
However with AIMO the idea is that even getting the right answer is supposed to be hard. So here, getting the right answer is worth something.
Kevin Buzzard (Oct 20 2024 at 11:29):
Wow, according to https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-2/leaderboard someone has already scored 7/50.
Ash Blanc (Nov 09 2024 at 01:26):
Hi, I m new here was curious to know about the IMO Grand Challenge. Is it just a group or a challenge/competition like aimo or other kaggle ones ?
Kevin Buzzard (Nov 09 2024 at 01:32):
You mean this https://imo-grand-challenge.github.io/ ? It's basically an inactive idea from many years ago.
Kevin Buzzard (Nov 09 2024 at 01:33):
It was being driven by Daniel Selsam but he then lost interest because he became convinced that informal was the way forwards.
Ash Blanc (Nov 09 2024 at 01:47):
Informal was the way forwards really ? If it could get affiliated by IMO this could indeed be a great initiative even AIMO currently lacks theorem proving challenges .
Jason Rute (Nov 09 2024 at 02:05):
@Ash Blanc IMO Grand Challenge first referred to Selsam’s challenge to win the IMO with an AI inside Lean. He formulated rules and a committee, and worked a bit on it himself, but it never went anywhere. It also more generally refers to the goal of using AI to get a gold medal in the International Math Olympiad. Some, like the original challenge, want to do this with formal math. The MiniF2F benchmarks contain IMO problems and DeepMind’s AlphaProof also used AI and Lean to solve 3 IMO problems from 2024 (with three days of computer time). Others want to solve the IMO with informal reasoning AI systems. Google has a private benchmark of IMO level problems they use to test Gemini for example. OpenAI’s o1 showed proficiency with math competition problems. Finally, the AIMO is the closest thing to a real competition with prize money and rules. The stated goal is to award a $5 million dollars to the first open source AI which can reach human gold level in the IMO. (DeepMind’s AlphaProof wouldn’t count since it isn’t open source.) More concretely they have been running a series of progress prize Kaggle competitions to solve difficult math problems. They are currently on the second competition. The problems are all in English (with latex math formulas) and the AI needs to answer with a number between 0 and 9999 (no proof required). The winning solutions from the first competition used LLMs to write python programs to find the answers.
Jason Rute (Nov 09 2024 at 02:13):
I would say Daniel Selsam’s ideas, even if they didn’t go anywhere, inspired all the other AI for IMO projects I mentioned above.
Ash Blanc (Nov 09 2024 at 02:14):
Yes indeed he pioneered indirectly
Kevin Buzzard (Nov 09 2024 at 10:46):
(re AIMO bounds, I believe it's 0-999 for the solutions, not 0-9999)
Mirek Olšák (Nov 09 2024 at 13:02):
Ash Blanc said:
Yes indeed he pioneered indirectly
Then I pioneered indirectly even before :-) https://aitp-conference.org/2018/slides/MO.pdf
Joseph Myers (Nov 09 2024 at 13:38):
Some of us suspect that both formal and informal are important to AI for mathematics: formal because it allows the AI to check its work reliably (so producing a game that AIs can learn to play following all the previous work on AI learning to play games) and informal because there's much more existing training material covering a much wider range of mathematics and reasoning at a higher level. If so, then formal-to-formal solvers and informal-to-formal autoformalizers for both statements and proofs (plus autoinformalization for formal proofs) could be useful components of a system that takes informal statements as inputs and produces informal proofs as output (together with supplementary output of formal statements and proofs to help with checking its work).
Joseph Myers (Nov 09 2024 at 13:42):
If we ever get an AI producing a thousand-page informal solution to a major open problem, we really don't want it to depend on the correctness of an informal reasoning system that, once in 500 pages, tends to hallucinate a logical step that seems plausible but is actually subtly and nonobviously wrong.
Jason Rute (Nov 09 2024 at 13:46):
Another important aspect of formal-in-the-middle approaches is that they make it possible to stop searching. Otherwise one ends up with majority voting systems which generate N independent solutions and take the most common answer. That is very inefficient. I’d be curious if anyone can make a useful informal-to-formal component based on theorem provers in the current AIMO Kaggle competition. (The time limits might be too strict.) Indeed, the winning solutions already use Python tool use for a weaker notion of informal-to-formal.
Ash Blanc (Nov 09 2024 at 16:11):
Jason Rute said:
Another important aspect of formal-in-the-middle approaches is that they make it possible to stop searching. Otherwise one ends up with majority voting systems which generate N independent solutions and take the most common answer. That is very inefficient. I’d be curious if anyone can make a useful informal-to-formal component based on theorem provers in the current AIMO Kaggle competition. (The time limits might be too strict.) Indeed, the winning solutions already use Python tool use for a weaker notion of informal-to-formal.
Not sure what that informal to formal here exactly means though I was too wondering how lean could be used in advantage along with the fine tuned llm and also how to incorporate/use lean via kaggle notebook
Jason Rute (Nov 09 2024 at 17:00):
@Ash Blanc I simply mean starting with an informal math question and then using formal tools in the middle as part of solving it. For that to happen, something like an LLM would have to translate part of the problem to a formal language. Examples:
- Most of the winning solutions to the first AIMO progress prize used tool integrated reasoning where they used an LLM to write python programs which did calculations or searches to find a numerical answer.
- An alternative could possibly be to use an LLM to translate the problem from English to Lean, including guessing the numerical answer, and then using a proof solver to try to fill in the proof. If so, then you have high confidence you have the right answer.
- AlphaProof basically did the above for the IMO, although the problems were translated to Lean by humans, not an LLM. But they still used LLMs to guess the solutions and to suggest easier alternatives to solve first.
- Draft-Sketch-Prove (DSP) used an LLM to both write informal proofs and translate those to a proof sketch Isabelle. It then filled in the proof sketch with the Isabelle Hammer. There are also many extensions.
- Don’t Trust; Verify (DTV) did something similar to DSP but for question-answer problems like GSM8k. I think something like DTV would be most relevant to the AIMO Progress Prizes. I don’t know if anyone has tried it.
RexWang (Nov 29 2024 at 12:05):
The leaderboard had been stuck for weeks until today's update. The top score has increased from 10 to 18. Near to 20 now.
This improvement might be due to the new model: KirillR/QwQ-32B-Preview-AWQ, as discussed here. Similar to Numina, I guess it involves extensive training on high-quality competition data.
Jason Rute (Nov 29 2024 at 12:23):
It looks like they are doing something different this time, where the rules have been amended to allow case-by-case whitelisting of models to account for open source models which are released after the cutoff date or have open-source-ish licenses which have restrictions after say a high number of users. There is no exact criteria for white listing. It is up to the organizers, but since there are still months to go, I think they are happy to add new models to the mix.
Jason Rute (Nov 29 2024 at 12:25):
So unlike Numina’s model, this QwQ model wasn’t trained for this competition. It is instead an open source competitor (clone?) of Open AI’s o1. It was released on Friday. (There are a few others from this past week too, DeepSeek-r1, Marco-o1, and Llava-o1, but I don’t know if they are as good or as useful for this competition.)
Jason Rute (Nov 29 2024 at 15:07):
Although, since AIME seems to be the new benchmark of choice for reasoning models (after o1’s impressive results), I think all these o1-like models have extensively trained on competition math problems and such.
Kevin Buzzard (Nov 30 2024 at 14:28):
20 has been hit, so we're in Early Sharing Prize territory.
Jason Rute (Nov 30 2024 at 18:54):
From the what I can tell, the early sharing prize winner will likely be this notebook using the QwQ model (which is an o1 like chain of thought model) mentioned above plus some good parameters/prompts and a bit of random luck to get up to 20. I seems the notebook left in the Numina python interpreter stuff, but the prompts don’t mention it so I don’t know if code execution matters. In some ways this seems to point to the usual R&D cycle of large labs being the main driver of success so far in this competition more than any tricks/ideas specific to this competition. But there is still 4 months to go.
RexWang (Dec 01 2024 at 04:42):
There appears to be fluctuation when submitting the same notebook. For instance, a notebook with a score of 10 can fluctuate to 13-15. Averaging the evaluation results over multiple submissions might be fairer, but it would consume more resources. Alternatively, this evaluation should be applied to the top few entries in the final submission phase.
Jason Rute (Dec 01 2024 at 11:52):
RexWang said:
Averaging the evaluation results over multiple submissions might be fairer
As someone pointed out in the first competition, that still doesn't prevent someone from "playing the lottery" to win. If someone wanted to play the lottery, they could still just set a seed (and reset it for each problem) to have a consistent output across multiple submissions. Then maybe if their seed is lucky, they would win. (One of the top 5 winners in the last competition just submitted the early sharing prize notebook unchanged and lucked out into the top 5. They were gracious enough to not claim the prize.)
friederrr (Dec 12 2024 at 16:44):
We have left Early Sharing Prize territory (re Kevin Buzzard's remark:
We just officially announced the winner after we vetted his notebook thoroughly. Congrats to Md Boktiar Mahbub Murad if you're reading this here :)
In the process of vetting it, I have worked with him to set up his solution on HuggingFace so anyone can test and play around with the winning submissions without having to write a single line of code (but you do have to pay in that case a few dollars to HuggingFace, running the model for you, which is the price for convenience): https://huggingface.co/MBMMurad/QwQ-32B-preview-AWQ-AIMO-earlysharing
Jason Rute (Dec 13 2024 at 03:45):
I sort of wonder if all the major progress in this contest will be from major labs releasing open source reasoning models, the organizers whitelisting them, and users making small but creative adjustments to make them fit to AIMO (quantization, appropriate prompts, etc). Or will there be new ideas, newly fine-tuned models, new datasets, etc. directly from the contestants?
Ash Blanc (Dec 13 2024 at 03:48):
friederrr said:
We have left Early Sharing Prize territory (re Kevin Buzzard's remark:
We just officially announced the winner after we vetted his notebook thoroughly. Congrats to Md Boktiar Mahbub Murad if you're reading this here :)
In the process of vetting it, I have worked with him to set up his solution on HuggingFace so anyone can test and play around with the winning submissions without having to write a single line of code (but you do have to pay in that case a few dollars to HuggingFace, running the model for you, which is the price for convenience): https://huggingface.co/MBMMurad/QwQ-32B-preview-AWQ-AIMO-earlysharing
Btw does that mean is AIMO over ?
Jason Rute (Dec 13 2024 at 03:52):
@Ash Blanc No, not even close.
Jason Rute (Jan 20 2025 at 13:16):
Another lab open sourced their reasoning model (DeepSeek-R1), and they also distilled it into smaller models. I wouldn’t be surprised if this leads to another small jump in performance on AIMO Progress Prize 2 across the board.
Alok Singh (Jan 21 2025 at 04:49):
[2] We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models.
Maybe we could use that to tune Lean
Jason Rute (Jan 26 2025 at 13:51):
As I predicted there has been another increase in AIMO progress prize 2 scores. (Unless someone comes up with new ideas, I imagine it will just be that every time a new open-source reasoning model comes out, there will be a step function jump in scores. :smile:)
Harald Carlens (Jan 27 2025 at 11:44):
Yes, I think before R1 was whitelisted the highest public leaderboard score was 25, and now that's gone up to 27. Quite a few of the other teams seem to have had a score bump as well. Annoyingly I didn't download the full leaderboard state at anytime last week and only noted 1st and 12th place scores so it's hard to see the impact more systematically. Did anyone happen to download it last week?
Last updated: May 02 2025 at 03:31 UTC