Zulip Chat Archive

Stream: IMO-grand-challenge

Topic: The Aim and Timeline IMO Grand Challenge

Jason Rute (May 21 2021 at 23:21):

Daniel Selsam said:

Jason Rute said:

Also the IMO grand challenge is in Lean 4 which attracts folks, but at the same time that is more of a showcase of Microsoft Research than a completion or benchmark.

I can't speak for everybody involved, but I aim to develop the IMO-GC into the ImageNet of the 2020s, i.e. the premiere AI benchmark that helps shape the global AI research agenda.

Continuing this conversation in another thread. First, I apologize for misunderstanding the goal of the IMO Grand Challenge. Nonetheless, I still have a lot of questions about what it is going to look like over the coming years. I know you, Daniel, and your collaborators are working on this by developing your nondeterministic tactic framework. I wonder, however, what role other researchers will have in this challenge.

Jason Rute (May 21 2021 at 23:22):

Since it is called a challenge, and the IMO happens only once a year, do you foresee this being something like a yearly Kaggle competition where teams compete once a year? (Or CASP if you want a more academic example.) If so, are we just waiting until mathlib is transferred to Lean 4 to start the competition? When do you think that would happen? Next year? Will you provide examples ahead of time showing how the problems will be structured.

Jason Rute (May 21 2021 at 23:22):

Or is it more like a benchmark (you mentioned ImageNet), where one publishes papers showing their progress on the benchmark. The researcher is told not to look at the most recent IMO problems, until they test their model. Then to double check, the best models of that year will likely test their model again on the newest IMO problems as they come out. If this is the approach, how do you think it will get started? Will you release a bunch of examples and guidelines and maybe some baselines using your nondeterministic tactic framework?

Jason Rute (May 23 2021 at 01:30):

Update: Ok, forgive my obnoxious line of questioning. I've just come to the realization that I'm thinking about this the wrong way. I was thinking about the IMO Grand Challenge as this particular thing that was announced a year and a half ago and will eventually take place in Lean 4 with rules and guidelines and specific examples of how problems will be encoded. Maybe that is still true, but I've now come to the conclusion that the main heart of the IMO grand challenge is just to pose the question:

Can a computer get a gold medal in the IMO, and if so what would it take?

Jason Rute (May 23 2021 at 01:30):

Any researcher is free to work on that question right now, publish results, and start a dialog with the community about their results. For example, there are already a number of IMO problems in Lean 3 in mathlib/archive and similar math competition problems in MiniF2F for both Lean 3 and Metamath (with other provers possibly in the future). So it is already possible to test current systems like the GPT-f family of neural provers on math competitions problems right now. (Also, if someone is curious how their favorite automatic theorem prover does on the 2021 IMO problems, it isn't hard to code up the six theorems in any ITP.)

Jason Rute (May 23 2021 at 01:30):

Moreover, if a research team has their own idea of how to solve IMO and they succeed, it won't really matter what the details are. If they (miraculously) succeed in generating pure natural language solutions, that would easily be a cover story on Nature or Science. Or if a team solves the IMO, but in Isabelle or Coq instead of Lean no one can fault their work if it is good. If a team uses excessive amounts of compute, like let's say they use a two billion USD neural model or they use a powerful SMT/SAT solver run for one month, then we can have a discussion as a community (just like is usual for research) if this is really "earning a IMO gold medal", but it will still be potentially good publishable research. Also, if there is any question about if an agent is overfit to the particular test problems, it is easy-ish to ask an expert to come up with 6 new IMO level problems to feed to the agent (or one can just wait until next years IMO to get fresh problems).

Jason Rute (May 23 2021 at 01:30):

While it is nice to have standard benchmarks to point to and use as yard sticks, it isn't necessary for research, and I feel more at ease about the ambiguity surrounding this challenge.

Scott Morrison (May 23 2021 at 02:14):

It may also be worth setting the bar a little lower at first --- I will be extremely impressed if any system can convincingly score 1/42 on an IMO exam, let alone get a medal.

Jason Rute (May 23 2021 at 02:29):

I don't know how the IMO scoring works (I went to a small high school and almost completely missed out on math competitions), but I think the MiniF2F dataset will be a good formal math benchmark in that regard since it will have a wide range of problems. So we will get more signal than just "it couldn't solve any of the IMO problems this year".

Daniel Selsam (May 24 2021 at 15:27):

@Jason Rute Thanks for following up.

I've now come to the conclusion that the main heart of the IMO grand challenge is just to pose the question: Can a computer get a gold medal in the IMO, and if so what would it take?

Yes, that is the heart of the IMO-GC. Well said.

I still have a lot of questions about what it is going to look like over the coming years.

There are a lot of unknowns and unknowables still. The most concrete component of the IMO-GC so far is the global collaboration to make olympiads (and mathematical problem solving more generally) a viable research platform for both the ML and AR communities. Note that we are still far from this; even ignoring the upcoming port, we are still a long way from being able to even state geometry problems.

The second component is the quest to win, and this is where the dynamics are hard to predict. I see five complications:

how to deal with the fact that the IMO is too hard to measure progress at the beginning?
- proposal: measure on a range of olympiads (there are dozens per year and some are much easier)
how to prevent test-set hacking?
- proposal: keep running leaderboard, measuring on olympiad problems/shortlists as they are released
to the extent that the winner builds off community data/tactics/contributions, how to ensure community receives its share of glory?
- proposal: try to maintain two story lines for now (collaboration/competition) and adjust focus depending on the technical approaches that are winning
- examples: an AlphaGo Zero approach would deserve all the glory, whereas one that trained on mountains of community data and dispatched to community-written tactics should share
how to deal with the fact that it may be impossible to package some agents so they can be checksummed and run in cloud sandbox?
- proposal: TBD, but we may need to resort to an honor code for leaderboard rather than accept submissions
related: how to ensure winning technology advances science as much as possible (i.e. is reproducible)?
- the original rules (requiring entire system be open-source and easily reproducible) is ideal but might not be feasible, e.g. if the NN is an org's singular $1B model, so we'll have to adjust the rules depending on the technical approaches taken

I know you, Daniel, and your collaborators are working on this by developing your nondeterministic tactic framework. I wonder, however, what role other researchers will have in this challenge.

So far, all proposed approaches depend heavily on tactics. When gpt-f emits simp, it is building on thousands of lines of code, ten thousand user-annotated simp lemmas, and an ongoing effort by the mathlib community to structure the library so that simp is effective. If the community does start writing nondeterministic tactics down the road, it would likely help all approaches as well.

If they (miraculously) succeed in generating pure natural language solutions, that would easily be a cover story on Nature or Science.

If a team wins I2I without indirecting through F2F, that would be absolutely amazing and I bet the IMO-GC committee will agree that the grand challenge has been won.

Also, if there is any question about if an agent is overfit to the particular test problems, it is easy-ish to ask an expert to come up with 6 new IMO level problems to feed to the agent (or one can just wait until next years IMO to get fresh problems).

Doing well on past IMO problems may be a useful metric at times, but ultimately it is closer to 0 than to 1. Please, anyone reading this, do not over-claim until you do well on a fresh IMO.

Jason Rute (May 24 2021 at 22:34):

Thanks @Daniel Selsam! This helps clarify a lot for me (and hopefully for others).

Last updated: May 02 2025 at 03:31 UTC