Zulip Chat Archive

Stream: IMO-grand-challenge

Topic: Current status?


Kevin Lacker (Sep 21 2020 at 21:31):

Hey, I saw today's Quanta article about the IMO grand challenge. It sounds cool. I was wondering what the current status is - are there any teams working on an IMO problem solver with some work in progress? I'd be interested in helping out if there was some group open to outside contributors

Kevin Buzzard (Sep 21 2020 at 21:54):

We are currently very interested in getting a nice collection of IMO problems formalised, both statements and proofs

Kevin Lacker (Sep 21 2020 at 21:55):

is there a repo of this stuff somewhere

Kevin Buzzard (Sep 21 2020 at 21:55):

There are a couple of Olympiad problems formalised in mathlib

Kevin Lacker (Sep 21 2020 at 21:57):

is there a big list or is it like, the combinatorics ones are in the combinatorics directory, etc

Kevin Buzzard (Sep 21 2020 at 21:57):

https://github.com/leanprover-community/mathlib/blob/master/archive/imo1988_q6.lean

Kevin Buzzard (Sep 21 2020 at 21:57):

There are very few

Kevin Buzzard (Sep 21 2020 at 21:57):

It would be nice to start organising them like that

Kevin Buzzard (Sep 21 2020 at 21:57):

They will all live in the archive directory

Kevin Buzzard (Sep 21 2020 at 21:58):

src is for the standard maths library, this is entertainment

Kevin Lacker (Sep 21 2020 at 22:01):

i wonder what the very easiest IMO problem to formalize is...

Stanislas Polu (Sep 22 2020 at 06:40):

IMO 1972 b2 is quite easy and we chatted about its formalization in general a few weeks ago. I'll make a PR to add to to archive/

Stanislas Polu (Sep 22 2020 at 07:08):

Done here: https://github.com/leanprover-community/mathlib/pull/4209

Quang Dao (Sep 22 2020 at 10:56):

For some problems that ask you to find a certain object (say IMO 2019 P1 & P4), how would you write the statement in Lean?

Johan Commelin (Sep 22 2020 at 10:57):

There has been quite some discussion about that (on this stream) without a satisfactory answer so far

Johan Commelin (Sep 22 2020 at 10:59):

@Quang Dao see https://leanprover.zulipchat.com/#narrow/stream/208328-IMO-grand-challenge/topic/How.20to.20help.3F/near/175254384

Johan Commelin (Sep 22 2020 at 10:59):

(took some time to find that discussion)

Quang Dao (Sep 22 2020 at 11:06):

thanks! this link from the discussion is now unusable: https://github.com/IMO-grand-challenge/formal-encoding/blob/master/design/determine.lean

Quang Dao (Sep 22 2020 at 11:06):

do you know where it has moved to?

Quang Dao (Sep 22 2020 at 11:13):

anyway, I'm trying to find simpler instances in Lean where the problem is "find something". Could you solve a polynomial equation (say a quadratic) in Lean?

Miroslav Olšák (Sep 22 2020 at 11:20):

Reposting my list of "find something" answers of shortlisted problems 2006 -- 2018: http://www.olsak.net/mirek/determine-answers.txt

Daniel Selsam (Sep 22 2020 at 13:13):

Quang Dao said:

thanks! this link from the discussion is now unusable: https://github.com/IMO-grand-challenge/formal-encoding/blob/master/design/determine.lean

https://gist.github.com/dselsam/da69f929de2d623b4a8b5d8ef8a278f9

Lucas Teixeira (Jan 10 2023 at 14:11):

What's the status of the IMO challenge??? What are some of the noteworthy attempts at solving it and what benchmarks have they set??

Lucas Teixeira (Jan 10 2023 at 14:13):

Nvm, I think this topic is more appropriate for #Machine Learning for Theorem Proving

Jason Rute (Jan 10 2023 at 19:09):

The IMO grand challenge has always seemed to me to be more of an idea than a concrete plan. I suspect it won't start happening until a group has an AI which they think can do well on it. Although there is a committee so maybe it is active and I'm just not in the loop. Nonetheless, there has been a lot of progress on the MiniF2F benchmark which can be thought of as an easier version of the IMO grand challenge.

Jason Rute (Jan 10 2023 at 19:27):

In another thread @Joseph Myers (who is not on the IMO Grand Challenge committee but seems very interested in it) was trying to get it going.

Junyan Xu (Jan 10 2023 at 19:27):

Just a thought: is it possible to set up something like the SuperGLUE benchmark/leaderboard ...

Jason Rute (Jan 10 2023 at 19:47):

I should also point out that a major premise of the IMO grand challenge is that for a computer to do well on the IMO (and for a human to trust the computer's solution) it has to give a formal proof. Models like Minerva (and even ChatGPT) are starting to cast doubt on that assumption. (Although models like Draft Sketch Prove (#Machine Learning for Theorem Proving > More papers on autoformalization) also show the two approaches are quite compatible and systems which generate informal proofs and convert them to formal theorem proofs are a nice symbiosis.)

Johan Commelin (Jan 10 2023 at 19:50):

Why do you say that even ChatGPT starts to cast doubt on that assumption?
If anything, it has strengthened my belief that I'll only believe a computer-generated proof if it is formally verified. ChatGPT is way too good at bullshitting.

Jason Rute (Jan 10 2023 at 19:59):

Before Minerva, I was 99% sure the first computer solutions to the IMO grand challenge would be formal. After, I'm 80% sure. As for BS, the question is if that is a fundamental property of large language model applications, or if it can be fixed (without having to formalize everything in a formal system).

Jason Rute (Jan 10 2023 at 20:01):

(To be clear, I don't think ChatGPT itself, especially the current version, would do well.)

Jason Rute (Jan 10 2023 at 20:15):

(I thought I was posting in #IMO-grand-challenge > Current status? in response to a question asked there. Sorry to resurrect this old thread.)

Notification Bot (Jan 10 2023 at 20:32):

5 messages were moved here from #Machine Learning for Theorem Proving > Meta IMO result by Johan Commelin.

David Renshaw (Jan 10 2023 at 21:19):

Pretty soon we should be able to translate most of minif2f to Lean 4.

Joseph Myers (Jan 10 2023 at 23:44):

A key feature of the IMO Grand Challenge should be that it keeps entrants honest: (a) the problems are new, and entrants required to have finished their coding and training before the IMO, so they can't have learned from reading an informal solution to the same problem; (b) the marking criteria are as far as possible objective, public and such that marks for a solution can be independently verified without inside knowledge (they don't depend on the (non-public, not collected anywhere) mark schemes used for human solutions at the IMO, nor on having a common understanding of what somewhat vague natural language attempts do or do not count as sufficient for a complete solution worth full marks); (c) entrants make a public declaration in advance that they are entering, to avoid publication bias from only announcing results that sound good; (d) the formal statements are independently agreed before the competition to be an accurate translation of the informal statements.

I don't think existing public claims of AI to have solved olympiad-like problems satisfy any of (a), (c) or (d), while it's not clear how informal-to-informal AI would manage (b) (at least it would need independent assessment of any claimed solutions by someone who was a coordinator on the relevant problem at the IMO).

Jason Rute (Jan 11 2023 at 01:53):

I think something in the spirit of (b) is possible for informal-to-informal. By the way, one reason that Daniel Selsam, the creator of the IMO grand challenge, didn't address the informal-to-informal part is that he thought this would so hard that he said he would resign if it happens (see section ML-topia in his blog post). I don't know if he still thinks that, but I'd hate to think that a natural language solution is rejected just because it is in natural language. (If like 2019 Daniel Selsam, one thinks that it is so impossible it is not worth planning for, I can respect that opinion.)

Kevin Buzzard (Jan 11 2023 at 08:55):

I'm on the committee and as far as I know it's not active. I've been taking a great interest in Joseph's posts on how the technical details should work

Daniel Selsam (Jan 12 2023 at 02:56):

@Jason Rute time will tell but fwiw I think formal is now the underdog -- and I did resign! I just got a new job :)

David Renshaw (Jan 12 2023 at 03:33):

I think formal is now the underdog

:open_mouth:

Johan Commelin (Jan 12 2023 at 07:28):

@Daniel Selsam Congrats on the new job! (Mind telling us where/what?)

What made you switch positions so that you now think formal is the underdog? Is there a big breakthrough that I missed?

fzyzcjy (Jan 12 2023 at 11:46):

+1 pretty curious why formal is not a good approach?

Joseph Myers (Jan 12 2023 at 12:28):

You could do an autoformalization challenge on IMO problems: give the AI formal statements (easy-mode, i.e. answers for any "determine" problem included) for the problems of an IMO from after that AI was coded and trained and informal statements and solutions, and ask it to produce formal proofs of the provided formal statements based on the given informal proofs. My impression of the current state of AI is that this is a lot closer to working than producing full proofs of recent IMO problems from scratch (and that as discussed above, if an AI could produce a valid informal solution then it could be connected to an AI that produces a formal version from that).

David Renshaw (Jan 12 2023 at 12:46):

The IMO Grand Challenge as originally stated, i.e. formal to formal, still strongly resonates with me. If that approach is now the underdog (or perhaps more charitably: is receiving less hype these days) well, then, all the more glory there will be when it arrives at the destination first. :P

Daniel Selsam (Jan 12 2023 at 15:45):

Johan Commelin said:

Is there a big breakthrough that I missed?

No recent breakthroughs, but that is the whole point: the current models are the same models as a few years ago (which couldn't even produce coherent sentences) just bigger.

  • scaling reliably lowers perplexity
  • lowering perplexity reliably improves capabilities
  • tons of runway left in compute & data, + low-hanging algorithmic fruit everywhere

I think formal is the underdog now for IMO because there is a lot of hard work humans would need to do and they are not strongly incentivized to do it, whereas informal gets to ride the scaling wave and will (I think) eventually fall as a side-effect. I agree with you that current models are nowhere near and think that if one did not know their history one would be right to dismiss them entirely -- however, I think this annoying extrapolation argument is actually very strong. We'll see.

Bolton Bailey (Jan 12 2023 at 15:56):

@Daniel Selsam What do you think is the median year in which the challenge will be completed, if you don't mind my asking?

Mario Carneiro (Jan 12 2023 at 16:39):

@Daniel Selsam said:

I agree with you that current models are nowhere near and think that if one did not know their history one would be right to dismiss them entirely -- however, I think this annoying extrapolation argument is actually very strong. We'll see.

There is a limit to the extrapolation argument. Right now the models are so big that only large companies can run the models and everyone else has to buy time on them. If they keep growing they will become national level artifacts. At some point it won't be worth their time to run some silly problem some researchers thought up about a math problem. This doesn't seem to be scaling to me in the sense that we aren't empowering everyone to do formal methods, we are centralizing and gatekeeping the tools because of the spectacular costs involved. It's not really a question of whether it is possible (well it is to some extent, but once we prove it is possible that won't be enough to keep the funding coming) - it is whether it is economical, and while you can handwave to some extent and say tech gets better and cheaper over time, the growth of ML models is far outpacing that growth.

fzyzcjy (Jan 12 2023 at 23:50):

Daniel Selsam I think formal is the underdog now for IMO because there is a lot of hard work humans would need to do and they are not strongly incentivized to do it

I would appreciate it if I could know what are those hard work - Do you mean deep learning algorithms and ideas, or the time-consuming work to formalize the whole math world?

Siddhartha Gadgil (Jan 13 2023 at 02:06):

Mario Carneiro said:

Daniel Selsam said:

I agree with you that current models are nowhere near and think that if one did not know their history one would be right to dismiss them entirely -- however, I think this annoying extrapolation argument is actually very strong. We'll see.

There is a limit to the extrapolation argument. Right now the models are so big that only large companies can run the models and everyone else has to buy time on them. If they keep growing they will become national level artifacts. At some point it won't be worth their time to run some silly problem some researchers thought up about a math problem. This doesn't seem to be scaling to me in the sense that we aren't empowering everyone to do formal methods, we are centralizing and gatekeeping the tools because of the spectacular costs involved. It's not really a question of whether it is possible (well it is to some extent, but once we prove it is possible that won't be enough to keep the funding coming) - it is whether it is economical, and while you can handwave to some extent and say tech gets better and cheaper over time, the growth of ML models is far outpacing that growth.

The way I think of this issue is that large models are becoming utilities, like power plants. Very few companies/people will generate electricity, but many will make electrical appliances and most people will use them.

The above holds provided "prompt engineering" and "post processing" will be enough to get good results from the huge models.

Junyan Xu (Feb 14 2023 at 06:28):

@Mario Carneiro I think models like the Toolformer that offloads/outsources some capabilities to external (symbolic) tools will drive down at least the inference cost, if not the training cost as well. The paper reports competitive zero-shot performance with much lower parameter count. (Langchain's MRKL systems are also mentioned by many people in connection with it.) RETRO performs well on knowledge-intensive tasks with much lower parameter count, by retrieving from a database rather than a toolbox. And there are architecture innovations like RWKV-LM that are less resource-intensive than transformers but may prove to be as powerful (they just released an RNN with 14B parameters).

Next-generation models may be trained to not just use tools, but also create tools and organize them into a codebase (think of long-term memory, or crystallized intelligence), and they're gonna read books and papers and grow mathlib for us automatically (auto-formalization). Being able to execute programs and observe outputs serves as a strong form of grounding (Yoav Goldberg, 5th paragraph), similar to formalization of mathematics. Drori et al.'s PNAS paper and Program-aided Language Models from CMU already do program synthesis in Python, but do not store the programs and build upon them; as we know, to prove a big theorem, it's often necessary to have lots of intermediate lemmas in place.

As the model continually learns, it might gradually optimize programs in its codebase (think of AlphaTensor), including other models it creates and stores there, so more efficient models could eventually emerge in the codebase and the original model could offload most of its capability to them, and we could just use the most efficient one. The original model probably need to be of adaptive computation time though, since otherwise offloading can't possibly save time even though it may improve accuracy. Data center cooling and chip design are other places where a positive feedback loop for AI efficiency could be formed.

ChatGPT has shown that reinforcement learning can be a very effective way of aligning language models to our purposes, and a codebase has a naturally associated dependency graph, which could potentially help with credit assignment, so actions that wrote the most popular functions/theorems in the codebase/library could be rewarded. As a bonus, PageRank is easy on a DAG.

I've written down some thoughts in this tweet thread a few weeks ago but apparently I was talking into air :laughter_tears: Are my ideas too crazy, or too well-known and already actively pursued in major labs @Geoffrey Irving @Timothee Lacroix? Or maybe I was just shadow-banned by Twitter. (Unfortunately it seems OpenAI's Lean paper authors have all left ...)

I think these ideas overlap with those of memorizing transformers from @Yuhuai Tony Wu, @Christian Szegedy, et al. which use non-differentiable memory as well, but memory access is via kNN lookup of vector embeddings. Since LMs are supposed to good at languages, I think it's natural to let them establish and follow naming conventions to name the new concepts they invented, and interact with their memory via tokens, but I don't have the experience to judge which approach is more practical.

As for democratization of access, I'm generally optimistic: I'm pretty impressed that I can now run Stable Diffusion (SD) to create 512x512 images with just 4GB VRAM on my laptop. And it may not be difficult to adapt LMs to our particular purpose: for example, it now suffice to fine-tune only 3MB of the SD model's parameters on custom data, thanks to LoRA. Open Assistant is now crowd-sourcing training data to create a ChatGPT clone, and this paradigm of distributed data generation combined with centralized training could be traced back to Leela Zero / Leela Chess Zero. It's worth noting that people are building distributed training infrastructure as well, but it may take time to mature.

Volunteer computing has a long history and people have donated massive amounts of compute to projects like GIMPS (Mersenne prime search), or Fishtest, and through BOINC. So it's just a matter of popularizing mathematics to get more computing resources directed to mathlib. When AI is capable of creating new mathematics, it will draw even more volunteer resources: it's much more exciting to see long-standing conjectures being solved than seeing Elo rating keeps improving monotonically (pun intended). Considering how fast LMs can turn natural language into code and vice versa (already useful in reverse engineering), I think the formal-informal gap could be fairly small for machines.

Time is on our side. Conversational LMs will revolutionize education, and people around the world will get equal access to the most sophisticated mathematical ideas. Advanced mathematics will come out of brains of geniuses and ivory towers and reach more people than ever. (mathlib is already democratizing access to every detail for every proof and definition in it, which you don't normally get in textbooks, and LMs will be able to parse and summarize the Lean code and explain the big picture.) As automation progresses and people work less hours, they will have more spare time to devote to mathematics, among other means of entertainment. Gladly, Microsoft's CEO seems to share the same vision:

Do we need to learn math anymore? Why learn math?
Lemme tell you, I'm an electrical engineer who never understood the Maxwell's equations. So now I finally get, thanks to ChatGPT, a better understanding of it. So I think one of the things is we will all enjoy a lot more math because we will have that personalized tutor who will, in fact, be able to intervene at the crucial time when you're making a conceptual mistake and help you learn. So just imagine that, just Matt, what if there was a fantastic tutor teacher for every student learning math going forward, that is now possible.

Exciting times are ahead!

(Mandatory disclaimer: ChatGPT did not engage in the writing of this essay.)

Junyan Xu (Mar 26 2023 at 07:45):

The above post is being updated here with latest developments. Many tools look helpful to mathlib development, but I haven't formed an actionable plan to combine them in the best way. Meanwhile, @Scott Morrison has already built something like ReAct over at the GPT thread.

Junyan Xu (Jun 12 2023 at 04:04):

https://www.lesswrong.com/posts/sHcHriEoJAGcRt3oC/i-bet-usd500-on-ai-winning-the-imo-gold-medal-by-2026

David Renshaw (Jun 12 2023 at 13:54):

https://twitter.com/azsantosk/status/1655998142544846865

(excluding open-source, I don't want to encourage that)

Wow. It feels disorienting to hear this kind of attitude.

Joseph Myers (Jun 13 2023 at 00:24):

Is that person who doesn't want open-source AI coming from the existential risk school of AI alignment?

Scott Morrison (Jun 13 2023 at 00:33):

That was my understanding of the statement.

Joseph Myers (Jul 15 2023 at 23:36):

One of the coordinators at IMO 2023 asked ChatGPT to solve problem 1 (i.e., informal-to-informal) and posted the results on the internal coordinator Discord. ChatGPT produced a load of nonsense, not even guessing the right answer (prime powers). I suspect that's more representative of what current AI can actually do on current IMO problems at present than the misleading hype we sometimes see.

Jason Rute (Jul 16 2023 at 02:11):

On one hand, I’m not sure ChatGPT used naively (especially without a systematic feedback and filtering strategy), is really the state of the art on IMO problems. On the other hand, I’m not sure that any state of the art system would do much better at the moment (especially since no one has taken up the challenge so far).

Zhangir Azerbayev (Jul 18 2023 at 15:33):

Jason Rute said:

On one hand, I’m not sure ChatGPT used naively (especially without a systematic feedback and filtering strategy), is really the state of the art on IMO problems. On the other hand, I’m not sure that any state of the art system would do much better at the moment (especially since no one has taken up the challenge so far).

I suspect the method in "Let's Verify Step by Step" would generate something that isn't nonsense some of the time and would be able to catch nonsense solutions.

I also suspect oai might've nerfed the results in the paper for publication, as the industry labs are rumored to be doing.

Mario Carneiro (Jul 18 2023 at 15:49):

Zhangir Azerbayev said:

I also suspect oai might've nerfed the results in the paper for publication, as the industry labs are rumored to be doing.

Could you explain what you mean by this? What does it mean to nerf the results, and what is the incentive to do so?

Jason Rute (Jul 18 2023 at 15:53):

By "the paper for publication" you mean LVSbS?

Zhangir Azerbayev (Jul 19 2023 at 15:54):

Mario Carneiro said:

Zhangir Azerbayev said:

I also suspect oai might've nerfed the results in the paper for publication, as the industry labs are rumored to be doing.

Could you explain what you mean by this? What does it mean to nerf the results, and what is the incentive to do so?

The industry labs don't publish their best innovations anymore for competitive and safety reasons. Given this fact, it's not implausible that some of the publications that do come out don't include all the researcher's finding.

There are two things that jump out at me as natural ways to improve LVSbS: namely finetuning the model for more than just 1.5B tokens of mathematical text and doing reinforcement learning using the verifier. But there's also a good chance I'm just being paranoid and these two things simply didn't work or the team didn't have time to implement them.

Zhangir Azerbayev (Jul 19 2023 at 15:55):

Jason Rute said:

By "the paper for publication" you mean LVSbS?

Yes.

Aren Güralp (Aug 06 2023 at 20:45):

Is anybody here working on this at the moment?

Jason Rute (Aug 06 2023 at 22:44):

I’d be surprised if anyone admits as much. A number of people are working on AI in both formal and informal theorem proving, including the MiniF2F benchmark which contains old Olympiad problems, but I don’t know of anyone publicly working on preparing for the 2024 IMO (and I don’t think the IMO Grand Challenge committee has done anything in years). I know some people want to force there to be a competition sooner rather than later even if all the entrants do terribly, but at the same time I can’t imagine anyone stepping forward to compete unless they think they can get a non-zero score.

Jason Rute (Aug 06 2023 at 22:44):

I think a different competition filled with a range of problem difficulties would be better but no one is talking about organizing that, and I don’t claim it would be easy to organize since one still has to come up with problems, rules, and entrants.

Kevin Buzzard (Aug 06 2023 at 22:46):

I can confirm that the IMO Grand Challenge committee has not done anything in years. Is there anything in particular that you want us to do? The only thing I thought of was summarising some of Joseph Myers' excellent posts about the practicalities of the grand challenge, but I half-wondered whether less is more.

David Renshaw (Aug 06 2023 at 23:03):

I still have a long-term interest in the challenge, though somewhat backburnered at the moment.

Scott Morrison (Aug 07 2023 at 01:52):

Could the committee encourage or be involved in producing "mathematician endorsed" benchmarks that would allow reporting on intermediate progress towards IMO automation?

Aren Güralp (Aug 07 2023 at 02:05):

Another question to ask is how an attempt would even take place- it's not as if every IMO comes with formalized versions of its problems. Could the committee contribute to progress on the challenge by producing "official" formalizations of IMO problems/solutions for past years and each passing year?

Scott Morrison (Aug 07 2023 at 03:33):

I would presume that the committee is not inclined to produce official solutions. There are many formalised solutions in the Archive/IMO/ directory of the Mathlib repository.

Scott Morrison (Aug 07 2023 at 03:34):

"Official" formalised statements for at least the last few years is an excellent question, however!

Scott Morrison (Aug 07 2023 at 03:35):

And formulating a plan to generate formalisations of the new problems each year as quickly as possible after the IMO might be useful too!

Jason Rute (Aug 09 2023 at 01:01):

I started a new thread about automated theorem proving competitions in general since I think the discussion here shows some interest in this topic: #Machine Learning for Theorem Proving > Automatic Theorem Proving Competitions I don't have any concrete answers but it would be nice to think more about this topic in general, not just in the context of the IMO Grand Challenge.

Yutaka Nagashima (Aug 30 2023 at 22:14):

Dear the IMO Grand Challenge committee,

I'm contemplating entering the IMO 2024 challenge but with Isabelle/HOL. Would the committee be open to the following approach?

  1. I'll submit my enhanced Isabelle extension tailored for IMO problems to the committee before the start of IMO2024.
  2. The IMO2024 challenge takes place.
  3. After the challenge, I'll formalize the year's IMO problems using Isabelle/HOL.
  4. Next, I'll evaluate how effectively my tool addresses these problems.
  5. If successful, I'll share the results with the committee for validation.
  6. It's important to emphasize that I'd like to formalize the problems myself. The complexity in the proof search can shift dramatically depending on their framing in Isabelle/HOL.

For the record, I don't expect to lean heavily on extensive computational power or unique hardware for the proof search. I assume this won't pose any concerns.

Regards,
Yutaka

Kevin Buzzard (Aug 30 2023 at 22:49):

It would be nice to know your full name but as far as I'm concerned anything goes right now when it comes to how this will actually work.

David Renshaw (Aug 30 2023 at 22:56):

@Yutaka that sounds great. I imagine that the way things will get jump started on the Lean side is when someone does basically those same steps, but for Lean.

David Renshaw (Aug 30 2023 at 22:57):

If successful, I'll share the results with the committee for validation.

sharing unsuccessful results is also good and interesting!

David Renshaw (Aug 30 2023 at 23:03):

(I am not on any official committee, but my impression is that there has not been much central coordination for the IMO Grand Challenge since its inception, except for what can be found in this Zulip channel. )

Miroslav Olšák (Aug 30 2023 at 23:03):

Hi, @Yutaka , have you tried your system on any previous IMO problems? I am curious what your system can do. The induction tactics I have seen (if you are the Yutaka I know) were quite impressive but I would be afraid that it is still rather far from IMO level...

Yutaka Nagashima (Aug 30 2023 at 23:08):

Kevin Buzzard said:

It would be nice to know your full name but as far as I'm concerned anything goes right now when it comes to how this will actually work.

Hi, sorry, that was my first or second post to Zulipchat. And I am still learning how to use this.
I think now my profile shows my full name.

Yutaka Nagashima (Aug 30 2023 at 23:10):

David Renshaw said:

(I am not on any official committee, but my impression is that there has not been much central coordination for the IMO Grand Challenge since its inception, except for what can be found in this Zulip channel. )

Perhaps things will change when someone solves a problem, making the prospect of winning a gold medal seem more attainable.

Joseph Myers (Aug 30 2023 at 23:28):

You should publish your proposed formalization conventions for how IMO problem statements are translated from English to Isabelle/HOL before the IMO. I expect that would include things such as choices of types (IMO problem statements are happy to e.g. use subtraction or division on natural numbers when a formal statement might need to be more explicit about the type of the result of such an operation) and exactly how particular phrases in an IMO problem are interpreted as implying or not implying nondegeneracy conditions (that's mainly for geometry, but also e.g. if a problem involves division it should probably be taken as including a requirement that there is no division by zero), and how the statement presented to your solver should indicate the result to be proved and any data to be provided for "determine" problems.

To avoid publication bias, you should also commit to announce your results whether or not your solver manages to solve any problems.

Yutaka Nagashima (Aug 30 2023 at 23:28):

Miroslav Olšák said:

Hi, @_Yutaka , have you tried your system on any previous IMO problems? I am curious what your system can do. The induction tactics I have seen (if you are the Yutaka I know) were quite impressive but I would be afraid that it is still rather far from IMO level...

Yes, I'm Yutaka from Prague, among other places.

I've yet to test my system against IMO problems, and I anticipate that there's room for improvement in the initial results.

I've recently upgraded the induction prover using an abduction graph (https://youtu.be/d7IXk0vB2p0?t=120) enabling it to autonomously identify beneficial conjectures.

In the coming months, I'm set on infusing it with more intelligence and introducing domain-specific heuristics crafted for IMO challenges.
My hope is that by the summer of 2024, solving one of the IMO problems of that year will become a feasible goal.

Kevin Buzzard (Aug 30 2023 at 23:33):

Oh hi you are the Yutaka I was expecting/hoping :-) Yeah I think that right now IMO is too hard for any system which is why I'm not too fussed about protocol but I think Joseph makes some good points. Do you have any thoughts about how to translate questions of the form "work out the number x"?

Jason Rute (Aug 31 2023 at 13:28):

Hi @Yutaka Nagashima ! I'm familiar with some of your previous work. You started doing neural theorem proving for Isabelle, before it was fashionable. :smile: However, I'm curious if you know about the PISA and MiniF2F benchmarks. PISA is a benchmark of Isabelle/HOL theorems. MiniF2F is a benchmark of competition problems ranging from easy to IMO problems (in many languages including Isabelle/HOL). I think you should really consider testing your system on MiniF2F (and PISA if possible) to see how well it does. That might give you a sense how close your system is to being able to compete in IMO (and solve a problem or two). Also, benchmarks in general are a good way to compare progress. For example, do you think your system has a lot of progress over other Isabelle systems like SledgeHammer, Thor, Draft-Sketch-Prove, Baldur, MagnusHammer, "Decomposing the Enigma", and others? @Albert Jiang and @Wenda Li (also at Cambridge) would be good people to talk to about setting up a benchmark.

Jason Rute (Aug 31 2023 at 13:28):

Along those lines does the system in your video have a paper associated with it?

Yutaka Nagashima (Aug 31 2023 at 21:07):

Joseph Myers said:

You should publish your proposed formalization conventions for how IMO problem statements are translated from English to Isabelle/HOL before the IMO. I expect that would include things such as choices of types (IMO problem statements are happy to e.g. use subtraction or division on natural numbers when a formal statement might need to be more explicit about the type of the result of such an operation) and exactly how particular phrases in an IMO problem are interpreted as implying or not implying nondegeneracy conditions (that's mainly for geometry, but also e.g. if a problem involves division it should probably be taken as including a requirement that there is no division by zero), and how the statement presented to your solver should indicate the result to be proved and any data to be provided for "determine" problems.

Yes, I can declare which types I use in advance. I will likely use the standard types provided by Isabelle for numerical problems. I don't think using the types in the standard library would cause serious concerns regarding the integrity of the formalization. If there were, mathematicians wouldn't use Isabelle.

For my 2024 projects, I am planning to skip geometry. At least by myself, it seems harder to formalize the geometry problems in Isabelle.

It would be nice to know a list of typical controversial English expressions used in IMO problems, which cause confusion when formalizing those problems.

To avoid publication bias, you should also commit to announce your results whether or not your solver manages to solve any problems.

Yes. That’s right.

Yutaka Nagashima (Aug 31 2023 at 21:26):

Kevin Buzzard said:

Oh hi you are the Yutaka I was expecting/hoping :-) Yeah I think that right now IMO is too hard for any system which is why I'm not too fussed about protocol but I think Joseph makes some good points. Do you have any thoughts about how to translate questions of the form "work out the number x"?

I believe “work out the number x” means “find the number x (that satisfies certain conditions specified in the problem)” in English.

I think that is equivalent to proving “there exists x, such that …”?

Then, to prove the theorem, a prover must at least internally know what “x” is.

My proof would likely refer explicitly to the value of “x” unless finding that value is trivial for existing tools.

For example, Isabelle can prove "∃x. x" using the “auto” tactic. So, my proof wouldn't explicitly state “x=True”, but Isabelle's proof checker should recognize that it proves this theorem by instantiating “x” as True.

For more challenging problems, we have to provide instances explicitly, and we can probably see them in the proof scripts.

Yury G. Kudryashov (Aug 31 2023 at 21:31):

No, "find xx" and "prove that xx exists" are two different problems.

Yury G. Kudryashov (Aug 31 2023 at 21:31):

E.g., "how many roots has the following equation?"

Yury G. Kudryashov (Aug 31 2023 at 21:32):

Or "how many weighings does one need to find the bad coin?"

Yury G. Kudryashov (Aug 31 2023 at 21:33):

To be more specific, what do you expect as input and output for https://artofproblemsolving.com/wiki/index.php/2023_IMO_Problems/Problem_1 ?

Scott Morrison (Aug 31 2023 at 22:11):

Surely the type of the output for that problem needs to be a sigma type, with the first element being a function (n : ℕ) → composite n → Bool, and the second an iff proof relating that condition to the condition in the statement.

Scoring well is not just an inhabitation of this type (which one could do trivially because the condition in the statement is decidable, and hence has a coercion to Bool), but an inhabitation for which the first component is "obviously" "n = p^x for some prime p and x ≥ 2".

As far as I understand, the GC committee have never specified whether "obviously" here consists of a human inspecting the term after "submission" or some allowed step of automation (aesop?) to compare with the standard answer.

(I feel like everyone has already acknowledged that IMO problems do not reduce to inhabitations of specified types, but keeps avoiding this issue?)

Yutaka Nagashima (Aug 31 2023 at 22:16):

Yury G. Kudryashov said:

No, "find xx" and "prove that xx exists" are two different problems.

Thanks. I struggle with variables that are not bound explicitly.
I should not have written "is equivalent to".
But how about "implies"?

Does proving "there exists x, such that..." imply "finding x such that..."?

Scott Morrison (Aug 31 2023 at 22:17):

No, this is not enough.

Joseph Myers (Aug 31 2023 at 22:18):

Yutaka Nagashima said:

Yes, I can declare which types I use in advance. I will likely use the standard types provided by Isabelle for numerical problems. I don't think using the types in the standard library would cause serious concerns regarding the integrity of the formalization. If there were, mathematicians wouldn't use Isabelle.

Using standard types is entirely appropriate, but in Lean, for example, if a problem says something is a positive integer, there is a choice to be made between , and ℕ+, which is not a distinction that needs considering in informal mathematics, and if the choice made is , then you need to convert to for subtraction (or at least for subtraction that might have a negative result) since IMO problems definitely aren't discussing truncating subtraction, and much the same applies for division. I don't know how Isabelle handles these sorts of issues, but looking at past problems may help you establish rules for choices of types in expressions like that.

Yutaka Nagashima (Aug 31 2023 at 22:19):

Yury G. Kudryashov said:

E.g., "how many roots has the following equation?"

I assume this implies that the answer is a single number.
So, how about proving "∃! n. n is the number of roots of the equation"?

Scott Morrison (Aug 31 2023 at 22:20):

IMO problems often expect the participant to actually write down an explicit numeral to get full points. Lean's type system (merely by allowing substitution --- so this applies to many theorem proving languages) doesn't really allow us to insist on this.

Scott Morrison (Aug 31 2023 at 22:21):

You can add small requirements outside the type system: the solver might be expected to return a s : { x : ℕ // P x }, but with the extra requirement that #eval s.1 prints a numeral in finite time.

Scott Morrison (Aug 31 2023 at 22:23):

Imo2023Q1 linked to above is a more extreme version of this, where the solver is expected to return "an explicit predicate", so we need either human inspection or an agreed upon automation step to check that the explicit predicate is sufficiently explicit (and equivalent to the required answer).

Miroslav Olšák (Aug 31 2023 at 22:25):

Yutaka Nagashima said

E.g., "how many roots has the following equation?"
I assume this implies that the answer is a single number.
So, how about proving "∃! n. n is the number of roots of the equation"?

That is not a good idea, you just instantiate n = "the number of roots" :-)
But really -- this discussion again? Why don't we just allow for start specifying the constant we are looking for in the problem statement? The main part of any IMO problem is the proof anyway.

Joseph Myers (Aug 31 2023 at 22:25):

My expectation is that a human checks the term in such cases. Certainly there are problems with multiple reasonable forms for expressing the term; consider IMO 2023 problem 5 where log2n+1\lfloor \log_2 n \rfloor + 1 or log2(n+1)\lceil \log_2 (n+1)\rceil are both reasonable informal representations.

Kevin Buzzard (Aug 31 2023 at 22:26):

@Yutaka Nagashima I don't think you have understood the issue here. If the question says "which number works" then the answer cannot be "there exists a number which works". Indeed the existence might be obvious and all the work might be finding the right number

Joseph Myers (Aug 31 2023 at 22:30):

I did previously suggest an "easy mode" of the IMO Grand Challenge, where the solver is given a formal statement to prove that includes the answer for such "determine" problems. But any claims of solutions in easy mode for such a problem do need to make clear that it's an easy-mode problem that was solved rather than claiming to have solved an actual IMO problem.

Kevin Buzzard (Aug 31 2023 at 22:30):

The issue is that there are IMO problems which do not translate directly into a proposition where the goal is to prove the proposition. The translation into a proposition is sometimes the hard part.

Scott Morrison (Aug 31 2023 at 22:47):

Miroslav Olšák said:

But really -- this discussion again?

because the GC committee still hasn't made a decision, rather spoiling the point of having the GC? :-) (Or maybe spoiling the point of having the committee? :stuck_out_tongue_wink:)

It's clear from this thread that lacking such a decision/announcement is still a problem.

Scott Morrison (Aug 31 2023 at 22:48):

It's sort of hard for mathematicians to complain about ML teams claiming results on the IMO without actually specifying the standards! But there's certainly been this sort of (well-justified) complaining.

Yutaka Nagashima (Sep 01 2023 at 00:50):

Jason Rute said:

Hi Yutaka Nagashima ! I'm familiar with some of your previous work.

Hi Jason, I think I saw you on YouTube. :smile:

You started doing neural theorem proving for Isabelle, before it was fashionable. :smile: However, I'm curious if you know about the PISA and MiniF2F benchmarks. PISA is a benchmark of Isabelle/HOL theorems.

I had a look at MiniF2F and that was why I prefer to specify problems by myself. The way MiniF2F formalizes the problems makes it hard for the Abduction Prover to solve them (for technical reasons mostly specific to Isabelle and my tool).

MiniF2F is a benchmark of competition problems ranging from easy to IMO problems (in many languages including Isabelle/HOL). I think you should really consider testing your system on MiniF2F (and PISA if possible) to see how well it does.

Yes. I will do so (at least for MiniF2F) at some point.

But for the current implementation, I am fairly certain that the result of the Abduction Prover against MiniF2F is not going to be great: the Abduction Prover assumes that it is facing inductive problems without library support (specialized tactics or auxiliary lemmas). The Abduction Prover also assumes that it cannot know the types used in the problems in advance.

These assumptions are an unnecessary burden when we have concrete problem domains in mind like we do so for Math competitions. The way the Abduction Prover produces conjectures is not suitable for IMO problems either, I suppose.

That might give you a sense how close your system is to being able to compete in IMO (and solve a problem or two). Also, benchmarks in general are a good way to compare progress.

Yes. I agree.
At the same time, I'm curious about the measures taken to ensure test sets aren't used during LLM training. I also wonder how to interpret evaluations that use varying levels of computational resources but are based on the same benchmark.
Perhaps these concerns have been discussed elsewhere?

For example, do you think your system has a lot of progress over other Isabelle systems like SledgeHammer, Thor, Draft-Sketch-Prove, Baldur, MagnusHammer, "Decomposing the Enigma", and others?

For Sledgehammer, I am confident, since I use it as my subtool and conducted experiments for a predecessor of the Abduction Prover.

Albert Jiang and Wenda Li (also at Cambridge) would be good people to talk to about setting up a benchmark.

Thanks. I met them.

Yutaka Nagashima (Sep 01 2023 at 01:13):

Jason Rute said:

Along those lines does the system in your video have a paper associated with it?

The original idea of the Abduction Prover is presented in this short paper: Goal-Oriented Conjecturing for Isabelle/HOL (or at arXiv). The current implementation is quite different, though

Jason Rute (Sep 01 2023 at 14:16):

Yutaka Nagashima said:

At the same time, I'm curious about the measures taken to ensure test sets aren't used during LLM training. I also wonder how to interpret evaluations that use varying levels of computational resources but are based on the same benchmark.
Perhaps these concerns have been discussed elsewhere?

Those are two really good questions which deserve more answers.

For the first question, my impression is that researchers fall into two camps. (1) Those who think it is not a problem, and (2) those who think it invalidates all the results so far. I fall in the middle. I think data contamination is a problem, but it is subtle because it also depends a lot on the use cases. This is discussed in a fair amount of details in our appendix to the PACT paper. It is also a frequent topic on #Machine Learning for Theorem Proving. The best solution is just to test on new theorems which come after the training data. The IMO Grand Challenge is an extreme version of this approach, but I think it is too extreme, since it is hard to get a good signal from so few and so hard of problems.

As for computational resources, that is also an important and often discussed thing. One nice approach is to plot performance as a function of resources. The MagnusHammer paper does this really well. And it is easy to do for both time (theorems proven verse search time) and search steps (if it is a search based algorithm). This isn't perfect and it is again subtle since neural approaches want GPUs, classical ATPs want lots of cores, reinforcement learning algorithms want days instead of minutes, and so on. It also depends a lot on the use case. A system which can solve IMO problems with a couple GPUs and lots of cores in a few hours seems fine for the IMO grand challenge, but for a typical ITP user, they may want something which just runs on a laptop (or over an inexpensive API) in seconds. Again, I think having benchmarks and competitions which take into account computational resources would be good, but I also don't want to say powerful techniques which currently take days on many GPUs are worthless, since it is very possible to bring both the cost and resources of those approaches down in the near future (and conversely, some approaches specifically work because they scale well with resources, whereas some other approaches can't scale well even if given a lot more resources).

Joseph Myers (Sep 01 2023 at 22:13):

Yutaka Nagashima said:

I had a look at MiniF2F and that was why I prefer to specify problems by myself. The way MiniF2F formalizes the problems makes it hard for the Abduction Prover to solve them (for technical reasons mostly specific to Isabelle and my tool).

The conventions for formalizing problem statements should be reasonably idiomatic for each ITP, but not biased to what a particular tool wants to see as input. In particular, if there is more than one entrant using Isabelle/HOL, all those entrants should agree on the conventions used so that only a single version of the statements in Isabelle/HOL is needed.

Joseph Myers (Sep 01 2023 at 22:17):

Jason Rute said:

The best solution is just to test on new theorems which come after the training data. The IMO Grand Challenge is an extreme version of this approach, but I think it is too extreme, since it is hard to get a good signal from so few and so hard of problems.

You could easily enough pick ten national and regional olympiads held every year with original problems and hold competitions with all of those, as long as there's some agreement between entrants about which olympiads to use. Or use the IMO shortlist (minus the competition problems) when it's released a year later, though that does run the risk of people having posted national selection tests (that may use the shortlist problems during the year that the shortlist remains confidential) online during that time (and also there would be the question of whether to exclude any shortlist problems that the Jury eliminated as being already known, or too close to something already known).

Yutaka Nagashima (Sep 01 2023 at 23:39):

Yury G. Kudryashov said:

To be more specific, what do you expect as input and output for https://artofproblemsolving.com/wiki/index.php/2023_IMO_Problems/Problem_1 ?

Thank you for focusing on a concrete problem.
I will give it a try to formalize this problem (probably when the prover is ready for math problems).
It might take a while, but I assume there's no urgency at the moment.

Yutaka Nagashima (Sep 01 2023 at 23:51):

Miroslav Olšák said:

That is not a good idea, you just instantiate n = "the number of roots" :-)
But really -- this discussion again? Why don't we just allow for start specifying the constant we are looking for in the problem statement? The main part of any IMO problem is the proof anyway.

It sounds like a similar conversation happened in the past. Anyway, let’s continue this in France next week.

Yutaka Nagashima (Sep 02 2023 at 00:58):

Kevin Buzzard said:

Yutaka Nagashima I don't think you have understood the issue here. If the question says "which number works" then the answer cannot be "there exists a number which works". Indeed the existence might be obvious and all the work might be finding the right number

It seems that if we require a proof to be in the form of “n works; therefore, there exists a number which works”, that should suffice, since the first part (“n works”) answers the question of “which number works”.

By the way, I am also considering using what's known as “schematic variables” in Isabelle for "work out" problems. However, I haven't had the chance to use schematic variables directly, so I'm unsure about its effectiveness.

In any case, I'll revisit this topic once I've had some hands-on experience with formalizing IMO problems on my own. Up to now, I've mainly explored problems that were already formalized in MiniF2F. I might gain more insight into the concern you mentioned as I delve deeper into specific IMO problems.

Yutaka Nagashima (Sep 02 2023 at 01:42):

Jason Rute said:

Yutaka Nagashima said:

At the same time, I'm curious about the measures taken to ensure test sets aren't used during LLM training. I also wonder how to interpret evaluations that use varying levels of computational resources but are based on the same benchmark.
Perhaps these concerns have been discussed elsewhere?

Those are two really good questions which deserve more answers.

For the first question, my impression is that researchers fall into two camps. (1) Those who think it is not a problem, and (2) those who think it invalidates all the results so far. I fall in the middle. I think data contamination is a problem, but it is subtle because it also depends a lot on the use cases. This is discussed in a fair amount of details in our appendix to the PACT paper. It is also a frequent topic on #Machine Learning for Theorem Proving. The best solution is just to test on new theorems which come after the training data. The IMO Grand Challenge is an extreme version of this approach, but I think it is too extreme, since it is hard to get a good signal from so few and so hard of problems.

As for computational resources, that is also an important and often discussed thing. One nice approach is to plot performance as a function of resources. The MagnusHammer paper does this really well. And it is easy to do for both time (theorems proven verse search time) and search steps (if it is a search based algorithm). This isn't perfect and it is again subtle since neural approaches want GPUs, classical ATPs want lots of cores, reinforcement learning algorithms want days instead of minutes, and so on. It also depends a lot on the use case. A system which can solve IMO problems with a couple GPUs and lots of cores in a few hours seems fine for the IMO grand challenge, but for a typical ITP user, they may want something which just runs on a laptop (or over an inexpensive API) in seconds. Again, I think having benchmarks and competitions which take into account computational resources would be good, but I also don't want to say powerful techniques which currently take days on many GPUs are worthless, since it is very possible to bring both the cost and resources of those approaches down in the near future (and conversely, some approaches specifically work because they scale well with resources, whereas some other approaches can't scale well even if given a lot more resources).

I lean more towards (2) myself, but I can also see value in good results obtained from data contamination. It's possible that such results might reflect a system's efficacy as a search engine.

Presentation is crucial to me. While I have reservations about associating good results from data contamination with robust reasoning capabilities, I do think they might often translate to a better user experience.

Regarding computational resources, I'm largely in agreement with you. My only slight concern is that the term “search steps” might be interpreted in various ways by different individuals.

A practical solution might be to request authors to clearly declare the computational resources they've utilized. This would allow for a basic comparison, even if, as you rightly pointed out, the exact nature or quantity of these resources can't be distilled into a singular value.


Last updated: Dec 20 2023 at 11:08 UTC