Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Meta's blog post on their IMO result mentions far stronger results than their paper on Hyper Tree Proof Search -- 10/20 IMO problems solved and miniF2F solve rate of 60% vs 2/20 IMO problems solved and solve rate of ~40ish%. What's been added with the new result? Just more compute or a new technique?

Kevin Buzzard (Nov 24 2022 at 07:26):

One thing which has been added is more hype -- when I questioned them about the result at the Paris conference they told me that "10 IMO problems" did not mean "10 problems which appeared on IMOs", in contrast to the usual usage of the phrase, and were even vague about whether the problems were on the shortlist or long list, as well as being unable to tell me which problems they were. I was told "the details are in the blog post and paper" which was not true.

Kevin Buzzard (Nov 24 2022 at 07:29):

There is furthermore the huge issue of translation of the informal text into formal, for example the problem "find the smallest N for which (some statement involving 7th powers) is true" was replaced by "show that N>=19" and the computer-discovered proof just did the brute force check to show that 1<=N<=18 didn't work. One can certainly argue that the question solved is not the full strength of the IMO question here. In particular there was no proof that 19 worked.

Sid (Nov 24 2022 at 07:45):

Kevin Buzzard said:

One thing which has been added is more hype -- when I questioned them about the result at the Paris conference they told me that "10 IMO problems" did not mean "10 problems which appeared on IMOs", in contrast to the usual usage of the phrase, and were even vague about whether the problems were on the shortlist or long list, as well as being unable to tell me which problems they were. I was told "the details are in the blog post and paper" which was not true.

Interesting. I assumed that by IMO they meant the problems designated as IMO problems in miniF2F-test (https://github.com/openai/miniF2F/blob/main/lean/src/test.lean). They also claim 60% on miniF2F however which is a big improvement over SOTA.

Sid (Nov 24 2022 at 07:46):

Kevin Buzzard said:

There is furthermore the huge issue of translation of the informal text into formal, for example the problem "find the smallest N for which (some statement involving 7th powers) is true" was replaced by "show that N>=19" and the computer-discovered proof just did the brute force check to show that 1<=N<=18 didn't work. One can certainly argue that the question solved is not the full strength of the IMO question here. In particular there was no proof that 19 worked.

You mean this is an issue with miniF2F?

Tom Chen (Nov 24 2022 at 07:52):

Sid said:

Kevin Buzzard said:

There is furthermore the huge issue of translation of the informal text into formal, for example the problem "find the smallest N for which (some statement involving 7th powers) is true" was replaced by "show that N>=19" and the computer-discovered proof just did the brute force check to show that 1<=N<=18 didn't work. One can certainly argue that the question solved is not the full strength of the IMO question here. In particular there was no proof that 19 worked.

You mean this is an issue with miniF2F?

IMHO, miniF2F mentioned in their paper that equivalent formalization is not always possible. For example, today when reading Formal Mathematics Statement Curriculum Learning, they said that their version of imo_1961_p1 is indeed a weaker version as well.

Tom Chen (Nov 24 2022 at 07:52):

image.png

Kevin Buzzard (Nov 24 2022 at 07:58):

All I'm saying is that most people who know what the acronym "IMO" means in this context will assume that "IMO problem" means "problem which appeared on the IMO" rather than any of the other interpretations which people might want to give it, eg "problem which is simpler than an IMO problem" or "problem which did not appear on the IMO".

Siddhartha Gadgil (Nov 24 2022 at 08:42):

Tom Chen said:

Sid said:

Kevin Buzzard said:

There is furthermore the huge issue of translation of the informal text into formal, for example the problem "find the smallest N for which (some statement involving 7th powers) is true" was replaced by "show that N>=19" and the computer-discovered proof just did the brute force check to show that 1<=N<=18 didn't work. One can certainly argue that the question solved is not the full strength of the IMO question here. In particular there was no proof that 19 worked.

You mean this is an issue with miniF2F?

IMHO, miniF2F mentioned in their paper that equivalent formalization is not always possible. For example, today when reading Formal Mathematics Statement Curriculum Learning, they said that their version of imo_1961_p1 is indeed a weaker version as well.

Not to make a controversy, but while "equivalent problem is not always possible" excluding problems without equivalent formalization is easy.

Tom Chen (Nov 24 2022 at 08:43):

@Siddhartha Gadgil Totally agree. Indeed I am also feeling the claim that someone solve X IMO questions while it is indeed much weaker version is not good. I personally wrongly thought it meant the real IMO when firstly read those papers as well, and it is really misleading.

Joseph Myers (Nov 24 2022 at 12:44):

Sid said:

Interesting. I assumed that by IMO they meant the problems designated as IMO problems in miniF2F-test (https://github.com/openai/miniF2F/blob/main/lean/src/test.lean). They also claim 60% on miniF2F however which is a big improvement over SOTA.

Looking at that miniF2F file it appears to have the same issue with problems being misidentified as IMO problems as mathzoo does (see my remarks in a comment on #15621). imo_2007_p6 there is not IMO 2007 problem 6, it's shortlist problem A6.

I think that when IMO Grand Challenge rules are established, they should include (a) detailed conventions for writing formal statements of problems, (b) independent review of those formal statements, including but not limited to agreement among all entrants using the same theorem proving system; (c) if it turns out afterwards that the formal statement is not a correct translation of the informal one, or that it gives more information than it should, any entrant using that formal statement automatically scores 0 on that problem; (d) if a problem cannot be stated reasonably idiomatically using definitions already present in the standard libraries of the theorem proving system being used before the start of that IMO, no scores can be awarded on that problem; (e) entrants must make a public declaration before the start of the IMO that they intend to enter that year's IMO Grand Challenge and that they will promptly publish their results (even if all scores are 0).

Detailed conventions and review could be applied to statements from past IMOs being used outside the IMO Grand Challenge (but for pre-2006 IMOs it would be a good idea to check multiple sources to confirm the proper form of the informal problem statement as far as possible, there may be some mistakes in the older statements on imo-official and they may not all correspond to the English wording used at the time - pre-1967 probably didn't have English wording used at the time at all).

Albert Jiang (Nov 24 2022 at 14:13):

There are quite a few issues with miniF2F naming. I spotted ~5 naming issues with it. There have been ~40 fixes (formal statements) to v1. The cleanest, newest version is here: https://github.com/facebookresearch/miniF2F.

The specific issue with IMO 2007 P6 was corrected about 1 month ago in here as well:
https://github.com/facebookresearch/miniF2F/blob/5271ddec788677c815cf818a06f368ef6498a106/lean/src/test.lean#L231
https://github.com/facebookresearch/miniF2F/blob/main/isabelle/test/imosl_2007_algebra_p6.thy.

The problem is that previous papers all used the initial version of miniF2F (v1) for benchmarking. Therefore it's impossible to compare with previous results if we changed the benchmark. Going forward, we should really use this version instead.

Kevin Buzzard (Nov 24 2022 at 14:19):

I hadn't understood that miniF2F was being regarded as "the canonical dataset" nowadays. This explains a lot -- thanks to all!

Albert Jiang (Nov 24 2022 at 14:32):

Kevin Buzzard said:

I hadn't understood that miniF2F was being regarded as "the canonical dataset" nowadays. This explains a lot -- thanks to all!

It's kind of the default dataset for ML folks, I think because it is lightweight, more or less standalone, and covers multiple proof assistants. We ML people like to work on algorithms that hopefully generalise to more than one proof assistant.

The hope is to grow miniF2F to more than what it has currently and make sure that they are correct. To ensure that, we really need some help from the community - spotting and fixing the errors in miniF2F v1 was due to a wide effort between Yutaka, Daniel, Kunhao, Wenda, Guillaume, Timothee, Marie-Anne, and myself. The more inspection and discussion about rigour and difficulty the better! I really hope one day we can make very confident claims about what we can solve. So the above discussion is really appreciated and thanks to everyone in advance!

The benchmark is supported and maintained here: https://github.com/facebookresearch/miniF2F because it's now difficult to merge things in the original OpenAI-maintained directory.

Eric Wieser (Nov 24 2022 at 14:32):

for example the problem "find the smallest N for which (some statement involving 7th powers) is true" was replaced by "show that N>=19

I would wager that this problem is pervasive even in the "fixed" https://github.com/facebookresearch/miniF2F. I looked for the first problem with an inequality in its conclusion in the hope of finding the same mistake, but found a much worse one:

https://github.com/facebookresearch/miniF2F/blob/5271ddec788677c815cf818a06f368ef6498a106/lean/src/valid.lean#L169-L174

Which apparently formalizes this question...

Eric Wieser (Nov 24 2022 at 14:34):

because it's now difficult to merge things in the original OpenAI-maintained directory.

Why is this the case?

Albert Jiang (Nov 24 2022 at 14:35):

Eric Wieser said:

because it's now difficult to merge things in the original OpenAI-maintained directory.

Why is this the case?

I think because the original authors of the miniF2F project have left OpenAI. Please correct me if I'm wrong on this @Stanislas Polu @Kunhao Zheng @Jesse Michael Han

Joseph Myers (Nov 24 2022 at 14:36):

Eric Wieser said:

for example the problem "find the smallest N for which (some statement involving 7th powers) is true" was replaced by "show that N>=19

I would wager that this problem is pervasive even in the "fixed" https://github.com/facebookresearch/miniF2F. I looked for the first problem with an inequality in its conclusion in the hope of finding the same mistake, but found a much worse one:

https://github.com/facebookresearch/miniF2F/blob/5271ddec788677c815cf818a06f368ef6498a106/lean/src/valid.lean#L169-L174

Which apparently formalizes this question...

That's another example I mentioned in #15621 as a mathzoo misidentification (problem 3 duplicate wrongly listed as problem 6). And I wasn't checking at all exhaustively for such issues, simply illustrating the very big gap between review standards in mathzoo and mathlib.

Tom Chen (Nov 24 2022 at 14:40):

@Albert Jiang By the way, looks like "ML folks" have another community / discussion board / platform to see progress other than this channel - I am quite interested in neural theorem provers, so could you please share the link? Thank you!

Albert Jiang (Nov 24 2022 at 14:43):

Eric Wieser said:

for example the problem "find the smallest N for which (some statement involving 7th powers) is true" was replaced by "show that N>=19

I would wager that this problem is pervasive even in the "fixed" https://github.com/facebookresearch/miniF2F. I looked for the first problem with an inequality in its conclusion in the hope of finding the same mistake, but found a much worse one:

https://github.com/facebookresearch/miniF2F/blob/5271ddec788677c815cf818a06f368ef6498a106/lean/src/valid.lean#L169-L174

Which apparently formalizes this question...

This looks like a naming issue: I think it's actually corresponding to this question instead (imo_2006_p3): https://artofproblemsolving.com/wiki/index.php/2006_IMO_Problems/Problem_3. Although the formalisation is not quite perfect either...

Thanks so much for spotting this! I'll open an issue for this problem.

Albert Jiang (Nov 24 2022 at 14:44):

Tom Chen said:

Albert Jiang By the way, looks like "ML folks" have another community / discussion board / platform to see progress other than this channel - I am quite interested in neural theorem provers, so could you please share the link? Thank you!

If such a channel exists, I'm not in it! Sorry!

Jason Rute (Nov 24 2022 at 14:44):

@Tom Chen You mean all of machine learning, or machine learning for math and theorem proving? I know a lot of general ML discussion happens on Twitter, including on theorem proving, but I'm not aware of another specific place to discuss the latter topic besides here and twitter.

Joseph Myers (Nov 24 2022 at 14:45):

All these issues I think help show the merits of the high standards of mathlib review - and with having a single community-associated place (the mathlib repository) where the review takes place and everything goes into a highly integrated whole, rather than repositories associated with one particular AI group that we're now learning get abandoned as individuals move on. I realise that for training an AI it may not matter much if problems are stated or identified correctly, but for claiming success in solving problems from particular competitions, accuracy is important.

I think the mathlib archive is an appropriate place to have formal solutions to all 380 past IMO problems (and other competitions, shortlist problems, etc. if people wish to add them). It's not a suitable place for formal statements without solutions (I think the roadmap directory, for formal roadmaps for things that might go in mathlib proper, is also a failed experiment) - but maybe there should be some leanprover-community repository for formal statements of problems that don't yet have proofs in the mathlib archive (and probably with problems removed from that repository when they get proofs in the mathlib archive), with reasonable standards of review applied?

Joseph Myers (Nov 24 2022 at 14:53):

(There's also a case for having statements without proofs in a separate repository even after proofs are in the mathlib archive, because those statements can follow whatever conventions are agreed for the IMO Grand Challenge when the mathlib archive might follow different conventions in some cases. E.g. when the problem asks you to determine some answer (for which there have been various past discussions of how to represent this in a formal IMO Grand Challenge statement), in the mathlib archive it's actually reasonable to just prove the result with the answer inserted, but that version of the statement makes things easier than desired if given to an AI solver.)

Albert Jiang (Nov 24 2022 at 14:55):

Joseph Myers said:

All these issues I think help show the merits of the high standards of mathlib review - and with having a single community-associated place (the mathlib repository) where the review takes place and everything goes into a highly integrated whole, rather than repositories associated with one particular AI group that we're now learning get abandoned as individuals move on. I realise that for training an AI it may not matter much if problems are stated or identified correctly, but for claiming success in solving problems from particular competitions, accuracy is important.

I think the mathlib archive is an appropriate place to have formal solutions to all 380 past IMO problems (and other competitions, shortlist problems, etc. if people wish to add them). It's not a suitable place for formal statements without solutions (I think the roadmap directory, for formal roadmaps for things that might go in mathlib proper, is also a failed experiment) - but maybe there should be some leanprover-community repository for formal statements of problems that don't yet have proofs in the mathlib archive (and probably with problems removed from that repository when they get proofs in the mathlib archive), with reasonable standards of review applied?

I wholeheartedly agree. It would be awesome to have some review system to host such things systematically.

On a related issue: can we add a tag in the comments of the past IMO solutions such that for machine learning we can easily filter them out from the training data? There has been some concern over whether the language models have seen and memorised the informal answers. Personally I really do not think that's the case having seen the models being robust to large perturbations in the questions. But if that can be done, the AI success will be more convincing.

Eric Wieser (Nov 24 2022 at 16:59):

What would be the purpose of a tag in the comments where we already have a mathlib/archive/imo "tag" in the filename?

Albert Jiang (Nov 24 2022 at 17:34):

Eric Wieser said:

What would be the purpose of a tag in the comments where we already have a mathlib/archive/imo "tag" in the filename?

I was worried that some data crawling schemes (not operated by us) may not preserve the filename. So a tag is something nice to have so we can be very sure to filter them out. Otherwise we can do some manual filtering based on file contents, e.g., if content contains "imo" then drop.

Kunhao Zheng (Nov 25 2022 at 04:07):

Hi all! Thanks for the issues raised and discussions! It's good to see how large the room is for improvement in the quality of miniF2F and the direction of working on it.

Totally second bringing more community efforts/review pipelines into the development of miniF2F, e.g. enhancing review process/pipelines and hosting it in a community-share place. In the very beginning, Stan, Jesse & I wanted miniF2F to evolve in a community-driven way to benefit AI people working on different ITPs. And along the year miniF2F has received many fixes. Since Stan & Jesse are moving on with other stuff, to make that happen help from the community is super appreciated. Let me know how I can help!

Jason Rute (Nov 27 2022 at 15:59):

@Kunhao Zheng @Albert Jiang In another place (the Coq Zulip), @Karl Palmskog mentioned that he thought that it would be good to move the MiniF2F repo to a more collaborative repo like https://github.com/leanprover-community. MiniF2F isn't a Lean specific project. Indeed, Isabelle sees as much love in MiniF2F as Lean, and the major need right now is to add Coq. However, leanprover-community does get a lot of visibility and support. I do understand that from a corporate perspective there is a push to put things under a corporate branded Github both for PR and increased control, so if it move elsewhere others may need to take more of a maintainer role.

Jason Rute (Nov 27 2022 at 15:59):

Also, it would be important for maintainers to have a clear sense of the needs of a machine learning benchmark. For example, now that this is an established benchmark, it is an issue for the tests to be change even if they are buggy. Clear versioning would be needed for any major changes. The other concern is having tools to prevent any benchmarks from ending up in the training data. Some datasets for example use "canary GUIDs" for this purpose. Albert's tags are another way.

Joseph Myers (Nov 27 2022 at 17:57):

My suggestion of a leanprover-community repository wasn't so much for MiniF2F (or a machine learning benchmark at all), as for a hypothetical project to formalize many olympiad problem statements (without solutions - I think solutions belong in the mathlib archive), following consistent conventions for the translation between formal and informal statements and with careful review. This would probably need to go hand in hand with two other things: writing down the detailed conventions being used for that translation and proposed for the IMO Grand Challenge (every time some question arises about the right formal expression of some informal concept, or whether to go for a more literal translation or one that's more idiomatic in Lean, the question would need to be discussed and the conclusion written down and applied to existing and new formalizations), and tracking the cases where a clean Lean statement corresponding closely to the informal statement depends on some concepts, or linkages between concepts, that currently aren't in mathlib and so will need to be added to mathlib (with associated API, as usual for adding any definition, rather than just adding a definition in isolation without any API, so adding some problems to such a repository might wait quite a long time until mathlib has the required API).

Such a project would expect tests to change if found to be buggy, or if the conventions change, or if the original formalization was based on a paraphrase of the original version of an older problem and the original English (or other original language) version of the problem is subsequently confirmed, so allowing a more faithful conversion to Lean. So it might be better as a source from which problems can be taken for a machine learning benchmark than for direct use as such a benchmark.

Tom Chen (Nov 27 2022 at 23:29):

Just my two cents: In addition to being under facebook or leancommunity, from my experience of open source, it seems common to create another same-named GitHub organization and let the organization own minif2f (e.g. organization named minif2f). Many popular open-source library that is not created by a single person does so, such as https://github.com/flutter/flutter (even though it is Google's official product).

Albert Jiang (Nov 28 2022 at 11:07):

Jason Rute said:

Kunhao Zheng Albert Jiang In another place (the Coq Zulip), Karl Palmskog mentioned that he thought that it would be good to move the MiniF2F repo to a more collaborative repo like https://github.com/leanprover-community. MiniF2F isn't a Lean specific project. Indeed, Isabelle sees as much love in MiniF2F as Lean, and the major need right now is to add Coq. However, leanprover-community does get a lot of visibility and support. I do understand that from a corporate perspective there is a push to put things under a corporate branded Github both for PR and increased control, so if it move elsewhere others may need to take more of a maintainer role.

Great suggestion! I'm not currently affiliated with Facebook and I like the idea of having a "high council of ITP gurus in Lean, Isabelle, Coq, and HOL Light".

I am still torn though because on the one hand, currently at Facebook there's a big team working on theorem proving and can maintain the benchmark quite well. Secondly, it's entirely open-sourced so should be fairly easy if it needs to be moved elsewhere.

On the other hand, the community-driven advantage is pretty clear: it won't stop simply because people move to different companies.

To me the most ideal situation is to form a Github organisation as @Tom Chen suggested that involves Kunhao, someone from Facebook, and experts in Coq and HOL Light too. This needs some time and drive from someone who volunteers to do it. I think the ML for formal maths community will really benefit a lot from it. Currently I don't have the capacity to take on such a project but if in 3-4 months time nobody else does, I can do it.

Tom Chen (Nov 28 2022 at 11:26):

@Albert Jiang If needed, I can spend some time volunteer doing it (though quite busy now so maybe not a lot of time). However, I am quite new to this field, so not sure whether I can help :/

Karl Palmskog (Nov 28 2022 at 12:54):

many thanks to Jason for summarizing the discussion in the Coq Zulip here. A move to leanprover-community for this repo would also lower the bar for adding, for example, Coq equivalents to the benchmarks, since we would know there is somebody in the organization who could review and merge PRs for the long term.

Karl Palmskog (Nov 28 2022 at 12:55):

if you form a new organization GitHub for this repo, then I would worry about who is the admin of this organization, is there sufficient interest, etc. For leanprover-community, I think we know there will people around for the long term.

Joseph Myers (Nov 28 2022 at 13:48):

For the specific case of conventions for formalizing problems (but maybe not the problems themselves), there's also the IMO-grand-challenge organization. I don't think the IMO Grand Challenge should be considered Lean-specific, even if it's currently described as such. Much of the formalization conventions for the IMO Grand Challenge would be independent of the particular system being used - it would be describing generally applicable principles for filling in details left out by informal problem statements, such as exactly what type to use for variables and explicitly representing things that are implicit from e.g. an informal statement doing a division implying it's not dividing by zero. (See the paper that noted their ML system claimed a solution to an IMO problem, that on further inspection was wrong because the formal version of the problem incorrectly used truncating nat subtraction - avoiding truncating subtraction and division would be on the list of general formalization issues included in such conventions.) And then there would be more specific translations for specific constructs to use in each system. Ultimately, for any ambiguity concerning a particular problem (whether a past problem or a future one when actually running the IMO Grand Challenge) the aim would be for all systems to agree on the detailed interpretation being used for formalization (e.g. whether a variable is a natural number or an integer required to be nonnegative; whether two points in a geometry problem are specified as different or whether the solver must prove that if required).

Timothee Lacroix (Nov 28 2022 at 13:59):

Hello,
Here's my two cents coming from ML rather than formalisation. I think finding good ways to be faithful to the complexity of IMO problems is important, but we (as in, the meta team working on neural theorem proving) are completely unqualified to do this.Furthermore, current problems in benchmarks like MiniF2F are already out of reach for our systems, so we'd rather spend time

Joseph Myers (Nov 28 2022 at 19:52):

As a concrete proposal, I suggest that versions of "determine" problems that tell the solver what it needs to prove, when a human contestant would have had to work out that answer themselves, should be named e.g. imo_2006_p3_easy_mode to avoid the name giving the misleading impression that what's asked is actually equivalent to an IMO problem. That way, such "easy mode" variants of problems would still be available for ML use, but avoiding the present misrepresentation of the nature of the problems.

Joseph Myers (Nov 28 2022 at 20:29):

I definitely expect that precise translations of competition problem statements into Lean or other systems following agreed conventions for such translations, and writing such conventions, should involve collaboration with people with expertise in competition problems and in formalization rather than just being done by ML people.

Joseph Myers (Nov 28 2022 at 20:34):

Fortunately, we have plenty of people here with expertise in both competition problems and in formalization. (I'm reminded of the comments in an early attempt to formalize the semantics of C - Norrish's thesis - about the difficulty in finding people who understand both formal methods and the details of C semantics - we're not in that situation here.)

Joseph Myers (Nov 28 2022 at 20:40):

I think it would be a good idea for the IMO Grand Challenge to start running officially before any entrants think they are likely to solve any problems - and benchmarks of too-hard-at-present problems provide something to aim for. Negative results are worthwhile results in science, and "all entrants scored 0 for five years before any entrant solved a real problem in the IMO Grand Challenge" gives more information than "no-one entered the IMO Grand Challenge for 20 years until someone asked to enter once they were confident their AI could solve a majority of IMO problems".

Junyan Xu (Nov 28 2022 at 21:15):

AI nowadays might achieve nonzero scores on some problems in natural-language IMO, right? (so AI might beat some countries if allowed to compete)

Adam Topaz (Nov 28 2022 at 21:39):

Is that really true? Who gets to judge whether the AI translated the natural language problem into a correct lean statement?

Kevin Buzzard (Nov 28 2022 at 21:42):

I think we're talking about turning natural language questions into natural language answers -- no Lean involved.

Adam Topaz (Nov 28 2022 at 21:42):

oh I see. sorry I should have read more carefully

Scott Morrison (Nov 28 2022 at 21:43):

Even for that I'm pretty skeptical. (Either than anyone has claimed this, or that any such claims would be believable.)

Scott Morrison (Nov 28 2022 at 21:44):

It's a small relatively small pool of problems, and likely "big" NL AIs have read the questions and answers during training.

Junyan Xu (Nov 28 2022 at 21:53):

I'm not aware of such claims, but Google's Minerva was evaluated on Poland's National Math Exam 2022 (of course not same level as IMO, but probably not in the training set), and achieved national average:

National Math Exam in Poland [...] is taken by approximately 270K high-school students every year.
We evaluated Minerva 62B on the National Math Exam in Poland and found that it achieves a score of 57%, which happened to be the national average in 2021. The 540B model achieves 65%."

(this is second-hand quoted from https://www.metaculus.com/questions/6728/ai-wins-imo-gold-medal/)

Jason Rute (Nov 28 2022 at 23:41):

@Joseph Myers while you can propose it, I’m not sure in what state the IMO grand challenge is in and if it will ever run before there is a system out there good enough to make non-negligible progress. Of course anyone can organize such a challenge, but I think the hard work is convincing folks to participate with their models.

Also, all the “rules” of the original IMOGC would have to be modified to reflect the current state of research. Open source models are unlikely to be the first models to realistically do well. Also I don’t know if just having problems in just Lean is a good idea. First, I think there is good reason to ask for the problem in both Lean and natural language. I also think there is reason to want to include other ITPs, like Isabelle and Coq (or whatever ITP the participating models use).

Jason Rute (Nov 28 2022 at 23:42):

I personally would like to see a competition with problems with a range or difficulties so we can get meaningful results sooner than later.

Jason Rute (Nov 28 2022 at 23:45):

But the same problems of getting participation remain.

Jason Rute (Nov 29 2022 at 00:43):

And I could be totally wrong about where the IMO GC is currently at. I just found and article on it from this month: #IMO-grand-challenge > Article about IMO grand challenge

Joseph Myers (Nov 29 2022 at 00:46):

I said above that I don't think it should be Lean-specific, and the rules should be up for discussion in general. The fact that AIs have solved some older IMO or IMO-easy-mode problems - even if those problems are easier than would now be likely to appear as problems 1 or 4 - is to my mind evidence that there's enough chance of an existing AI solving a new IMO problem - maybe 1% or 2% for each future problem 1 or 4 that's algebra or number theory? - that it would be worth having a competition if anyone would enter, even if there's a 99% chance they all score 0.

Joseph Myers (Nov 29 2022 at 00:50):

Human IMO contestants get the problems in up to three languages (at least one of which must be an official language - English, French, German, Russian, Spanish), so it seems fair for IMO Grand Challenge entrants to be able to get an informal version as well (and in principle a third version, whether formal or informal, though I don't know that that would be of any use to the entrants).

Joseph Myers (Nov 29 2022 at 00:53):

By involving new problems (and given a requirement that the AI code is frozen and receives no more training between the start of the IMO for human contestants and when the AI sits the IMO Grand Challenge), the IMO Grand Challenge addresses the concern about whether a machine learning system might have seen a solution somewhere to an existing problem it's finding a formal solution to. There are many national and regional olympiads out there that also use new problems and with a range of difficulty levels, so any of them could be used for something similar to the IMO Grand Challenge - provided the entrants commit in public that they will be doing such a challenge based on a particular olympiad, before the human contestants sit that olympiad.

Joseph Myers (Nov 29 2022 at 01:14):

Would any of the people working on AIs for formal-to-formal mathematical competition problem solving like to comment here now: would you enter the IMO Grand Challenge, or something similar based on easier-than-IMO competitions, if it were running now (assuming you could enter with whatever ITP you like, and that the AI doesn't need to be open source, but you do need to announce in advance of the IMO that you're entering, agree the formal statements of problems with the community following whatever conventions may be established, and commit to announce your results, including a full copy of any solutions achieved, promptly afterwards, even if those results are a score of 0 on the whole competition)? If not, is there a particular threshold you think should be reached before entering it? And for the IMO Grand Challenge committee members, do you intend to start running it as a real competition as soon as at least one entrant is available, as long as there's time to agree rules with that entrant and the community before the first IMO for which the competition is held?

Albert Jiang (Nov 30 2022 at 09:30):

Joseph Myers said:

Would any of the people working on AIs for formal-to-formal mathematical competition problem solving like to comment here now: would you enter the IMO Grand Challenge, or something similar based on easier-than-IMO competitions, if it were running now (assuming you could enter with whatever ITP you like, and that the AI doesn't need to be open source, but you do need to announce in advance of the IMO that you're entering, agree the formal statements of problems with the community following whatever conventions may be established, and commit to announce your results, including a full copy of any solutions achieved, promptly afterwards, even if those results are a score of 0 on the whole competition)? If not, is there a particular threshold you think should be reached before entering it? And for the IMO Grand Challenge committee members, do you intend to start running it as a real competition as soon as at least one entrant is available, as long as there's time to agree rules with that entrant and the community before the first IMO for which the competition is held?

I like this setting a lot. But we (at least me) are not ready to let machines compete in IMO yet. Maybe in two years time we would be ready.

Albert Jiang (Nov 30 2022 at 09:34):

To be clear: I do think negative results (i.e., zero scores) are very important for science, and they should be made public afterwards.

Junyan Xu (Dec 01 2022 at 08:24):

some IMO stuff on Twitter from NeurIPS:
https://twitter.com/MichaelTrazzi/status/1597972575975768066
https://twitter.com/deliprao/status/1597770289676685312

@andregraubner @braneloop @ethansdyer The latest point mean that even if you have models that are good at solving problems that we know how to formalize, it doesn't mean those models would be able to solve problems we have no good formalization if we scaled them properly

- NeurIPS Michaël Trazzi (@MichaelTrazzi)

I am not skeptical of these claims. That's saying something. https://twitter.com/MichaelTrazzi/status/1597755209324711936

- @deliprao@mastodon.social (@deliprao)

Jason Rute (Jan 10 2023 at 19:47):

I should also point out that a major premise of the IMO grand challenge is that for a computer to do well on the IMO (and for a human to trust the computer's solution) it has to give a formal proof. Models like Minerva (and even ChatGPT) are starting to cast doubt on that assumption. (Although models like Draft Sketch Prove (#Machine Learning for Theorem Proving > More papers on autoformalization) also show the two approaches are quite compatible and systems which generate informal proofs convert them to formal theorem proofs are a nice symbiosis.)

Johan Commelin (Jan 10 2023 at 19:50):

Why do you say that even ChatGPT starts to cast doubt on that assumption?
If anything, it has strengthened my belief that I'll only believe a computer-generated proof if it is formally verified. ChatGPT is way too good at bullshitting.

Jason Rute (Jan 10 2023 at 19:59):

Before Minerva, I was 99% sure the first computer solutions to the IMO grand challenge would be formal. After, I'm 80% sure. As for BS, the question is if that is a fundamental property of large language model applications, or if it can be fixed (without having to formalize everything in a formal system).

Jason Rute (Jan 10 2023 at 20:01):

(To be clear, I don't think ChatGPT itself, especially the current version, would do well.)

Last updated: May 02 2025 at 03:31 UTC

Stream: Machine Learning for Theorem Proving

Topic: Meta IMO result

Sid (Nov 24 2022 at 07:11):

Kevin Buzzard (Nov 24 2022 at 07:26):

Kevin Buzzard (Nov 24 2022 at 07:29):

Sid (Nov 24 2022 at 07:45):

Sid (Nov 24 2022 at 07:46):

Tom Chen (Nov 24 2022 at 07:52):

Tom Chen (Nov 24 2022 at 07:52):

Kevin Buzzard (Nov 24 2022 at 07:58):

Siddhartha Gadgil (Nov 24 2022 at 08:42):

Tom Chen (Nov 24 2022 at 08:43):

Joseph Myers (Nov 24 2022 at 12:44):

Albert Jiang (Nov 24 2022 at 14:13):

Kevin Buzzard (Nov 24 2022 at 14:19):

Albert Jiang (Nov 24 2022 at 14:32):

Eric Wieser (Nov 24 2022 at 14:32):

Eric Wieser (Nov 24 2022 at 14:34):

Albert Jiang (Nov 24 2022 at 14:35):

Joseph Myers (Nov 24 2022 at 14:36):

Tom Chen (Nov 24 2022 at 14:40):

Albert Jiang (Nov 24 2022 at 14:43):

Albert Jiang (Nov 24 2022 at 14:44):

Jason Rute (Nov 24 2022 at 14:44):

Joseph Myers (Nov 24 2022 at 14:45):

Joseph Myers (Nov 24 2022 at 14:53):

Albert Jiang (Nov 24 2022 at 14:55):

Eric Wieser (Nov 24 2022 at 16:59):

Albert Jiang (Nov 24 2022 at 17:34):

Kunhao Zheng (Nov 25 2022 at 04:07):

Jason Rute (Nov 27 2022 at 15:59):

Jason Rute (Nov 27 2022 at 15:59):

Joseph Myers (Nov 27 2022 at 17:57):

Tom Chen (Nov 27 2022 at 23:29):

Albert Jiang (Nov 28 2022 at 11:07):

Tom Chen (Nov 28 2022 at 11:26):

Karl Palmskog (Nov 28 2022 at 12:54):

Karl Palmskog (Nov 28 2022 at 12:55):

Joseph Myers (Nov 28 2022 at 13:48):

Timothee Lacroix (Nov 28 2022 at 13:59):

Joseph Myers (Nov 28 2022 at 19:52):

Joseph Myers (Nov 28 2022 at 20:29):

Joseph Myers (Nov 28 2022 at 20:34):

Joseph Myers (Nov 28 2022 at 20:40):

Junyan Xu (Nov 28 2022 at 21:15):

Adam Topaz (Nov 28 2022 at 21:39):

Kevin Buzzard (Nov 28 2022 at 21:42):

Adam Topaz (Nov 28 2022 at 21:42):

Scott Morrison (Nov 28 2022 at 21:43):

Scott Morrison (Nov 28 2022 at 21:44):

Junyan Xu (Nov 28 2022 at 21:53):

Jason Rute (Nov 28 2022 at 23:41):

Jason Rute (Nov 28 2022 at 23:42):

Jason Rute (Nov 28 2022 at 23:45):

Jason Rute (Nov 29 2022 at 00:43):

Joseph Myers (Nov 29 2022 at 00:46):

Joseph Myers (Nov 29 2022 at 00:50):

Joseph Myers (Nov 29 2022 at 00:53):

Joseph Myers (Nov 29 2022 at 01:14):

Albert Jiang (Nov 30 2022 at 09:30):

Albert Jiang (Nov 30 2022 at 09:34):

Junyan Xu (Dec 01 2022 at 08:24):

Jason Rute (Jan 10 2023 at 19:47):

Johan Commelin (Jan 10 2023 at 19:50):

Jason Rute (Jan 10 2023 at 19:59):

Jason Rute (Jan 10 2023 at 20:01):