Zulip Chat Archive

Stream: IMO-grand-challenge

Topic: Suggested formalization conventions

Joseph Myers (Jan 06 2025 at 20:19):

I've now written up my preferred conventions for how to state IMO problems in Lean, if such statements are to be used as a challenge to AI or human solvers or formalizers. It also includes a more detailed geometry TODO list (list of geometry definitions that I think should be added to mathlib for cleanly stating all past IMO problems with geometrical elements) than my previous version that only looked at about ten years of problems.

https://github.com/jsm28/IMOLean

I don't know if anyone has plans for using future IMOs as a formal-to-formal challenge, but if they do then I think it would be preferable for all entrants in a given year to use a single Lean version of the problems (modulo any complications around different entrants wanting to use different versions of mathlib), written according to conventions agreed in advance, and this is my initial suggestion for what such conventions should look like.

I hope to add example statements following these conventions to this repository in future (somewhere between "past few years" and "everything 2006 onwards for which definitions are in mathlib"); they aren't ready yet because when I went to start writing them last night I ran into the problems with setting up / updating repositories using current mathlib.

Kevin Buzzard (Jan 06 2025 at 20:39):

Thanks a lot for putting your thoughts down! Yes, right now there is a cache issue, it will hopefully be fixed when 4.16-rc2 is released.

Eric Wieser (Jan 06 2025 at 21:57):

A potential alternative to

namespace IMO1637P6

def answer : Set ℕ := sorry

theorem result :
    {n : ℕ | ∃ a b c : ℕ, 0 < a ∧ 0 < b ∧ 0 < c ∧ a ^ n + b ^ n = c ^ n} = answer := by
  sorry

end IMO1637P6

would be

namespace IMO1637P6

abbrev Question (answer : Set ℕ) : Prop :=
  {n : ℕ | ∃ a b c : ℕ, 0 < a ∧ 0 < b ∧ 0 < c ∧ a ^ n + b ^ n = c ^ n} = answer

theorem result : Question sorry := sorry

end IMO1637P6

which means that you can perhaps drop result from the source file entirely

Eric Wieser (Jan 06 2025 at 21:58):

(which has the benefit of making the source file sorry-free)

Joseph Myers (Jan 07 2025 at 00:32):

That does have the disadvantage of making things further from idiomatic Lean in the case of universally quantified statements (where as I've written it the hypotheses go before the colon as normal, but with this structure you'd have only answer before the colon in the definition of Question, I suppose, and everything else after it).

Eric Wieser (Jan 07 2025 at 00:39):

One advantage of this formalization is that it lets you state that two different formalizations of the question are equivalent to each other, without having to choose an answer to do so

Eric Wieser (Jan 07 2025 at 00:42):

On another point, I think your plan to #min_imports is not a good one; this greatly increases the difficultly of bumping mathlib, as any file split is likely to leave you with broken imports. Of course, you could have a script that replaces all the imports with import Mathlib, updates lean, and re-minimizes, but I'm not sure what benefit the minimization is providing at this point.

Joseph Myers (Jan 07 2025 at 00:43):

At least if the two formalizations choose the same type for answer (which in general they might not), otherwise you also need to provide a map between the types in the statement of equivalence.

Eric Wieser (Jan 07 2025 at 00:45):

(I didn't explicitly say it earlier, but despite my small quibbles this document is a great resource!)

Joseph Myers (Jan 07 2025 at 00:48):

Using #min_imports is effectively just following idiomatic Lean/mathlib conventions; using import Mathlib would work equally well (and avoid AIs needing to add imports - though AIs might choose to remove imports internally if using Lean tactics to suggest intermediate steps that work better with fewer imports).

One thing I don't have a sense of is how large the risk is, with minimized imports, that an AI that finds it needs to add imports to solve the problem also accidentally makes the statement no longer defeq to the original one (and so any solution no longer valid under the rules) because of different (propeq, hopefully, but not defeq) instances being found. We'd like the instances found with different imports to be defeq, but surely do have cases where that's not so.

Eric Wieser (Jan 07 2025 at 00:52):

Unfortunately, adding imports often results in compilation errors; for instance:

import Mathlib.Algebra.Module.Basic
import Mathlib.Data.Nat.Factorial.Basic
import Mathlib -- syntax error with this line

open scoped Nat

example (φ : ℕ →+* ℕ) (n : ℕ) : φ (n !) = (φ n)! := by
  simp

(edit: I previously said "in the past" referring to a similar clash with π notation)

Eric Wieser (Jan 07 2025 at 00:56):

Most of these are cases of scoped notations sharing a namespace which should not, but I think it would be better to discover these in CI for your project (and then fix them in mathlib) than to have the AI agents discover them.

Joseph Myers (Jan 07 2025 at 00:57):

Since I mentioned the issue of different AI entrants to a formal-to-formal challenge potentially wanting to use different mathlib versions, I should maybe elaborate on why I think there could be such an issue. Two things are in tension here:

If an AI is trained with particular versions of mathlib and Lean, it will have learned lemma names, tactic usage etc. for those versions and may not work so well with other versions. This issue would tend to indicate AI entrants wanting an old version fixed well in advance of any challenge.
But if the challenge wishes to encourage early sharing (like AIMO has its Early Sharing Prize), then you want any additions to Lean and mathlib to help solving IMO problems by AI (new lemmas, new or improved tactics, modifications to Lean to facilitate interaction with AIs, etc.) to be upstreamed by all entrants as they go along - you don't want entrants to hold onto those things for competitive advantage (and potentially end up producing solutions that start with a long sequence of prewritten lemmas and tactics). And being able to get the advantage of such early sharing requires using a recent mathlib version that has all the new features upstreamed by other entrants.

There is also an argument here that if such a challenge wishes to promote more broadly useful AI then it's desirable for such AI to be able to adapt to new Lean and mathlib versions without requiring a huge amount of retraining, in which case it could be considered a feature for the rules to say "the precise, recent version of mathlib master to be used will be announced shortly before the start of the competition".

Joseph Myers (Jan 07 2025 at 00:59):

(Which could be the version of mathlib tagged to work with the most recent Lean release, for example.)

Siddhartha Gadgil (Jan 07 2025 at 04:11):

I second the idea that the latest stable version of Lean and Mathlib are used for the competition (modulo the slight technical issue if the competition is at the beginning of the month).

This will favour systems that are more useful to Lean users and Mathematicians over those who purely use Lean to show technical AI skills.

Marcelo Fornet (Jan 08 2025 at 20:30):

When a problem involves specific but arbitrary numbers (for example, the year of the competition), but only depends on more limited properties of such numbers (say, depends on the year being at least 3, or on it being an even number), it's likely appropriate in a formal solution to generalize to arbitrary N satisfying the required conditions

I consider it is better to express the formalisation in terms of the original numbers rather than using the generalised version. To some extent using a more general statement could be a big hint towards the desired solution.

Kim Morrison (Jan 08 2025 at 22:26):

The quoted statement is talking about "a formal solution", not the problem statement, so generalization is appropriate I think.

Joseph Myers (Jan 09 2025 at 02:24):

Exactly. I've found that formalizing a solution often influences the formal problem statement. So problems in the mathlib archive and in compfiles may well have statements that were adjusted to make the solution more convenient to formalize, but that isn't wanted when giving a problem on its own as a challenge to a solver (or as a challenge to an autoformalizer along with a natural language solution, for that matter), hence this section discussing the differences between conventions for problems stated on their own and what might be appropriate for a problem accompanied by a solution.

Joseph Myers (Jan 26 2025 at 22:57):

I've now changed the conventions to use import Mathlib, added statements of non-geometry problems from IMO 2022, 2023 and 2024 following these conventions (the IMO 2022 statements include some that don't currently have statements in Compfiles, though with the usual caveats about the risk of mistakes in a formal statement written without a solution), and expanded the conventions in various places based on writing those statements.

Joseph Myers (Jan 26 2025 at 23:00):

I've also resurrected the old PR #7478 (defining congruence for indexed families of points) - as, while congruence is not among the missing geometry definitions for stating past IMO problems (no problems explicitly refer to congruence), I think it will be useful in setting up API for some of the missing definitions.

Joseph Myers (Feb 10 2025 at 00:31):

I've now added statements of IMO 2020 and 2021 problems following these conventions (and expanded the conventions further for issues that showed up).

Joseph Myers (Feb 10 2025 at 00:34):

Having compared my choices with those in Compfiles, I think the Compfiles version of IMO 2020 P5 omits two conditions from the problem: I'd understand a pair of cards to be two different cards (possibly with the same number on them), and the geometric mean explicitly needs to be of a nonempty collection of cards. The two versions are easily equivalent, but I don't think the omission of those two conditions was intended in the informal statement.

Joseph Myers (Feb 10 2025 at 00:37):

PRs filed so far to work towards having more of the definitions to state more past IMO geometry problems: #21111, #21588, #21604, #21617, #21618, #21622

Joseph Myers (Mar 30 2025 at 19:05):

We now have evidence that formal statements following these conventions are of use to humans at least, in the form of Jovan's draft #23431 formalizing IMO 2020 P6 using my formal statement.

Joseph Myers (Mar 30 2025 at 19:07):

I don't know if such formalizations of past or future IMO problem statements are of use for AI developers, though they're intended to be, and I think these conventions do have various advantages over some choices that have been made in the past about how to state problems for AIs.

Joseph Myers (Mar 30 2025 at 19:10):

(i) Having a single, independently developed Lean version of a problem for use by all such AIs seems preferable for assessing the abilities of theorem-proving AIs compared to each developer of such an AI producing their own Lean problem statements (and especially for comparing the abilities of different AIs if more than one wishes to attempt problems from the same IMO).

Joseph Myers (Mar 30 2025 at 19:14):

(ii) Using separate defs and structures in some cases as in my conventions seems preferable to artificially forcing all definitions into hypotheses of a single theorem statement as has sometimes been done for use with theorem provers. It's more idiomatic Lean in such cases, and more representative of non-competition mathematics where it's normal to have some new definition and wish to prove results about it, without necessarily being given any Lean API for that definition in advance.

Joseph Myers (Mar 30 2025 at 19:17):

(Using separate defs and structures also makes it a lot simpler to state some more complicated problems such as IMO 2024 P5. An AI is of course always free to translate problems into a different form more convenient for that AI to work with, provided it can prove the equivalence of the two forms.)

Joseph Myers (Mar 30 2025 at 19:26):

(iii) Given the prevalence of recent informal reasoning models that produce mathematics that looks superficially plausible (and often guesses the right answer) but turns out to have fatally flawed reasoning when you inspect the details of the output, I think fully formal proofs should definitely be the gold standard for assessing the theorem proving abilities of AIs. An AI that could produce informal IMO solutions of sufficient quality to count as a full solution when coming from a human (and of human-verifiable length) would be of some interest, but if AIs are to contribute proofs in mathematical research, it would be problematic for humans to need to check for fatal mistakes manually in a claimed ten-thousand-page solution to an unsolved problem; any AI hoping to make serious research contributions of proofs really needs to be able to verify its work formally as it goes along.

Joseph Myers (Mar 30 2025 at 19:33):

My current PRs to enable stating more past IMO geometry problems: #21960, #22178, #22483, #22745, #22747, #22751, #23255, #23323. I'm working towards setting up definitions of incenter and excenter. After that, the main missing geometry definitions relate to angle bisectors, arcs of circles and polygons, with a long tail of less frequently used definitions (including the previously discussed issue of Hausdorff measure normalization for talking about areas).

Joseph Myers (Mar 30 2025 at 19:35):

I do expect at some point to extend my collection of formal statements to cover more past IMOs, but I tend to give priority to setting up more definitions in mathlib to enable stating a greater proportion of problems, rather than stating incomplete sets of problems from more IMOs.

Joseph Myers (Mar 31 2025 at 23:58):

Thanks for the rapid reviews (all eight of those PRs are now in)! A few more notes:

Joseph Myers (Apr 01 2025 at 00:00):

Assessing an AI based on independent Lean formalizations of problem statements following known conventions (facilitating comparisons between different AIs) is of course not mutually exclusive with assessing it based on formalizations of problem statements by the AI developers; you can compare AI performance on different formalizations using different conventions and see what advantage or disadvantage arises from the different choices made.

Joseph Myers (Apr 01 2025 at 00:04):

Although there are still gaps in the problem statements covered in my collection (it now has five of the six problems from each year 2020 to 2024), it does have quite a few problems for which Compfiles lacks formal statements (see also my remark above about the Compfiles version of IMO 2020 P5).

Joseph Myers (Apr 01 2025 at 00:08):

Since the conventions base formal statements on the English versions originally provided to contestants, any potential extension before 2006 would depend on scans of the English papers originally provided to contestants (as opposed to retyped and possibly reworded versions such as on imo-official) being contributed to https://www.imo-register.org.uk/papers/ (but I don't expect to extend the formal statements as far back as 2006 any time soon).

Joseph Myers (Apr 01 2025 at 00:17):

While I'd hope to add formal statements following these conventions for future IMO problems as well as more past IMO problems, that's not something that will necessarily happen during or immediately after a given IMO without more specific planning to arrange it.

Joseph Myers (Apr 01 2025 at 00:29):

It's very likely that formalizing solutions to geometry (or combinatorial etc. geometry) problems using newly added definitions will show up missing API that should be added to mathlib (this does not mean that generating such formal solutions is out of reach for AI until the API is added, API lemmas aren't generally hard to prove, or to work around the absence of, and in any case the ability to identify and fill out missing API would be a good feature for a formalization AI). I might try formalizing a solution to IMO 2024 P4 at some point to complete the set of formal solutions to IMO 2024 problems. I don't expect to formalize solutions to many if any other past IMO geometry problems (though in general, the likelihood of showing up new API to add to mathlib makes a solution more interesting to me to formalize - there are lots of past algebra and number theory problems that aren't likely to involve any new API, and so that I'm unlikely to attempt formalizing myself in the absence of some personal connection such as having coordinated the problem, but that would make good formalization exercises for non-experts).

David Renshaw (Apr 01 2025 at 01:00):

Updated Compfiles Imo2020P5. Thanks!

Jovan Gerbscheid (Apr 03 2025 at 15:09):

Is there a reason why your problem statements contains names for hypotheses? These get picked up by the unused variable name linter, so that I would have to modify the statement to put it in mathlib.

Jovan Gerbscheid (Apr 03 2025 at 15:10):

In particular,

theorem result : ∃ c : ℝ, 0 < c ∧ ∀ {n : ℕ} (hn : 1 < n) {S : Finset P} (hSn : #S = n)
    (hSdist : (S : Set P).Pairwise fun x y ↦ 1 ≤ dist x y),
    ∃ l : AffineSubspace ℝ P, finrank ℝ l.direction = 1 ∧
      (∃ p₁ p₂, p₁ ∈ S ∧ p₂ ∈ S ∧ l.SOppSide p₁ p₂) ∧
      ∀ p ∈ S, c * (n : ℝ) ^ (-1 / 3 : ℝ) ≤ Metric.infDist p l := by
  sorry

has hn, hSn and hSdist unused

Joseph Myers (Apr 03 2025 at 22:58):

Do we have a way yet to put one line in lakefile.toml that means "always lint in mathlib mode" (i.e. produce the same linter warnings as would appear if the code were copied into mathlib)? I'd like to apply mathlib mode linting to all my Lean projects (thus, it's not intended for the statements to do things that such linting warns about, beyond any lints for use of sorry); without it, issues shown by linter warnings that only appear in mathlib are very likely to be missed.
What do you suggest as an alternative for avoiding that linter? Naming such hypotheses _ (if something like that works to avoid the warning)? Using → not ∀ for such hypotheses (nested within the statement of the result after the :) that aren't referenced in what follows?
You can't always avoid unused hypotheses with my conventions because sometimes the original problem might have an unused hypothesis. (Consider a problem involving a positive integer, where in fact the formal statement and Lean proof also work for zero. Then the formal statement will include n < 0 as part of accurately reflecting the informal statement, but that hypothesis will be unused when sorry is replaced by a proof.)

Jovan Gerbscheid (Apr 03 2025 at 23:02):

I think using → instead of ∀ is the usual way.

Jovan Gerbscheid (Apr 03 2025 at 23:04):

That is a bit unfortunate indeed. But it can be solved using set_option linter.unusedVariables false.

Kim Morrison (Apr 03 2025 at 23:19):

Not yet, but @Anne Baanen is working on it!

Joseph Myers (Apr 04 2025 at 00:10):

OK, I've changed two statements to use →. (There's at least one statement where the name isn't explicitly used but I think ∀ is required for it to be found in list indexing notation; see IMO 2023 P1. Hopefully the fact that the hypothesis is in fact used in the resulting terms, just not visibly in the input to Lean, suffices to avoid linter issues there.)

Jovan Gerbscheid (Apr 04 2025 at 00:14):

Another question I had about using the statement formalizations is about having other variables in scope. Should there be the rule that all the preceding auxiliary theorems (and variables) are inside of a (named) section to avoid accidentally changing the theorem statement?

Joseph Myers (Apr 04 2025 at 00:40):

My README says that, for any system involving automated evaluation of proposed solutions, the statement must remain defeq (as a function of the answer, when applicable) to the statement presented to the solver. I haven't attempted to define in detail how such automated evaluation might work. (But I guess it would involve processing oleans through a validator, and confirming only the permitted standard axioms are used, as well as doing the defeq checks, and human inspection of the proposed answer.)

This is not very relevant when solutions go through human review.

Patrick Massot (Apr 04 2025 at 07:23):

Kim Morrison said:

Not yet, but Anne Baanen is working on it!

Oh that’s great. Are they also working on being able to disable all linter in downstream projects. I mean avoiding writing in desperation lakefile that look like https://github.com/PatrickMassot/proofs_with_lean/blob/master/lakefile.toml#L8-L20 only to realize it has no effect and then write macros like https://github.com/PatrickMassot/proofs_with_lean/blob/master/ProofsWithLean/Lib.lean#L346-L349 and use them at the top of every single file.

Anne Baanen (Apr 04 2025 at 08:13):

I will make sure that disabling all linters can also be a oneliner, thanks for the complaint/bug report :)

Joseph Myers (Apr 07 2025 at 00:39):

Next major piece of definitions for being able to state more IMO problems in Lean now posted: #23752 has incenter, excenter and related definitions (with significant pieces left to followups, considering that's already a 600-line PR). Depends on #23712 and #23751; my other open PRs #23715, #23732 and #23735 are things I suspect might be relevant in such followups.

Jason Rute (Apr 17 2025 at 11:28):

Have you given any thought into making a formal version of https://matharena.ai/ (by @Mislav Balunović), which keeps track of how informal AI does on recent math competitions (including proof-based ones)? How much work would it be to formalize the competition problems into Lean for each math competition? (This seems more relevant now that we have Kimina-Prover.)

Mislav Balunović (Apr 19 2025 at 07:41):

This is definitely a direction we are considering to expand into at https://matharena.ai/ now that the competitions with numerical answers only (which are the easiest to evaluate) have been almost fully saturated by the recent models. Regarding proofs, our first take on this was evaluating natural language proofs on USAMO using human judges - the bottleneck here is obviously creating a grading scheme, manually reading LLM solutions, etc. We are definitely also thinking about introducing a category where we evaluate methods that produce formal solutions in Lean - if you have ideas on this direction please reach out!

Joseph Myers (Apr 20 2025 at 09:43):

A key step in doing such formal benchmarking is reviewing lots of past problems, identifying relevant definitions for those problems and adding them (+ at least minimal API) to mathlib in advance of possibly being of use in stating future problems. Otherwise the subset of problems in the benchmark is biased by what happens to be in mathlib at the time to allow the problems to be stated.

Joseph Myers (Apr 20 2025 at 09:45):

I'm not so worried about the difficulty of producing formal solutions being biased by what lemmas are in mathlib, since (a) an AI of serious practical use for formal theorem proving in research needs to be able to deal with definitions with little to no API, and build up basic results for them itself, and (b) AI developers (and their AIs, working together with the developers or autonomously) are free to contribute to mathlib just like anyone else, including API improvements that might help performance on future benchmark evaluations.

Joseph Myers (Apr 20 2025 at 09:56):

Once you have relevant definitions in mathlib, I don't think producing formal statements rapidly after a human competition is particularly hard. I think that the failure at DeepMind to get a formal statement of IMO 2024 P5 ready for AlphaProof on the day the problems were received (as mentioned in at least one talk) reflects the added difficulty of making such a statement under the artificial constraint of putting everything in a single theorem statement without separate defs using associated structures; once you can use separate types and definitions, such a problem is quite straightforward to translate to Lean (and if you do apply such an artificial constraint, the way to do it is probably first to write the statement without the constraint, and then iteratively replace custom types by products with separate hypotheses, then separate definitions by extra pairs of hypotheses).

Avoiding mistakes in the formal definitions is a separate issue; human formalizing a solution to check the formal statement is a lot of work for many olympiad combinatorics or geometry problems. Checklists of common pitfalls and having multiple people check each other's formal statements word by word in both directions against the informal version may help.

Last updated: May 02 2025 at 03:31 UTC