Zulip Chat Archive
Stream: general
Topic: CombiBench : A benchmark focusing on combinatorics problem
Junqi Liu (Apr 28 2025 at 16:28):
Glad to share the new work from our Kimina team! We have released the first Benchmark focusing on formalized mathematics in combinatorics —— CombiBench, and for the first time, proposed an Evaluation method for fill-in-the-blank questions —— Fine-Eval! We have open-sourced the complete datasets with and without solutions, as well as one-click Evaluation code that can be used with Kimina Lean Server!
Github:https://github.com/MoonshotAI/CombiBench
Huggingface:https://huggingface.co/datasets/AI-MO/CombiBench
Website:https://moonshotai.github.io/CombiBench
LeaderBoard:https://moonshotai.github.io/CombiBench/leaderboard.html
Jason Rute (Apr 28 2025 at 20:34):
Can you remind me how moonshot AI and Numini are related? Is moonshot = Kimina = Numina + Kimi ?
Junyan Xu (Apr 28 2025 at 21:06):
Kimi is product of Moonshot (the company), and Kimi Team (coauthor of Kimina) is presumably the team working on the product, as I understand it.
Jason Rute (Apr 28 2025 at 22:04):
"Without solution" means the model has to give the solution and a formal proof? What we sometimes call "hard mode"? Is that correct? Or do I have it backward (or totally wrong)?
Jason Rute (Apr 28 2025 at 22:05):
Also the Leaderboard is really hard to read on a mobile device.
Jason Rute (Apr 28 2025 at 22:08):
It looks like a nice benchmark overall.
Jason Rute (Apr 28 2025 at 22:10):
I’d be curious what quality control you did to sure there are no mistakes in translation. This is one of the largest issues with making these formal benchmarks. All the standard benchmarks when first released had a lot of mistakes (and many still do). This often makes the problems either impossible or trivial.
Bhavik Mehta (Apr 28 2025 at 22:15):
I am also curious about this: so far I have looked at three of the problems, and two out of the three had obvious mistakes (of the form Jason mentions).
It's a small sample size, but given past problems with benchmarks, this makes me not at all optimistic about this one.
Eric Wieser (Apr 28 2025 at 22:20):
for the first time, proposed an Evaluation method for fill-in-the-blank questions —— Fine-Eval!
Am I right in assuming that the dataset format for such questions is exactly that of PutnamBench?
Bhavik Mehta (Apr 28 2025 at 22:21):
From their README: "For problems that require providing a solution first and then proving its correctness, we have referred to the style of PutnamBench."
Jason Rute (Apr 28 2025 at 22:24):
Also I’d be curious what @Joseph Myers thinks both in terms of correctness and style. He has strong opinions on how competition problems should be formalized.
Jason Rute (Apr 28 2025 at 22:29):
Also, besides errors and style, I have a checklist of things to think about when making a Lean benchmark:
Eric Wieser (Apr 28 2025 at 22:40):
From the README:
It is released under permissive licenses Apache 2.0 for Lean 4.
The LICENSE
file says it is MIT-licensed. Which is correct?
Eric Wieser (Apr 28 2025 at 22:41):
https://github.com/MoonshotAI/CombiBench/blob/master/docs/README.md also looks out of place
Bhavik Mehta (Apr 28 2025 at 23:06):
I notice some overlap between the contents of Compfiles and this repository, in fact some code appears to be directly copied from there without attribution. Compare, for instance:
from Compfiles:
/-- A cell on the board for the game. -/
abbrev Cell (N : ℕ) : Type := Fin (N + 2) × Fin (N + 1)
/-- A row that is neither the first nor the last (and thus contains a monster). -/
abbrev InteriorRow (N : ℕ) : Type := (Set.Icc 1 ⟨N, by omega⟩ : Set (Fin (N + 2)))
/-- Data for valid positions of the monsters. -/
abbrev MonsterData (N : ℕ) : Type := InteriorRow N ↪ Fin (N + 1)
/-- The cells with monsters as a Set, given an injection from rows to columns. -/
def MonsterData.monsterCells (m : MonsterData N) :
Set (Cell N) :=
Set.range (fun x : InteriorRow N ↦ ((x : Fin (N + 2)), m x))
/-- Whether two cells are adjacent. -/
def Adjacent (x y : Cell N) : Prop :=
Nat.dist x.1 y.1 + Nat.dist x.2 y.2 = 1
/-- A valid path from the first to the last row. -/
structure Path (N : ℕ) where
/-- The cells on the path. -/
cells : List (Cell N)
nonempty : cells ≠ []
head_first_row : (cells.head nonempty).1 = 0
last_last_row : (cells.getLast nonempty).1 = N + 1
valid_move_seq : cells.Chain' Adjacent
/-- The first monster on a path, or `none`. -/
noncomputable def Path.firstMonster (p : Path N) (m : MonsterData N) : Option (Cell N) :=
letI := Classical.propDecidable
p.cells.find? (fun x ↦ (x ∈ m.monsterCells : Bool))
/-- A strategy, given the results of initial attempts, returns a path for the next attempt. -/
abbrev Strategy (N : ℕ) : Type := ⦃k : ℕ⦄ → (Fin k → Option (Cell N)) → Path N
/-- Playing a strategy, k attempts. -/
noncomputable def Strategy.play (s : Strategy N) (m : MonsterData N) :
(k : ℕ) → Fin k → Option (Cell N)
| 0 => Fin.elim0
| k + 1 => Fin.snoc (s.play m k) ((s (s.play m k)).firstMonster m)
with, from the "new" benchmark:
-- A cell on the board for the game.
abbrev Cell (N : ℕ) : Type := Fin (N + 2) × Fin (N + 1)
-- A row that is neither the first nor the last (and thus contains a monster).
abbrev InteriorRow (N : ℕ) : Type := (Set.Icc 1 ⟨N, by omega⟩ : Set (Fin (N + 2)))
-- Data for valid positions of the monsters.
abbrev MonsterData (N : ℕ) : Type := InteriorRow N ↪ Fin (N + 1)
-- The cells with monsters as a Set, given an injection from rows to columns.
def MonsterData.monsterCells {N} (m : MonsterData N) :
Set (Cell N) :=
Set.range (fun x : InteriorRow N ↦ ((x : Fin (N + 2)), m x))
-- Whether two cells are adjacent.
def Adjacent {N} (x y : Cell N) : Prop :=
Nat.dist x.1 y.1 + Nat.dist x.2 y.2 = 1
-- A valid path from the first to the last row.
structure Path (N : ℕ) where
-- The cells on the path.
cells : List (Cell N)
nonempty : cells ≠ []
head_first_row : (cells.head nonempty).1 = 0
last_last_row : (cells.getLast nonempty).1 = N + 1
valid_move_seq : cells.Chain' Adjacent
-- The first monster on a path, or `none`.
noncomputable def Path.firstMonster {N} (p : Path N) (m : MonsterData N) : Option (Cell N) :=
letI := Classical.propDecidable
p.cells.find? (fun x ↦ (x ∈ m.monsterCells : Bool))
-- A strategy, given the results of initial attempts, returns a path for the next attempt.
abbrev Strategy (N : ℕ) : Type := ⦃k : ℕ⦄ → (Fin k → Option (Cell N)) → Path N
-- Playing a strategy, k attempts.
noncomputable def Strategy.play {N} (s : Strategy N) (m : MonsterData N) :
(k : ℕ) → Fin k → Option (Cell N)
| 0 => Fin.elim0
| k + 1 => Fin.snoc (s.play m k) ((s (s.play m k)).firstMonster m)
Zhengying Liu (Apr 29 2025 at 00:18):
Thank you all for pointing out these errors! We will correct them ASAP. And you are all welcome to make pull requests too. We believe community effort is key for constructing a reliable benchmark consisting of only formal statement like this one. It's really not easy. We made sure that every problem is reviewed by at least two reviewers before the release but errors are still very hard to avoid...
Zhengying Liu (Apr 29 2025 at 00:21):
Could you share the exact location of these errors? We'll correct them ASAP. Thank you! @Bhavik Mehta
Bhavik Mehta said:
I am also curious about this: so far I have looked at three of the problems, and two out of the three had obvious mistakes (of the form Jason mentions).
It's a small sample size, but given past problems with benchmarks, this makes me not at all optimistic about this one.
Joseph Myers (Apr 29 2025 at 00:56):
I like the choice to use plenty of external def
s and types rather than trying to force everything into a single statement to be proved (which gets very awkward for many combinatorics problems), at least.
Joseph Myers (Apr 29 2025 at 00:58):
I looked at a few examples of problems I described in the IMOLean README as technically geometrical (that is, combinatorics problems where a literal reading of the original English statement uses geometrical concepts and you can plausibly argue whether than should be retained in a formal statement or whether the formal statement should be purely combinatorial).
Bhavik Mehta (Apr 29 2025 at 00:59):
Zhengying Liu said:
Could you share the exact location of these errors?
No, I am not prepared to do unpaid work for an AI company.
Joseph Myers (Apr 29 2025 at 01:04):
For IMO 2018 P4, the formal statement reflects the geometrical formulation in terms of a distance of (not present in the version in the IMO 2018 shortlist; the problem was reformulated after selection to avoid dependence on knowing what a knight's move is). But I think that formal statement gets the geometrical formulation wrong for a well-known reason: ℝ × ℝ
does not have the Euclidean metric. IMO 2017 P3 looks like it has the same problem with the wrong metric. For IMOLean my conventions explicitly say to use EuclideanSpace ℝ (Fin 2)
(which has the right metric) for problems involving coordinates.
Joseph Myers (Apr 29 2025 at 01:11):
IMO 2013 P6 is another technically geometrical problem I looked at as an example, where the geometry in the statement is about points on a circle and intersections of chords between those points. Without attempting to assess whether the proposed statement is correct, I'd say that if you want to use non-geometrical statements for such problems (where the only geometry is about orderings of points in a cyclic ordering described in terms of whether chords intersect), surely you should first add appropriate definitions to mathlib for whether four points in a CircularOrder
correspond to non-intersecting chords or not; there must be a better way to express that than the very long IsBeautiful
definition in that statement.
Joseph Myers (Apr 29 2025 at 01:18):
As another example (not geometrical), the formal statement for IMO 2022 P1 is very different in style from my statement for that problem. Again without trying to assess whether it's correct, the formal statement you have is rather heavy on computable functions defined imperatively with do
, which is not what I'd expect; the ability to prove things about such function definitions may not have much to do with the ability to prove things about more usual styles of mathematical problem statements, and such function definitions would be more common in Lean code that is intended to be executed than in definitions that are the subjects of proofs. (And while I suggest that AIs not be presented with doc strings about the correspondence between formal and informal statements when given problems to solve, I do suggest that formal statements be written with such doc strings on their definitions for the benefit of humans trying to judge correctness - just that the doc strings should be removed before giving the problems to an AI.)
Joseph Myers (Apr 29 2025 at 01:20):
I have explicitly not looked for other examples of problem statements that use the wrong metric, or any of the other well-known pitfalls in writing formal problem statements.
Joseph Myers (Apr 29 2025 at 01:26):
I think incidentally that the antonym of "easy mode" should be "standard mode" not "hard mode" (it's asking an AI to do the same things a human is asked for - find an answer and prove it correct).
Junqi Liu (Apr 29 2025 at 01:29):
Bhavik Mehta said:
I notice some overlap between the contents of Compfiles and this repository, in fact some code appears to be directly copied from there without attribution. Compare, for instance:
from Compfiles:/-- A cell on the board for the game. -/ abbrev Cell (N : ℕ) : Type := Fin (N + 2) × Fin (N + 1) /-- A row that is neither the first nor the last (and thus contains a monster). -/ abbrev InteriorRow (N : ℕ) : Type := (Set.Icc 1 ⟨N, by omega⟩ : Set (Fin (N + 2))) /-- Data for valid positions of the monsters. -/ abbrev MonsterData (N : ℕ) : Type := InteriorRow N ↪ Fin (N + 1) /-- The cells with monsters as a Set, given an injection from rows to columns. -/ def MonsterData.monsterCells (m : MonsterData N) : Set (Cell N) := Set.range (fun x : InteriorRow N ↦ ((x : Fin (N + 2)), m x)) /-- Whether two cells are adjacent. -/ def Adjacent (x y : Cell N) : Prop := Nat.dist x.1 y.1 + Nat.dist x.2 y.2 = 1 /-- A valid path from the first to the last row. -/ structure Path (N : ℕ) where /-- The cells on the path. -/ cells : List (Cell N) nonempty : cells ≠ [] head_first_row : (cells.head nonempty).1 = 0 last_last_row : (cells.getLast nonempty).1 = N + 1 valid_move_seq : cells.Chain' Adjacent /-- The first monster on a path, or `none`. -/ noncomputable def Path.firstMonster (p : Path N) (m : MonsterData N) : Option (Cell N) := letI := Classical.propDecidable p.cells.find? (fun x ↦ (x ∈ m.monsterCells : Bool)) /-- A strategy, given the results of initial attempts, returns a path for the next attempt. -/ abbrev Strategy (N : ℕ) : Type := ⦃k : ℕ⦄ → (Fin k → Option (Cell N)) → Path N /-- Playing a strategy, k attempts. -/ noncomputable def Strategy.play (s : Strategy N) (m : MonsterData N) : (k : ℕ) → Fin k → Option (Cell N) | 0 => Fin.elim0 | k + 1 => Fin.snoc (s.play m k) ((s (s.play m k)).firstMonster m)
with, from the "new" benchmark:
-- A cell on the board for the game. abbrev Cell (N : ℕ) : Type := Fin (N + 2) × Fin (N + 1) -- A row that is neither the first nor the last (and thus contains a monster). abbrev InteriorRow (N : ℕ) : Type := (Set.Icc 1 ⟨N, by omega⟩ : Set (Fin (N + 2))) -- Data for valid positions of the monsters. abbrev MonsterData (N : ℕ) : Type := InteriorRow N ↪ Fin (N + 1) -- The cells with monsters as a Set, given an injection from rows to columns. def MonsterData.monsterCells {N} (m : MonsterData N) : Set (Cell N) := Set.range (fun x : InteriorRow N ↦ ((x : Fin (N + 2)), m x)) -- Whether two cells are adjacent. def Adjacent {N} (x y : Cell N) : Prop := Nat.dist x.1 y.1 + Nat.dist x.2 y.2 = 1 -- A valid path from the first to the last row. structure Path (N : ℕ) where -- The cells on the path. cells : List (Cell N) nonempty : cells ≠ [] head_first_row : (cells.head nonempty).1 = 0 last_last_row : (cells.getLast nonempty).1 = N + 1 valid_move_seq : cells.Chain' Adjacent -- The first monster on a path, or `none`. noncomputable def Path.firstMonster {N} (p : Path N) (m : MonsterData N) : Option (Cell N) := letI := Classical.propDecidable p.cells.find? (fun x ↦ (x ∈ m.monsterCells : Bool)) -- A strategy, given the results of initial attempts, returns a path for the next attempt. abbrev Strategy (N : ℕ) : Type := ⦃k : ℕ⦄ → (Fin k → Option (Cell N)) → Path N -- Playing a strategy, k attempts. noncomputable def Strategy.play {N} (s : Strategy N) (m : MonsterData N) : (k : ℕ) → Fin k → Option (Cell N) | 0 => Fin.elim0 | k + 1 => Fin.snoc (s.play m k) ((s (s.play m k)).firstMonster m)
This is already in Archive. We point out this in the paper:
image.png
David Renshaw (Apr 29 2025 at 01:29):
Bhavik Mehta said:
I notice some overlap between the contents of Compfiles and this repository, in fact some code appears to be directly copied from there without attribution.
Compfiles imported that problem (2024P5) from mathlib4/Archive
. Let me know if there's something I could be doing to make the attribution more clear.
Junqi Liu (Apr 29 2025 at 01:35):
Eric Wieser said:
https://github.com/MoonshotAI/CombiBench/blob/master/docs/README.md also looks out of place
Sorry, this is fixed!
Bhavik Mehta (Apr 29 2025 at 01:39):
David Renshaw said:
Bhavik Mehta said:
I notice some overlap between the contents of Compfiles and this repository, in fact some code appears to be directly copied from there without attribution.
Compfiles imported that problem (2024P5) from
mathlib4/Archive
. Let me know if there's something I could be doing to make the attribution more clear.
Ah, I misunderstood the order that these were done in - this is clear in Compfiles. However, I think CombiBench should make this clearer in the dataset, rather than burying the attribution in the paper and putting "new work" in the announcement.
Junqi Liu (Apr 29 2025 at 01:45):
Joseph Myers said:
IMO 2013 P6 is another technically geometrical problem I looked at as an example, where the geometry in the statement is about points on a circle and intersections of chords between those points. Without attempting to assess whether the proposed statement is correct, I'd say that if you want to use non-geometrical statements for such problems (where the only geometry is about orderings of points in a cyclic ordering described in terms of whether chords intersect), surely you should first add appropriate definitions to mathlib for whether four points in a
CircularOrder
correspond to non-intersecting chords or not; there must be a better way to express that than the very longIsBeautiful
definition in that statement.
Thank you for your suggestion. We hope to get some feedback on our current formalized definitions. The definitions we used in the process of completing CombiBench will be gradually contributed to mathlib through a theorem library in the future. We will also continuously adjust the statements of our CombiBench in this process.
Junqi Liu (Apr 29 2025 at 01:48):
Joseph Myers said:
For IMO 2018 P4, the formal statement reflects the geometrical formulation in terms of a distance of (not present in the version in the IMO 2018 shortlist; the problem was reformulated after selection to avoid dependence on knowing what a knight's move is). But I think that formal statement gets the geometrical formulation wrong for a well-known reason:
ℝ × ℝ
does not have the Euclidean metric. IMO 2017 P3 looks like it has the same problem with the wrong metric. For IMOLean my conventions explicitly say to useEuclideanSpace ℝ (Fin 2)
(which has the right metric) for problems involving coordinates.
Thanks! Our team has considered this issue, but due to some reasons, we have chosen the current formalization method. However, thank you for pointing out this issue. We will discuss this point again.
Joseph Myers (Apr 29 2025 at 02:09):
IMO 2015 P1 is another example where you're using the wrong metric (so there are at least three problems in this benchmark with the wrong metric just from among those geometrically-flavoured combinatorics problems for which I was a coordinator at the IMO). In my view, it's best to use the right metric even for those problems for which the choice of metric doesn't actually matter (unless the problem statement doesn't involve coordinates, so a concrete type isn't used, and wouldn't use the type class hypotheses relating to the metric at all, so such hypotheses aren't needed) - but for these three examples, the problem only has its intended meaning at all with the Euclidean rather than the sup metric.
Jia Li (Apr 29 2025 at 02:51):
Bhavik Mehta said:
Zhengying Liu said:
Could you share the exact location of these errors?
No, I am not prepared to do unpaid work for an AI company.
There may have been some misunderstanding. Numina is a non-profit organization, and everything we develop — datasets, models, and training pipelines — will be fully open-sourced and freely available to the public. In this particular case, we are discussing a benchmark that we believe could contribute to the broader community. There is no plan to commercialize or profit from this work.
We are deeply grateful to the Lean community, whose support and guidance have been valuable to this project from the very beginning. In fact, we view the strength of the Lean community as one of Lean 4’s greatest assets, and it is a key reason we chose Lean 4 for our models.
We sincerely hope that our work can, in turn, contribute to the growth of the Lean ecosystem. You are absolutely right — we owe a great deal to the Lean community, and we will make a stronger effort to explicitly acknowledge and credit the community's contributions in our future communications.
Jia Li (Apr 29 2025 at 03:04):
Jason Rute said:
Can you remind me how moonshot AI and Numini are related? Is moonshot = Kimina = Numina + Kimi ?
Kimina-Prover is a collaboration between moonshot ai and Numina to develop an IMO level formal reasoning model. All our works are meants to be fully open sourced in the next months, especially the dataset. Kimi is the name of the product developped by moonshot. We planned to explain this in a blog post the day of the full release.
Jia Li (Apr 29 2025 at 03:12):
Bhavik Mehta said:
David Renshaw said:
Bhavik Mehta said:
I notice some overlap between the contents of Compfiles and this repository, in fact some code appears to be directly copied from there without attribution.
Compfiles imported that problem (2024P5) from
mathlib4/Archive
. Let me know if there's something I could be doing to make the attribution more clear.Ah, I misunderstood the order that these were done in - this is clear in Compfiles. However, I think CombiBench should make this clearer in the dataset, rather than burying the attribution in the paper and putting "new work" in the announcement.
Thank you for the feedbacks. We have updated our dataset card to be more explicit.
https://huggingface.co/datasets/AI-MO/CombiBench
We believe that the majority of the statements are novel, and that this does not compromise the originality of our work.
Jason Rute (Apr 29 2025 at 03:51):
The worry I have about taking a problem already formalized in mathlib/Archive or compcert is that many AI models will not be careful with avoiding contamination. No AI models probably thinks about such benchmarks for pretraining, and for fine tuning there have been cases like InternLM-Math where they train on test problems formalized in CompCert.
Jason Rute (Apr 29 2025 at 03:57):
As for @Bhavik Mehta’s concerns, regardless of if pointing out two errors is unpaid labor, it seems that (from his comments) your benchmark may need a lot more fixing than a casual bug fix from a friendly volunteer on two problems. Again, this is a major and common problem with these benchmarks. It is so easy to make mistakes that ruin the integrity of benchmark (much more than informal math benchmarks).
Junqi Liu (Apr 29 2025 at 04:40):
Jason Rute said:
As for Bhavik Mehta’s concerns, regardless of if pointing out two errors is unpaid labor, it seems that (from his comments) your benchmark may need a lot more fixing than a casual bug fix from a friendly volunteer on two problems. Again, this is a major and common problem with these benchmarks. It is so easy to make mistakes that ruin the integrity of benchmark (much more than informal math benchmarks).
Thank you for your suggestions. We understand that maintaining a benchmark is a long-term endeavor. As Zhengying mentioned earlier, our team has double-checked the work. However, we still cannot guarantee that there are no errors. We have also mentioned in the README: We welcome everyone to point out any potential issues through the issue tracker, and we will respond to each one. Regarding the style and metric issues raised by Joseph Myers, we will also give them serious consideration. For @Bhavik Mehta’s concerns, we will re-examine our Benchmark.
Yaël Dillies (Apr 29 2025 at 06:44):
Joseph Myers said:
you should first add appropriate definitions to mathlib for whether four points in a
CircularOrder
correspond to non-intersecting chords or not; there must be a better way to express that than the very longIsBeautiful
definition in that statement.
That was indeed the plan. Intersecting chords came up several times, and the second time I wrote a dedicated API, but I haven't yet pushed it towards mathlib
Bhavik Mehta (Apr 29 2025 at 09:41):
Jia Li said:
There may have been some misunderstanding.
The dataset you have published is from the Moonshot AI github account. And Moonshot AI is a for-profit AI company, no?
Bhavik Mehta (Apr 29 2025 at 09:45):
Indeed Jason explains my concerns clearly: the errors I have seen and that Joseph has kindly pointed out are very similar both in frequency and form to those in earlier benchmarks, and significantly compromise the quality of this benchmark.
Jia Li (Apr 29 2025 at 09:58):
Bhavik Mehta said:
Jia Li said:
There may have been some misunderstanding.
The dataset you have published is from the Moonshot AI github account. And Moonshot AI is a for-profit AI company, no?
Yes you are right. It is a joint development effort between the two teams. Moonshot is a for profit company. But this does not change the fact that we are releasing all the result of this collaboration (currently partial results). At the very beginning, we have agreed on open sourcing the datasets and the models that we developped togather. Hosting the page on moonshot is a way to give them credits to their contribution
Joseph Myers (Apr 29 2025 at 10:13):
Bhavik Mehta said:
Indeed Jason explains my concerns clearly: the errors I have seen and that Joseph has kindly pointed out are very similar both in frequency and form to those in earlier benchmarks, and significantly compromise the quality of this benchmark.
I'd go so far as to say that it would be more novel and interesting to see research that looks at why errors in formal problem statements are so common (covering both human and technical factors) and how to improve accuracy, than yet another paper about yet another formal benchmark.
Jason Rute (Apr 29 2025 at 11:46):
While @Bhavik Mehta isn't willing to share his experiment, I'm willing to share mine. There are 2.5 errors in the first three problems:
apmo_1991_p2
doesn't assume the points are distinct, so it is trivially false. (I would be curious if a good AI could prove this is false.)apmo_2023_p1
seems to be correct, but note that the definition oftouches
andtouches_interior_or_edge
are not symmetric (and I assume that was a mistake). The redeeming property of the theorem statement is that you consider both directions oftouches_interior_or_edge
which I think (I'm about 98% sure) fixes the problem.balticway_2015_p7
forgot about the anti-reflexivity of graphs. When stating that a subset is a clique (both the two groups in the partition, as well as the board), you can't havel
being adjacent to itself. You could useSimpleGraph.IsClique
instead. (As stated, the theorem is trivially true since theh_board
hypothesis is a contradiction. And moreover, the only way one can prove this theorem is to use the thath_board
is contradictory. Otherwise, the conclusion is impossible for the same reason. I wonder if, DeepSeek-Prover-style, a good AI could figure out that this hypothesis is contradictory.)
Jason Rute (Apr 29 2025 at 12:05):
But again, please assume there are many more errors which we didn't find. As for @Joseph Myers's point about why these errors are so common, I think it is a combination of multiple factors:
- You clearly didn't prove these theorems true. The first version of miniF2F took the time to formalize the proofs of a number of statements to guarantee correctness, and I think that is also what Harmonic (and proabably AlphaProof) did with their own version of MiniF2F.
- Also, by not using standard definitions like
SimpleGraph.IsClique
you are reinventing the wheel and making it more likely you will get a mistake. - There probably are ways around this, but it would involve writing unit-test style formalizations making sure the relevant parts are correct and keeping a list of common issues to check against. Distinctness is one such issue.
Eric Wieser (Apr 29 2025 at 12:16):
Eric Wieser said:
From the README:
It is released under permissive licenses Apache 2.0 for Lean 4.
The
LICENSE
file says it is MIT-licensed. Which is correct?
@Junqi Liu, ping on this.
Jason Rute (Apr 29 2025 at 12:34):
I think there is also a problem in the last problem, usamo_2000_p4
(which is the fourth problem I've looked at). If I understand correctly, the way you have formalized the problem is to find the cardinality of the smallest set with a right triangle, which is trivially 3, and not the smallest number n
which has a right triangle for all such sets of that size. So in the "without solution" mode, this theorem is impossible to prove as stated. (Again, thinking about how to find this error automatically, an AI could solve this trivial answer as 3 with a proof, which would automatically flag some bug since 3 is not the expected answer, so if correctly stated it would be impossible to prove it is 3.)
Junqi Liu (Apr 29 2025 at 12:54):
Eric Wieser said:
Eric Wieser said:
From the README:
It is released under permissive licenses Apache 2.0 for Lean 4.
The
LICENSE
file says it is MIT-licensed. Which is correct?Junqi Liu, ping on this.
This have fixed! MIT-licensed.
Junqi Liu (Apr 29 2025 at 15:43):
We are truly grateful for the errors @Jason Rute have pointed out to us. We have already corrected them and will gradually check each problem over the next two weeks. Where possible, we will also attempt to provide proofs for the problems. Meanwhile, we will ensure that these proofs do not appear in any training data, but are solely to guarantee the correctness of the formalization of the statements.
Kim Morrison (Apr 29 2025 at 16:26):
The thing I remained scared about here is that on the basis that Jason, Joseph, and Bhavik have been finding approximately 1 error per problem in the problems they have looked at, a bystander should be concluding from this that nearly all the problems which no one (external) has looked at are also incorrect. They are not pointing out the isolated examples of problems they have found, but are providing statistical evidence that almost everything is wrong --- and this evidence needs to be addressed systematically.
Joseph Myers (Apr 29 2025 at 21:49):
This also probably invalidates the leaderboard information, given the likelihood that AIs have actually discovered some problems that are made much easier by being incorrectly stated. Could the leaderboard be made to provide access to both lists of the problems claimed to be solved by each AI, and all the underlying Lean solutions (I couldn't see how to access this data at present), since seeing whether formal solutions actually correspond to the intended informal problem is a useful way of detecting formalization mistakes?
Joseph Myers (Apr 29 2025 at 21:50):
There were a few problems where I only commented on choices made without assessing whether the formal statement was correct, but having looked at a few more problems on which I was a coordinator, I think I believe the statistical evidence is indeed that almost all the problems are incorrect.
Joseph Myers (Apr 29 2025 at 21:53):
- The EGMO 2022 P5 formal statement defines
PerfectCover
in a way that distinguishes permutations of the labels on the dominoes, and distinguishes the two squares of each domino. Thus, the number of ways of covering the board gets multiplied by compared to the intended number in the formal statement. That's obviously no good for a problem concerned with when that number of ways is odd.
Joseph Myers (Apr 29 2025 at 21:57):
- The IMO 2023 P5 formal statement defines a directed graph
triangleGraph
used to determine permitted ninja paths. But the definition in the formal statement has edges pointing both up and down (so would allow ninja paths that mix edges up and down), whereas the problem requires ninja paths to go down only. Also, the with_solution version of that problem statement gives the wrong answer;Real.log
is not a base-2 logarithm.
Joseph Myers (Apr 29 2025 at 22:03):
- The formal statement of IMO 2014 P6 defines
finite_regions
as a predicate on sets of pointsS
in the plane, requiring such a set to be nonempty, bounded, and, for each line in the given set of lines, either all points ofS
must lie in that line, or all must lie on the same side of that line. For example, any singleton set would satisfy that predicate. The formal statement of the problem is then trivially false given howfinite_regions
is true for any singleton set; it doesn't come anywhere close to what the informal statement actually asks (about taking the finite regions determined by all lines and looking at the colours of lines on their boundaries).
Joseph Myers (Apr 29 2025 at 22:05):
And a couple of bonus incorrect statements for problems I wasn't a coordinator on:
Joseph Myers (Apr 29 2025 at 22:06):
- The one non-IMO problem using
dist
, brualdi_ch3_18, seems to have the same issue with using the wrong metric as the other problems I listed before.
Joseph Myers (Apr 29 2025 at 22:10):
- In the IMO 2005 P6 formal statement, the hypothesis
(h' : ∀ i, (solved i).card < Fintype.card participants)
looks like "each problem had some contestant who failed to solve it". That's not the same thing as "Moreover, no contestant solved all the 6 problems." from the informal statement (solved
here maps problems to Finsets of contestants who solved them).
Joseph Myers (Apr 29 2025 at 22:15):
Writing a formal solution based on a known informal solution to a problem is indeed a very good way of detecting mistakes in the formal statement. It's also incredibly time-consuming to do on the scale of 100 problems; my experience is that a combinatorics problem doesn't need to be that hard for a formal solution to be 1000 lines or more of Lean. (The first olympiad combinatorics problem I formalized a solution to ended up as 3000 lines of Lean 3, though I expect that was longer than necessary through inexperience.)
Jason Rute (Apr 29 2025 at 23:13):
Clearly, there is a lot of cleanup needed. I hope the authors do it and rerun the benchmark. And I hope they maintain it and version it properly going forward, noting which version each model was run with (including which version of Lean, Mathlib, and the benchmark itself).
Jason Rute (Apr 29 2025 at 23:14):
I may have been ambitious to suggest formalizing each proof. It would be good to get guidance from those with experience in this sort of thing on how to avoid mistakes.
Jason Rute (Apr 29 2025 at 23:32):
Overall, I think this benchmark likely won't have a lot of impact. It will be good for R&D, but as a goal post it has all the same problems that MiniF2F and PutnamBench have:
- The problems are old and found on the internet, probably even in various natural language math training sets (maybe even in Numina Math 1.5). I think leakage (where the model is aware of the informal proof) will be common.
- In some cases, there are even already formal proofs on the internet, and I suspect @David Renshaw plans to formalize many more of these problems, especially the IMO ones, in Compfiles. This leads to even more leakage, and of the worst kind, where the formal proof is available on the open internet.
- As models get better on these benchmarks (like they are on the informal math benchmarks), there will be a lot of distrust. Does it do well because it has memorized the proof (or the numerical solution at least), or because it is really reasoning?
- Competition math isn't really that interesting to what formal math is used for. It is just a convenient benchmark.
Jason Rute (Apr 29 2025 at 23:32):
I think going forward, the following two types of benchmarks will have more value:
- Benchmarks that are more closely tied to the needs of formal math: autoformalization, provably correct code, and research math.
- Benchmarks that are resistant to data leakage. For natural language competition math, the new standard is matharena.ai, a project by researchers as ETH where they run models on the current math competitions immediately after the competition (which avoids leakage). @Mislav Balunović has already stated that he is considering a formal version of the matharena benchmark. I think Lean experts here should partner with him on that (or make their own version of that benchmark). I could even imagine a near future (2026?) where the standard competitions (USAMO, IMO, Putnam, etc) release formal Lean versions of all their problems.
Joseph Myers (Apr 29 2025 at 23:50):
For the informal leakage issue, it's always possible for someone to use such a benchmark as an autoformalization benchmark - give an autoformalization AI the formal statement and the informal solution - even if it's not what the benchmark was designed or intended for. (That's still at risk of leakage from training materials including existing formal solutions.)
Joseph Myers (Apr 29 2025 at 23:56):
I like the idea of competitions releasing formal statements of their own problems. But there are a lot of competitions (plenty of which don't even have websites that get promptly updated with the official informal statements after the competition); to have several competitions releasing formal problem statements (and thus several different groups of people releasing them, not all with a great deal of Lean expertise), you really need a better solution (that can be applied by non-experts) to getting correct (even if not perfectly idiomatic) Lean problem statements. If several competitions release Lean versions of their problems where most of the formal statements are wrong, that's not very useful for benchmarking any kind of AI other than an interactive "tell me what I got wrong in this formal statement" autoformalizer for statements.
Jason Rute (Apr 29 2025 at 23:57):
Joseph Myers said:
For the informal leakage issue, it's always possible for someone to use such a benchmark as an autoformalization benchmark - give an autoformalization AI the formal statement and the informal solution - even if it's not what the benchmark was designed or intended for. (That's still at risk of leakage from training materials including existing formal solutions.)
I totally agree! I discussed this specifically here as a Tier 0 of my ambitious autoformalization benchmark idea where one formalizes entire mathematical arguments. Indeed ProofNet and MiniF2F are already ready for this Tier 0 autoformalization benchmark since they have informal proofs as well. This was run on MiniF2F a few years ago as part of the Draft-Sketch-Prove paper. In that paper, providing the human proof didn't help much at all, but I think with better autoformalizers it could help a lot. If not, we will discover that autoformalization is the current bottleneck.
Jason Rute (Apr 30 2025 at 00:09):
As for getting (at least some of the major) math competitions to have high-quality formalized translations, if anyone in the world could make it happen, it is you @Joseph Myers! :smile:
Junqi Liu (Apr 30 2025 at 02:41):
Joseph Myers said:
This also probably invalidates the leaderboard information, given the likelihood that AIs have actually discovered some problems that are made much easier by being incorrectly stated. Could the leaderboard be made to provide access to both lists of the problems claimed to be solved by each AI, and all the underlying Lean solutions (I couldn't see how to access this data at present), since seeing whether formal solutions actually correspond to the intended informal problem is a useful way of detecting formalization mistakes?
We list the most of 7 problems in the end of our paper. There are two problems (brualdi_ch7_7
and brualdi_ch14_33
) for which we did not post the solutions, as these two problems can be proved very easily with a one-step theorem.
Niels Voss (Apr 30 2025 at 23:22):
Out of curiosity, how do you plan to grade problems that require exhibiting certain sums? I picked a random problem in the benchmark:
import Mathlib
abbrev brualdi_ch5_9_solution : ℕ → ℤ := sorry
/--
Evaluate the sum $\sum_{k=0}^{n}(-1)^{k}\binom{n}{k} 10^{k}$.
-/
theorem brualdi_ch5_9 (n : ℕ) : ∑ k ∈ Finset.range (n + 1), (-1 : ℤ) ^ k * (n.choose k) * 10 ^ k =
brualdi_ch5_9_solution n := by sorry
where part of the solution involves exhibiting a function ℕ → ℤ
. To me, it seems that an AI could just do
import Mathlib
abbrev brualdi_ch5_9_solution : ℕ → ℤ :=
fun n => ∑ k ∈ Finset.range (n + 1), (-1 : ℤ) ^ k * (n.choose k) * 10 ^ k
/--
Evaluate the sum $\sum_{k=0}^{n}(-1)^{k}\binom{n}{k} 10^{k}$.
-/
theorem brualdi_ch5_9 (n : ℕ) : ∑ k ∈ Finset.range (n + 1), (-1 : ℤ) ^ k * (n.choose k) * 10 ^ k =
brualdi_ch5_9_solution n := by rfl
Would the solution for brualdi_ch5_9_solution
need to be in some sort of closed form?
This isn't really that big of a deal since it's quite easy for a human to just check manually that the function generated by the AI is simple enough to count as a solution.
Junqi Liu (May 01 2025 at 14:30):
@Niels Voss You can read our evaluation code or our paper, we consider this situation.
Last updated: May 02 2025 at 03:31 UTC