Zulip Chat Archive
Stream: general
Topic: CombiBench : A benchmark focusing on combinatorics problem
Junqi Liu (Apr 28 2025 at 16:28):
Glad to share the new work from our Kimina team! We have released the first Benchmark focusing on formalized mathematics in combinatorics —— CombiBench, and for the first time, proposed an Evaluation method for fill-in-the-blank questions —— Fine-Eval! We have open-sourced the complete datasets with and without solutions, as well as one-click Evaluation code that can be used with Kimina Lean Server!
Github:https://github.com/MoonshotAI/CombiBench
Huggingface:https://huggingface.co/datasets/AI-MO/CombiBench
Website:https://moonshotai.github.io/CombiBench
LeaderBoard:https://moonshotai.github.io/CombiBench/leaderboard.html
Jason Rute (Apr 28 2025 at 20:34):
Can you remind me how moonshot AI and Numini are related? Is moonshot = Kimina = Numina + Kimi ?
Junyan Xu (Apr 28 2025 at 21:06):
Kimi is product of Moonshot (the company), and Kimi Team (coauthor of Kimina) is presumably the team working on the product, as I understand it.
Jason Rute (Apr 28 2025 at 22:04):
"Without solution" means the model has to give the solution and a formal proof? What we sometimes call "hard mode"? Is that correct? Or do I have it backward (or totally wrong)?
Jason Rute (Apr 28 2025 at 22:05):
Also the Leaderboard is really hard to read on a mobile device.
Jason Rute (Apr 28 2025 at 22:08):
It looks like a nice benchmark overall.
Jason Rute (Apr 28 2025 at 22:10):
I’d be curious what quality control you did to sure there are no mistakes in translation. This is one of the largest issues with making these formal benchmarks. All the standard benchmarks when first released had a lot of mistakes (and many still do). This often makes the problems either impossible or trivial.
Bhavik Mehta (Apr 28 2025 at 22:15):
I am also curious about this: so far I have looked at three of the problems, and two out of the three had obvious mistakes (of the form Jason mentions).
It's a small sample size, but given past problems with benchmarks, this makes me not at all optimistic about this one.
Eric Wieser (Apr 28 2025 at 22:20):
for the first time, proposed an Evaluation method for fill-in-the-blank questions —— Fine-Eval!
Am I right in assuming that the dataset format for such questions is exactly that of PutnamBench?
Bhavik Mehta (Apr 28 2025 at 22:21):
From their README: "For problems that require providing a solution first and then proving its correctness, we have referred to the style of PutnamBench."
Jason Rute (Apr 28 2025 at 22:24):
Also I’d be curious what @Joseph Myers thinks both in terms of correctness and style. He has strong opinions on how competition problems should be formalized.
Jason Rute (Apr 28 2025 at 22:29):
Also, besides errors and style, I have a checklist of things to think about when making a Lean benchmark:
Eric Wieser (Apr 28 2025 at 22:40):
From the README:
It is released under permissive licenses Apache 2.0 for Lean 4.
The LICENSE file says it is MIT-licensed. Which is correct?
Eric Wieser (Apr 28 2025 at 22:41):
https://github.com/MoonshotAI/CombiBench/blob/master/docs/README.md also looks out of place
Bhavik Mehta (Apr 28 2025 at 23:06):
I notice some overlap between the contents of Compfiles and this repository, in fact some code appears to be directly copied from there without attribution. Compare, for instance:
from Compfiles:
/-- A cell on the board for the game. -/
abbrev Cell (N : ℕ) : Type := Fin (N + 2) × Fin (N + 1)
/-- A row that is neither the first nor the last (and thus contains a monster). -/
abbrev InteriorRow (N : ℕ) : Type := (Set.Icc 1 ⟨N, by omega⟩ : Set (Fin (N + 2)))
/-- Data for valid positions of the monsters. -/
abbrev MonsterData (N : ℕ) : Type := InteriorRow N ↪ Fin (N + 1)
/-- The cells with monsters as a Set, given an injection from rows to columns. -/
def MonsterData.monsterCells (m : MonsterData N) :
Set (Cell N) :=
Set.range (fun x : InteriorRow N ↦ ((x : Fin (N + 2)), m x))
/-- Whether two cells are adjacent. -/
def Adjacent (x y : Cell N) : Prop :=
Nat.dist x.1 y.1 + Nat.dist x.2 y.2 = 1
/-- A valid path from the first to the last row. -/
structure Path (N : ℕ) where
/-- The cells on the path. -/
cells : List (Cell N)
nonempty : cells ≠ []
head_first_row : (cells.head nonempty).1 = 0
last_last_row : (cells.getLast nonempty).1 = N + 1
valid_move_seq : cells.Chain' Adjacent
/-- The first monster on a path, or `none`. -/
noncomputable def Path.firstMonster (p : Path N) (m : MonsterData N) : Option (Cell N) :=
letI := Classical.propDecidable
p.cells.find? (fun x ↦ (x ∈ m.monsterCells : Bool))
/-- A strategy, given the results of initial attempts, returns a path for the next attempt. -/
abbrev Strategy (N : ℕ) : Type := ⦃k : ℕ⦄ → (Fin k → Option (Cell N)) → Path N
/-- Playing a strategy, k attempts. -/
noncomputable def Strategy.play (s : Strategy N) (m : MonsterData N) :
(k : ℕ) → Fin k → Option (Cell N)
| 0 => Fin.elim0
| k + 1 => Fin.snoc (s.play m k) ((s (s.play m k)).firstMonster m)
with, from the "new" benchmark:
-- A cell on the board for the game.
abbrev Cell (N : ℕ) : Type := Fin (N + 2) × Fin (N + 1)
-- A row that is neither the first nor the last (and thus contains a monster).
abbrev InteriorRow (N : ℕ) : Type := (Set.Icc 1 ⟨N, by omega⟩ : Set (Fin (N + 2)))
-- Data for valid positions of the monsters.
abbrev MonsterData (N : ℕ) : Type := InteriorRow N ↪ Fin (N + 1)
-- The cells with monsters as a Set, given an injection from rows to columns.
def MonsterData.monsterCells {N} (m : MonsterData N) :
Set (Cell N) :=
Set.range (fun x : InteriorRow N ↦ ((x : Fin (N + 2)), m x))
-- Whether two cells are adjacent.
def Adjacent {N} (x y : Cell N) : Prop :=
Nat.dist x.1 y.1 + Nat.dist x.2 y.2 = 1
-- A valid path from the first to the last row.
structure Path (N : ℕ) where
-- The cells on the path.
cells : List (Cell N)
nonempty : cells ≠ []
head_first_row : (cells.head nonempty).1 = 0
last_last_row : (cells.getLast nonempty).1 = N + 1
valid_move_seq : cells.Chain' Adjacent
-- The first monster on a path, or `none`.
noncomputable def Path.firstMonster {N} (p : Path N) (m : MonsterData N) : Option (Cell N) :=
letI := Classical.propDecidable
p.cells.find? (fun x ↦ (x ∈ m.monsterCells : Bool))
-- A strategy, given the results of initial attempts, returns a path for the next attempt.
abbrev Strategy (N : ℕ) : Type := ⦃k : ℕ⦄ → (Fin k → Option (Cell N)) → Path N
-- Playing a strategy, k attempts.
noncomputable def Strategy.play {N} (s : Strategy N) (m : MonsterData N) :
(k : ℕ) → Fin k → Option (Cell N)
| 0 => Fin.elim0
| k + 1 => Fin.snoc (s.play m k) ((s (s.play m k)).firstMonster m)
Zhengying Liu (Apr 29 2025 at 00:18):
Thank you all for pointing out these errors! We will correct them ASAP. And you are all welcome to make pull requests too. We believe community effort is key for constructing a reliable benchmark consisting of only formal statement like this one. It's really not easy. We made sure that every problem is reviewed by at least two reviewers before the release but errors are still very hard to avoid...
Zhengying Liu (Apr 29 2025 at 00:21):
Could you share the exact location of these errors? We'll correct them ASAP. Thank you! @Bhavik Mehta
Bhavik Mehta said:
I am also curious about this: so far I have looked at three of the problems, and two out of the three had obvious mistakes (of the form Jason mentions).
It's a small sample size, but given past problems with benchmarks, this makes me not at all optimistic about this one.
Joseph Myers (Apr 29 2025 at 00:56):
I like the choice to use plenty of external defs and types rather than trying to force everything into a single statement to be proved (which gets very awkward for many combinatorics problems), at least.
Joseph Myers (Apr 29 2025 at 00:58):
I looked at a few examples of problems I described in the IMOLean README as technically geometrical (that is, combinatorics problems where a literal reading of the original English statement uses geometrical concepts and you can plausibly argue whether than should be retained in a formal statement or whether the formal statement should be purely combinatorial).
Bhavik Mehta (Apr 29 2025 at 00:59):
Zhengying Liu said:
Could you share the exact location of these errors?
No, I am not prepared to do unpaid work for an AI company.
Joseph Myers (Apr 29 2025 at 01:04):
For IMO 2018 P4, the formal statement reflects the geometrical formulation in terms of a distance of (not present in the version in the IMO 2018 shortlist; the problem was reformulated after selection to avoid dependence on knowing what a knight's move is). But I think that formal statement gets the geometrical formulation wrong for a well-known reason: ℝ × ℝ does not have the Euclidean metric. IMO 2017 P3 looks like it has the same problem with the wrong metric. For IMOLean my conventions explicitly say to use EuclideanSpace ℝ (Fin 2) (which has the right metric) for problems involving coordinates.
Joseph Myers (Apr 29 2025 at 01:11):
IMO 2013 P6 is another technically geometrical problem I looked at as an example, where the geometry in the statement is about points on a circle and intersections of chords between those points. Without attempting to assess whether the proposed statement is correct, I'd say that if you want to use non-geometrical statements for such problems (where the only geometry is about orderings of points in a cyclic ordering described in terms of whether chords intersect), surely you should first add appropriate definitions to mathlib for whether four points in a CircularOrder correspond to non-intersecting chords or not; there must be a better way to express that than the very long IsBeautiful definition in that statement.
Joseph Myers (Apr 29 2025 at 01:18):
As another example (not geometrical), the formal statement for IMO 2022 P1 is very different in style from my statement for that problem. Again without trying to assess whether it's correct, the formal statement you have is rather heavy on computable functions defined imperatively with do, which is not what I'd expect; the ability to prove things about such function definitions may not have much to do with the ability to prove things about more usual styles of mathematical problem statements, and such function definitions would be more common in Lean code that is intended to be executed than in definitions that are the subjects of proofs. (And while I suggest that AIs not be presented with doc strings about the correspondence between formal and informal statements when given problems to solve, I do suggest that formal statements be written with such doc strings on their definitions for the benefit of humans trying to judge correctness - just that the doc strings should be removed before giving the problems to an AI.)
Joseph Myers (Apr 29 2025 at 01:20):
I have explicitly not looked for other examples of problem statements that use the wrong metric, or any of the other well-known pitfalls in writing formal problem statements.
Joseph Myers (Apr 29 2025 at 01:26):
I think incidentally that the antonym of "easy mode" should be "standard mode" not "hard mode" (it's asking an AI to do the same things a human is asked for - find an answer and prove it correct).
Junqi Liu (Apr 29 2025 at 01:29):
Bhavik Mehta said:
I notice some overlap between the contents of Compfiles and this repository, in fact some code appears to be directly copied from there without attribution. Compare, for instance:
from Compfiles:/-- A cell on the board for the game. -/ abbrev Cell (N : ℕ) : Type := Fin (N + 2) × Fin (N + 1) /-- A row that is neither the first nor the last (and thus contains a monster). -/ abbrev InteriorRow (N : ℕ) : Type := (Set.Icc 1 ⟨N, by omega⟩ : Set (Fin (N + 2))) /-- Data for valid positions of the monsters. -/ abbrev MonsterData (N : ℕ) : Type := InteriorRow N ↪ Fin (N + 1) /-- The cells with monsters as a Set, given an injection from rows to columns. -/ def MonsterData.monsterCells (m : MonsterData N) : Set (Cell N) := Set.range (fun x : InteriorRow N ↦ ((x : Fin (N + 2)), m x)) /-- Whether two cells are adjacent. -/ def Adjacent (x y : Cell N) : Prop := Nat.dist x.1 y.1 + Nat.dist x.2 y.2 = 1 /-- A valid path from the first to the last row. -/ structure Path (N : ℕ) where /-- The cells on the path. -/ cells : List (Cell N) nonempty : cells ≠ [] head_first_row : (cells.head nonempty).1 = 0 last_last_row : (cells.getLast nonempty).1 = N + 1 valid_move_seq : cells.Chain' Adjacent /-- The first monster on a path, or `none`. -/ noncomputable def Path.firstMonster (p : Path N) (m : MonsterData N) : Option (Cell N) := letI := Classical.propDecidable p.cells.find? (fun x ↦ (x ∈ m.monsterCells : Bool)) /-- A strategy, given the results of initial attempts, returns a path for the next attempt. -/ abbrev Strategy (N : ℕ) : Type := ⦃k : ℕ⦄ → (Fin k → Option (Cell N)) → Path N /-- Playing a strategy, k attempts. -/ noncomputable def Strategy.play (s : Strategy N) (m : MonsterData N) : (k : ℕ) → Fin k → Option (Cell N) | 0 => Fin.elim0 | k + 1 => Fin.snoc (s.play m k) ((s (s.play m k)).firstMonster m)with, from the "new" benchmark:
-- A cell on the board for the game. abbrev Cell (N : ℕ) : Type := Fin (N + 2) × Fin (N + 1) -- A row that is neither the first nor the last (and thus contains a monster). abbrev InteriorRow (N : ℕ) : Type := (Set.Icc 1 ⟨N, by omega⟩ : Set (Fin (N + 2))) -- Data for valid positions of the monsters. abbrev MonsterData (N : ℕ) : Type := InteriorRow N ↪ Fin (N + 1) -- The cells with monsters as a Set, given an injection from rows to columns. def MonsterData.monsterCells {N} (m : MonsterData N) : Set (Cell N) := Set.range (fun x : InteriorRow N ↦ ((x : Fin (N + 2)), m x)) -- Whether two cells are adjacent. def Adjacent {N} (x y : Cell N) : Prop := Nat.dist x.1 y.1 + Nat.dist x.2 y.2 = 1 -- A valid path from the first to the last row. structure Path (N : ℕ) where -- The cells on the path. cells : List (Cell N) nonempty : cells ≠ [] head_first_row : (cells.head nonempty).1 = 0 last_last_row : (cells.getLast nonempty).1 = N + 1 valid_move_seq : cells.Chain' Adjacent -- The first monster on a path, or `none`. noncomputable def Path.firstMonster {N} (p : Path N) (m : MonsterData N) : Option (Cell N) := letI := Classical.propDecidable p.cells.find? (fun x ↦ (x ∈ m.monsterCells : Bool)) -- A strategy, given the results of initial attempts, returns a path for the next attempt. abbrev Strategy (N : ℕ) : Type := ⦃k : ℕ⦄ → (Fin k → Option (Cell N)) → Path N -- Playing a strategy, k attempts. noncomputable def Strategy.play {N} (s : Strategy N) (m : MonsterData N) : (k : ℕ) → Fin k → Option (Cell N) | 0 => Fin.elim0 | k + 1 => Fin.snoc (s.play m k) ((s (s.play m k)).firstMonster m)
This is already in Archive. We point out this in the paper:
image.png
David Renshaw (Apr 29 2025 at 01:29):
Bhavik Mehta said:
I notice some overlap between the contents of Compfiles and this repository, in fact some code appears to be directly copied from there without attribution.
Compfiles imported that problem (2024P5) from mathlib4/Archive. Let me know if there's something I could be doing to make the attribution more clear.
Junqi Liu (Apr 29 2025 at 01:35):
Eric Wieser said:
https://github.com/MoonshotAI/CombiBench/blob/master/docs/README.md also looks out of place
Sorry, this is fixed!
Bhavik Mehta (Apr 29 2025 at 01:39):
David Renshaw said:
Bhavik Mehta said:
I notice some overlap between the contents of Compfiles and this repository, in fact some code appears to be directly copied from there without attribution.
Compfiles imported that problem (2024P5) from
mathlib4/Archive. Let me know if there's something I could be doing to make the attribution more clear.
Ah, I misunderstood the order that these were done in - this is clear in Compfiles. However, I think CombiBench should make this clearer in the dataset, rather than burying the attribution in the paper and putting "new work" in the announcement.
Junqi Liu (Apr 29 2025 at 01:45):
Joseph Myers said:
IMO 2013 P6 is another technically geometrical problem I looked at as an example, where the geometry in the statement is about points on a circle and intersections of chords between those points. Without attempting to assess whether the proposed statement is correct, I'd say that if you want to use non-geometrical statements for such problems (where the only geometry is about orderings of points in a cyclic ordering described in terms of whether chords intersect), surely you should first add appropriate definitions to mathlib for whether four points in a
CircularOrdercorrespond to non-intersecting chords or not; there must be a better way to express that than the very longIsBeautifuldefinition in that statement.
Thank you for your suggestion. We hope to get some feedback on our current formalized definitions. The definitions we used in the process of completing CombiBench will be gradually contributed to mathlib through a theorem library in the future. We will also continuously adjust the statements of our CombiBench in this process.
Junqi Liu (Apr 29 2025 at 01:48):
Joseph Myers said:
For IMO 2018 P4, the formal statement reflects the geometrical formulation in terms of a distance of (not present in the version in the IMO 2018 shortlist; the problem was reformulated after selection to avoid dependence on knowing what a knight's move is). But I think that formal statement gets the geometrical formulation wrong for a well-known reason:
ℝ × ℝdoes not have the Euclidean metric. IMO 2017 P3 looks like it has the same problem with the wrong metric. For IMOLean my conventions explicitly say to useEuclideanSpace ℝ (Fin 2)(which has the right metric) for problems involving coordinates.
Thanks! Our team has considered this issue, but due to some reasons, we have chosen the current formalization method. However, thank you for pointing out this issue. We will discuss this point again.
Joseph Myers (Apr 29 2025 at 02:09):
IMO 2015 P1 is another example where you're using the wrong metric (so there are at least three problems in this benchmark with the wrong metric just from among those geometrically-flavoured combinatorics problems for which I was a coordinator at the IMO). In my view, it's best to use the right metric even for those problems for which the choice of metric doesn't actually matter (unless the problem statement doesn't involve coordinates, so a concrete type isn't used, and wouldn't use the type class hypotheses relating to the metric at all, so such hypotheses aren't needed) - but for these three examples, the problem only has its intended meaning at all with the Euclidean rather than the sup metric.
Jia Li (Apr 29 2025 at 02:51):
Bhavik Mehta said:
Zhengying Liu said:
Could you share the exact location of these errors?
No, I am not prepared to do unpaid work for an AI company.
There may have been some misunderstanding. Numina is a non-profit organization, and everything we develop — datasets, models, and training pipelines — will be fully open-sourced and freely available to the public. In this particular case, we are discussing a benchmark that we believe could contribute to the broader community. There is no plan to commercialize or profit from this work.
We are deeply grateful to the Lean community, whose support and guidance have been valuable to this project from the very beginning. In fact, we view the strength of the Lean community as one of Lean 4’s greatest assets, and it is a key reason we chose Lean 4 for our models.
We sincerely hope that our work can, in turn, contribute to the growth of the Lean ecosystem. You are absolutely right — we owe a great deal to the Lean community, and we will make a stronger effort to explicitly acknowledge and credit the community's contributions in our future communications.
Jia Li (Apr 29 2025 at 03:04):
Jason Rute said:
Can you remind me how moonshot AI and Numini are related? Is moonshot = Kimina = Numina + Kimi ?
Kimina-Prover is a collaboration between moonshot ai and Numina to develop an IMO level formal reasoning model. All our works are meants to be fully open sourced in the next months, especially the dataset. Kimi is the name of the product developped by moonshot. We planned to explain this in a blog post the day of the full release.
Jia Li (Apr 29 2025 at 03:12):
Bhavik Mehta said:
David Renshaw said:
Bhavik Mehta said:
I notice some overlap between the contents of Compfiles and this repository, in fact some code appears to be directly copied from there without attribution.
Compfiles imported that problem (2024P5) from
mathlib4/Archive. Let me know if there's something I could be doing to make the attribution more clear.Ah, I misunderstood the order that these were done in - this is clear in Compfiles. However, I think CombiBench should make this clearer in the dataset, rather than burying the attribution in the paper and putting "new work" in the announcement.
Thank you for the feedbacks. We have updated our dataset card to be more explicit.
https://huggingface.co/datasets/AI-MO/CombiBench
We believe that the majority of the statements are novel, and that this does not compromise the originality of our work.
Jason Rute (Apr 29 2025 at 03:51):
The worry I have about taking a problem already formalized in mathlib/Archive or compcert is that many AI models will not be careful with avoiding contamination. No AI models probably thinks about such benchmarks for pretraining, and for fine tuning there have been cases like InternLM-Math where they train on test problems formalized in CompCert.
Jason Rute (Apr 29 2025 at 03:57):
As for @Bhavik Mehta’s concerns, regardless of if pointing out two errors is unpaid labor, it seems that (from his comments) your benchmark may need a lot more fixing than a casual bug fix from a friendly volunteer on two problems. Again, this is a major and common problem with these benchmarks. It is so easy to make mistakes that ruin the integrity of benchmark (much more than informal math benchmarks).
Junqi Liu (Apr 29 2025 at 04:40):
Jason Rute said:
As for Bhavik Mehta’s concerns, regardless of if pointing out two errors is unpaid labor, it seems that (from his comments) your benchmark may need a lot more fixing than a casual bug fix from a friendly volunteer on two problems. Again, this is a major and common problem with these benchmarks. It is so easy to make mistakes that ruin the integrity of benchmark (much more than informal math benchmarks).
Thank you for your suggestions. We understand that maintaining a benchmark is a long-term endeavor. As Zhengying mentioned earlier, our team has double-checked the work. However, we still cannot guarantee that there are no errors. We have also mentioned in the README: We welcome everyone to point out any potential issues through the issue tracker, and we will respond to each one. Regarding the style and metric issues raised by Joseph Myers, we will also give them serious consideration. For @Bhavik Mehta’s concerns, we will re-examine our Benchmark.
Yaël Dillies (Apr 29 2025 at 06:44):
Joseph Myers said:
you should first add appropriate definitions to mathlib for whether four points in a
CircularOrdercorrespond to non-intersecting chords or not; there must be a better way to express that than the very longIsBeautifuldefinition in that statement.
That was indeed the plan. Intersecting chords came up several times, and the second time I wrote a dedicated API, but I haven't yet pushed it towards mathlib
Bhavik Mehta (Apr 29 2025 at 09:41):
Jia Li said:
There may have been some misunderstanding.
The dataset you have published is from the Moonshot AI github account. And Moonshot AI is a for-profit AI company, no?
Bhavik Mehta (Apr 29 2025 at 09:45):
Indeed Jason explains my concerns clearly: the errors I have seen and that Joseph has kindly pointed out are very similar both in frequency and form to those in earlier benchmarks, and significantly compromise the quality of this benchmark.
Jia Li (Apr 29 2025 at 09:58):
Bhavik Mehta said:
Jia Li said:
There may have been some misunderstanding.
The dataset you have published is from the Moonshot AI github account. And Moonshot AI is a for-profit AI company, no?
Yes you are right. It is a joint development effort between the two teams. Moonshot is a for profit company. But this does not change the fact that we are releasing all the result of this collaboration (currently partial results). At the very beginning, we have agreed on open sourcing the datasets and the models that we developped togather. Hosting the page on moonshot is a way to give them credits to their contribution
Joseph Myers (Apr 29 2025 at 10:13):
Bhavik Mehta said:
Indeed Jason explains my concerns clearly: the errors I have seen and that Joseph has kindly pointed out are very similar both in frequency and form to those in earlier benchmarks, and significantly compromise the quality of this benchmark.
I'd go so far as to say that it would be more novel and interesting to see research that looks at why errors in formal problem statements are so common (covering both human and technical factors) and how to improve accuracy, than yet another paper about yet another formal benchmark.
Jason Rute (Apr 29 2025 at 11:46):
While @Bhavik Mehta isn't willing to share his experiment, I'm willing to share mine. There are 2.5 errors in the first three problems:
apmo_1991_p2doesn't assume the points are distinct, so it is trivially false. (I would be curious if a good AI could prove this is false.)apmo_2023_p1seems to be correct, but note that the definition oftouchesandtouches_interior_or_edgeare not symmetric (and I assume that was a mistake). The redeeming property of the theorem statement is that you consider both directions oftouches_interior_or_edgewhich I think (I'm about 98% sure) fixes the problem.balticway_2015_p7forgot about the anti-reflexivity of graphs. When stating that a subset is a clique (both the two groups in the partition, as well as the board), you can't havelbeing adjacent to itself. You could useSimpleGraph.IsCliqueinstead. (As stated, the theorem is trivially true since theh_boardhypothesis is a contradiction. And moreover, the only way one can prove this theorem is to use the thath_boardis contradictory. Otherwise, the conclusion is impossible for the same reason. I wonder if, DeepSeek-Prover-style, a good AI could figure out that this hypothesis is contradictory.)
Jason Rute (Apr 29 2025 at 12:05):
But again, please assume there are many more errors which we didn't find. As for @Joseph Myers's point about why these errors are so common, I think it is a combination of multiple factors:
- You clearly didn't prove these theorems true. The first version of miniF2F took the time to formalize the proofs of a number of statements to guarantee correctness, and I think that is also what Harmonic (and proabably AlphaProof) did with their own version of MiniF2F.
- Also, by not using standard definitions like
SimpleGraph.IsCliqueyou are reinventing the wheel and making it more likely you will get a mistake. - There probably are ways around this, but it would involve writing unit-test style formalizations making sure the relevant parts are correct and keeping a list of common issues to check against. Distinctness is one such issue.
Eric Wieser (Apr 29 2025 at 12:16):
Eric Wieser said:
From the README:
It is released under permissive licenses Apache 2.0 for Lean 4.
The
LICENSEfile says it is MIT-licensed. Which is correct?
@Junqi Liu, ping on this.
Jason Rute (Apr 29 2025 at 12:34):
I think there is also a problem in the last problem, usamo_2000_p4 (which is the fourth problem I've looked at). If I understand correctly, the way you have formalized the problem is to find the cardinality of the smallest set with a right triangle, which is trivially 3, and not the smallest number n which has a right triangle for all such sets of that size. So in the "without solution" mode, this theorem is impossible to prove as stated. (Again, thinking about how to find this error automatically, an AI could solve this trivial answer as 3 with a proof, which would automatically flag some bug since 3 is not the expected answer, so if correctly stated it would be impossible to prove it is 3.)
Junqi Liu (Apr 29 2025 at 12:54):
Eric Wieser said:
Eric Wieser said:
From the README:
It is released under permissive licenses Apache 2.0 for Lean 4.
The
LICENSEfile says it is MIT-licensed. Which is correct?Junqi Liu, ping on this.
This have fixed! MIT-licensed.
Junqi Liu (Apr 29 2025 at 15:43):
We are truly grateful for the errors @Jason Rute have pointed out to us. We have already corrected them and will gradually check each problem over the next two weeks. Where possible, we will also attempt to provide proofs for the problems. Meanwhile, we will ensure that these proofs do not appear in any training data, but are solely to guarantee the correctness of the formalization of the statements.
Kim Morrison (Apr 29 2025 at 16:26):
The thing I remained scared about here is that on the basis that Jason, Joseph, and Bhavik have been finding approximately 1 error per problem in the problems they have looked at, a bystander should be concluding from this that nearly all the problems which no one (external) has looked at are also incorrect. They are not pointing out the isolated examples of problems they have found, but are providing statistical evidence that almost everything is wrong --- and this evidence needs to be addressed systematically.
Joseph Myers (Apr 29 2025 at 21:49):
This also probably invalidates the leaderboard information, given the likelihood that AIs have actually discovered some problems that are made much easier by being incorrectly stated. Could the leaderboard be made to provide access to both lists of the problems claimed to be solved by each AI, and all the underlying Lean solutions (I couldn't see how to access this data at present), since seeing whether formal solutions actually correspond to the intended informal problem is a useful way of detecting formalization mistakes?
Joseph Myers (Apr 29 2025 at 21:50):
There were a few problems where I only commented on choices made without assessing whether the formal statement was correct, but having looked at a few more problems on which I was a coordinator, I think I believe the statistical evidence is indeed that almost all the problems are incorrect.
Joseph Myers (Apr 29 2025 at 21:53):
- The EGMO 2022 P5 formal statement defines
PerfectCoverin a way that distinguishes permutations of the labels on the dominoes, and distinguishes the two squares of each domino. Thus, the number of ways of covering the board gets multiplied by compared to the intended number in the formal statement. That's obviously no good for a problem concerned with when that number of ways is odd.
Joseph Myers (Apr 29 2025 at 21:57):
- The IMO 2023 P5 formal statement defines a directed graph
triangleGraphused to determine permitted ninja paths. But the definition in the formal statement has edges pointing both up and down (so would allow ninja paths that mix edges up and down), whereas the problem requires ninja paths to go down only. Also, the with_solution version of that problem statement gives the wrong answer;Real.logis not a base-2 logarithm.
Joseph Myers (Apr 29 2025 at 22:03):
- The formal statement of IMO 2014 P6 defines
finite_regionsas a predicate on sets of pointsSin the plane, requiring such a set to be nonempty, bounded, and, for each line in the given set of lines, either all points ofSmust lie in that line, or all must lie on the same side of that line. For example, any singleton set would satisfy that predicate. The formal statement of the problem is then trivially false given howfinite_regionsis true for any singleton set; it doesn't come anywhere close to what the informal statement actually asks (about taking the finite regions determined by all lines and looking at the colours of lines on their boundaries).
Joseph Myers (Apr 29 2025 at 22:05):
And a couple of bonus incorrect statements for problems I wasn't a coordinator on:
Joseph Myers (Apr 29 2025 at 22:06):
- The one non-IMO problem using
dist, brualdi_ch3_18, seems to have the same issue with using the wrong metric as the other problems I listed before.
Joseph Myers (Apr 29 2025 at 22:10):
- In the IMO 2005 P6 formal statement, the hypothesis
(h' : ∀ i, (solved i).card < Fintype.card participants)looks like "each problem had some contestant who failed to solve it". That's not the same thing as "Moreover, no contestant solved all the 6 problems." from the informal statement (solvedhere maps problems to Finsets of contestants who solved them).
Joseph Myers (Apr 29 2025 at 22:15):
Writing a formal solution based on a known informal solution to a problem is indeed a very good way of detecting mistakes in the formal statement. It's also incredibly time-consuming to do on the scale of 100 problems; my experience is that a combinatorics problem doesn't need to be that hard for a formal solution to be 1000 lines or more of Lean. (The first olympiad combinatorics problem I formalized a solution to ended up as 3000 lines of Lean 3, though I expect that was longer than necessary through inexperience.)
Jason Rute (Apr 29 2025 at 23:13):
Clearly, there is a lot of cleanup needed. I hope the authors do it and rerun the benchmark. And I hope they maintain it and version it properly going forward, noting which version each model was run with (including which version of Lean, Mathlib, and the benchmark itself).
Jason Rute (Apr 29 2025 at 23:14):
I may have been ambitious to suggest formalizing each proof. It would be good to get guidance from those with experience in this sort of thing on how to avoid mistakes.
Jason Rute (Apr 29 2025 at 23:32):
Overall, I think this benchmark likely won't have a lot of impact. It will be good for R&D, but as a goal post it has all the same problems that MiniF2F and PutnamBench have:
- The problems are old and found on the internet, probably even in various natural language math training sets (maybe even in Numina Math 1.5). I think leakage (where the model is aware of the informal proof) will be common.
- In some cases, there are even already formal proofs on the internet, and I suspect @David Renshaw plans to formalize many more of these problems, especially the IMO ones, in Compfiles. This leads to even more leakage, and of the worst kind, where the formal proof is available on the open internet.
- As models get better on these benchmarks (like they are on the informal math benchmarks), there will be a lot of distrust. Does it do well because it has memorized the proof (or the numerical solution at least), or because it is really reasoning?
- Competition math isn't really that interesting to what formal math is used for. It is just a convenient benchmark.
Jason Rute (Apr 29 2025 at 23:32):
I think going forward, the following two types of benchmarks will have more value:
- Benchmarks that are more closely tied to the needs of formal math: autoformalization, provably correct code, and research math.
- Benchmarks that are resistant to data leakage. For natural language competition math, the new standard is matharena.ai, a project by researchers as ETH where they run models on the current math competitions immediately after the competition (which avoids leakage). @Mislav Balunović has already stated that he is considering a formal version of the matharena benchmark. I think Lean experts here should partner with him on that (or make their own version of that benchmark). I could even imagine a near future (2026?) where the standard competitions (USAMO, IMO, Putnam, etc) release formal Lean versions of all their problems.
Joseph Myers (Apr 29 2025 at 23:50):
For the informal leakage issue, it's always possible for someone to use such a benchmark as an autoformalization benchmark - give an autoformalization AI the formal statement and the informal solution - even if it's not what the benchmark was designed or intended for. (That's still at risk of leakage from training materials including existing formal solutions.)
Joseph Myers (Apr 29 2025 at 23:56):
I like the idea of competitions releasing formal statements of their own problems. But there are a lot of competitions (plenty of which don't even have websites that get promptly updated with the official informal statements after the competition); to have several competitions releasing formal problem statements (and thus several different groups of people releasing them, not all with a great deal of Lean expertise), you really need a better solution (that can be applied by non-experts) to getting correct (even if not perfectly idiomatic) Lean problem statements. If several competitions release Lean versions of their problems where most of the formal statements are wrong, that's not very useful for benchmarking any kind of AI other than an interactive "tell me what I got wrong in this formal statement" autoformalizer for statements.
Jason Rute (Apr 29 2025 at 23:57):
Joseph Myers said:
For the informal leakage issue, it's always possible for someone to use such a benchmark as an autoformalization benchmark - give an autoformalization AI the formal statement and the informal solution - even if it's not what the benchmark was designed or intended for. (That's still at risk of leakage from training materials including existing formal solutions.)
I totally agree! I discussed this specifically here as a Tier 0 of my ambitious autoformalization benchmark idea where one formalizes entire mathematical arguments. Indeed ProofNet and MiniF2F are already ready for this Tier 0 autoformalization benchmark since they have informal proofs as well. This was run on MiniF2F a few years ago as part of the Draft-Sketch-Prove paper. In that paper, providing the human proof didn't help much at all, but I think with better autoformalizers it could help a lot. If not, we will discover that autoformalization is the current bottleneck.
Jason Rute (Apr 30 2025 at 00:09):
As for getting (at least some of the major) math competitions to have high-quality formalized translations, if anyone in the world could make it happen, it is you @Joseph Myers! :smile:
Junqi Liu (Apr 30 2025 at 02:41):
Joseph Myers said:
This also probably invalidates the leaderboard information, given the likelihood that AIs have actually discovered some problems that are made much easier by being incorrectly stated. Could the leaderboard be made to provide access to both lists of the problems claimed to be solved by each AI, and all the underlying Lean solutions (I couldn't see how to access this data at present), since seeing whether formal solutions actually correspond to the intended informal problem is a useful way of detecting formalization mistakes?
We list the most of 7 problems in the end of our paper. There are two problems (brualdi_ch7_7 and brualdi_ch14_33) for which we did not post the solutions, as these two problems can be proved very easily with a one-step theorem.
Niels Voss (Apr 30 2025 at 23:22):
Out of curiosity, how do you plan to grade problems that require exhibiting certain sums? I picked a random problem in the benchmark:
import Mathlib
abbrev brualdi_ch5_9_solution : ℕ → ℤ := sorry
/--
Evaluate the sum $\sum_{k=0}^{n}(-1)^{k}\binom{n}{k} 10^{k}$.
-/
theorem brualdi_ch5_9 (n : ℕ) : ∑ k ∈ Finset.range (n + 1), (-1 : ℤ) ^ k * (n.choose k) * 10 ^ k =
brualdi_ch5_9_solution n := by sorry
where part of the solution involves exhibiting a function ℕ → ℤ. To me, it seems that an AI could just do
import Mathlib
abbrev brualdi_ch5_9_solution : ℕ → ℤ :=
fun n => ∑ k ∈ Finset.range (n + 1), (-1 : ℤ) ^ k * (n.choose k) * 10 ^ k
/--
Evaluate the sum $\sum_{k=0}^{n}(-1)^{k}\binom{n}{k} 10^{k}$.
-/
theorem brualdi_ch5_9 (n : ℕ) : ∑ k ∈ Finset.range (n + 1), (-1 : ℤ) ^ k * (n.choose k) * 10 ^ k =
brualdi_ch5_9_solution n := by rfl
Would the solution for brualdi_ch5_9_solution need to be in some sort of closed form?
This isn't really that big of a deal since it's quite easy for a human to just check manually that the function generated by the AI is simple enough to count as a solution.
Junqi Liu (May 01 2025 at 14:30):
@Niels Voss You can read our evaluation code or our paper, we consider this situation.
Pawan sasanka ammanamanchi (May 07 2025 at 11:08):
The Combibench paper recently got released, as someone trying to do justice to evaluation and measuring model performance on theorem proving. Any thoughts by people more familiar with the issues with lean benchmarks here, if the Fine-eval method they propose is everything you would want to see?
Paper: https://arxiv.org/abs/2505.03171
Jason Rute (May 07 2025 at 11:18):
Wow, this is the most explicit cheat I've seen so far! Just comment out the theorem and it passes all the tests!!!! :laughter_tears: But this is yet another reason we need a better (and official) checking mechanism like I propose here: #lean4 > A better Lean checker
Jason Rute (May 07 2025 at 11:25):
And to be clear, their evaluation strategy is not enough. It wouldn't guard against the DeepSeek bug either. The simplest extra thing they should do is to append #print axioms <theorem_name> at the end. That will make sure the theorem exists (which would guard against deleting the theorem and other bugs/exploits that cause the theorem to get dropped from the environment), and they could check the output to make sure it only uses the three good axioms. But it wouldn't guard against this kind of exploit:
local instance instNatAdd : Add Nat := ⟨Nat.mul⟩
theorem foo : 1 + 0 = 0 := by rfl
(or some other exploits using metaprogramming). They should look into using @Rob Lewis's autograder and @GasStationManager's SafeVerify in the meantime. I think that would be a pretty good solution.
Jason Rute (May 07 2025 at 11:31):
cc @Junqi Liu
Jason Rute (May 07 2025 at 11:51):
Another easy thing to do is to append this at the very end (in addition to #print axioms)
set_option pp.all true
#check <theorem_name>
#print <new_definition_used_in_problem>
which would check that the theorem statement and any new definitions/axioms used in the problem statement haven't changed. Turning off the pretty printer with set_option pp.all true better guards against exploits like the local instance one above. (But again, I wish there were a more one-size-fits-all official solution for these problems.)
Jason Rute (May 07 2025 at 12:07):
(Also, I can empathize with the model and why it tried to cheat on that problem. That is one of the problems we pointed out was false above. I'd love to see the chain-of-thought to see if the model realized it was false and worthless to try to prove it, so it just commented it out. :laughter_tears:)
Jason Rute (May 07 2025 at 12:45):
Also, I just saw that the solutions in Appendix 3 all use native_decide. That is a questionable tactic. It bypasses the kernel, performing computation using the interpreter instead of the kernel. It needs special axioms to allow this, and lean4checker won't check a proof using it. It is known that you can use it to fairly easily prove False. (I'm not sure, however, if there are any known ways that don't involve making new externally implemented definitions. It might be that if you are using standard definitions, it is fine in practice.) Hence, many consider it dangerous to allow it. You can check if it is used with #print_axioms. If the proof works with decide in place of native_decide, that is better.
Eric Wieser (May 07 2025 at 13:01):
You can trick native_decide with @[csimp]
Jason Rute (May 07 2025 at 13:09):
I made a list of ways that native_decide can be exploited here: #general > Using `native_decide` to prove False?
Junqi Liu (May 08 2025 at 04:41):
Jason Rute said:
Wow, this is the most explicit cheat I've seen so far! Just comment out the theorem and it passes all the tests!!!! :laughter_tears: But this is yet another reason we need a better (and official) checking mechanism like I propose here: #lean4 > A better Lean checker
The situation you mentioned is avoidable in our checks. Please refer to the fourth point in our requirements, which states that the Lean 4 code extracted from the model's response, after removing all comments, should meet the four requirements illustrated in the diagram.
Junqi Liu (May 08 2025 at 04:44):
Jason Rute said:
And to be clear, their evaluation strategy is not enough. It wouldn't guard against the DeepSeek bug either. The simplest extra thing they should do is to append
#print axioms <theorem_name>at the end. That will make sure the theorem exists (which would guard against deleting the theorem and other bugs/exploits that cause the theorem to get dropped from the environment), and they could check the output to make sure it only uses the three good axioms. But it wouldn't guard against this kind of exploit:local instance instNatAdd : Add Nat := ⟨Nat.mul⟩ theorem foo : 1 + 0 = 0 := by rfl(or some other exploits using metaprogramming). They should look into using Rob Lewis's autograder and GasStationManager's SafeVerify in the meantime. I think that would be a pretty good solution.
Can you explain what the Deepseek bug you mentioned is?
Niels Voss (May 08 2025 at 06:45):
I don't understand the specifics of the deepseek bug but it's discussed in detail here:
I think it's that in older versions of Lean, using apply? in some circumstances involving universes can cause the theorem to get deleted instead of erroring out, which meant that if you were just checking for error messages you wouldn't find any, although someone should please double check me on this.
Junqi Liu (May 08 2025 at 08:30):
(deleted)
Junqi Liu (May 08 2025 at 10:00):
Jason Rute said:
And to be clear, their evaluation strategy is not enough. It wouldn't guard against the DeepSeek bug either. The simplest extra thing they should do is to append
#print axioms <theorem_name>at the end. That will make sure the theorem exists (which would guard against deleting the theorem and other bugs/exploits that cause the theorem to get dropped from the environment), and they could check the output to make sure it only uses the three good axioms. But it wouldn't guard against this kind of exploit:local instance instNatAdd : Add Nat := ⟨Nat.mul⟩ theorem foo : 1 + 0 = 0 := by rfl(or some other exploits using metaprogramming). They should look into using Rob Lewis's autograder and GasStationManager's SafeVerify in the meantime. I think that would be a pretty good solution.
Thank you very much! Your suggestion is great, and we will incorporate it into our checks right away. And so far, we haven't found the model using this method to cheat in our experimental results. Perhaps the current model's capabilities are not yet sufficient to write such an advanced tactic as a local instance. :+1:
Jason Rute (May 08 2025 at 10:49):
Junqi Liu said:
Jason Rute said:
Wow, this is the most explicit cheat I've seen so far! Just comment out the theorem and it passes all the tests!!!! :laughter_tears: But this is yet another reason we need a better (and official) checking mechanism like I propose here: #lean4 > A better Lean checker
The situation you mentioned is avoidable in our checks. Please refer to the fourth point in our requirements, which states that the Lean 4 code extracted from the model's response, after removing all comments, should meet the four requirements illustrated in the diagram.
To be clear, I know it was avoided in your checks, but I thought it was funny and (more importantly) think it shows the subtleties involved here. Your checks are quite ad-hoc, and while they now catch this, they could easily miss other stuff. I think it is good for others to be aware of this exploit to avoid it. Did you also see this behavior in training Kimina-Prover, and if so did you have checks on it then?
Jason Rute (May 08 2025 at 11:04):
Junqi Liu said:
Can you explain what the Deepseek bug you mentioned is?
The DeepSeek bug was a bug in Lean < 4. 13.0 (and in particular in Lean 0.4.9 which is what DeepSeek used) where apply ? with a type jump in the proof caused Lean to drop the theorem from the kernel without an error message.
I called it the DeepSeek bug, since it came out in the results of the DeepSeek-Prover where the model seems to have discovered that exploit. It comes up in one of their claimed MiniF2F solutions and two of the 7B Putnam Bench solutions in the paper. (I can't tell if they claimed, without realizing it was a bug, that this pattern was common in their 7B model.) You can read more about it here:
- #lean4 > apply? might suppress other warnings/errors (description of the bug)
- #Machine Learning for Theorem Proving > DeepSeek-Prover V2 TODO List (compact summary of the issues in the DeepSeek-Prover V2 paper)
The reason I make a large deal about this bug is because most of the standard checks (including, I think, all the checks you mention in the CombiBench paper, as well as other gold-standard checks like lean4checker) would not have caught it. Also, unlike other known exploits, it used standard tactics inside a proof, so it was even more subtle and hard to find. I hope the Lean community uses this as a way to come up with a set of better non-ad-hoc solutions (something like SafeVerify but officially supported).
Junqi Liu (May 08 2025 at 11:07):
Jason Rute said:
Junqi Liu said:
Can you explain what the Deepseek bug you mentioned is?
The DeepSeek bug was a bug in Lean < 4. 13.0 (and in particular in Lean 0.4.9 which is what DeepSeek used) where
apply ?with a type jump in the proof caused Lean to drop the theorem from the kernel without an error message.
I called it the DeepSeek bug, since it came out in the results of the DeepSeek-Prover where the model seems to have discovered that exploit. It comes up in one of their claimed MiniF2F solutions and two of the 7B Putnam Bench solutions in the paper. (I can't tell if they claimed, without realizing it was a bug, that this pattern was common in their 7B model.) You can read more about it here:
- #lean4 > apply? might suppress other warnings/errors (description of the bug)
- #Machine Learning for Theorem Proving > DeepSeek-Prover V2 TODO List (compact summary of the issues in the DeepSeek-Prover V2 paper)
The reason I make a large deal about this bug is because most of the standard checks (including, I think, all the checks you mention in the CombiBench paper, as well as other gold-standard checks like
lean4checker) would not have caught it. Also, unlike other known exploits, it used standard tactics inside a proof, so it was even more subtle and hard to find. I hope the Lean community uses this as a way to come up with a set of better non-ad-hoc solutions (something like SaneVerify but officially supported).
Yes! However, perhaps we could monitor the feedback from Lean and prohibit the appearance of the keyword Try this in the feedback. Do you think this approach could avoid the Deepseek-bug?
Jason Rute (May 08 2025 at 11:10):
It is only a coincidence that this bug uses apply? which also prints "Try this". The next such bug may not. Also exact? is a "Try this" tactic, but it is really powerful and AlphaProof uses it. I think one needs a more principled approach to this sort of stuff, which I think the two tools SafeVerify and autograder are (although I would prefer a more official tool).
Jason Rute (May 08 2025 at 11:13):
Also, while we are talking about exploits to avoid, here is the worst exploit I know of (thanks to @Robin Arnez in #general > Using `native_decide` to prove False? for pointing it out):
set_option debug.skipKernelTC true in
theorem innocent_lemma [Add Nat] : 1 + 1 = 2 := by
simp
theorem f : False := by
have := @innocent_lemma ⟨fun _ _ => 0⟩
simp only [instHAdd] at this
contradiction
The ad-hoc print checks I suggested won't work:
set_option pp.all true
#check f -- f : False
#print axioms f -- 'f' depends on axioms: [propext]
But SafeVerify and lean4checker would catch it. (I haven't checked if @Rob Lewis's autograder would catch it as well.)
Jason Rute (May 08 2025 at 11:16):
I haven't found an exploit yet that SafeVerify doesn't catch. The only question now is if it is fast enough for practical applications (cc @GasStationManager).
Jason Rute (May 08 2025 at 11:19):
...And if it covers all the usage patterns in something like CombiBench.
Junqi Liu (May 08 2025 at 11:21):
Jason Rute said:
It is only a coincidence that this bug uses
apply?which also prints "Try this". The next such bug may not. Alsoexact?is a "Try this" tactic, but it is really powerful and AlphaProof uses it. I think one needs a more principled approach to this sort of stuff, which I think the two tools SaneVerify and autograder are (although I would prefer a more official tool).
Fine! But our Kimina Lean Server can capture this false. Maybe there's just an issue with Deepseek's Lean server.
Jason Rute (May 08 2025 at 11:25):
It has to do with versions. DeepSeek uses Lean 4.9.0. I assume you use something >= 4.13.0. The bug, while not discovered until now, only affects older versions of Lean.
Jason Rute (May 09 2025 at 00:22):
@Junqi Liu The final proof in the appendix of brualdi_ch6_11 is wrong for two reasons. First, I think the answer in the paper is wrong, or at least a different answer works for the native_decide calculation at the end. Also, the proof of the lemma h1 is not finished at least on the version of Lean/math at https://live.lean-lang.org/ where I tested it. Does it work for you?
Jason Rute (May 09 2025 at 00:23):
(If it does work in your version of Lean and not the one I was using, maybe some simp lemmas changed in that time.)
Junqi Liu (May 09 2025 at 01:56):
It is work for me. It should be noted that all our training and testing were conducted using version v4.15.0.
import Mathlib
abbrev brualdi_ch6_11_solution : ℕ := 24024
theorem brualdi_ch6_11
(sols : Finset (Equiv.Perm (Finset.Icc 1 8)))
(h_sols : ∀ σ, σ ∈ sols ↔ (∀ i, Even i.1 → σ i ≠ i)) :
sols.card = brualdi_ch6_11_solution := by
have h1 : sols = Finset.filter (fun σ => ∀ i, Even i.1 → σ i ≠ i) (Finset.univ) := by
ext σ
simp [h_sols]
rw [h1]
native_decide
Junqi Liu (May 09 2025 at 01:59):
For native_decide, if Lean can deal with some simple computations, then why don't we let it compute them?
Aaron Liu (May 09 2025 at 02:12):
Junqi Liu said:
For
native_decide, if Lean can deal with some simple computations, then why don't we let it compute them?
Since native_decide can be unsound (see #general > Using `native_decide` to prove False?)
Jason Rute (May 09 2025 at 02:34):
As @Aaron Liu says, native_decide is a double-edged sword. It is fast because it uses machine data structures like bigints, arrays, etc. to do computation, but they come with a cost, since the interface to them is more likely to be wrong. Moreover, since you can add your own external implementation of functions, you can easily make your own proofs of false. (And there are some other examples in that thread.) People do use native_decide but at their own peril. If used carefully, it is probably correct and faster than pure varients in Lean (at least right now).
Junqi Liu (May 09 2025 at 02:35):
Jason Rute said:
As Aaron Liu says,
native_decideis a double-edged sword. It is fast because it uses machine data structures like bigints, arrays, etc. to do computation, but they come with a cost, since the interface to them is more likely to be wrong. Moreover, since you can add your own external implementation of functions, you can easily make your own proofs of false. (And there are some other examples in that thread.) People do usenative_decidebut at their own peril. If used carefully, it is probably correct and faster than pure varients in Lean (at least right now).
Agree! I think it makes sense for tasks like computing cardinalities or when dealing with numerical values.
Jason Rute (May 09 2025 at 02:41):
And as for brualdi_ch6_11, I must have copy-and-pasted it wrong from the paper (which involved a bit of cleanup). The version above works fine on https://live.lean-lang.org/ with 24024 as well as on my laptop.
GasStationManager (May 09 2025 at 02:54):
I do wonder whether part of what native_decide does, like fast numerical computation, can become verified (using only the standard axioms).
Jason Rute (May 09 2025 at 02:57):
Yes, tactics like norm_num (and I think the upcoming grind tactic) are very good and don’t using anything external. Also, the four color theorem proof in Coq is a great example of good engineering to speed up a proof by computation.
Jason Rute (May 09 2025 at 03:02):
And there is another practical reasoning why native_decide is a problem for benchmarks. A lot of the talk above was about how to verify proofs correctly without worrying about exploits or bugs. You just can’t verify a native_decide proof in the same way. Lean4checker, SafeVerify, etc won’t work because the proof just doesn’t exist in full (because native_decide doesn’t generate a kernel proof).
Robin Arnez (May 09 2025 at 05:35):
I don't really think numerical computation is necessarily much slower without native_decide -- at least most basic operations (addition, subtraction, multiplication, division, modulo, power, greatest common divisor, equality, inequality, bitwise and, bitwise or, bitwise xor, shift left, shift right) on natural numbers do use bigints, even when just using reduction. The main thing that verification using decide struggles with are:
- well-founded recursion (the nemesis of
decideandrflproofs): makes any kind of reduction slow and often impossible - long iteration:
decidedoesn't really work well with e.g. deep recursion because it doesn't seem like it has good tail-call strategies - other blockers:
opaque,@[irreducible]and casts (as in▸) can often block reduction completely while the compiler has no problem with them (example:Rat.addis irreducible).
But, in fact, it is still useful enough that grind leaves a lot of the work to rfl (kernel level unification).
Jason Rute (May 09 2025 at 10:26):
Then I look forward to when we have a pure tactic which is more or less a drop in replacement for native_decide. @Robin Arnez is grind going to be that tactic?
Henrik Böving (May 09 2025 at 11:10):
native_decide performs native computation that the kernel believes, there are just certain problems that are not tractable without this. grind is not a term evaluator its an SMT style tactic
Last updated: Dec 20 2025 at 21:32 UTC