Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: LeanCat benchmark (category theory in Lean 4)

Rongge (Jan 04 2026 at 17:28):

New preprint + repo: LeanCat (Part I: 1-categories) — 100 statement-level Lean 4 problems for abstraction-heavy, library-grounded category-theory reasoning.
Paper: https://arxiv.org/pdf/2512.24796 Repo: https://github.com/sciencraft/LeanCat
Pinned toolchain: Lean 4.19.0 / mathlib 4.19.0; includes NL descriptions + metadata (Easy/Med/High = 20/42/38).
Baselines: best 8.25% pass@1 / 12.00% pass@4 (sharp drop on Medium/High).
Feedback welcome: (i) pointers to closely related work / similar benchmark efforts; (ii) best practices for evaluation + reproducibility (toolchain pinning, scripts, logging, etc.).

Rongge (Jan 04 2026 at 17:47):

Cross-post: LeanCat (Part I: 1-categories) — a Lean 4 benchmark of 100 statement-level category-theory tasks stressing abstraction + library navigation.
Paper: https://arxiv.org/pdf/2512.24796 Repo: https://github.com/sciencraft/LeanCat
Pinned Lean 4.19.0 / mathlib 4.19.0; metadata includes topic clusters + difficulty tiers (20/42/38); best 8.25% pass@1 / 12.00% pass@4.
Pointers welcome on robust eval/reproducibility protocols for tool-augmented proving (retrieval + compiler feedback); main thread: LeanCat announcement.

Eric Wieser (Jan 04 2026 at 17:54):

It seems a little strange to release the first version of the benchmark at a toolchain that is 7 months old; do you plan to update it?

Notification Bot (Jan 04 2026 at 17:56):

A message was moved here from #mathlib4 > LeanCat: a Lean 4 benchmark suite for category theory by Eric Wieser.

Notification Bot (Jan 04 2026 at 17:56):

A message was moved from this topic to #Machine Learning for Theorem Proving > LeanCat benchmark (category theory in Lean 4) by Eric Wieser.

Eric Wieser (Jan 04 2026 at 17:57):

(let's keep discussion in one place)

Jason Rute (Jan 04 2026 at 18:16):

Some questions:

Are these new problems, or problems from Mathlib? The first one is literally in mathlib docs#CategoryTheory.NatTrans.id_comm, so any model trained on mathlib or which can use Mathlib search tools should get it, right?
The last problem maybe isn't in Mathlib as is, but are all the relevant parts in Mathlib? Does one just need to do definition chasing and Mathlib search to find the answer?
If these are not theorems from Mathlib, how do you know they are true and formalized correctly? Did you formalize them by hand? Did you have an extensive review process? (For example, I'm a bit surprised that the third condition of the last problem (CAT_statement_S0100) doesn't depend on F or adj, but I don't know this area of math well, so maybe that is a well known fact.)
What does this benchmark add that other benchmarks don't? It seems easier than PutnamBench and MiniF2F, right? (The fact that you didn't use any Lean-specific baselines make it a bit hard to compare. And no LLM lean models use with pass@1.)
Or is the goal of this project to make a benchmark with advanced math or more definition/diagram chasing?
Do you think a SoTA lean model (anything which does well on PutnamBench) would struggle on this benchmark? (My largest worry is that a SoTA model gets >90% on this benchmark immediately.)

Rongge (Jan 04 2026 at 18:37):

Eric Wieser said:

It seems a little strange to release the first version of the benchmark at a toolchain that is 7 months old; do you plan to update it?

Thanks for flagging this. We pinned Lean/mathlib 4.19 for reproducibility of the v1 results, and because the current eval harness is still tied to a LeanExplore-based retrieval + compiler-feedback loop on that toolchain. We'd like to support a more recent mathlib/Lean release as well, but I can't promise a timeline until we decouple/update the retrieval backend. If there's a community-preferred pattern here (e.g. frozen vs rolling), pointers would be very appreciated.

Rongge (Jan 04 2026 at 19:02):

Jason Rute said:

Some questions:

Are these new problems, or problems from Mathlib? The first one is literally in mathlib docs#CategoryTheory.NatTrans.id_comm, so any model trained on mathlib or which can use Mathlib search tools should get it, right?

The last problem maybe isn't in Mathlib as is, but are all the relevant parts in Mathlib? Does one just need to do definition chasing and Mathlib search to find the answer?

If these are not theorems from Mathlib, how do you know they are true and formalized correctly? Did you formalize them by hand? Did you have an extensive review process? (For example, I'm a bit surprised that the third condition of the last problem (CAT_statement_S0100) doesn't depend on F or adj, but I don't know this area of math well, so maybe that is a well known fact.)

What does this benchmark add that other benchmarks don't? It seems easier than PutnamBench and MiniF2F, right? (The fact that you didn't use any Lean-specific baselines make it a bit hard to compare. And no LLM lean models use with pass@1.)

Or is the goal of this project to make a benchmark with advanced math or more definition/diagram chasing?

Do you think a SoTA lean model (anything which does well on PutnamBench) would struggle on this benchmark? (My largest worry is that a SoTA model gets >90% on this benchmark immediately.)

Fair questions — quick answers:

• It’s a mix by design: a few Easy items are literally in mathlib (calibration for library navigation / interface familiarity); most aren’t “look up this theorem”.
• Even when ingredients exist in mathlib, the point is composing definitions/lemmas across abstractions (diagram/definition chasing in a big library), not contest-style tricks.
• For items not already in mathlib, we wrote the statements in Lean and had them sanity-checked by multiple category theorists; for harder ones we also attach sources/citations (see per-problem refs in metadata.json).
• LeanCat’s goal is different from miniF2F / PutnamBench: those are competition-flavored, while we target abstraction-heavy, library-grounded category theory.
• On “instant saturation”: current generic LLM baselines are far from it (~8% pass@1 / ~12% pass@4 overall; sharp drop on Medium/High).
• Happy to hear suggestions on stronger Lean-specific baselines and best practices for eval/reproducibility here.

Joël Riou (Jan 04 2026 at 19:15):

There is a mistake in problem 76: in Nonempty ((Pro (FintypeCat)) ≃ Profinite), ≃ (bijection) must be replaced by ≌ (equivalence of categories).

Rongge (Jan 04 2026 at 19:18):

FYI: please reply in the discussion thread: ML thread. I’ll follow up there.

Rongge (Jan 04 2026 at 19:25):

Joël Riou said:

There is a mistake in problem 76: in Nonempty ((Pro (FintypeCat)) ≃ Profinite), ≃ (bijection) must be replaced by ≌ (equivalence of categories).

Thanks a lot for catching this — you’re absolutely right. I’ll patch the statement + metadata.json; feel free to open an issue/PR if you prefer.

Rongge (Jan 04 2026 at 19:26):

Announcement/entry point: mathlib4 post. Let's keep discussion in this thread.

Jason Rute (Jan 04 2026 at 19:27):

@Eric Wieser Sorry. I posted in the wrong thread. I thought it was the other one. Can you move it? (I can't move between channels.)

Jason Rute (Jan 04 2026 at 19:42):

Rongge said:

On “instant saturation”: current generic LLM baselines are far from it (~8% pass@1 / ~12% pass@4 overall; sharp drop on Medium/High).

I think your baselines are very weak (even if the models used are strong models overall), including that you are using 1 to 4 model calls per problem which is rare in this field. It would be nice to use some Lean-specific models. It might be that the way these problems are stated, with many in-file variables, abbrevs, and defs will also present novel challenges, especially to models that were trained explicitly on the very simple minif2f format, but that is unclear.

Jason Rute (Jan 04 2026 at 19:45):

The question I'm trying to figure out if there is something inherently challenging mathematically or Lean-wise about this benchmark. If any mathlib category theory experts are in this thread, do you think you would be pleasantly surprised if a tool filled in these sorrys?

Rongge (Jan 04 2026 at 19:53):

Jason Rute said:

Rongge said:

On “instant saturation”: current generic LLM baselines are far from it (~8% pass@1 / ~12% pass@4 overall; sharp drop on Medium/High).

I think your baselines are very weak (even if the models used are strong models overall), including that you are using 1 to 4 model calls per problem which is rare in this field. It would be nice to use some Lean-specific models. It might be that the way these problems are stated, with many in-file variables, abbrevs, and defs will also present novel challenges, especially to models that were trained explicitly on the very simple minif2f format, but that is unclear.

Thanks Jason. For v1 we kept the baselines intentionally simple and reproducible: generate a full proof, then at most a small compiler-error repair loop (1–4 tries), not heavy search. I agree this isn’t the strongest Lean-specific setup, and it makes comparisons harder.

We’d like to add a Lean-specific prover baseline (Aesop / LeanDojo-style provers like ReProver) once we have the engineering/compute to wire it up cleanly; those are usually reported as “% theorems proved under a fixed time/search budget” rather than a straight pass@1, so we’d also like to align the protocol/metrics.

On “is it inherently challenging”: I wouldn’t be shocked if the Easy tier gets filled quickly with mathlib search. Where I’d be pleasantly surprised is strong, consistent performance on Medium/High — many items aren’t a single lookup, but require stitching interfaces/diagram chasing (and in a few cases we had to add missing glue lemmas). If there’s a standard baseline/config people expect here, pointers (or PRs) would be much appreciated.

Jason Rute (Jan 04 2026 at 20:09):

Rongge said:

We’d like to add a Lean-specific prover baseline (Aesop / LeanDojo-style provers like ReProver) once we have the engineering/compute to wire it up cleanly; those are usually reported as “% theorems proved under a fixed time/search budget” rather than a straight pass@1, so we’d also like to align the protocol/metrics.

Aesop is just a tactic and LeanDojo's Reprover is really outdated (and never that high performing). I would much rather see evaluations with Goedel-prover v2, DeepSeek-prover v2, or Kimina prover v2 (using them with their suggested prompts) with pass@32 or higher. Or with Aristotle or Claude Code (with the Lean MCP) for a more realistic eval (since these are the two tools people actually use in practice, but that might be harder).

Notification Bot (Jan 04 2026 at 22:08):

This topic was moved here from #mathlib4 > LeanCat: a Lean 4 benchmark suite for category theory by Kevin Buzzard.

Notification Bot (Jan 04 2026 at 22:09):

8 messages were moved here from #Machine Learning for Theorem Proving > LeanCat: a Lean 4 benchmark suite for category theory by Kevin Buzzard.

Kevin Buzzard (Jan 04 2026 at 22:09):

(@Rongge please don't crosspost)

Justin K. Wang (Jan 05 2026 at 00:02):

Jason Rute said:

do you think you would be pleasantly surprised if a tool filled in these sorrys?

Dear Jason,

Thank you for your question.
As a supplementary to Rongge's answer, we’ve already completed most of the full proofs, in the LEAN language. For medium- and high-level problems, it is common that some additional “bridging” lemmas are needed to connect existing results in the library and close the proof. For some of the problems, completing the proof requires more than 1,000 lines of LEAN code. This reflects the benchmark’s difficulty and robustness.

Hope readers will get something meaningful from our paper! ;-D

Best,
Justin K. Wang
Jan 5, 2025

Jason Rute (Jan 05 2026 at 02:54):

Thanks @Justin K. Wang . That does help—both in understanding the difficulty and the effort you went through to ensure correctness. (We will have to see how hard it is to fill in those 1000 lines.)

Rongge (Jan 05 2026 at 21:38):

Jason Rute said:

Rongge said:

We’d like to add a Lean-specific prover baseline (Aesop / LeanDojo-style provers like ReProver) once we have the engineering/compute to wire it up cleanly; those are usually reported as “% theorems proved under a fixed time/search budget” rather than a straight pass@1, so we’d also like to align the protocol/metrics.

Aesop is just a tactic and LeanDojo's Reprover is really outdated (and never that high performing). I would much rather see evaluations with Goedel-prover v2, DeepSeek-prover v2, or Kimina prover v2 (using them with their suggested prompts) with pass@32 or higher. Or with Aristotle or Claude Code (with the Lean MCP) for a more realistic eval (since these are the two tools people actually use in practice, but that might be harder).

Quick clarification: I mentioned Aesop only as a cheap Lean-native sanity-check baseline, not “the main prover”.

Agree the more interesting strong baselines are modern LLM provers (DeepSeek/Goedel/Kimina/…), probably at higher pass@k with their recommended prompts. v1 started from a minimal reproducible setup.

If you already have DeepSeek/Goedel/Kimina running somewhere, would you be willing to try a quick pass@32 on LeanCat and share config/logs? Partial is totally fine.

Simon Sorg (Jan 06 2026 at 13:43):

Problem 3 entails 4 theorems. Do you give partial scores here, or do you flag it as unsolved if < 4 theorems are filled in?

Nick Adfor (Jan 11 2026 at 03:15):

Eric Wieser said:

It seems a little strange to release the first version of the benchmark at a toolchain that is 7 months old; do you plan to update it?

It might be the same problem as FATE benchmark. 7 months might be the time it takes to write code for benchmark and submit paper.

Rongge (Jan 12 2026 at 04:35):

Simon Sorg said:

Problem 3 entails 4 theorems. Do you give partial scores here, or do you flag it as unsolved if < 4 theorems are filled in?

We treat each file as one problem: solved = everything in the file compiles (so S_0003 needs 4/4). Keeps pass@k clean + avoids weird weighting from file splitting.
No partial credit in the main score for now—maybe we’ll add a “#theorems solved / total” stat later as a nice-to-have analysis.

Justin K. Wang (Jan 14 2026 at 10:41):

Nick Adfor said:

Eric Wieser said:

It seems a little strange to release the first version of the benchmark at a toolchain that is 7 months old; do you plan to update it?

It might be the same problem as FATE benchmark. 7 months might be the time it takes to write code for benchmark and submit paper.

Hi Nick,

Thank you for pointing this out. As Rongge mentioned, that's because of taking the LeanExplore into consideration; otherwise we have to build a search tool ourselves! :-P

Best, Justin

Rongge said:

Thanks for flagging this. We pinned Lean/mathlib 4.19 for reproducibility of the v1 results, and because the current eval harness is still tied to a LeanExplore-based retrieval + compiler-feedback loop on that toolchain. We'd like to support a more recent mathlib/Lean release as well, but I can't promise a timeline until we decouple/update the retrieval backend. If there's a community-preferred pattern here (e.g. frozen vs rolling), pointers would be very appreciated.

Nick Adfor (Jan 14 2026 at 16:47):

Justin K. Wang said:

Hi Nick,

Thank you for pointing this out. As Rongge mentioned, that's because of taking the LeanExplore into consideration; otherwise we have to build a search tool ourselves! :-P

Best, Justin

Hi, Justin,

Thank you for replying this immediately. Actually what I want to know most is about the reason why a search tool is needed (It might be I do not understand the article well). I also know that FATE https://arxiv.org/abs/2511.02872 seems to use search tool implictly (Their search tool is leansearch.net, which is not included in the article. But I know that when proofreading the benchmark dataset they still uses leansearch to help as the working document referred to it.) What is the relation between the search tool and the benchmark dataset? (For personal use, I am more familiar with moogle as it is in vscode. So I cannot tell the difference between the different search tools)

Best, Nick

Rongge (Jan 16 2026 at 10:11):

Nick Adfor said:

Justin K. Wang said:

Hi Nick,

Thank you for pointing this out. As Rongge mentioned, that's because of taking the LeanExplore into consideration; otherwise we have to build a search tool ourselves! :-P

Best, Justin

Hi, Justin,

Thank you for replying this immediately. Actually what I want to know most is about the reason why a search tool is needed (It might be I do not understand the article well). I also know that FATE https://arxiv.org/abs/2511.02872 seems to use search tool implictly (Their search tool is leansearch.net, which is not included in the article. But I know that when proofreading the benchmark dataset they still uses leansearch to help as the working document referred to it.) What is the relation between the search tool and the benchmark dataset? (For personal use, I am more familiar with moogle as it is in vscode. So I cannot tell the difference between the different search tools)

Best, Nick

Hi Nick, the benchmark dataset itself is independent of any search tool. We use search only in our workflow/baselines: in particular, LeanBridge (our kernel-checked pipeline) uses retrieval (e.g. LeanExplore) to fetch relevant mathlib lemmas/definitions as context, which makes both curation and tool-augmented proving much more efficient. So the search tool affects the solver and development workflow, not the benchmark items.
Regarding search tools: leansearch / LeanExplore / Moogle are just different retrieval frontends/backends over (roughly) the same corpus; they differ in ranking/UI/API, but they’re interchangeable from the dataset’s perspective.

Last updated: Feb 28 2026 at 14:05 UTC