Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: First impressions from using Claude 4 Opus and Sonnet


Adam Kurkiewicz (May 22 2025 at 19:43):

Anthropic today released two new LLMs, Claude 4 Opus (more powerful, expensive) and Claude 4 Sonnet (still expensive, but less).

They can both write syntactically correct Lean 4 code and I think they should be able to auto-formalise Olympiad-level problems with minimum guidance from the user (I'm in fact in the process of trying to formalise an Olympiad problem to demonstrate this).

For the outer, reasoning loop I use Claude 4 Opus (initially I ask it to produce a draft of the formalisation):

I'm trying to formalise the following natural language proof:

https://web.evanchen.cc/exams/USAMO-2022-notes.pdf

Can you check if my formalisation is on a good track and propose the next good step in the formalisation?

<Lean proof goes here>

This produces mostly correct lean 4 code, which is an improved version (more sorries filled) of the previous version of the formalisation. There are usually some issues.

I then use the less powerful (but in my experience better syntactically) model, Claude 4 Sonnet, to fix the errors (I simply paste it the code and the problem, and do several iteration of fixing the syntax before going back to the reasoning loop).

And the crazy thing is that I'm making progress (with previous versions of Claude my experience was syntactic salad that wasn't super usable).

The biggest issue so far seems to be the cost. It cost me ~10 USD to get ~400 lines of lean proof (I expect final version will be ~1000). Still cheaper than undergraduates ;)

Funnily the Opus model sometimes switches its thinking from English to Chinese (but responses are always in English).

Adam Kurkiewicz (May 22 2025 at 19:47):

For the "syntactic improvement" inner loop I use lake build on the command line, because I don't yet have a setup, which would give Claude Sonnet interactive access to my VSCode.

Justin Asher (May 22 2025 at 21:24):

Are you formalizing a single problem? A proof? All the problems? It is not clear to me base don what you wrote. In either case, I am super excited to hear more about your progress!

I just finished writing a Python MCP for my search engine, LeanExplore, which you can find on the repository. I wrote an example of an LLM which you can ask to search Lean repositories for you in the README, but this should integrate nicely with Claude since they created the MCP. Still need to write tests and documentation, but I think this could be very helpful for what you are trying to do!

Justin Asher (May 22 2025 at 21:30):

I tried asking it for relevant Lean statements for one of the given problems, and this is what I got:

Example

By adjusting the prompting you can get a lot more relevant data and whatnot.

Adam Kurkiewicz (May 23 2025 at 05:41):

Hey Justin, Yes' Im working on a single problem, this problem specifically:

https://dwrensha.github.io/compfiles/problems/Compfiles.Usa2022P1.html

My starting point was what there was in Compfiles, which is this:

https://live.lean-lang.org/#url=https%3A%2F%2Fdwrensha.github.io%2Fcompfiles%2Fproblems%2FCompfiles.Usa2022P1.sol.lean

My objective is to see if I can "vibe-code" this problem, i.e. can I get a complete formalised solution out of the model with minimal input from my side. It's not complete autoformalisation, cause I will still sometimes fix obvious proof errors, and sometimes I will decide to focus on a particular lemma or guide the model in a particular direction, but it's supposed to be at least 5x faster than regular coding. I suppose the difference in the workflow is that the workflow I'm trialling is a human-in-the-loop workflow (most heavy lifting is done by AI), as opposed to AI-in-the-loop (where most heavy lifting is done by a human).

Adam Kurkiewicz (May 23 2025 at 05:51):

What looks to me to be a major shift in the state of the art is that all prior off-the-shelf AI models (chatGPTs, etc...) that I tried simply couldn't produce good enough Lean 4 code to justify using them in a project. Lots of hallucinated lemmas, mixing up lean 3 and lean 4 syntax, etc... Those problems seem to be substantially addressed in this newest release.

Justin Asher (May 23 2025 at 06:03):

Ah, this looks neat, I had not seen the Lean file that you linked before. I will have to check it out when I get more time.

I concur that the previous models can be quite poor. When I was coding Lean with Gemini 2.5 Pro, I had to copy and paste the source files. I am trying to build software to get the AI to do the heavy lifting, too.

Please let me know if you figure out any useful strategies! I think there are a lot of little optimizations just around things like prompt engineering.

Eric Taucher (May 23 2025 at 07:33):

Adam Kurkiewicz said:

initially I ask it to produce a draft of the formalisation

Could you post a share for the conversation.

Thanks

Adam Kurkiewicz (May 23 2025 at 07:36):

I'm using the Lean API, I don't think I have the share button:

Screenshot 2025-05-23 at 09.36.21.png

Adam Kurkiewicz (May 23 2025 at 07:38):

I think sharing chats is for their Claude.ai consumer product, which I don't have actually (I think the models are identical, it's just a difference in pricing and UI)

Adam Kurkiewicz (May 23 2025 at 07:44):

If this is of any help, this is an example chat (just fixing a small lemma with Claude 4 Sonnet, which the "outer loop" produced using Claude 4 Opus):

System prompt:

You are an expert Lean 4 formaliser. Help the user with their lean 4 formalisation

And the user prompt:

import Mathlib.Data.Finset.Card
import Mathlib.Data.Fintype.Card
import Mathlib.Data.Fintype.Prod
import Mathlib.Combinatorics.Hall.Basic
import Mathlib.Tactic
import Mathlib.Data.Rat.Cast.Order

import ProblemExtraction

problem_file { tags := [.Combinatorics] }

/-!
# USA Mathematical Olympiad 2022, Problem 1

Let a and b be positive integers. The cells of an (a+b+1) × (a+b+1) grid
are colored amber and bronze such that there are at least a² + ab - b
amber cells and at least b² + ab - a bronze cells. Prove that it is
possible to choose a amber cells and b bronze cells such that no two
of the a + b chosen cells lie in the same row or column.
-/

namespace Usa2022P1

-- Count amber cells in a given partition
def amber_count (a b : ) (color : Fin (a + b + 1) × Fin (a + b + 1)  Fin 2)
    (R_A C_A : Finset (Fin (a + b + 1))) :  :=
  ((R_A ×ˢ C_A).filter (fun p => color p = 0)).card

-- Count bronze cells in a given partition
def bronze_count (a b : ) (color : Fin (a + b + 1) × Fin (a + b + 1)  Fin 2)
    (R_B C_B : Finset (Fin (a + b + 1))) :  :=
  ((R_B ×ˢ C_B).filter (fun p => color p = 1)).card

-- The key probabilistic lemma
lemma expected_value_implies_existence
    {α : Type*} [Fintype α] [DecidableEq α]
    (f : α  )
    (k : )
    (h : k  (Finset.univ.sum f) / Fintype.card α) :
     a : α, k  f a := by
  by_contra h_not
  push_neg at h_not
  have :  a : α, f a < k := h_not
  have sum_bound : Finset.univ.sum f < Fintype.card α * k := by
    have h1 : Finset.univ.sum f < Finset.univ.sum (fun _ => k) := by
      apply Finset.sum_lt_sum
      intro a _
      exact this a
    rwa [Finset.sum_const, Finset.card_univ, nsmul_eq_mul] at h1
  have : k  (Finset.univ.sum f) / Fintype.card α := h
  rw [div_le_iff₀ (Nat.cast_pos.mpr Fintype.card_pos)] at this
  linarith

Problems:

 [6275/6277] Building Compfiles.Usa2022P1
trace: .> LEAN_PATH=././.lake/packages/Cli/.lake/build/lib:././.lake/packages/batteries/.lake/build/lib:././.lake/packages/Qq/.lake/build/lib:././.lake/packages/aesop/.lake/build/lib:././.lake/packages/proofwidgets/.lake/build/lib:././.lake/packages/importGraph/.lake/build/lib:././.lake/packages/LeanSearchClient/.lake/build/lib:././.lake/packages/plausible/.lake/build/lib:././.lake/packages/mathlib/.lake/build/lib:././.lake/packages/tryAtEachStep/.lake/build/lib:././.lake/build/lib DYLD_LIBRARY_PATH= /Users/adamkurkiewicz/.elan/toolchains/leanprover--lean4---v4.17.0-rc1/bin/lean -Dpp.unicode.fun=true -DautoImplicit=false -DrelaxedAutoImplicit=false -Dweak.linter.style.multiGoal=true ././././Compfiles/Usa2022P1.lean -R ./././. -o ././.lake/build/lib/Compfiles/Usa2022P1.olean -i ././.lake/build/lib/Compfiles/Usa2022P1.ilean -c ././.lake/build/ir/Compfiles/Usa2022P1.c --json
error: ././././Compfiles/Usa2022P1.lean:45:34: typeclass instance problem is stuck, it is often due to metavariables
  Fintype ?m.2693
error: ././././Compfiles/Usa2022P1.lean:51:6: tactic 'rewrite' failed, did not find instance of the pattern in the target expression
  ?m.4871 / ?m.4872  ?m.4870
α : Type u_1
inst✝¹ : Fintype α
inst : DecidableEq α
f : α  
k : 
h : k  Finset.univ.sum f / (Fintype.card α)
h_not :  (a : α), f a < k
this :  (a : α), f a < k
sum_bound : Finset.univ.sum f < (Fintype.card α) * k
this : k  Finset.univ.sum f / (Fintype.card α)
 False
error: Lean exited with code 1
Some required builds logged failures:
- Compfiles.Usa2022P1
error: build failed

And then after several iterations of pasting lake build errors, I get what looks like correct lean 4 formalisation of that particular lemma:

Looking at the errors:

1. The first error says "typeclass instance problem is stuck, it is often due to metavariables Fintype ?m.2693". This is at line 45:34.

Looking at line 45:
``lean
(h : k  (Finset.univ.sum f) / Fintype.card α) :
``

The issue here is that `Fintype.card α` returns a `Nat`, but we're dividing by it in the rational numbers. We need to cast it to `ℚ`. The correct syntax should be `↑(Fintype.card α)` or `(Fintype.card α : )`.

2. The second error is at line 51:6 with the `rewrite` tactic. Looking at the code:
``lean
rw [div_le_iff₀ (Nat.cast_pos.mpr Fintype.card_pos)] at this
``

The issue is that `div_le_iff₀` expects a specific form. Let me check what the actual lemma looks like. The problem is that `this` has type `k  Finset.univ.sum f / (Fintype.card α)`, but the pattern matcher might be having trouble because of how the cast is written.

Let me fix both issues:

1. First, I need to fix the type annotation in line 45 to properly cast to rationals.
2. Second, I need to make sure the rewrite at line 51 works correctly.

Let me revise the proof:
Looking at the errors, there are two issues:

1. The type annotation on line 45 needs to properly cast `Fintype.card α` to rationals
2. The rewrite on line 51 is failing to match the pattern

Here's the corrected version:

```lean
import Mathlib.Data.Finset.Card
import Mathlib.Data.Fintype.Card
import Mathlib.Data.Fintype.Prod
import Mathlib.Combinatorics.Hall.Basic
import Mathlib.Tactic
import Mathlib.Data.Rat.Cast.Order

import ProblemExtraction

problem_file { tags := [.Combinatorics] }

/-!
# USA Mathematical Olympiad 2022, Problem 1

Let a and b be positive integers. The cells of an (a+b+1) × (a+b+1) grid
are colored amber and bronze such that there are at least a² + ab - b
amber cells and at least b² + ab - a bronze cells. Prove that it is
possible to choose a amber cells and b bronze cells such that no two
of the a + b chosen cells lie in the same row or column.
-/

namespace Usa2022P1

-- Count amber cells in a given partition
def amber_count (a b : ) (color : Fin (a + b + 1) × Fin (a + b + 1)  Fin 2)
    (R_A C_A : Finset (Fin (a + b + 1))) :  :=
  ((R_A ×ˢ C_A).filter (fun p => color p = 0)).card

-- Count bronze cells in a given partition
def bronze_count (a b : ) (color : Fin (a + b + 1) × Fin (a + b + 1)  Fin 2)
    (R_B C_B : Finset (Fin (a + b + 1))) :  :=
  ((R_B ×ˢ C_B).filter (fun p => color p = 1)).card

-- The key probabilistic lemma
lemma expected_value_implies_existence
    {α : Type*} [Fintype α] [DecidableEq α]
    (f : α  )
    (k : )
    (h : k  (Finset.univ.sum f) / (Fintype.card α : )) :
     a : α, k  f a := by
  by_contra h_not
  push_neg at h_not
  have :  a : α, f a < k := h_not
  have sum_bound : Finset.univ.sum f < (Fintype.card α : ) * k := by
    have h1 : Finset.univ.sum f < Finset.univ.sum (fun _ => k) := by
      apply Finset.sum_lt_sum
      intro a _
      exact this a
    rwa [Finset.sum_const, Finset.card_univ, nsmul_eq_mul] at h1
  have : k  (Finset.univ.sum f) / (Fintype.card α : ) := h
  rw [div_le_iff (Nat.cast_pos.mpr Fintype.card_pos)] at this
  linarith
``

The main changes are:
1. Changed `Fintype.card α` to `(Fintype.card α : )` to properly cast to rationals
2. Changed `div_le_iff₀` to `div_le_iff` (the non-zero version, since we have the positivity condition)

Adam Kurkiewicz (May 23 2025 at 07:56):

Here's another interesting snippet (from a "thinking" box in a different chat). It looks like it can somehow search Mathlib when it gets stuck. Previous versions of Claude couldn't do that:

I see the issue. I'm using the wrong lemma. Fintype.card_subtype_eq seems to be about a different thing than what I need. Let me think about what lemma I actually need.
I want to connect Fintype.card {s // color s = 0} with (Finset.univ.filter (fun p => color p = 0)).card.
Let me search for the correct lemma. I think it should be something like Finset.card_subtype or similar.
Actually, let me look more carefully. The error message showed that Fintype.card_subtype_eq has type:
Copy (y : ?m.5844) [inst : Fintype { x // x = y }], Fintype.card { x // x = y } = 1
This is clearly not what I want. This lemma is saying that the cardinality of {x // x = y} is 1, which makes sense because there's exactly one x such that x = y, namely x = y itself.
What I actually want is something that relates Fintype.card {s // p s} to (Finset.univ.filter p).card.
Let me think... I believe the correct lemma should be Finset.card_subtype or something similar.
Let me try to search in my memory for the correct lemma name. I think it might be:

Finset.card_subtype
Fintype.card_of_subtype
Finset.card_filter

Actually, let me think about this differently. We have:

Fintype.card {s // color s = 0} which is the cardinality of the subtype
(Finset.univ.filter (fun p => color p = 0)).card which is the cardinality of the filtered finset

I'm investigating a connection between these two cardinality expressions. The lemma I need should directly relate the cardinality of a subtype to the cardinality of a filtered universal set. I'll explore the standard library to find the precise relationship between these two representations.
I'll experiment with Finset.card_subtype as a potential solution, considering different approaches like simplification techniques or targeted lemma selection.

Jonathan Laurent (May 23 2025 at 08:04):

Does anyone know how Claude 4 compares to other models such as GPT-4.1 or Gemini 2.5 for auto-formalization?

Adam Kurkiewicz (May 23 2025 at 08:13):

@Jonathan Laurent anecdotally, the issue so far has been just producing syntactically valid Lean 4 without too many hallucinated lemmas and without mixing Lean 3 syntax. I don't see any hallucinated lemmas with Claude 4, and it's "thinking" box explicitly refers to searching Mathlib (as if it had that capability). But the lemma's signature appear hallucinated sometimes. For example, this looks like a hallucination, right? I took it from Claude's "thinking box" It says it's looking at the Mathlib documentation, the name of the lemma is real, but the signature of the lemma looks suspicious:

Let me search for the correct lemma name. In Mathlib, this is typically called something like Fintype.card_subtype or similar.
Actually, let me check what the exact name is. The lemma should be in Mathlib.Data.Fintype.Card or Mathlib.Data.Finset.Card.
Looking at the Mathlib documentation, the correct lemma is:
Finset.card_subtype :  {α : Type*} [inst : Fintype α] (p : α  Prop) [inst_1 : DecidablePred p], Fintype.card {a // p a} = (Finset.filter p Finset.univ).card

What I'm finding with Moogle is this:

https://www.moogle.ai/search/raw?q=Finset.card_subtype

Eric Taucher (May 23 2025 at 08:19):

Adam Kurkiewicz said:

I'm using the Lean API

Ahh, then it is not a sharable conversation. Thanks.

Jonathan Laurent (May 23 2025 at 08:22):

I have seen this problem of mixing Lean 3/4 syntax and hallucinating mathlib lemmas with GPT-4.1 too. Did anyone find strategies around this? Can prompting help here? Is Claude 4 doing better mostly a product of maybe doing web searches or maybe it was trained/fined-tuned on more recent data?

Adam Kurkiewicz (May 23 2025 at 08:23):

My conclusion with previous tools was that they weren't super helpful. This doesn't seem the case with Claude 4 TBH. It seems doing quite well so far.

Adam Kurkiewicz (May 23 2025 at 08:24):

It's not exactly completely hands-off, one still needs to guide it. But it isn't hopeless.

Adam Kurkiewicz (May 23 2025 at 08:29):

OK @Justin Asher, here is a strategy that got me out of an AI-driven rabbit hole with no chance of progress:

User:

It doesn't work. Hey, different idea. Write out the full signature of the lemma you think exists. And I will search for it in Moogle.

And Claude Sonnet 4 response:

The user is asking me to write out the full signature of the lemma I think should exist that connects `Fintype.card {s // P s}` with `(Finset.univ.filter P).card`.

Let me think about what this lemma should look like:

1. We have a type `
2. We have a predicate `P : α  Prop`
3. We need ` to be a `Fintype`
4. We need `P` to be decidable (i.e., `[DecidablePred P]`)
5. The lemma should state that `Fintype.card {a : α // P a} = (Finset.univ.filter P).card`

So the full signature should be something like:

``lean
theorem some_lemma_name {α : Type*} [Fintype α] (P : α  Prop) [DecidablePred P] :
  Fintype.card {a // P a} = (Finset.univ.filter P).card
``

This is the lemma I think should exist.
You're right, let me write out the exact signature of the lemma I think should exist:

``lean
theorem some_lemma_name {α : Type*} [Fintype α] (P : α  Prop) [DecidablePred P] :
  Fintype.card {a // P a} = (Finset.univ.filter P).card
``

This should be a lemma that connects:
- The cardinality of the subtype `{a // P a}` (elements of type ` satisfying predicate `P`)
- With the cardinality of the filtered finset `(Finset.univ.filter P).card` (all elements of ` that satisfy `P`)

Can you search for this signature in Moogle?

And sure enough, I search for the signature in Moogle, paste the Moogle link to Claude and it fixes the lemmas without a problem.

I'm not sure if this was faster than doing things myself, but looks like a promising strategy to include if you want to independently guide this model to a solution.

Adam Kurkiewicz (May 23 2025 at 08:34):

A side observation: it looks like the "personality" of Claude Sonnet 4 and Claude Opus 4 is very different. Claude Sonnet 4 seems really polite and sort of servile, while Claude Opus 4 is sort of assertive and not super politically correct. It told me at some point something I said was nonsensical :D. I don't think Claude Sonnet 4 would say something like that.

Matteo Cipollina (May 23 2025 at 08:37):

Jonathan Laurent ha scritto:

I have seen this problem of mixing Lean 3/4 syntax and hallucinating mathlib lemmas with GPT-4.1 too. Did anyone find strategies around this? Can prompting help here? Is Claude 4 doing better mostly a product of maybe doing web searches or maybe it was trained/fined-tuned on more recent data?

I think 'hallucinating' is not an accurate word. From what I could see AI tries to give a reasonable name (based on it's experience with mathlib names) to the lemma 'he' thinks would be needed in the proof (which may still be wrong or nonsensical). If you ask 'him' something like 'please provide xyz as a separate lemma' and then ask 'him' to prove it, you will find out whether the lemma exists under another name (by searching the API's space with exact? rw? etc), whether it is a specification of an existing lemma or a untrue lemma (as stated) or unhelpful for the purpose. So I think those 'hallucinations' could (locally, not globally) sometimes be helpful, sort of 'blind' hints.
I have the feeling that, at least based on the current tools, the area where there is the widest margin of improvement is the iterative workflow, though that would certainly take time, strategy, coordination and would have - I think - a quality output directly proportional to the quality of the human user responsible for it.

Eric Taucher (May 23 2025 at 08:43):

Adam Kurkiewicz said:

I'm not sure if this was faster than doing things myself, but looks like a promising strategy to include if you want to independently guide this model to a solution.

I find that in some cases generating code for specialized languages like Prolog, providing the AI with both the coding task and a link to the API documentation—while requiring verification that each predicate/function is either noted in the provided API documentation or included in the generated code with test cases—significantly improved code quality and significantly eliminated hallucinated predicate/function calls.

Kelly Davis (May 23 2025 at 08:52):

Adam Kurkiewicz said:

My conclusion with previous tools was that they weren't super helpful.

Did "previous tools" include (the free) DeepSeek-V3?

DeepSeek used DeepSeek-V3 to:

  1. Autoformalize a theorem's statement
  2. Create from a theorem's statement a series of "sorry proven" propositions which imply the theorem.

With this they had much success.

Adam Kurkiewicz (May 23 2025 at 09:31):

I didn't use DeepSeek-V3

Adam Kurkiewicz (May 23 2025 at 09:34):

Eric Taucher said:

I find that in some cases generating code for specialized languages like Prolog, providing the AI with both the coding task and a link to the API documentation

That's been also my experience. This still seems to be needed with Claude 4 Sonnet and Opus. An interesting difference I noticed: Claude 4 Sonnet can't read Moogle links (I have to copy paste the results of the search myself. Claude 4 Opus seems to be able to do it.

Eric Taucher (May 23 2025 at 10:10):

Adam Kurkiewicz said:

An interesting difference I noticed: Claude 4 Sonnet can't read Moogle links (I have to copy paste the results of the search myself. Claude 4 Opus seems to be able to do it.

I'm currently developing and testing Model Context Protocol (MCP) servers for integration with Claude Desktop (Windows version). Rather than providing extensive API documentation links, I've found that using OpenAPI specifications offers a more efficient solution. These specs are typically contained in single JSON or YAML files, making them substantially more compact than comprehensive API documentation while remaining easily interpretable by Large Language Models.

For example, the Zulip forum's OpenAPI specification is available here.

My current implementation uses FastMCP (video course) with its OpenAPI integration capabilities. It's important to note that security remains a significant concern with MCP implementations, though ongoing development is addressing these challenges.

Note: I have not looked for an OpenAPI spec for Lean 4 and one may not be possible for all of Lean 4 due to the user-extensible parser.

Justin Asher (May 23 2025 at 14:55):

@Eric Taucher I already did all of this. I used MCP, FastMCP, and OpenAI Agents to allow any model that supports MCP to search the packages indexed by my search engine. In particular, my repository allows models to search v4.19.0 of Mathlib, Std, Batteries, Init, and PhysLean (more to be added) using MCP.

Also, Moogle is outdated and not maintained. That is why I created LeanExplore. I am going to eventually have nightly versions of many of the repositories on Reservoir indexed.

Here is the repository:
https://github.com/justincasher/lean-explore

Specifically look at:
https://github.com/justincasher/lean-explore/tree/main/src/lean_explore/mcp

Justin Asher (May 23 2025 at 14:56):

The good thing is that if you get an API key from my website:
https://www.leanexplore.com/api-keys

Then the setup is minimal since it just pings my server for requests instead of needing to run the index locally.

Justin Asher (May 23 2025 at 14:58):

I am planning on writing out the documentation for the repository today and uploading it to PyPI, if anyone is interested in testing my package.

Jonathan Laurent (May 23 2025 at 15:04):

@Justin Asher What models did you test your pipeline with? How did they compare?

Justin Asher (May 23 2025 at 15:09):

@Jonathan Laurent I finished writing the lean-explore package yesterday, and I now need to write code for testing models for autoformalization & automated theorem proving (this will be a new package). I will let you guys know when I am done with this.

I did, however, write a basic search agent using GPT-4.1 (see the README), which seemed to work really well at retrieving the correct Lean code that I was looking for. So, this should not be an issue anymore.

One useful feature I implemented is both the ability to get a Lean statement and get information about its dependencies (i.e., the declarations that it cited). This helps a lot for the model at understanding statements. The models will have access to more and more information as what I index expands in the future.

Justin Asher (May 23 2025 at 15:13):

Adam Kurkiewicz said:

And sure enough, I search for the signature in Moogle, paste the Moogle link to Claude and it fixes the lemmas without a problem.

Please try the LeanExplore MCP when you get the chance. It should solve this problem. I know I need to write the documentation, but once I do, I really think you should try it. It works like a charm.

Adam Kurkiewicz (May 24 2025 at 18:24):

@Justin Asher , I just checked out leanExplore -- looks really cool! I should try it one day.

For now it seems like the proof in the PDF is a bit too terse/ high level for the AI to deal with on its own. I think I would have to write a more detailed "blueprint" to test those models properly. I might do this at some point. They still seem like a step ahead in terms of generating quality lean 4 code.

Justin Asher (May 24 2025 at 19:12):

@Adam Kurkiewicz Thanks for letting me know!

Adam Kurkiewicz (May 24 2025 at 19:46):

Instead of working out my own blueprint of this particular olympiad problem for the informal evaluation of these models, I used a short "blueprint" from this thread.

I was able to get Claude 4 Opus to auto-formalise Lemma 1 from the blueprint (burning about 15 USD and consuming about a half of allowed context window of 200k tokens).

There must be a more efficient/agentic way of working with these models, which I think I should look into.

A great thing about this blueprint from the Equational project is that everything is completely self contained so searching the standard library isn't a concern here.

Justin Asher (May 25 2025 at 02:42):

@Adam Kurkiewicz For reference, documentation can be found here:
https://www.leanexplore.com/docs

Adam Kurkiewicz (May 25 2025 at 05:57):

@Justin Asher I'm considering running another experiment using Bruno's blueprint from the Equational project (which I linked to in the previous message).

But this time I'd like to try something more automated (and I'd like to see if I can get that Lemma 1 auto-formalised that way). Hopefully the "agentic" approach would be also cheaper.

I was thinking to try cursor, maybe it will have some kind of agentic mode.

Reading through your documentation it looks like you have already thought through this at some level:

https://www.leanexplore.com/docs/mcp

For this particular problem, of course unusually, there are no dependencies. The theory is self-contained in a very short "introduction" at the top of the Lean file, which I think makes it for an excellent benchmark of pure reasoning (obviously I understand that a tool like Vampire or prover9 can destroy this equation in a few seconds, but I think it's clear that what Claude Opus is doing is actually very different: it's reading the PDF and following the human-accessible logic, so this is still a good autoformalisation test IMO).

My question is: is there an agentic set-up that you could recommend for these kind of autoformalisation challenges?

What I'm thinking to do is just let the agent run in the background and see if it can finish the proof after a couple of hours. I'd like to check if the following claim from Anthropic's press release could be translated into the realm of auto-formalisation challenges:

Claude Opus 4 excels at coding and complex problem-solving, powering frontier agent products. Cursor calls it state-of-the-art for coding and a leap forward in complex codebase understanding. Replit reports improved precision and dramatic advancements for complex changes across multiple files. Block calls it the first model to boost code quality during editing and debugging in its agent, codename goose, while maintaining full performance and reliability. Rakuten validated its capabilities with a demanding open-source refactor running independently for 7 hours with sustained performance. Cognition notes Opus 4 excels at solving complex challenges that other models can't, successfully handling critical actions that previous models have missed.

Justin Asher (May 25 2025 at 08:52):

@Adam Kurkiewicz I do want to point out that while the equational theories project is relatively self-contained, it is already built on top of the Lean 4 base library. I think this is what a lot of people are missing when, for instance, they assume that there is no need for search or analyzing dependencies. The problem with current LLMs is that, as you noticed, they really struggle to understand libraries without a lot of code to train on, like Lean 4. Hence, when my system fetches dependency information, it not only is looking at Mathlib, but also any other indexed libraries.

Let's look at the definition of a polynomial derivative using the CLI:

Screenshot 2025-05-25 at 4.49.36 AM.png

The key point here is that my dependencies graph has picked up that this definition is citing many declarations from the Init library, part of the base Lean 4 library.

Justin Asher (May 25 2025 at 08:55):

Adam Kurkiewicz said:

Justin Asher My question is: is there an agentic set-up that you could recommend for these kind of autoformalisation challenges?

I know Terence Tao seemed to ask something similar here:
#Machine Learning for Theorem Proving > Tools for semi-automated line-by-line formalization

Some other people mentioned packages that they had created, which might be helpful.

Justin Asher (May 25 2025 at 09:01):

To be honest, in my experience, the hard part about coding in Lean 4 (or using any other small repository) has been simply accessing the correct source files to know how to use the definitions properly. Whenever I provide Gemini 2.5 Pro with the correct source files, it tends to figure things out, barring any conceptual struggle, which is something that should not happen for the equational theories project. That is why I am advocating for a search and dependency heavy approach.

Justin Asher (May 25 2025 at 09:06):

Also, building a package to do things like what you want to do is my next goal. I will let you know if I make any good progress. I think there are a lot more efficient ways to provide the LLM with feedback and contextual dependency information.

Eric Taucher (May 25 2025 at 11:24):

@Justin Asher @Adam Kurkiewicz Thanks for sharing your thoughts publicly.

Something like LeanExplore was on my todo list and may not have to do it now!

My current MCP project with Lean is to extract information from Lean forum messages so that they can be added as context for questions with Claude Desktop. The idea is like providing a PDF to an LLM and then asking questions but instead the context is messages from this forum.

Many times I see users asking questions about information on this forum and are unable to find it, need it summarized, what is the latest SoTA (State of the Art), etc. The talks about blueprints and summaries are what inspired this idea. Daily I use Discourse forums which integrated AI over a year ago and these forms of questions are accessible with a built-in Discourse AI bot.

Eric Taucher (May 25 2025 at 11:24):

Here are the proof of concept cURL commands for Windows, syntax is slightly different for Linux. (ref).

curl -sSX GET -G https://leanprover.zulipchat.com/api/v1/streams -u %ZULIP_EMAIL_ADDRESS%:%ZULIP_API_KEY%

curl -sSX GET -G https://leanprover.zulipchat.com/api/v1/users/me/458659/topics -u %ZULIP_EMAIL_ADDRESS%:%ZULIP_API_KEY% --data-urlencode allow_empty_topic_name=true

curl -sSX GET -G https://leanprover.zulipchat.com/api/v1/messages -u %ZULIP_EMAIL_ADDRESS%:%ZULIP_API_KEY% --data-urlencode "narrow=[{\"operator\": \"channel\",\"operand\": 458659},{\"operator\": \"topic\",\"operand\": \"Upstream to Mathlib\"}]" --data-urlencode "anchor=oldest" --data-urlencode "num_after=1000"

Eric Taucher (May 25 2025 at 11:35):

This topic started as First impressions from using Claude 4 Opus and Sonnet and now has a few other ideas.

I would like to propose that the messages related to MCP be split off as a separate topic.

Justin Asher (May 25 2025 at 18:13):

@Eric Taucher Yes, I agree, creating a separate channel to discuss MCP tools for Lean would be a good idea, if you would like to. :slight_smile: :working_on_it: :robot:

Adam Kurkiewicz (May 26 2025 at 07:43):

Thanks Justin, yes a good point re searching the standard library. I might try to set something like this up later this week, I'll let you know how I get on if I do end up doing it.

Justin Asher (May 26 2025 at 14:26):

@Adam Kurkiewicz You also might be interested in what @Eric Taucher and I were discussing on the LeanExplore chat. We managed to get LeanExplore hooked up to Claude desktop (no API cost, just subscription). I will update the package and docs with how to do this when I can.


Last updated: Dec 20 2025 at 21:32 UTC