Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: MCP Tools for LLMs and Agentic Mathematics

Justin Asher (May 26 2025 at 14:49):

I have recently been discussing with a lot of people on various implementations of MCP tools for LLMs. Since this is becoming a standard protocol, and because it is even supported outside the API in programs like Claude Desktop, I thought it would be a good idea to create a thread on various MCP tools. I think that an overall goal should be to create an MCP tool suite that makes building AI math agents easy and plug-in-play, like building with Legos.

So far, here are some of the tools that I know about which support MCP:

LeanExplore supports searching the stable versions of various Lean libraries. I plan on adding information about how to use it in, for instance, Claude Desktop too.
@GasStationManager built LeanTool, which connects LLMs with a Lean "Code Interpreter," allowing them to interact directly with the Lean executable to improve code accuracy and leverage interactive theorem-proving features.
@Oliver Dressler built lean-lsp-mcp, which provides an MCP interface for agentic interaction with the Lean theorem prover, leveraging the Language Server Protocol via leanclient. This tool allows LLMs to access Lean's diagnostic messages, proof goals, hover information, and code completions, facilitating automated reasoning and proof development.

I am sure I am missing some others, so please let me know.

My main question is the following: How far away from fully automating Lean theorem proving? What about autoformalization? I think with the increasing intelligence of models, this should be possible. What additional tools do we need to enable this? Do we need to build an agentic workflow tool? This is what I will be working on once I finish updating LeanExplore. In particular, I am going to be workingon a package called LeanMiner, a math mining machine.

Justin Asher (May 26 2025 at 14:57):

One of the main things I had in mind for my LeanMiner package was accessibility. I think it would be great if a user could just write a comment that says something like:

-- LeanMiner: Finish this proof.

and then the program begins working on this task in the background, out of sight. I think this would help a lot of researchers' by cleanly integrating such a tool into their work flow, as I do not think full automation is yet a possibility.

Eric Taucher (May 26 2025 at 14:59):

Justin Asher said:

implementations of MCP tools for LLMs

Let me know if you agree.

While not a hammer for all MCP servers, creating an OpenAPI spec and using with FastMCP integrated OpenAPI then testing at https://editor.swagger.io/ is a force multiplier.

GasStationManager (May 26 2025 at 17:28):

@Justin Asher Thanks for starting a discussion! I think it's great to have more MCP tools for Lean. Ideally, we'd have a layer of MCP tools that expose Lean's various functionalities to the LLMs, which then allows tools that provide more complex functionalities to be built.

In the former camp, one tool that I think is missing and would be great to have, is a MCP server for a Lean REPL interface. There are a few python packages providing this, including the recent Pantograph, and Lean-Interact. Here's why I think making a MCP wrapper for one will be useful: quite a few of the recent prover models (including Kimina Prover and Deepseek Prover v2) are trained in an RL environment provided by a lean repl; but when deploying, the tools are taken away. I think for certain models, the most effective way to do inference-time scaling is to put them in a repl environment and let them search. It'd also be interesting to put general-purpose reasoning LLMs like o3 and r1 in a repl, and see how well they do.

In the latter camp, I'm building (on top of LeanTool) a development framework with automated testing feedback. The prototype is not fully MCP yet but will be. See e.g. https://gasstationmanager.github.io/ai/2025/05/22/alphabeta-goose.html

Justin Asher (May 26 2025 at 17:58):

@GasStationManager Yes, I would like to build a Lean MCP REPL, and I do not think it would be too hard.

One idea in this direction I had was branching. What this means is that the LLM should be able to write out the first part of the document—for instance, the part that it assumes is correct—and run that; it then saves this as the main branch. It then should be able to create test branches on top of the main branch, where it can just focus on the next theorem it would like to prove. This is somehwat similar to a Git system for the LLMs to interact with Lean. I am assuming something already exists for other AI agents, but I think having one tailored to the Lean syntax would be good. This would integrate well with MCP in like Claude desktop, because it would not have to write out the entire document and run it over and over again. The LLM could instead focus on inductively writing declarations and their proofs. I think this would also be a good general audience tool, accessible to many Lean researchers.

And the automated testing feedback framework sounds really promising.

Luis Valero (May 28 2025 at 07:34):

Hello, let me know if this isn’t the right channel to send this, but it seems somewhat relevant to this discussion.

Besides from MCP-based methods, there was a paper on Arxiv recently on using RAG & a somewhat similar branching method you mentioned. I haven’t been able to read it thoroughly so I’m not sure if it’s a particularly good paper, but it does have some interesting ideas & implementation details which could help.

https://arxiv.org/pdf/2505.20613

Alexander Heckett (May 30 2025 at 02:19):

(deleted)

Alexander Heckett (May 30 2025 at 02:21):

Justin Asher said:

One of the main things I had in mind for my LeanMiner package was accessibility. I think it would be great if a user could just write a comment that says something like:
-- LeanMiner: Finish this proof.
and then the program begins working on this task in the background, out of sight. I think this would help a lot of researchers' by cleanly integrating such a tool into their work flow, as I do not think full automation is yet a possibility.

Is this kind of functionality necessary on top of what coding agents already provide? For example, I can type into Claude Code "prove such-and-such theorem in such-and-such file" and it goes away and makes its attempt.

Justin Asher (May 30 2025 at 02:22):

@Alexander Heckett Does this work well with Lean + Claude Code? I had asked people previously if Claude code was any good at automated theorem proving, and had not heard any responses.

Justin Asher (May 30 2025 at 02:24):

The other problem that I assumed with a tool like Claude Code is that there are presumably a lot more efficient ways of interacting with Lean code for LLMs. For instance, we do not rely on just compiler feedback but rather the LSP.

Alexander Heckett (May 30 2025 at 02:26):

I've been trying it out today. My setup is Claude Code + LeanTool + LeanExplore. After a lot of guiding the model and $16 dollars (tbf most of that went into setting up LeanTool and LeanExplore) I have a 50-line proof that x^2 + x^4 is convex. It's quite far from perfect but it kind of works.

Justin Asher (May 30 2025 at 02:29):

Neat! My next goal was to update LeanExplore to do nightly updated of the repositories on Reservoir, so perhaps once I do that, we can start automating sorry proving in existing repositories. Albeit, I do not know how expensive this would be. One of the advantages of building a custom system for Lean is it would be cheaper to run.

Justin Asher (May 30 2025 at 02:29):

I would also be curious to see how well your setup works on some benchmarks, like the MiniF2F.

Alexander Heckett (May 30 2025 at 02:30):

LeanTool was set up as an MCP but I couldn't get the LeanExplore MCP to work so I just gave it access via command-line arguments. Honestly I'm not sure there's any difference between interacting via an MCP and via command-line arguments. I could imagine all of the MCP calls to LeanTool being command-line calls instead and nothing really changing.

Justin Asher (May 30 2025 at 02:31):

I am going to push a MCP feature for things like Claude Desktop with the next update of LeanExplore. @Eric Taucher and I had discussed this previously. The advantage of interacting with the dedicated MCP is it should be more efficient.

Alexander Heckett (May 30 2025 at 02:45):

Justin Asher said:

Neat! My next goal was to update LeanExplore to do nightly updated of the repositories on Reservoir, so perhaps once I do that, we can start automating sorry proving in existing repositories. Albeit, I do not know how expensive this would be. One of the advantages of building a custom system for Lean is it would be cheaper to run.

I think the prompt matters a lot. I found Claude Code to be quite skittish when writing Lean -- it will try an approach, fail to succeed immediately, try a completely different approach, fail again to succeed immediately, try a third approach, and so on. You really have to emphasize that it should see its attempts through. It might also be good to get Claude Code to use a more dedicated model (maybe something like DeepSeek-Prover-V2?) to make low-level progress while it guides the overall attempt. Changes like this would probably make significant differences in the eventual price.

Eric Taucher (May 30 2025 at 10:04):

Justin Asher said:

The advantage of interacting with the dedicated MCP is it should be more efficient.

Thanks @Justin Asher

Building on that point, there are other advantages with MCP:

AI-First Tool Integration - MCP enables AI systems to connect to tools in a fundamentally different way than traditional APIs. Rather than exposing raw APIs that require programmers to make implementation decisions, MCP allows AI to discover what functionality a tool provides and intelligently request that functionality directly.

Standardized Protocol - Since MCP establishes a common standard, there's no need to create custom connections between clients (like chatbots) and servers (like Lean Explore). Any MCP client can connect to any MCP server using the same protocol.

One current concern with MCP is security. LeanExplore stands out as a positive example of addressing security proactively, offering API keys with a user-friendly interface for key acquisition and revocation: https://www.leanexplore.com/api-keys

Advantages of Using Anthropic Claude Desktop as Host with MCP Client: (ref)

Extensible tool ecosystem: Multiple MCP servers can be added to Claude's available tools
Granular control: Each tool can be activated/deactivated through UI toggles
Function-level permissions: Specific functionality within each tool can be toggled on/off
Simple configuration: Adding new MCP servers only requires updating claude_desktop_config.json
User consent workflow: Before any tool call, users receive notifications with options to "Deny", "Allow once", or "Allow always"

Alexander Heckett (May 30 2025 at 23:11):

So I've been messing around a bit more and I wonder if something like a GitHub Action would be useful. I have a setup with Claude Code + DeepSeek Prover V2 + LeanTool + LeanExplore right now. Claude Code functions as the outer-loop keeping progress on track, it can call Prover V2 via API, it can access LeanTool via MCP Server, and it can query LeanExplore via command line. It sort of works but it's very haphazard. It would be nice to be able to package the prompt and setup into a GitHub Action so that I can run that process on whatever "sorry" statements I want.

Alexander Heckett (May 30 2025 at 23:23):

From a user-interface perspective, I think it's also valuable to be able to operate lemma proving in "fire-and-forget" mode. For example, I might want to take my research ideas during the day, break them down into lemmas and formalize intermediate statements, leave some comments about how to fill in the blanks, launch Claude Code or whatever, go to bed, and wake up the next morning with the "sorry"'s filled in. Certainly right now it can take my setup 5-10 minutes to prove basic mathematical facts (the kinds of things which I would assert without justification in a handwritten proof) so I could imagine this automated theorem proving process being slow.

Alexander Heckett (May 30 2025 at 23:57):

Actually maybe an MCP server itself is the answer. Maybe it makes sense to release this kind of automated theorem proving tool as an MCP Server which under the hood involves an AI agent calling other Lean-related MCP servers. Certainly it feels awkward to present this functionality in the interactive environment where it might take many minutes to get a response.

Alexander Heckett (May 31 2025 at 01:28):

Actually I guess that's pretty close to what LeanTool does by default. I've been using LeanTool via Claude Code with the feedback loop turned off, but I suppose it also functions as an OpenAI API server in its own right. I should try plugging it into Claude Code with the feedback loop activated and see if it works well.

Justin Asher (May 31 2025 at 01:56):

LeanExplore should now be working via MCP: https://www.leanexplore.com/docs/mcp#integrating-with-mcp-client-applications

Regarding the "fire-and-forget" mode, that is what I was trying to get at with the ability to leave comments like

-- LeanMiner: Finish this proof.

and then have the agent go through and try to finish the proofs. I think this would help speed up a lot of things; many researchers like to leave sorry statements for other people to fill out.

Isn't part of slowness from Claude itself? Here is a relevant website. I think if we can optimize the time it takes to get Lean feedback, the model output speed, and the model cost, we probably would already have a decent bot on our hands. For instance, my gut says that Gemini Flash Thinking 2.5 would be good here.

I can see a multilevel approach, where one AI agent orchestrates others, proposing strategies and then having the other bots go off and attempt them. That is one of the things that I wanted to build.

Justin Asher (May 31 2025 at 01:58):

I think if we can get it down to around $1 per sorry, that means we can prove as many statements as those that exist in Mathlib for less than say $250k. This seems like a good deal, especially since this automated approach should in theory scale well. I also presume costs will come down.

Oliver Dressler (May 31 2025 at 09:46):

@Justin Asher thanks for the mention.

I think MCP is a great way to provide additional interactivity to AI agents.

While coding agents in IDEs (like Cursor or VS Code) have access to files and basic LSP features, they cannot use the full set of essential Lean tools, such as the InfoView.
This motivated the creation of lean-lsp-mcp; to provide all the other tools we use, such as looking up goal states, hover info, web search etc.
As mentioned above, many agents were trained in REPL environments, so they are used to extra info such as goal states.

I haven't used this MCP server much outside of IDE agentic coding, where state is maintained by the agent and the file content.
Thererfore the design of lean-lsp-mcp is attempting to minimize statefulness, tool calls are all independent.
I'm not sure how an agent would interact with a REPL MCP, which would be very stateful, it might have to repeatedly provide the current state to the agent.

As shown by @Alexander Heckett a practical setup would often involve multiple MCP servers, specialized for different tasks.
I think this is a good idea, and I would like to see more specialized MCP servers for Lean.

Regarding AI agents calling AI agents, there are some protocols available such as https://github.com/google-a2a/A2A .
It might be worth integrating with an existing protocol.

Regarding security, while MCP is powerful, it is not a very good standard in my opinion. Hopefully some major revisions are coming soon!
It is unsafe and not very well documented. I would encourage developers to put some information into the docs regarding security.
E.g. in the case of lean-lsp-mcp, while there is no sensitive information, it can still access your local file system etc.

Overall, I am looking forward to seeing how MCPs can augment AI agents!

Eric Taucher (May 31 2025 at 09:54):

Oliver Dressler said:

I'm not sure how an agent would interact with a REPL MCP, which would be very stateful, it might have to repeatedly provide the current state to the agent.

FYI

Memory MCP Server

https://github.com/modelcontextprotocol/servers/tree/main/src/memory

Sequential Thinking MCP Server

https://github.com/modelcontextprotocol/servers/tree/main/src/sequentialthinking

Eric Taucher (May 31 2025 at 10:01):

With regards to AI working on a sorry, also see:

#Machine Learning for Theorem Proving > SorryDB project

Eric Taucher (May 31 2025 at 10:02):

Oliver Dressler said:

Regarding AI agents calling AI agents, there are some protocols available such as https://github.com/google-a2a/A2A .
It might be worth integrating with an existing protocol.

:+1:

Image source: MCP vs A2A: What the Future of AI Agent Communication Looks Like

image.png

Image source: Linked in - Generative AI’s Post

Eric Taucher (May 31 2025 at 11:36):

Other possibilities

Expose other Proof Assistants

as MCP Host and connect with A2A.
as MCP Server and connect with MCP, e.g. https://github.com/angrysky56/mcp-rocq

Expose Automated Theorem Provers as MCP Server and connect with MCP, e.g.

* Z3 - https://github.com/javergar/z3_mcp
* Prover9/Mace4 - https://github.com/angrysky56/mcp-logic
* MiniZinc - https://github.com/angrysky56/mcp-solver

Expose other useful code and connect with MCP

SymPy, e.g. https://github.com/sdiehl/sympy-mcp

Note: There are often more than one MCP server for an entity, e.g. for SymPy as noted with a Google search.

Eric Taucher (May 31 2025 at 12:43):

"RocqStar: Leveraging Similarity-driven Retrieval and Agentic Systems for Rocq generation" by Nikita Khramov, Andrei Kozyrev, Gleb Solovev and Anton Podkopaev (pdf)

Frauke Harms (May 31 2025 at 14:48):

Just wanted to give a shoutout to @Oliver Dressler for the lean-lsp-mcp. I've been using it in Cursor and it works like a charm! No more guessing about the proof state :). I've also managed to connect it to Claude Desktop, the mcp.json just needs an extra PATH env, otherwise it won't find lake.
I haven't tried LeanExplore or LeanTool yet, but will try them in the future.

Justin Asher (Jun 01 2025 at 04:28):

Would anyone be interested in building a single Lean 4 agent that the community can use easily? For instance, one that can complete sorrys, to start off. Then, gradually, we can move to more complicated tasks, like creating entire documents or projects.

Justin Asher (Jun 01 2025 at 04:34):

Similar @Alexander Heckett , I am using Claude to prove Lean statements in the MiniF2F problem set, and it seems to be doing well. I think we can automate a lot of things once I add nightly indexing to LeanExplore and more repositories.

Eric Taucher (Jun 01 2025 at 09:14):

Justin Asher said:

Would anyone be interested in building a single Lean 4 agent that the community can use easily?

Interested: Yes

While I can not commit to making it a high priority, I can offer to stay informed and helping out when and where I can.

Luis Valero (Jun 01 2025 at 09:31):

Id also be interested.
Currently I cant devote lots of time, but over the summer I will be free to help develop an early version. I attached a diagram for an idea I had for the general setup, this way it would be easy to swap out pieces as it gets built.
screenshot-zo 1 jun 2025 11:25:43 CEST.png

Let me know if I am overstepping a bit, I haven't been very active in this thread but I have followed it closely. Id also be the most interested on the ATP server. I am aware there's existing ones, but having a powerful system for suggesting proof methods in addition to LeanSearch could be pretty effective.

Eric Taucher (Jun 01 2025 at 09:47):

Luis Valero said:

easy to swap out pieces as it gets built

:+1:

Eric Taucher (Jun 01 2025 at 09:54):

Luis Valero said:

having a powerful system for suggesting proof methods in addition to LeanSearch could be pretty effective

:+1:

This is from the PDF in my earlier post.

image.png

One of the many interesting parts: the use of multi-agent debate on the planning stage of proof synthesis.

Eric Taucher (Jun 01 2025 at 11:40):

FYI

My current focus is not on resolving the sorry statements, but rather on referencing forum messages and information through an MCP (Model Context Protocol).

For those who can access Anthropic Claude conversation shares, you can view a recent idea exchange with Claude here: https://claude.ai/share/0f0c0f11-afcf-424a-af9c-a75b91fb9c55

Please note that I cannot guarantee how long this share will remain available, though it should be accessible for at least a week before I delete it.

The gist is some of the use cases Claude suggested:

search messages
get topic summary (This was the need that started this initiative many months ago)
Search for discussions about specific mathematical concepts
Find discussions and help about specific Lean code patterns
Find all discussions about a specific theorem or proof
Find different approaches to proving the same theorem
Find discussions of similar Lean compilation errors
Track how mathlib definitions/theorems evolved over time - (I suspect this would use the rss channel)
Search for Lean 4 code patterns, tactics, and proof strategies
Find forum discussions related to specific papers or mathematical results
Track discussions about mathlib4 changes, additions, and refactoring
Find expert-level discussions and insights on advanced topics
Connect mathematical theory discussions to their Lean 4 implementations
Fetch actual code/context from GitHub references
Track mathlib4 PRs related to specific concepts
Find theorems by mathematical meaning, not just keywords
Handle different mathematical notations for same concept
Suggest appropriate tools based on user's task and complexity

Justin Asher (Jun 01 2025 at 14:19):

@Luis Valero Any feedback or ideas would be helpful, too! If you could let me know more about what you mean by suggesting proof methods, I am interested.

Justin Asher (Jun 01 2025 at 14:20):

@Eric Taucher I was considering exposing forum messages via the MCP, but I know some people on the Zulip chat would not be very happy about this. Albeit, simply using messages during inference time versus training on them should be a bit different.

Justin Asher (Jun 01 2025 at 14:21):

I created a repository here where I am going to start building out this agentic system:
https://github.com/justincasher/lean-miner

It is empty right now, but I plan on building this over the next month or so while working on improving LeanExplore.

Eric Taucher (Jun 01 2025 at 14:55):

Justin Asher said:

@Eric Taucher I was considering exposing forum messages via the MCP

Thanks!

Changed wording to referencing.

Luis Valero (Jun 01 2025 at 15:45):

Justin Asher said:

Luis Valero Any feedback or ideas would be helpful, too! If you could let me know more about what you mean by suggesting proof methods, I am interested.

I meant systems for suggesting tactics, hypothesis, possible proof directions, maybe even having some way of recognizing when current progress is stalling, and suggesting parallel problems to gain some insight. Essentially most of the things already in @Eric Taucher diagram.

Eric Taucher (Jun 02 2025 at 09:46):

The use of perturbations to assess an LLMs abilities for university-level symbolic mathematics is noted in this paper. Don't remember/recall perturbations being noted for LLMs assessment on Lean task.

"ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark" by Michael Shalyt, Rotem Elimelech and Ido Kaminer (PDF)

Eric Taucher (Jun 03 2025 at 11:57):

(deleted)

Adam Kurkiewicz (Jun 04 2025 at 22:44):

Justin Asher said:

I think if we can get it down to around $1 per sorry, that means we can prove as many statements as those that exist in Mathlib for less than say $250k.

A good direction for this would be to build a "data acquisition" company and sell the data to e.g. Anthropic or Deepmind via an acquisition. One could readily get some seed funding for this. I've once raised a few million USD for something very similar, although in a different field (computational biology).

Data here would be formalisations of mathematical textbooks for example. Or research papers (I know Kevin would say that Mathlib is not ready for most research papers in his field for example).

Adam Kurkiewicz (Jun 04 2025 at 22:46):

The idea would be to optimise the dollar spent for the value of the formalisations produced. I wouldn't be surprised if Anthropic, DeepMind, etc... all already had internal/ semi-internal teams doing this.

Justin Asher (Jun 05 2025 at 03:45):

@Adam Kurkiewicz I like the analogy to computational biology. That's a good idea. Thanks for pointing this out.

Dominic Steinitz (Jun 05 2025 at 10:42):

Justin Asher said:

Adam Kurkiewicz I like the analogy to computational biology. That's a good idea. Thanks for pointing this out.

Off topic but I also worked on computational biology (modelling the physiology of chemotherapy induced anaemia, erythrocyte production, nephrotoxicity etc via PDEs and another application liver damage via acetaminophen). We did buy in patient data to test our models on. It was surprisingly (to me at any rate) successful.

Adam Kurkiewicz (Jun 05 2025 at 14:59):

I think we're on the verge of being able to produce a formalisation of a standard undergraduate textbook largely automatically, which I'm sure would be valuable training data for many LLM companies, especially if done at scale (e.g. 100 or 1000 textbooks). An interesting market dynamic here as well is that if the "database" of formalisations is licensed to one of the LLM companies, the other LLM companies need access to that database too in order to not to fall behind. This would be especially true if the database was a leading database in this field (in our case the field of Lean 4 formalisations of maths textbooks). So getting one client (e.g. Anthropic) would automatically translate into putting a lot of pressure on other LLM companies to license the data as well. I would say it's a really good competitive position to be in.

A company like this could do a lot of "good" things for mathematics too (not just focus on database licensing revenue). E.g. it could provide for free an autoformalisation benchmark to the entire community (which would be a small subset of the total data, e.g. 1% of all data). It could also open source a lot of the tooling (as long as some tooling remains proprietary to keep an advantage over any competition that may come up).

One would have to of course be careful to only use open source textbooks with appropriate licensing (or make a deal, ideally an exclusive deal, with a publisher, e.g. Springer). But for example this would be a suitable open source textbook to use:

https://understandinglinearalgebra.org/home.html

Oliver Nash (Jun 05 2025 at 15:04):

Adam Kurkiewicz said:

I think we're on the verge of being able to produce a formalisation of a standard undergraduate textbook largely automatically,

I find this quite surprising: what evidence do you have to support this?

Adam Kurkiewicz (Jun 05 2025 at 15:09):

No solid evidence, but good experience working with Claude 4 Sonnet, and extrapolation of the capability of LLMs over time. Mostly just a hunch admittedly.

Oliver Nash (Jun 05 2025 at 15:12):

Thanks, I certainly would love to see this day arrive!

Justin Asher (Jun 05 2025 at 16:23):

Oliver Nash said:

I find this quite surprising: what evidence do you have to support this?

What about #Machine Learning for Theorem Proving > Gemini 2.5 Pro 06/05 ? Seems like we are trending in the right direction.

Frauke Harms (Jun 05 2025 at 16:26):

Just wanted to chime in with an update. I have experimented a bit more and found that even with MCP support the current generation of general-purpose models (o3, Claude 4, Gemini 2.5 Pro) does not help much if you already know your way around mathlib and lean. Both for textbook proofs in probability and combinatorics and original work. That said I remain very optimistic about the trajectory of the field (my background is in ML). There are some well-funded efforts announced or already underway specifically for lean (DeepMind, Darpa, Kimana, DeepSeek), so I think we just have to be a bit more patient.
Edit: This is not a criticism of the MCPs (tried LeanExplore as well and it also works like a charm!). I think they are exactly what is needed once the models are ready for it!

Frauke Harms (Jun 11 2025 at 05:27):

Update: After a bit more tinkering, I think I was too harsh on the current models in my earlier post. Some of my frustration can instead be attributed to the following:

The textbook proof of the lemma I was trying to formalize (the sunflower lemma) was flawed. Fixing it required substantial intellectual work that I would not expect LLMs to be able to do on the spot at this point.
Tooling glitches: The Cursor agent gets feedback about the proof primarily via linting errors. Cursor's system prompt has a cutoff of 3 attempts to fix those errors. This leads to a phenomenon where the model makes good progress, then suddenly erases everything and goes back to the initial sorry. This is exacerbated by indentation errors. When stopped before erasing and a bit of manual help, it turned out the model (Claude Sonnet 4) was often on the right track.

Overall it was still a fun (but expensive) experience and I wish I had made a recording so people could see what an agent-assisted proof setup looks like in practice.

Willem vanhulle (Jun 30 2025 at 18:53):

Does anyone have any success with using any other agents than Claude Sonnet in combination with the MCP server of Lean? I can’t seem to force Gemini agents to query the MCP server properly.

Eric Taucher (Jun 30 2025 at 19:00):

Willem vanhulle said:

Does anyone have any success with using any other agents than Claude Sonnet in combination with the MCP server of Lean?

No, at present only use MCP with Claude Windows Desktop and while not always easy do have success.

Mostly responding so you know others are at least interested in what you are doing and to let you know you are not being ignored.

Please keep us up to date with problems, successes, etc.

GasStationManager (Jun 30 2025 at 19:28):

@Willem vanhulle I've used the LeanTool MCP with some combinations of LLM models and coding assistants; besides Claude, there's Gemini and Goose, and Cursor and GPT 4.1 + o3

Willem vanhulle (Jun 30 2025 at 19:29):

Oh, and this project can run an MCP server that is more powerful than https://github.com/oOo0oOo/lean-lsp-mcp? @GasStationManager

GasStationManager (Jun 30 2025 at 19:40):

@Willem vanhulle I wouldn't say more powerful; perhaps different modes of interaction. LeanTool mainly interacts with Lean via passing whole source files. If you have a model that is able to take advantage of the more fine-grained feedback that Lean LSP provides, perhaps it can get more out of lean-lsp-mcp.

LeanTool also has a non-MCP side that provides a simple feedback loop, so you could run it in an automated workflow without human interaction.

Oliver Dressler (Jul 01 2025 at 12:09):

Willem vanhulle said:

Does anyone have any success with using any other agents than Claude Sonnet in combination with the MCP server of Lean? I can’t seem to force Gemini agents to query the MCP server properly.

I agree the models have differing tendencies to call MCP tools. Claude 4 seems to work best.

Unfortunately they are very sensitive to the prompt/instructions used but at the same time do not follow them well. I am planning to collect well working prompts but haven't gotten around to it. Feel free to message me any findings.

Using (crude) prompts as an incentive to use tools works sometimes. Like:

Analyze the statement using mcp tools (goal, hover, ...), then search for relevant lemmas, finally replace the sorry with a proof outline containing sorries to be filled in later

Get diagnostics, then fix the first issue

Proof the statement. Rely heavily on the lean-lsp MCP, read its instructions carefully!

I also tend to get frustrated when the model gets side-tracked. In that case I often stop, keep all changes, start a new chat (+ button at the top), and re-prompt. Hopefully this situation improves as the models and IDE integrations develop further.

Justin Asher (Jul 01 2025 at 19:26):

Willem vanhulle said:

Does anyone have any success with using any other agents than Claude Sonnet in combination with the MCP server of Lean? I can’t seem to force Gemini agents to query the MCP server properly.

Are you using any packages or templates for this? Such as

Justin Asher (Jul 01 2025 at 19:33):

I do not know if the most recent Gemini model's support MCP (I know they want to). You may need to use function calling.

GasStationManager (Jul 01 2025 at 20:53):

Gemini-2.5-pro does support calling MCP tools, as long as it is paired with a coding assistant interface that supports MCP. I've seen it happen with Goose (mentioned above) and Cursor. The coding assistant interface connects to the MCP server, and then exposes the MCP tool to the model as a function call.

Justin Asher (Jul 01 2025 at 22:41):

@GasStationManager Thanks for letting me know! That sounds right.

Willem vanhulle (Jul 02 2025 at 06:49):

I have not used these tools yet

Justin Asher zei:

Are you using any packages or templates for this?

No, not directly. I use the interface offered by VS Code's Copilot. Do you just build agents by scratch using these SDKs?

Justin Asher (Jul 02 2025 at 14:06):

Yes, I have played around with some of the aforementioned SDKs before. There definitely is a bit of a learning curve, and you probably would want to tailor the SDK to the model that you want to use, but it is worth trying. In particular, they allow you to manage the model's context much more effectively. Albeit, I completely understand why most people would just want an out-of-the-box solution.

Frauke Harms (Jul 03 2025 at 18:43):

Glad to see some updates in this thread! I'm still using Claude (Sonnet or Opus), but have since switched to Claude Code from Cursor (you can combine it with Cursor/VS Code actually) and this has both fixed the 3 attempt cutoff issue and comes with a monthly price cap. The CLAUDE.md file repeatedly nudges the model towards using the MCPs and it seems to work well. It is still a wild ride, but I am learning :D.

GasStationManager (Jul 03 2025 at 21:13):

I've been trying out LeanExplore MCP recently and it's working great.
I added the LeanExplore MCP server to my Claude Desktop, together with LeanTool's MCP server.

As an exercise, I asked Claude Sonnet / Opus to prove the theorems and lemmas from Morph Labs/Trinity's recent autoformalization result.
For each theorem/lemma, I prompted with the lean theorem statement, and which lemma it depends on according to the published blueprint. They were able to prove the first dozen or so lemmas before I hit my usage limit for the day.
(I start with Sonnet which is cheaper, and switch to Opus if it gets stuck, which happened for two of those lemmas)

Overall they are able to effectively use multiple calls to LeanExplore's search tool and LeanTool's code interpreter to iteratively refine their proof attempts.

I think these tools, combined with a capable model like Sonnet & Opus, can already be helpful to people working on formalization.

The proofs so far:
pabc_claude.lean.txt

Justin Asher (Jul 03 2025 at 21:33):

Great! Am working on getting LeanExplore updated to v4.21.0.

Eric Taucher (Jul 03 2025 at 22:40):

GasStationManager said:

I prompted with the lean theorem statement

Can you share the prompt?

Bhavik Mehta (Jul 03 2025 at 22:43):

GasStationManager said:

I've been trying out LeanExplore MCP recently and it's working great.
I added the LeanExplore MCP server to my Claude Desktop, together with LeanTool's MCP server.

As an exercise, I asked Claude Sonnet / Opus to prove the theorems and lemmas from Morph Labs/Trinity's recent autoformalization result.
For each theorem/lemma, I prompted with the lean theorem statement, and which lemma it depends on according to the published blueprint. They were able to prove the first dozen or so lemmas before I hit my usage limit for the day.
(I start with Sonnet which is cheaper, and switch to Opus if it gets stuck, which happened for two of those lemmas)

Overall they are able to effectively use multiple calls to LeanExplore's search tool and LeanTool's code interpreter to iteratively refine their proof attempts.

I think these tools, combined with a capable model like Sonnet & Opus, can already be helpful to people working on formalization.

The proofs so far:
pabc_claude.lean.txt

This is nice! Is there a version of this setup that you can share, in such a way that I can (relatively) easily try doing the same experiment myself?

Eric Taucher (Jul 03 2025 at 22:43):

GasStationManager said:

The proofs so far

Seeing

lemma two_rpow_ge_add_one (x : â„) (hx : x â‰¥ 1) : 2 ^ x â‰¥ x + 1 := by

Any suggestions on how to view it correctly, e.g. save with different file type, open in XYZ tool and change encoding, etc.

GasStationManager (Jul 04 2025 at 00:16):

@Eric Taucher Here's a typical prompt:

Try to prove the following theorem in Lean. You'll need to use lemma 7 from the attached file. maintain a syntactically correct proof sketch with sorrys, use the check_lean tool to check syntax and get goal states from sorrys, and iteratively refine the sketch. you may use LeanExplore's search tool to look for relevant theorems

import Mathlib
lemma lemma10 (s : Finset ℕ) (a : ℕ → ℕ) (ε : ℝ) (hε : ε > 0) (hε_small : ε < 1/100)  (hs_prime : ∀ p ∈ s, p.Prime) (ha_ge_one : ∀ p ∈ s, a p ≥ 1) : ∏ p ∈ s.filter (fun (p : ℕ) => (p : ℝ) ^ ε ≥ 2), ((a p + 1 : ℝ) / ((p : ℝ) ^ ((a p : ℝ) * ε))) ≤ 1 := by

GasStationManager (Jul 04 2025 at 00:21):

@Eric Taucher it does seem like an encoding problem with the file. I was pasting the proofs from Claude Desktop into Windows Notepad... looked fine in Windows but looks corrupted when uploaded. I will look into it and get a version without the issues.

GasStationManager (Jul 04 2025 at 00:43):

@Bhavik Mehta Sure!

install LeanExplore, see https://www.leanexplore.com/docs/quickstart
get an API key from the website
install LeanTool
In my case I was serving the LeanTool MCP server from a linux machine, and accessing it from Windows. If you are running Mac or Linux it might be easier to run LeanTool MCP locally in stdio mode.
Claude Desktop for Windows. (Claude Code or Cursor would also work.)
Go to File > Settings > Developer, and edit the config file to set up the MCP servers. Here's mine:

{
    "mcpServers": {
        "LeanTool": {
            "command": "npx",
            "args": [
                "-y",
                "supergateway",
                "--sse",
                "<my remote leantool mcp server address>/sse"
            ]
        },
        "LeanExplore": {
          "command": "leanexplore",
          "args": [
            "mcp",
            "serve",
            "--backend",
            "api",
            "--api-key",
            "<omitted>"
          ]
        }
    }
}

Note that Claude Desktop doesn't yet support remote MCP servers; I had to use Supergateway to act as a proxy for LeanTool.

Let me know if you have trouble setting it up!

GasStationManager (Jul 04 2025 at 01:00):

@Eric Taucher Here's a version of the file that is non-corrupted: https://github.com/GasStationManager/LeanTool/blob/main/examples/pabc_claude.lean.txt

Joscha Mennicken (Jul 04 2025 at 07:29):

GasStationManager said:

Eric Taucher it does seem like an encoding problem with the file. I was pasting the proofs from Claude Desktop into Windows Notepad... looked fine in Windows but looks corrupted when uploaded. I will look into it and get a version without the issues.

As far as I can tell, the file is perfectly fine utf-8 encoded Unicode, which is exactly what you'd want, but Zulip doesn't tell the browser this, so the browser guesses some weird encoding. Maybe this could be considered a Zulip bug.

Joscha Mennicken (Jul 04 2025 at 07:41):

After a bit more experimenting, if you had named the file pabc_claude.lean instead of pabc_claude.lean.txt, everything should have worked fine.

(deleted) (Jul 04 2025 at 09:33):

this sounds cool and all but how expensive is this?

(deleted) (Jul 04 2025 at 09:39):

if the total cost is less than $200/month then it is worth using for me

Justin Asher (Jul 04 2025 at 16:27):

You can try Claude Code for $17 per month, and then there is the max plan for $100, cf. the Anthropic billing page. I have not used Claude Code much. Just the MCP integration into the desktop app.

It would also be worth checking out the Gemini CLI, which is free at the moment, but probably a bit harder to set up here.

GasStationManager (Jul 04 2025 at 17:35):

Cost is definitely a concern. Opus 4 is very impressive: has strong reasoning ability, knows its way around Lean, and are good at using tools. It can solve some pretty tough tasks. But I could only get very limited access to it per day (I'm on the $17/month pro plan). Sonnet 4 is good enough most of the time.

I do recommend also trying the Gemini models. Had some good experience with Gemini-2.5-pro, even before the latest June update.

Willem vanhulle (Jul 05 2025 at 14:05):

I have managed to configure claude-code to use two Lean MCP servers and spent more than > 100 $ in a day using Claude Opus 4. The power of claude-code is too addicting. I am taking a break from it, because it distracts from actually learning the Lean language (and its costly).

Justin Asher (Jul 05 2025 at 14:33):

@Willem vanhulle A lot of people were saying with the max plan you essentially get a few thousand dollars in Claude Code API credits for $100 per month, for what it's worth.

Willem vanhulle (Jul 05 2025 at 15:51):

Justin Asher zei:

Willem vanhulle A lot of people were saying with the max plan you essentially get a few thousand dollars in Claude Code API credits for $100 per month, for what it's worth.

Ah thanks for mentioning! But don't you think the impact of using such agents early on in the learning process is disadvantageous? I started a month ago and don't know many details yet.

Justin Asher (Jul 05 2025 at 16:52):

Willem vanhulle said:

But don't you think the impact of using such agents early on in the learning process is disadvantageous?

Yeah, probably. I more so had in mind the case where someone who is using an AI to fill in proofs that they didn't want to spend time doing themselves.

Niels Voss (Jul 06 2025 at 04:57):

Ah thanks for mentioning! But don't you think the impact of using such agents early on in the learning process is disadvantageous? I started a month ago and don't know many details yet.

As an anecdote, when first learning Lean in 2022 I relied extensively on the library_search tactic (which is now called exact?), which tries to find a proof in the Lean library. This works really well for a lot of simple goals but quickly becomes less powerful the more difficult the thing you are trying to work on. Because of this, I never learned how to properly search mathlib manually or how to read the mathlib documentation. It got to the point where I would try running library_search after every tactic to hope it would close it and I would get frustrated when it couldn't find anything. Over time though, I learned how to properly search through mathlib, which sped up my proofs, and even though I still use exact? occasionally, I would still be comfortable writing Lean code without it. (In fact, I had done something similar when learning programming for the first time where I would overrely on Eclipse's autocomplete feature and didn't know how to find functions when that didn't work.)

Of course, library_search / exact? is not AI-based so the situation might be different. But I would caution against being completely reliant upon a tool, regardless of whether or not it is AI-based.

Frauke Harms (Jul 07 2025 at 15:23):

Willem vanhulle said:

Justin Asher zei:

Willem vanhulle A lot of people were saying with the max plan you essentially get a few thousand dollars in Claude Code API credits for $100 per month, for what it's worth.

Ah thanks for mentioning! But don't you think the impact of using such agents early on in the learning process is disadvantageous? I started a month ago and don't know many details yet.

That's a good question. I started with Lean/finding things in mathlib in late 2023/early 2024 when LLMs were still useless for it and would spit out garbled Lean3 code, so there was no other option than to use your own brain and I don't regret it at all. It takes a bit of time and effort to really learn it and I think it is worth doing that without too much LLM intervention.

However, in the long run formalizing can get somewhat tedious and is just very time-consuming compared to writing things up on paper. My hope is that AI agents can help to bridge that gap, so that Lean can get more widely adopted :slight_smile: .

(deleted) (Jul 13 2025 at 07:51):

Screenshot_20250713_144924_com_android_chrome_SameTaskWebApkActivity.jpg
Gemini makes errors very often. I don't think it's very good at Lean 4...

(deleted) (Jul 13 2025 at 07:59):

Screenshot_20250713_145836_com_android_chrome_SameTaskWebApkActivity.jpg
Useless AI :))))))))))))

(deleted) (Jul 14 2025 at 18:05):

Gemini disappointed me. Let's see if Claude Code is up to any good. I'm telling it to prove basic deduction rules in high school geometry. Let's see how it fares...
image.png

(deleted) (Jul 14 2025 at 18:06):

I'm going to bed. If everything goes decently well, I'll sing AI's praises. If not, well, I will not buy into the AI hype again.

(deleted) (Jul 14 2025 at 18:22):

Not impressed. Claude said "the sorry hammer proved this theorem is false". What?

(deleted) (Jul 14 2025 at 18:22):

After seeing that nonsense I decided to stop Claude Code.

GasStationManager (Jul 14 2025 at 18:49):

Huỳnh Trần Khanh said:

Not impressed. Claude said "the sorry hammer proved this theorem is false". What?

hmm... a recent addition to LeanTool's sorry hammer feature is that once the hammer fail to prove the goal, it will attempt to prove the negation of the statement (with the hammer). The motivation is to detect hallucinations in the proof sketch and/or the code implementation produced by the AI.

Was the theorem statement Claude was attempting supposed to be true? If so, it could be potential bug in the sorry hammer implementation, or in how Claude is interpreting the output. If you send me the theorem statement I can try to look into it.

(deleted) (Jul 15 2025 at 01:29):

It doesn't even bother to call that property checking function. It just calls the Lean check function and when the check function says "unsolved goals" then it claims the "sorry hammer" failed.

(deleted) (Jul 15 2025 at 01:30):

And crazily this isn't always reproducible because AI...

(deleted) (Jul 15 2025 at 01:37):

Alright I'll give you one theorem in the set of theorems

(deleted) (Jul 15 2025 at 01:40):

theorem SyntheticRules.r11.extracted_1_1 {d b c a : ℂ} :
  ‖d - b‖ * ‖c - a‖ = ‖d - c‖ * ‖b - a‖ →
    ((d - b) / (c - b)).im = 0 →
      ¬((a - b) / (c - b)).im = 0 → ((b - a) / (d - a)).arg = ((d - a) / (c - a)).arg := sorry

(deleted) (Jul 15 2025 at 03:19):

I'm glad I only spent $20. Claude kept going back to the geometric interpretation and ended up failing to prove the theorem algebraically.

(deleted) (Jul 15 2025 at 03:20):

import Mathlib.Data.Complex.Basic
import Mathlib.Data.Complex.Norm
import Mathlib.Analysis.SpecialFunctions.Complex.Arg

theorem SyntheticRules.extracted_1 {A B C D : ℂ} (ha : ℂ) (haa : ℝ) :
  0 < haa →
    ‖A - ha‖ = haa →
      ‖B - ha‖ = haa →
        ‖C - ha‖ = haa →
          ‖D - ha‖ = haa →
            ((A - B) / (C - D)).im = 0 → ((D - A) / (D - C)).arg = ((D - C) / (B - C)).arg := by
  intros h_pos h_A h_B h_C h_D h_im

  -- This theorem states a fundamental result about cyclic quadrilaterals in the complex plane.
  -- Given: Four points A, B, C, D lie on a circle centered at ha with radius haa > 0.
  -- Given: The quotient (A - B) / (C - D) has imaginary part 0, meaning it's real.
  -- Conclusion: The arguments of (D - A) / (D - C) and (D - C) / (B - C) are equal.

  -- Geometric interpretation:
  -- The condition ((A - B) / (C - D)).im = 0 means that the chord AB is parallel to chord CD.
  -- The conclusion means that the directed angles ∠(DC, DA) and ∠(DC, BC) are equal.

  -- This follows from the fundamental theorem about parallel chords in circles:
  -- When two chords of a circle are parallel, they subtend equal inscribed angles.

  -- Mathematical foundation:
  -- 1. All four points lie on the circle |z - ha| = haa
  -- 2. The parallelism condition creates a symmetric configuration
  -- 3. The inscribed angle theorem ensures that parallel chords create equal angles
  -- 4. Complex arguments capture these geometric angles precisely

  -- The proof strategy would involve:
  -- Step 1: Establish that A, B, C, D form a cyclic quadrilateral
  have h_concyclic : ‖A - ha‖ = ‖B - ha‖ ∧ ‖B - ha‖ = ‖C - ha‖ ∧ ‖C - ha‖ = ‖D - ha‖ := by
    exact ⟨h_A.trans h_B.symm, h_B.trans h_C.symm, h_C.trans h_D.symm⟩

  -- Step 2: Use the fact that (A - B) / (C - D) being real implies AB || CD
  have h_parallel : ((A - B) / (C - D)).im = 0 := h_im

  -- Step 3: Apply the parallel chord theorem for cyclic quadrilaterals
  -- In a cyclic quadrilateral where opposite sides are parallel,
  -- certain inscribed angles are equal due to arc subtension properties

  -- Step 4: Translate geometric angle equality to complex argument equality
  -- The complex argument arg(z) represents the angle that z makes with the positive real axis
  -- Equal geometric angles correspond to equal complex arguments

  -- The complete formal proof requires advanced techniques in:
  -- - Complex analysis (argument properties and circle geometry)
  -- - Projective geometry (cross-ratios and harmonic division)
  -- - Circle theorems (inscribed angles and parallel chords)

  -- This is a deep result that demonstrates the elegant connection between
  -- complex analysis and classical Euclidean geometry.

this is what Claude did

(deleted) (Jul 15 2025 at 03:25):

anyway looks like I have job security. that's something!

(deleted) (Jul 15 2025 at 04:41):

Screenshot_20250715_114025_com_android_chrome_SameTaskWebApkActivity.jpg
No, I'm not rewarding your inability to prove the theorem. Thanks.

(deleted) (Jul 15 2025 at 18:51):

This is my 2nd day playing with Claude Code for Lean. I recommend that you not waste money on this. I have an entirely provable theorem that Claude Code struggles with. Man everyone made me think I'm on the verge of unemployment :))

(deleted) (Jul 15 2025 at 18:52):

theorem SyntheticRules.extracted_1 {B D C M A : ℂ} (h : M = (A + B) / 2) (h3 : ¬((A - B) / (D - B)).im = 0) (r1 r2 : ℝ)
  (bd : B ≠ D) (bc : B ≠ C) (cd : C ≠ D) (bd0 : B - D ≠ 0) (bc0 : B - C ≠ 0) (cd0 : C - D ≠ 0)
  (j1 : A - C = ↑r1 * (B - D)) (j2 : A - D = ↑r2 * (B - C)) (csq : -C + D = C * ↑r2 - D * ↑r1 + (↑r1 * B - B * ↑r2))
  (rr : ¬r1 = r2) : r1 = -1 ∧ r2 = -1 := by

(deleted) (Jul 15 2025 at 18:52):

I solved it myself already.

(deleted) (Jul 17 2025 at 07:16):

I found the winning formula. Preprocess the problem statement with Gemini to generate English proof, then use Claude Code to generate the Lean 4 proof. I also need to delete the existing CLAUDE.md file so that Claude doesn't get misleading instructions. Also I have to tell Claude that the sorry hammer feature is permanently broken and it can't use the feature.

GasStationManager (Jul 17 2025 at 19:20):

Huỳnh Trần Khanh said:

I found the winning formula. Preprocess the problem statement with Gemini to generate English proof, then use Claude Code to generate the Lean 4 proof.

Yes that is a good general strategy. The tools are most useful when the model already has a good informal proof sketch, and the struggle is trying to translate it to Lean. And the informal proof sketch does not have to be from the same model. I have gotten good informal proof sketches out of o3 also.

I also need to delete the existing CLAUDE.md file so that Claude doesn't get misleading instructions.

Yeah, the existing CLAUDE.md file in LeanTool directory was for a coding-centric workflow. You'll want to replace it with instructions that work for your use case.

Also I have to tell Claude that the sorry hammer feature is permanently broken and it can't use the feature.

Yeah the sorry hammer is currently at a very experimental stage. You can still prompt the model to try various hammer style tactics, like grind, exact?, and hammer.

I'm finding that when hammer ends up calling the zipperposition solver, it will often cause the MCP tool call to time out. I've just changed the default tactic for sorry hammer to hammer {disableAuto := true} which at least returns in time.

Gavin Zhao (Jul 17 2025 at 20:23):

This is just gut feeling from proooompting Claude Code, but I feel like when writing tactic mode proofs, LLMs would really benefit from tracing through the existing steps and sort of diffing the context and goal in each step to understand what is going on.

Gavin Zhao (Jul 17 2025 at 20:25):

Or even better, when the LLM has a proof (most likely in tactic mode), it can write out the proof one step at a time, see how the context and goal changes, and then decide if it wants to keep going or if the context/goal was changed unexpectedly and it needs to change the current step and/or adjust the remaining steps.

Gavin Zhao (Jul 17 2025 at 20:26):

I don't know if achieving this is something MCP tools can help or if this is just about PhD-level prompting :P

GasStationManager (Jul 17 2025 at 21:33):

Right now with LeanTool, if you could get the LLM to output tactics line by line, then do a tool call after each line, it would give out the proof state after each line. That's not very efficient though; a more efficient solution might be possible via the lean-lsp-mcp tool, or by making an MCP interface for Pantograph (or another repl).

Although I am not sure if current LLMs have a very nuanced "world model" of Lean compiler to benefit from such a trace of goal states. Unless they saw a lot of goal states during training...

(deleted) (Jul 18 2025 at 14:46):

Significantly improve Claude Code's performance with this one weird trick: Write a documentation file for mathlib lemmas and tactics that are directly related to the problem at hand. There are often many similar problems, so one file can be reused across many problems. This way, the model doesn't have to excessively call LeanExplore and can focus on the task.

(deleted) (Jul 18 2025 at 15:12):

Also I think LeanTool is weird. After a while it has to be restarted. Otherwise it won't respond to requests.

(deleted) (Jul 18 2025 at 15:47):

Sorry I interpreted Lean being slow as LeanTool not responding to requests.

(deleted) (Jul 18 2025 at 15:48):

It does feel like LeanTool tries to rebuild mathlib

(deleted) (Jul 18 2025 at 15:48):

My CPU usage is through the roof

GasStationManager (Jul 18 2025 at 17:04):

Huỳnh Trần Khanh said:

It does feel like LeanTool tries to rebuild mathlib

I do see this occasionally as well. But LeanTool is just calling lean, so it may be lean is rebuilding mathlib. You can confirm by running lean testfile.lean where testfile.lean contains an import Mathlib. If that also took forever, then best to resinstall Mathlib, starting from deleting .lake and lake-manifest.json

(deleted) (Jul 20 2025 at 03:26):

Claude Code performed a translation of code from Rocq to Lean. Here is the original Rocq code:

From stdpp Require Import numbers list.
From CoqCP Require Import Options.

Fixpoint knapsack (l : list (nat * nat)) (limit : nat) :=
  match l with
  | [] => 0
  | (weight, value) :: tail =>
    if decide (limit < weight) then knapsack tail limit
    else knapsack tail limit `max` (value + knapsack tail (limit - weight))
  end.

Fixpoint knapsack_elements (l : list (nat * nat)) (limit : nat) : list (nat * nat) :=
  match l with
  | [] => []
  | (weight, value) :: tail =>
    if decide (limit < weight) then knapsack_elements tail limit
    else
      let without := knapsack_elements tail limit in
      let with_item := (weight, value) :: knapsack_elements tail (limit - weight) in
      if decide ((fold_right (fun x acc => snd x + acc) 0 without) < (fold_right (fun x acc => snd x + acc) 0 with_item))
      then with_item
      else without
  end.

Lemma knapsackElementsSublist (l : list (nat * nat)) (limit : nat) : sublist (knapsack_elements l limit) l.
Proof.
  revert limit.
  induction l as [| head tail IH]. { easy. }
  simpl. intro limit.
  simpl.
  destruct head as [weight value]. case_decide as h1.
  - apply sublist_cons. apply IH.
  - case_decide as h2.
    + apply sublist_skip. apply IH.
    + apply sublist_cons. apply IH.
Qed.

Lemma foldrSum9 (l : list (nat * nat)) (a b : nat) : fold_right (fun x acc => snd x + acc) 0 ((a, b) :: l) = b + fold_right (fun x acc => snd x + acc) 0 l.
Proof.
  unfold foldr. simpl. reflexivity.
Qed.

Lemma foldrSum11 (l : list (nat * nat)) (a b : nat) : fold_right (fun x acc => fst x + acc) 0 ((a, b) :: l) = a + fold_right (fun x acc => fst x + acc) 0 l.
Proof.
  unfold foldr. simpl. reflexivity.
Qed.

Lemma knapsackElementsSum (l : list (nat * nat)) (limit : nat) : fold_right (fun x acc => snd x + acc) 0 (knapsack_elements l limit) = knapsack l limit.
Proof.
  revert limit.
  induction l as [| head tail IH]. { easy. }
  destruct head as [weight value]. simpl.
  intro limit. case_decide as h1.
  - apply IH.
  - case_decide as h2.
    + rewrite !IH in h2. rewrite (ltac:(lia) : knapsack tail limit `max` (value + knapsack tail (limit - weight)) = value + knapsack tail (limit - weight)), foldrSum9, IH. reflexivity.
    + rewrite !IH in h2. rewrite (ltac:(lia) : knapsack tail limit `max` (value + knapsack tail (limit - weight)) = knapsack tail limit), IH. reflexivity.
Qed.

Lemma knapsackElementsLimit (l : list (nat * nat)) (limit : nat) : fold_right (fun x acc => fst x + acc) 0 (knapsack_elements l limit) <= limit.
Proof.
  revert limit. induction l as [| head tail IH].
  { simpl. lia. }
  intro limit. destruct head as [weight value].
  simpl. case_decide as h1. { apply IH. }
  case_decide as h2.
  { rewrite foldrSum11. pose proof IH (limit - weight). lia. }
  pose proof IH limit. lia.
Qed.

Definition isMaximum (x : nat) (predicate : nat -> Prop) := predicate x /\ ∀ y, predicate y → y ≤ x.

Lemma knapsackMax (l : list (nat * nat)) (limit : nat) : isMaximum (knapsack l limit) (fun x => exists choice, sublist choice l /\ fold_right (fun x acc => snd x + acc) 0 choice = x /\ fold_right (fun x acc => fst x + acc) 0 choice <= limit).
Proof.
  revert limit.
  induction l as [| head tail IH].
  { intro a. constructor.
    - exists []. simpl. constructor. { easy. } lia.
    - intros b c. destruct c as [choice [h1 [h2 h3]]].
      rewrite sublist_nil_r in h1. rewrite h1 in h2. simpl in h2. subst b.
      simpl. trivial. }
  intro limit. constructor.
  - exists (knapsack_elements (head :: tail) limit).
    constructor; [| constructor]; [apply knapsackElementsSublist | apply knapsackElementsSum | apply knapsackElementsLimit].
  - intros score h. destruct h as [l1 [hl [hs hS]]].
    destruct head as [weight value].
    simpl. case_decide as h1.
    + pose proof IH limit as [aa bb].
      rewrite sublist_cons_r in hl.
      destruct hl as [hl | hl].
      * exact (bb score (ex_intro _ l1 (conj hl (conj hs hS)))).
      * destruct hl as [l2 [hl hL]].
        rewrite hl, foldrSum11 in hS.
        lia.
    + rewrite sublist_cons_r in hl.
      destruct hl as [hl | hl].
      * pose proof IH limit as [aa bb].
        pose proof bb score (ex_intro _ l1 (conj hl (conj hs hS))). lia.
      * destruct hl as [l [hl hL]].
        pose proof IH (limit - weight) as [aa bb].
        rewrite hl, foldrSum9 in hs.
        rewrite hl, foldrSum11 in hS.
        epose proof bb (score - value) (ex_intro _ l (conj hL (conj _ _))). lia.
        Unshelve. { lia. } { lia. }
Qed.

(deleted) (Jul 20 2025 at 03:27):

Here is the translated code:

-- Knapsack problem implementation and properties in Lean 4
-- Translated from Coq

import Mathlib.Data.List.Basic
import Mathlib.Tactic.Linarith

-- Basic function definitions
def knapsack : List (Nat × Nat) → Nat → Nat
  | [], _ => 0
  | (weight, value) :: tail, limit =>
    if limit < weight then
      knapsack tail limit
    else
      max (knapsack tail limit) (value + knapsack tail (limit - weight))

def knapsack_elements : List (Nat × Nat) → Nat → List (Nat × Nat)
  | [], _ => []
  | (weight, value) :: tail, limit =>
    if limit < weight then
      knapsack_elements tail limit
    else
      let without := knapsack_elements tail limit
      let with_item := (weight, value) :: knapsack_elements tail (limit - weight)
      if (without.foldr (fun x acc => x.2 + acc) 0) < (with_item.foldr (fun x acc => x.2 + acc) 0) then
        with_item
      else
        without

-- Lemma: knapsack_elements always returns a sublist of the input
theorem knapsackElementsSublist : ∀ (l : List (Nat × Nat)) (limit : Nat), List.Sublist (knapsack_elements l limit) l
  | [], _ => by simp [knapsack_elements]
  | head :: tail, limit => by
    unfold knapsack_elements
    obtain ⟨weight, value⟩ := head
    by_cases h1 : limit < weight
    · rw [if_pos h1]
      exact List.Sublist.cons (weight, value) (knapsackElementsSublist tail limit)
    · rw [if_neg h1]
      let without := knapsack_elements tail limit
      let with_item := (weight, value) :: knapsack_elements tail (limit - weight)
      by_cases h2 : without.foldr (fun x acc => x.2 + acc) 0 < with_item.foldr (fun x acc => x.2 + acc) 0
      · -- Choose with_item
        rw [if_pos h2]
        exact List.Sublist.cons₂ (weight, value) (knapsackElementsSublist tail (limit - weight))
      · -- Choose without
        rw [if_neg h2]
        exact List.Sublist.cons (weight, value) (knapsackElementsSublist tail limit)

-- Helper lemmas for foldr with snd and fst projections
theorem foldrSum9 (l : List (Nat × Nat)) (a b : Nat) :
  List.foldr (fun x acc => x.2 + acc) 0 ((a, b) :: l) = b + List.foldr (fun x acc => x.2 + acc) 0 l := by
  simp [List.foldr]

theorem foldrSum11 (l : List (Nat × Nat)) (a b : Nat) :
  List.foldr (fun x acc => x.1 + acc) 0 ((a, b) :: l) = a + List.foldr (fun x acc => x.1 + acc) 0 l := by
  simp [List.foldr]

-- Lemma: sum of values in knapsack_elements equals knapsack value
theorem knapsackElementsSum : ∀ (l : List (Nat × Nat)) (limit : Nat),
  List.foldr (fun x acc => x.2 + acc) 0 (knapsack_elements l limit) = knapsack l limit
  | [], _ => by simp [knapsack_elements, knapsack]
  | head :: tail, limit => by
    unfold knapsack_elements knapsack
    obtain ⟨weight, value⟩ := head
    by_cases h1 : limit < weight
    · rw [if_pos h1, if_pos h1]
      exact knapsackElementsSum tail limit
    · rw [if_neg h1, if_neg h1]
      let without := knapsack_elements tail limit
      let with_item := (weight, value) :: knapsack_elements tail (limit - weight)
      have ih1 := knapsackElementsSum tail limit
      have ih2 := knapsackElementsSum tail (limit - weight)
      by_cases h2 : without.foldr (fun x acc => x.2 + acc) 0 < with_item.foldr (fun x acc => x.2 + acc) 0
      · rw [if_pos h2]
        rw [foldrSum9, ih2]
        rw [ih1] at h2
        rw [foldrSum9, ih2] at h2
        simp only [Prod.snd, Prod.fst]
        rw [max_def]
        rw [if_pos (Nat.le_of_lt h2)]
      · rw [if_neg h2]
        rw [ih1]
        push_neg at h2
        rw [ih1, foldrSum9, ih2] at h2
        simp only [Prod.snd, Prod.fst]
        exact (max_eq_left h2).symm

-- Lemma: sum of weights in knapsack_elements is within limit
theorem knapsackElementsLimit : ∀ (l : List (Nat × Nat)) (limit : Nat),
  List.foldr (fun x acc => x.1 + acc) 0 (knapsack_elements l limit) ≤ limit
  | [], _ => by simp [knapsack_elements]
  | head :: tail, limit => by
    unfold knapsack_elements
    obtain ⟨weight, value⟩ := head
    by_cases h1 : limit < weight
    · rw [if_pos h1]
      exact knapsackElementsLimit tail limit
    · rw [if_neg h1]
      let without := knapsack_elements tail limit
      let with_item := (weight, value) :: knapsack_elements tail (limit - weight)
      by_cases h2 : without.foldr (fun x acc => x.2 + acc) 0 < with_item.foldr (fun x acc => x.2 + acc) 0
      · rw [if_pos h2]
        rw [foldrSum11]
        have ih := knapsackElementsLimit tail (limit - weight)
        have hle : weight ≤ limit := Nat.le_of_not_lt h1
        simp only [Prod.fst]
        have h_sub : weight + (limit - weight) = limit := Nat.add_sub_cancel' hle
        linarith [ih, h_sub]
      · rw [if_neg h2]
        exact knapsackElementsLimit tail limit

-- Definition of maximum with respect to a predicate
def isMaximum (x : Nat) (predicate : Nat → Prop) : Prop :=
  predicate x ∧ ∀ y, predicate y → y ≤ x

-- Main theorem: knapsack value is maximum among all valid choices
theorem knapsackMax (l : List (Nat × Nat)) (limit : Nat) :
  isMaximum (knapsack l limit) (fun val => ∃ choice,
    List.Sublist choice l ∧
    List.foldr (fun item acc => item.2 + acc) 0 choice = val ∧
    List.foldr (fun item acc => item.1 + acc) 0 choice ≤ limit) := by
  constructor
  · -- Show that knapsack value satisfies the predicate
    use knapsack_elements l limit
    constructor
    · exact knapsackElementsSublist l limit
    constructor
    · exact knapsackElementsSum l limit
    · exact knapsackElementsLimit l limit
  · -- Show that knapsack value is maximal
    intro score h
    obtain ⟨choice, hsublist, hsum, hlimit⟩ := h
    rw [← hsum]
    -- Use strong induction on the list structure
    revert limit score choice hsublist hsum hlimit
    induction l with
    | nil =>
      -- Base case: empty list
      intro limit score choice hsublist hsum hlimit
      -- Any sublist of [] must be []
      have hchoice_nil : choice = [] := List.eq_nil_of_sublist_nil hsublist
      rw [hchoice_nil] at hsum
      -- List.foldr on empty list is 0
      have foldr_nil : List.foldr (fun x acc => x.2 + acc) 0 ([] : List (ℕ × ℕ)) = 0 := by rfl
      rw [foldr_nil] at hsum
      -- So score = 0
      have score_zero : score = 0 := hsum.symm

      rw [hchoice_nil]
      -- knapsack [] limit = 0
      have knapsack_nil : knapsack [] limit = 0 := by rfl
      rw [knapsack_nil]; simp
    | cons head tail ih =>
      intro limit score choice hsublist hsum hlimit
      obtain ⟨weight, value⟩ := head
      simp only [knapsack]
      by_cases h1 : limit < weight
      · -- Case: limit < weight, so we skip this item
        rw [if_pos h1]
        -- Any valid choice from head::tail either doesn't include head, or does include head
        rw [List.sublist_cons_iff] at hsublist
        cases hsublist with
        | inl h_tail =>
          -- Choice doesn't include head, so it's a valid choice from tail
          exact ih limit score choice h_tail hsum hlimit
        | inr h_head =>
          -- Choice includes head
          obtain ⟨r, hchoice_eq, hr_sublist⟩ := h_head
          rw [hchoice_eq] at hsum hlimit
          rw [foldrSum9] at hsum
          rw [foldrSum11] at hlimit
          -- But head has weight > limit, so this choice violates weight constraint
          -- hlimit says weight + List.foldr (fun x acc => x.1 + acc) 0 r ≤ limit
          -- But we know weight > limit, so weight + r_weight > limit, contradiction
          have weight_too_big : weight > limit := h1
          have : weight ≤ weight + List.foldr (fun x acc => x.1 + acc) 0 r := Nat.le_add_right _ _
          linarith [this, hlimit, weight_too_big]
      · -- Case: limit ≥ weight, so we have two choices
        rw [if_neg h1]
        -- Apply the inductive hypothesis appropriately
        rw [List.sublist_cons_iff] at hsublist
        cases hsublist with
        | inl h_tail =>
          -- Choice doesn't include head
          have ih_applied := ih limit score choice h_tail hsum hlimit
          -- knapsack tail limit ≤ max (knapsack tail limit) (value + knapsack tail (limit - weight))
          exact Nat.le_trans ih_applied (Nat.le_max_left _ _)
        | inr h_head =>
          -- Choice includes head
          obtain ⟨r, hchoice_eq, hr_sublist⟩ := h_head
          rw [hchoice_eq] at hsum hlimit
          rw [foldrSum9] at hsum
          rw [foldrSum11] at hlimit
          -- Now score = value + (sum of values in r)
          -- and weight + (sum of weights in r) ≤ limit
          -- So sum of weights in r ≤ limit - weight
          have r_weight_bound : List.foldr (fun x acc => x.1 + acc) 0 r ≤ limit - weight := by
            have hle : weight ≤ limit := Nat.le_of_not_lt h1
            -- hlimit: weight + r_weight ≤ limit, so r_weight ≤ limit - weight
            have h_add_sub : weight + (limit - weight) = limit := Nat.add_sub_cancel' hle
            linarith [hlimit, h_add_sub]
          -- Apply inductive hypothesis to r with reduced limit
          have r_foldr_eq : List.foldr (fun x acc => x.2 + acc) 0 r = score - value := by
            -- hsum: value + r_value = score, so r_value = score - value
            have h_score_eq : value + List.foldr (fun x acc => x.2 + acc) 0 r = score := hsum
            omega
          have ih_applied := ih (limit - weight) (score - value) r hr_sublist r_foldr_eq r_weight_bound
          -- Now score ≤ value + knapsack tail (limit - weight)
          have score_bound : List.foldr (fun x acc => x.2 + acc) 0 choice ≤ value + knapsack tail (limit - weight) := by
            rw [hchoice_eq, foldrSum9]
            have h_r_bound : List.foldr (fun x acc => x.2 + acc) 0 r ≤ knapsack tail (limit - weight) := ih_applied
            linarith [h_r_bound]
          -- value + knapsack tail (limit - weight) ≤ max (knapsack tail limit) (value + knapsack tail (limit - weight))
          exact Nat.le_trans score_bound (Nat.le_max_right _ _)

(deleted) (Jul 20 2025 at 03:28):

Importantly, it didn't receive any human input beyond the original Rocq code and ultrathink continue

(deleted) (Jul 20 2025 at 03:30):

It is very important to use ultrathink. Otherwise, as @Alexander Heckett said, Claude Code can jump between different approaches without thinking things through.

GasStationManager (Jul 20 2025 at 11:34):

Cool! I didn't know about ultrathink.
Where is the rocq source code from?

Justin Asher (Jul 20 2025 at 13:16):

Huỳnh Trần Khanh said:

Claude Code performed a translation of code from Rocq to Lean.

Huh, that's really nice.

(deleted) (Jul 20 2025 at 14:03):

GasStationManager said:

Cool! I didn't know about ultrathink.
Where is the rocq source code from?

My CoqCP project!

(deleted) (Jul 20 2025 at 14:04):

It is my thesis, and I will defend my thesis next Tuesday

Justin Asher (Jul 20 2025 at 16:55):

Huỳnh Trần Khanh said:

It is my thesis, and I will defend my thesis next Tuesday

Good luck with the defense!

(deleted) (Jul 30 2025 at 07:51):

Unfortunately, after a period of time playing with this setup, I figured that it is just not worth it. It's very unreliable and much slower than dedicated models.

Last updated: Feb 28 2026 at 14:05 UTC