Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Claude 4 agent in VS Code


Willem vanhulle (Jun 27 2025 at 19:18):

I have been trying to find the best setup in VS Code for agent mode. It seems to me that Claude 4 is one of the better models. It seems to be able to reason better about Lean and fix errors. However, often it stops before all problems have been solved and declares everything fixed.

Am I forgetting something? A configuration or settings that prevents Claude 4 from seeing the Lean output and proof state?

Oliver Dressler (Jun 28 2025 at 09:49):

The VS Code agent has only limited access to your project, such as file contents, history and some diagnostic messages. A lot of the information (e.g. most of the Lean InfoView) is not available as it is a custom VS Code extension.

lean-lsp-mcp is an MCP that provides additional tools to the agent, such as

  • Obtaining proof state at any position in a Lean file.
  • Reading all diagnostic messages (including errors, warnings, and infos).
  • Querying external search tools such as leansearch.net.

Eric Vergo (Jun 28 2025 at 12:53):

Hey Willem,

I have been using ClaudeCode to auto-formalize some things and have found many concerning behaviors, including the one you mentioned.

To work around it, at the beginning of every working session I instruct Claude (among many other things) to use the command Bash(lake build <module_name> 2>&1 | tee build_output.txt) This writes the entire output of the build to a text file in /root which the LLM has access to.

Willem vanhulle (Jun 28 2025 at 12:55):

Eric Vergo zei:

To work around it, at the beginning of every working session I instruct Claude (among many other things) to use the command Bash(lake build <module_name> 2>&1 | tee build_output.txt) This writes the entire output of the build to a text file in /root which the LLM has access to.

This sounds like a good solution, but I think it is still best to install the MCP server as mentioned by @Oliver Dressler. Once I got it set-up I could stop copy-pasting a lot of context. I have just made a nix definition for it you need it: https://github.com/wvhulle/riddle-proofs.

Eric Vergo (Jun 28 2025 at 13:23):

Willem vanhulle said:

I have just made a nix definition for it you need it: https://github.com/wvhulle/riddle-proofs.

Thanks; I will check this out.

I'm still spending time getting my hands dirty to find all the rough edges when working with these systems. For the time being, I don't feel like I have a good enough understanding of what can go wrong with Claude Code by itself to incorporate new tools. I'm preparing a longer post, but I have seen 'reward hacking' behaviors such as: deleting theorems it couldn't prove to reduce sorry counts, introducing new axioms when explicitly told not to, among many others. While the MCP tools look like they will help prevent these behaviors, it's worth noting that 'the motivation' to do so is still there.

Based on what I have seen while working with Claude Code, I believe that documenting, studying, and understanding this behavior while an LLM works with Lean will be of broad interest. Here is why: If the LLM is essentially trying to cheat when writing formally verified code that it knows is going to be reviewed by mathematicians/computer scientists, what on earth is it doing when it's in an environment where it's much harder to catch cheating and it's not being scrutinized properly?

Willem vanhulle (Jun 28 2025 at 13:27):

Maybe it is just not trained yet on the right large Lean corpus. I think that’s more likely, since it’s very arcane.


Last updated: Dec 20 2025 at 21:32 UTC