Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Provably-Correct Vibe Coding

GasStationManager (Jul 15 2025 at 17:48):

Meme-y title aside, I believe that vibe coding is what people want, and Lean is in a good position to deliver a vibe-coding solution that actually works.

I have made a demo app at ProvablyCorrectVibeCoding.com, in which:

The user creates a formal specification for a Lean coding task. Or take an existing problem from Code with Proofs: the Arena. (Upcoming feature: autoformalize from natural language problem description and a test set)
An AI agent attempts to output code with proof of correctness. The AI is an LLM (e.g. Sonnet/o3/Deepseek/Gemini/Grok4) with user-provided API key, armed with LeanTool. LeanTool provides not only ability to iteratively fix syntax errors, it also facilitates hallucination detection via property-based testing.
The solution is checked against the specification using SafeVerify. If the solution passes, the user now has an implementation of the coding task that is guaranteed to satisfy the specification.

This demo is a proof-of-concept of a vision that I have been building towards via these open-source projects: a future with safe, hallucination-free coding AIs. We are of course quite a bit of distance from that goal, but hopefully this gives an idea of what that might look like.

This is initial alpha release; please try it out and let me know what you think! Much of my current research focus is on scaffolding that enables efficient inference-time scaling, in particular detecting hallucinations by leveraging the formal spec and feedback from Lean. If you are interested in collaborating, let me know!

(deleted) (Jul 16 2025 at 01:56):

I really like your work, and I hope one day foundation models will catch up...

(deleted) (Jul 16 2025 at 01:58):

Speaking of collaborating, I have some training data.

(deleted) (Jul 16 2025 at 01:59):

And a lot of math problems to test your stuff on.

Ping J (Jul 16 2025 at 04:15):

i think i've seen work, probably from Danqi Chen, on jointly generating code, specification, and doc.

are you familiar with their work? i personally think this is a less-than-fundamental approach and we will eventually have a alpha-zero style better alternative

GasStationManager (Jul 16 2025 at 04:17):

@Huỳnh Trần Khanh Thanks!
BTW, you can try pure mathematical proof tasks on the demo app as well. Use the "draft_sketch_prove" workflow option which is suitable for proofs.

As for training: while I don't have the budget or the expertise for any large-scale training, one promising direction in my opinion is to have LeanTool as part of a multi-turn RL environment, to fine-tune models to effectively use these tools to get useful information from interacting with Lean. Potentially relevant: verifiers

Ping J (Jul 16 2025 at 04:36):

https://arxiv.org/pdf/2310.17807

GasStationManager (Jul 16 2025 at 04:51):

@Ping J I see you linked to the CLOVER paper. I think it is interesting; and broadly agree with the overall goals. But from my perspective as someone in large part motivated by AI Safety, I really think that the humans should "own" the specification, at least be very much in the loop when creating the specification and ultimately take responsibility for it.

As for alpha-zero: in my essay I wrote (and still believe) that ultimately some kind of RL (possibly alpha-zero style) will be needed to achieve recursive self-improvement, and get us better and better coding AIs. But one of the missing pieces was: what would be an effective search part of alpha-zero for the coding domain? Is it straight MCTS as in AlphaProof? Or perhaps something else. I wanted to explore this space, and once we have a good idea of how to do search/inference-time scaling, we can incorporate it into an RL pipeline (e.g. see above post).

Last updated: Feb 28 2026 at 14:05 UTC