Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Delta Prover

Jason Rute (Jul 25 2025 at 11:24):

What is the relation of this paper on Lean 4 by ByteDance to Seed Prover? https://arxiv.org/abs/2507.15225 This prover is called Delta Prover. Is it the same? Is this the paper/report for this project?

Jason Rute (Jul 25 2025 at 11:30):

Oh this is an agent based approach using existing LLMs, so maybe not. It it is something different, I can move this to its own conversation.

Jason Rute (Jul 25 2025 at 11:33):

From here, Seed Prover and Delta Prover seem to be different sister projects. Maybe one is trained and the other an agentic loop? https://github.com/ByteDance-Seed/Seed-Prover

(deleted) (Jul 25 2025 at 12:05):

Lol speculation

(deleted) (Jul 25 2025 at 12:05):

But honestly using a DSL is a very obvious way to circumvent the lack of training on Lean for off the shelf LLMs

(deleted) (Jul 25 2025 at 12:05):

But also very laborious...

Notification Bot (Jul 25 2025 at 12:07):

6 messages were moved here from #Machine Learning for Theorem Proving > Seed Prover Achieves Silver-Level Score at IMO 2025 by Jason Rute.

(deleted) (Jul 25 2025 at 12:08):

Phew. Much better home for this discussion.

(deleted) (Jul 25 2025 at 12:08):

By "speculation" I mean the connection is extremely tenuous

Jason Rute (Jul 25 2025 at 12:08):

What connection? Between Seed Prover and Delta Prover?

(deleted) (Jul 25 2025 at 12:09):

Yeah that's right. Delta Prover to me is a way to coax existing LLMs into writing Lean code through an intermediate DSL.

Jason Rute (Jul 25 2025 at 12:11):

I was just confused that ByteDance had two Lean AI prover projects achieving a very high score on MiniF2F. But it turns out they do. (Although we don't know how Seed Prover works, and what exactly they were using for the "Seed Prover" IMO results.)

Jason Rute (Jul 25 2025 at 12:16):

The corresponding authors are @yichi zhou @Pan Lu and @Hang Li

Zheng Yuan (Jul 25 2025 at 15:11):

Jason Rute said:

I was just confused that ByteDance had two Lean AI prover projects achieving a very high score on MiniF2F. But it turns out they do. (Although we don't know how Seed Prover works, and what exactly they were using for the "Seed Prover" IMO results.)

DeltaProver and SeedProver are two systems. Some techniques in DeltaProver are used in SeedProver. SeedProver joins IMO 2025.

Zheng Yuan (Jul 25 2025 at 15:13):

Jason Rute said:

Oh this is an agent based approach using existing LLMs, so maybe not. It it is something different, I can move this to its own conversation.

Yes, SeedProver is a self-trained model with some agentic loop evolved from DeltaProver. DeltaProver is a research project on agentic loop.

Jonathan Laurent (Jul 25 2025 at 17:14):

This is an interesting paper, with a significant result (state-of-the-art on MiniF2F, using only a generalist model). After a quick read, I have two questions.

First, one of the most interesting and novel aspects of their pipeline is their "Iterative Decomposition Repair" proposal: when a Draft-Sketch-Prove attempt fails, the whole process starts again, _with additional feedback_ regarding what subgoals could not be proved in the previous attempt. I would be very interested in seeing precisely how they prompt the model to leverage such feedback. However, I did not find this information in the prompting templates provided in appendix.

Second, a central point of the paper seems to be that Lean4 is not directly suitable for implementing Draft-Sketch-Prove (as opposed to Isabelle), hence the introduction of a custom DSL for managing and gluing subproofs. Citing the paper:

DSP uses the formal language Isabella, while current Lean 4 lack robust support for managing decomposed sub-problems. Specifically, there is no convenient way to simultaneously: (i) store decomposed sub-problems; (ii) extract them as formal statements; and (iii) integrate their individual proofs back into a unified whole proof. For instance, using the have tactic is inadequate for (iii), while listing sub-problems as lemmas falls short for (ii).

I am not sure I get this point and how have is inadequate for their purposes. The Lean language server can be used to validate a sketch that containing sorries, extract the current goal for each sorries, and validate a subproof in context by substituting the associated "sorry" with it and querying the LSP again. Admittedly, this whole process could be further facilitated by the existing tooling but I do not get the necessity of introducing a completely new DSL for managing subgoals, and the insistence on such a DSL in the paper, which I see as a minor engineering detail. Am I missing something?

(deleted) (Jul 25 2025 at 17:33):

My feeling is the paper doesn't tell the whole story and the DSL is much more extensive than what is detailed in the paper.

(deleted) (Jul 25 2025 at 17:35):

To the point of being something more similar to Waterproof.

Jonathan Laurent (Jul 25 2025 at 17:56):

@Huỳnh Trần Khanh Nothing in the paper is hinting at that. What makes you suspect this?

GasStationManager (Jul 25 2025 at 19:41):

@Jonathan Laurent FYI, a lot of the features of LeanTool are geared towards making DSP work in Lean: automatic extraction of goals from sorrys, and the recent feature to try a (list of) hammer tactics on a sorry. And with LeanTool available as an MCP server, you could easily compose it with other tools. I am really enjoying using LeanTool and LeanExplore together, and then it is not hard to prompt (say) Sonnet 4 into doing (recursive) DSP style proving. See here for some example proofs.

The DSL approach is potentially interesting, as a way to manage sub-goals on a larger scale. I will need to try to understand it further. For my use cases though, I often have sorrys inside of recursive functions, and it might not be possible to extract the goal into a lemma outside of the function.

(deleted) (Jul 26 2025 at 01:00):

I suspect that the DSL is much more extensive than what is claimed in the paper because the paper keeps mentioning the DSL a lot. I did an experiment where I pasted a long specification for a DSL into a reasoning model and the model consistently wrote correct code. Claude Code, even when armed with LeanTool and LeanExplore, still struggles with writing correct Lean code. Therefore, I suspect that the DSL covers much more than just subgoal decomposition.

(deleted) (Jul 26 2025 at 04:48):

I propose that we replicate this paper. We know that making a DSL is the winning formula, so we have the general direction already.

Justin Asher (Jul 26 2025 at 05:17):

Probably pretty expensive? You totally could cook up an AI researcher right now if you had funding. I would not doubt if that is the next thing to be done...

(deleted) (Jul 26 2025 at 05:18):

Well I already plan to replicate the paper myself so... After the AlphaGeometry transpilation project this project will be next...

(deleted) (Jul 26 2025 at 05:19):

I have free time and cash, I just need motivation

(deleted) (Jul 26 2025 at 05:21):

As for the cost, I don't think inference costs a lot, but it requires a lot of my time, which I think is a justifiable use of my time since I can exploit the work commercially

Justin Asher (Jul 26 2025 at 05:23):

Are you formalizing problems for work? You can totally use LLMs to do that. (Like in the paper.)

(deleted) (Jul 26 2025 at 05:25):

Yeah that's right I plan to use the strategy in the paper, which is to create an entirely new language for subgoal decomposition. After that, I also create a language for finer details of the proof. I try to make it close enough to Lean, but also different enough from Lean so the LLM doesn't use its existing intuition about Lean and end up generating broken code.

yichi zhou (Jul 26 2025 at 05:34):

@Jason Rute Delta-Prover is a separate project focused on researching test-time strategies. However, the test-time strategy used in Seed Prover is an extension of Delta Prover’s approach.

Acturally, in seed prover, we found a smarter way to effectively scale up at test time without relying on a new DSL. Details of Seed Prover’s test-time strategy will be available in our technical report.

Last updated: Feb 28 2026 at 14:05 UTC