Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Planner + Generator Trend in Recent-ish NTP models

Nicholas Jiang (Dec 04 2025 at 15:42):

I noticed that many recent-ish models combine a strong general LLM (e.g., Gemini 2.5 Pro) for high-level tasks like planning and goal decomposition, and then a proof generator to write Lean (Hilbert, Deepseek Prover V2). However, this makes me think that without the planner knowing the capabilities of the proof generator that it can't choose between different plans to minimize the number of proof generator samples to produce a proof. That is, I don't believe that there is always a single valid plan or that all plans result in the same "difficulty" of intermediate goals for the proof generator. While I have this intuition in mind, I don't really have a point of reference to say "we see a similar example in this paper where a planner that knows the capability of its executor models does a better job".

Does anyone have thoughts on whether a prover-capability-aware planner can significantly improve the "sample efficiency" (i.e. the average k in pass@k to prove a theorem) and if there are any related past or current papers that touch on this?

Deleted User 968128 (Dec 08 2025 at 01:57):

Yes, splitting into plan/act is baseline feature of agentic code clients and has been in use for some time now.

There have been serious efforts to do exactly what you have suggested: fine tune agentic models to be strong within the context of a flow and stronger 'generators'.

One of my favorites is tencent and TRAE + Doubao-Seed-Code, which they had to do because Anthropic was getting banned from China companies. GPT5.1-codex-max is very likely also this.

However, I think most fine tuning is being done within the context of an agentic flow rather than a planner model, as that would be more expensive to train and perhaps somewhat inflexible with a tendency to reward hack.

One problem is that these sorts of models are harder to benchmark / prone to benchmaxxing. You can't just evaluate the model itself, you really need appropriate scaffolding to get the most out of it. This is why you'll see companies like Anthropic report a superior result over swebench's official but inferior result for their model on the same dataset.

On top of all this, we have a moving target. While this approach can yield very impressive gains, the risk of obsolescence is great given the pace at which frontier models are moving. What is very fog of war though is how big, how costly their models are, and if they are even using internal agentic flows.

Perhaps they are not obsoleting anything but rather just throwing money at it.

Deleted User 968128 (Dec 08 2025 at 02:08):

As I mention here #Machine Learning for Theorem Proving > Kaggle, 38/50 already via open models, $2M prize fund @ 💬 the paper from Aristotle (#Machine Learning for Theorem Proving > Aristotle achieves SOTA 96.8% proof generation on VERINA ) discusses this further.

Deleted User 968128 (Dec 08 2025 at 02:12):

Simon Sorg (Dec 08 2025 at 10:15):

NTP is frequently used for Neural Theorem Proving, which also seems to be the intended use here.

Deleted User 968128 (Dec 08 2025 at 14:46):

Thanks, removed the comment. You can look at the edit history if you're curious.

Last updated: Feb 28 2026 at 14:05 UTC