Zulip Chat Archive
Stream: Machine Learning for Theorem Proving
Topic: Context Rot in MLTP
Alapan Chaudhuri (Aug 30 2025 at 15:22):
I was going through some of the benchmarks used for MLTP (ML based Theorem Proving). MiniF2F, ProofNet and competition oriented datasets (IMO, Putnam problems) typically seem to involve relatively self-contained problems with short proof lengths.
I am new to this, so I am not sure but how do LLM-based proof assistants handle context degradation in long-form theorem proving vs. competition-style benchmarks?
Does context rot affect theorem proving any differently from other cases?
-
Has anyone has done systematic measurements comparing proof success rates as a function of dependency chain length or required context size?
-
What strategies have been effective for managing this? Chunking proof goals differently, using RAG approaches for relevant lemmas, or finding that certain proof structuring patterns help maintain coherence?
Kelly Davis (Aug 31 2025 at 11:16):
From what I've read there are several methods can address context degradation in long-form theorem proving:
- Generate a purported proof, pass that purported proof to Lean for validation. If the purported proof is invalid, retain the valid portion of the purported proof and continue proving from some leaf or non-leaf proof state of the partial purported proof. For example this is utilized by DeepSeek-Prover-V1.5.
- Generate a purported proof, pass that purported proof to Lean for validation. If the purported proof is invalid, pass the purported proof and error message to the model allowing it to correct the purported proof. Repeat as needed. For example this is utilized by Goedel-Prover-V2 and Delta‑Prover.
- Generate the proof a tactic/proof step at a time and use a more standard algorithm, e.g. Monte Carlo Tree Search (MCTS), to organize the large scale arc of the proof. For example HTPS and to some extent DeepSeek-Prover-V1.5 utilize this technique.
Note that these methods were not necessarily explicitly constructed to address context degradation in long-form theorem proving. And surely there are other examples, but these were the first that came to mind.
As to your other questions:
- There are some measurements done in the Kimina-Prover Preview paper, i.e. Figure 4, showing how miniF2F test accuracy increases with an increasing number of tokens. (There are likely other studies, but that was the first that came to mind.)
- Some of the things I mentioned earlier are effective for managing this, but there are also others, e.g. LEGO-Prover creates a growing library of lemmas that it can draw upon; Draft, Sketch, and Prove first creates an informal proof, which is made into a formal proof sketch, which is then used to formally prove the target result.
Hope this helps, though this is only the tip of the iceberg.
Jared green (Aug 31 2025 at 15:24):
manage context-select a subset of the context to use in every query.
Last updated: Dec 20 2025 at 21:32 UTC