Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Ensembling with NTP Models


Nicholas Jiang (Nov 07 2025 at 01:03):

Hi there! I'd like to share a white paper that me and some friends at the University of Waterloo wrote that investigated whether different prover models exhibited 'non-overlapping-failures' and whether we could ensemble a group of whole proof generation LLMs to take advantage of their differing strengths. The main empirical finding is that under the same budget of 24 proof attempts per theorem for both a single model and an ensemble (i.e. 8 attempts per problem for 3 different models) that the ensemble proves more theorems in MiniF2F-test over many re-sampling trials. Due to hardware limitations, we only did this study for 7B models, but for the next stage, we'd like to move to using models that are closer to SOTA. We wish to apply this non-overlapping failures property at the subgoal level, using a planner like the one from BFS-Prover-V2 to produce a 'proof sketch' that is verified and then attack each subgoal with the ensemble. More details in the paper itself. We're currently struggling a bit with getting the planner in BFS-Prover-V2 to work and mainly we're getting issues when trying to trace a MiniF2F repo with LeanDojo (e.g. issue here: https://github.com/yangky11/miniF2F-lean4/issues/31). If anyone is interested in this idea and would like to help us, please let me know!

The Unreasonable Effectiveness of Nonoverlapping Failures in LLM Prover Ensembles.pdf

Repo: https://github.com/script-jpg/non-overlapping-failures

Rishabh Sharma (Nov 07 2025 at 01:13):

I'm one of the friends working on this paper, feel free to reach out!!


Last updated: Dec 20 2025 at 21:32 UTC