Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Lean benchmarks for modern (opaque) LLMs

Jason Rute (Nov 05 2024 at 17:26):

I mean it is too late to hope ProofNet (Lean 3) and possibly ProofNet (Lean 4) pairs are not already trained on. That doesn’t mean the models will do 100%, and it might not even mean the benchmark is worthless, but it will draw skepticism into if such results will generalize to new problems. (I admit I’m more skeptical than most, but not all, other researchers in this field.)

Auguste Poiroux (Nov 05 2024 at 17:42):

I agree with you on that point to some degree, especially because by multiplying the number of versions, we are increasing the chances for data contamination. For the Lean 4 version, most closed models still have cut-off training dates ending before 2024, so it is more or less still fine for these models. But in a few months, we won't know anymore.
However, there are also open models and datasets out there such as Llemma / ProofPile 2 that specifically filtered out ProofNet. These can be used to develop new autoformalization methods while ensuring no data contamination.

Jason Rute (Nov 05 2024 at 17:46):

Even Lemma only filters it for fine tuning on ProofPile. Was it used in the training of the LLaMa or Code-LLaMa base model? Would that make a difference? We just don’t know these things.

Jason Rute (Nov 05 2024 at 17:56):

I think either (a) we except that our benchmarks are broken, (b) we make better more thought out benchmarks, or (c) we just work on real-world auto-formalization and stop bothering with broken toy benchmarks. (But I’m likely being at least a bit naive.)

Auguste Poiroux (Nov 05 2024 at 18:49):

I agree we can't know for sure, we can only make guesses and a posteriori analyses. We know that Llemma is based on CodeLLama and Llama2 which have a cutoff of September 2022. ProofNet (original in Lean3) has been released in 2023 I believe? But then you are right, we don't know if Meta accidentally included ProofNet in their fine-tuning data.
To me, the main problem is transparency in today's LLM training, but we cannot do much about that. Unfortunately, like you said, this implies that we have to find new ways to evaluate these models.
I agree that (c) is the ideal setup, but I think it might be hard to compare different works. (b) sounds good, we just need a good idea (ideally not as expensive to maintain as running private evaluations) :upside_down:
I am happy to discuss more how we can overcome today's benchmark issues (I was secretly hoping that my message would lead to such discussion ^^). However, if we continue in that direction, maybe we should create a new thread since it is a larger issue than just ProofNet?

Jason Rute (Nov 05 2024 at 18:51):

Yeah, feel free to make a new thread and move any messages there.

Notification Bot (Nov 05 2024 at 18:57):

6 messages were moved here from #Machine Learning for Theorem Proving > ProofNet fix by Auguste Poiroux.

Jason Rute (Nov 06 2024 at 14:20):

My thoughts on benchmarks have been relatively consistent and I’ve shared them a few times:

they should focus on the sorts of problems that theorem proving folks care about (which is probably not competition math)
they should be benchmarks for systems and not LLMs (so connecting an LLM to a tree search, a symbolic solver, etc should be acceptable)
they will leak into LLM training and one needs a plan for that. Canary strings are good, but likely not enough. Some possible options are (a) completely hidden problems (like in Kaggle). (b) temporary new problems (like this year’s math competitions, new additions to mathlib, recent sorry theorems, etc) which are only valid tests for a short period before becoming just another test set, or (c) problems without solutions that exist anywhere.
benchmarks should probably steer clear of specific things the community wants to formalize. Otherwise when that work is completed then it is in the test set. (The exception is if one adopts the idea above that benchmarks are temporary.). Minif2f has this problem since if one wants to do better on the real math competitions it is good to formalize the older competitions (like in mathcomp, which has more and more overlap with MiniF2F and soon probably PutnamBench).
we have to take account that unless the problem is novel, the English (or Chinese, etc) version has a proof online which will be part of the training data. So then the task becomes autoformalization of a proof.
benchmarks which don’t come with proofs probably have about 10-20% of Lean code with mistakes which makes the problems trivially true or trivially false. (Examples include Minif2f, PutnamBench, and ProofNet.) One needs a plan for this.
Lean (and Coq, Isabelle) are constantly changing and it doesn’t necessarily make sense to fix a particular version, but then how does one have a static benchmark? One needs a plan.
Benchmarks shouldn’t be owned by individual researchers (and worse by researchers at companies). The problem is the the researcher leaves the field (or the company) and the benchmark can’t be fixed. That is why minif2f is so massively broken. There is no oversite and planning ahead for these issues.
Benchmarks will be gamed (Goodharts law). We clearly see that with MiniF2f. Harmonic’s 90% on MiniF2F is a great example. They spent a lot of compute (probably) to solve each problem, they changed the problems, and they specifically trained on problems very similar to the test problems. (And they are not the only ones. DeepSeek probably is doing RL on the test problems without realizing it, and AlphaProof use 3 days of compute for inference, well beyond what people had originally intended when they though about the “IMO grand challenge”.) Now this isn’t necessarily wrong, and the results are interesting, but this should be thought about ahead of time.
Honestly, I think benchmarks aren’t helping this field much and it might be better to just focus on real-world applications like solving unsolved problems on the MathOverflow level (which may not actually be unsolved but exist somewhere unknown to the author), (auto)formalization of real math, translation between ITPs, filling in sorry’s in real Lean projects, real-world use cases of ATP/ITP, and writing verified code. (Some of these can be benchmarks of course, but with an eye to solving the real problem and maybe also where the benchmark is only a temporary stepping stone until the problem is solved.)

Auguste Poiroux (Nov 06 2024 at 15:09):

Thank you for sharing and summarizing your thoughts! Very interesting, and I agree with most of them.
What about having competitions then? I think this could solve most of the problems here + better comparability between methods. We can imagine the following setup (although it might be hard to organize in practice):

An annual competition with ATP and autoformalization tracks
Each event comes with a brand-new set of unseen problems, ideally problems that theorem proving folks care about. For instance, it could be something like a Lean blueprint of an ongoing project, with proofs to fill-in. (maybe I am a bit naive here, even if the field is advancing fast, ATP methods are not at that level yet).
Two different competitions: unrestricted and fixed-compute (kaggle-like) leaderboards.

Alex Meiburg (Nov 06 2024 at 19:21):

I think that the three objectives

focus on the sorts of problems that theorem proving folks care about (which is probably not competition math)

benchmarks should probably steer clear of specific things the community wants to formalize.

we have to take account that unless the problem is novel, the English (or Chinese, etc) version has a proof online which will be part of the training data.

are in pretty great tension (and are in even in tension pairwise). Novel problems, that would not be of interest for humans to formalize, but is still the kind of thing that "theorem proving folks care about"? I have trouble picturing what that would look like.

Alex Meiburg (Nov 06 2024 at 19:25):

One route is problems that involve "big constants". Like, "compute the 12th homotopy group of the 10-dimensional Hawaiian earring" or similar. But these are sort of risky in the sense that there may just end up being a pretty 'routine' approach that computes the kth group of the n-dimensional Hawaiian earring, for k >= n, and the LLM memorizes this and just plugs in constants at the end.

Alex Meiburg (Nov 06 2024 at 19:27):

As another example, computing prime ramifications or class field numbers is a nice sort exercise I remember doing a few times as an undergrad, and it's easy to plug in different number fields to get different problems.

But then there usually ends up being some kind of tractable formula for e.g. cubic fields, which could very well make its way into mathlib in the future and then the problem becomes very easy.

Thomas Zhu (Nov 07 2024 at 16:21):

For a shameless self-promotion, our miniCTX benchmark aims to solve exactly some of the issues raised.

Its problems are sourced from real-world Lean projects (PFR, PNT+, etc) instead of competition problems. In practice, we found some techniques that can significantly boost performance on these real-world problems, while not having any effect on miniF2F performance. So if the ultimate goal is a theorem proving helper for Lean users, over-optimizing on miniF2F-like benchmarks might not be the best approach.
It is sourced from theorems created after a cut-off date (December 2023). This is so that LLMs pretrained before this date are safe to evaluate on miniCTX.
It will continually automatically update in the future, to stay ahead of future pre-training cutoffs.
Its problems consist of theorems as well as surrounding context and metadata, allowing for any combination of LLMs, tree search, symbolic approaches, premise selection, etc. to solve the problem; taking as input theorem name, preceding content, content in the repository, natural language comments, proofs of similar theorems, etc. In our ablations, we found all of these helpful toward theorem proving.

Of course, this is not a perfect solution, and still requires some (albeit mostly automatic) maintenance on our side, but I think this is a first step toward a more real-world, LLM-aware benchmark.

Last updated: May 02 2025 at 03:31 UTC