Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: ProofNet fix

Auguste Poiroux (Nov 05 2024 at 16:28):

Hello everyone,

Interest in the ProofNet benchmark has been growing in the ML community for both automated theorem proving and autoformalization. I believe it is a very interesting benchmark, in particular because it contains unique formalizations (and it is excluded from training sets such as ProofPile 2).
However, during my experiments, I noticed that current Lean 4 versions of this benchmark contain a non-negligible number of formalization errors: I estimate that more than 25% of the (natural language, formal) pairs contain inaccuracies (different semantic meaning). A significant portion of these errors also appear to be present in the original Lean 3 version.
I spent some time fixing these errors. The updated benchmark is available here for exploration and comparison with existing ProofNet versions: https://proofnet4-fix.streamlit.app/. While I am fairly confident in most of the fixes, I am not a Lean or Mathlib expert, so it's possible I overlooked errors or introduced new mistakes. Feedback and critiques are highly welcome ^^

My current plan is:

to release the updated benchmark as "ProofNet 1.1" to avoid comparison issues in future works.
to release it exclusively on HuggingFace to minimize the risk of data contamination.

While I advocate for better benchmarks in general, I believe it’s best to avoid having multiple variations of the benchmark whenever possible. Before moving forward, I’d appreciate any thoughts from the community on alternative approaches or considerations I might be missing.

Thank you in advance for your input!

Jason Rute (Nov 05 2024 at 16:36):

MiniF2F has the same problems. The Lean benchmarks are a big mess. How does putting it on huggingface reduce data contamination?

Auguste Poiroux (Nov 05 2024 at 16:38):

By data contamination, I meant "reducing the risk for the benchmark to be included in LLM training data". For pretraining data, I believe Github repos are scraped more often than HuggingFace datasets. But maybe I am mistaken?

Jason Rute (Nov 05 2024 at 16:39):

I think Huggingface datasets are prime training material.

Jason Rute (Nov 05 2024 at 16:39):

Also, isn’t ProofNet intentionally automatically translated? So such errors are inevitable and sort of part of the benchmark, no? (The benchmark probably should have been to prove/disprove the theorem.)

Auguste Poiroux (Nov 05 2024 at 16:44):

Well, good to know for HuggingFace datasets, it's quite unfortunate. Do you have any recommendation then?
One of the goal of ProofNet is to assess autoformalization performance by comparing predictions with ground truth formalizations (paper where it was introduced https://arxiv.org/abs/2302.12433). I don't think it was automatically translated, but @Zhangir Azerbayev can confirm. (it would seem weird to me to report the best autoformalization method at 16% otherwise)

Jason Rute (Nov 05 2024 at 16:53):

Sorry. I misremembered.

Jason Rute (Nov 05 2024 at 16:54):

Canary strings are supposed to be the way to keep this stuff out of the training data.

Jason Rute (Nov 05 2024 at 17:11):

But it might be too late for ProofNet (even a fixed up version)

Jason Rute (Nov 05 2024 at 17:11):

(deleted)

Auguste Poiroux (Nov 05 2024 at 17:22):

Damn, the field is moving super fast ^^ Lean 4 versions of ProofNet have been there for 9 months at best (this is the first one to the best of my knowledge: https://github.com/rahul3613/ProofNet-lean4). Jokes aside, I agree that for ATP results it might be too late.
However, for autoformalization, it is definitely not too late in my opinion. Current published results are evaluated by hand. I saw that at ICLR there is a submission (https://openreview.net/attachment?id=hUb2At2DsQ&name=pdf not my work) where they develop a reference-based metric. Since reference-based metrics solely depend on the quality of the reference labels, there is interest in having good benchmarks (otherwise results might be biased towards models making mistakes similar to human ones).

Alex Meiburg (Nov 06 2024 at 19:34):

I asked ChatGPT to 'complete' the problem statement, from ProofNet:

{"name": "exercise_1_16a", "split": "valid", "informal_prefix": "/-- Suppose $k \\geq 3, x, y \\in \\mathbb{R}^k, |x - y| = d > 0$, and $r > 0$. Prove that if $2r > d$,

and it output the exactly correct completion from the dataset:

{"name": "exercise_1_16a", "split": "valid", "informal_prefix": "/-- Suppose $k \\geq 3$, $x, y \\in \\mathbb{R}^k$, $|x - y| = d > 0$, and $r > 0$. Prove that if $2r > d$, then there exists a point $z \\in \\mathbb{R}^k$ such that $|x - z| = r$ and $|y - z| = r$. --/"

so it has _definitely_ been trained on, and memorized, significant parts of ProofNet. Note that the variable "z" was not even mentioned in my prompt.

Jason Rute (Nov 06 2024 at 19:36):

Although ChatGPT also has search. I don’t know how it exactly works but it could be just fetching that live from the internet. (But maybe it shows you a link in that case.)

Alex Meiburg (Nov 06 2024 at 19:38):

I'm pretty sure this was not with search, because (1) it responded much faster than is typical for a search-augmented answer (which I've seen it do -- there's usually a few second delay), (2) I'm on the free plan which _shouldn't_ have search, and (3) it typically gives links when it does search and it didn't here.

Alex Meiburg (Nov 06 2024 at 19:39):

Oh, I just realized its completion wasn't quite right. It's supposed to be "there exist infinitely many points z", not "there exists a point z". But otherwise it's right

Still enough for me to say yeah, that's memorization

Auguste Poiroux (Nov 06 2024 at 21:58):

In my opinion, memorization of the informal statement is not enough to conclude about data contamination. In fact, these informal statements have been extracted from existing text books (see ProofNet papers), on which ChatGPT may have been trained. Additionally, memorizing the informal statements is not an issue, because what we care about is formalization capabilities. To me the real issue is if the model has been trained on/memorized the formalizations.

Qi Liu (Dec 17 2024 at 13:01):

impressive work & really looking forward to your release. (btw. perhaps the Lean 4 version of MiniF2F should also be inspected. It also has a great number of inaccurate and even incorrect formal statements.)

Last updated: May 02 2025 at 03:31 UTC