Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Hard proof-based auto-formalization benchmark

Jason Rute (Apr 27 2025 at 21:44):

I'm making a thread just for my proof-based autoformalization benchmark proposal. It is the first benchmark mentioned in my slides here. (One can still use #Machine Learning for Theorem Proving > Talk: Preparing for the next stage in autoformalization to discuss my talk in general.)

Jason Rute (Apr 27 2025 at 21:44):

This benchmark has a lot of interest and it would be good to get it fleshed out more. First, I should mention I will likely not be able to participate myself much at all. (In short, this isn't compatible with my job and I don't have a lot of personal time.) But nonetheless, I'm excited for others interests.

Jason Rute (Apr 27 2025 at 21:44):

To me the most important aspects of this benchmark are the following:

Tests real-world Lean autoformalization. There are plenty of Lean proof benchmarks. This should be focused on auto-formalization even if in some ways it is a secondary emergent property from this benchmark.
It is scored by proof correctness only. This avoids all the issues most common with other autoformalization benchmarks where scoring is much more ambiguous and challenging.
It is hard. The main benchmark should have proofs that contain lemmas or definitions in many of the proofs. When formalized into Lean it will be more than just a small proof block (or a horribly long proof block if it is).
It avoids near future data leakage. If you make the benchmark out of problems which will be formalized soon (what @Kim Morrison calls the rising tide of Mathlib) then there is a real danger that these problems will be formalized and contaminate the benchmark. So no work-a-day theorems, common theorems, famous theorems, etc. Pick really non-standard stuff. The subjects can be standard, and maybe should be, since they need to be formalized using existing mathlib definitions (or short non-mathlib definitions).
It is very carefully vetted. The main issue with Lean/Rocq/Isabelle benchmarks are that they contain a lot (like about a quarter in some cases) of misformalizations. This can make problems too easy or impossible (and in this case, not fit the proof). So problems need to be triple-checked for correctness. Assumptions and edge cases need to be carefully vetted.

Jason Rute (Apr 27 2025 at 21:45):

Things I don't feel as strongly about:

Should there be an easier trial version to start? This would be a version with no lemmas or definitions in the proofs. The proofs would be short and simple. The two obvious candidates are MiniF2F and ProofNet. (PutnamBench could work too but may be more work since the proofs are probably in PDF format as pointed out to me by @Zeyu Zheng .)
How do we avoid copyright issues? Probably the best solution is to link to papers or write proofs from scratch, but don't copy paper text.
How to set up the tiers? I envisioned three tiers. Tier 1 was written by hand and self-contained. Tier 2 was self-contained papers. Tier 3 was non-self-contained papers. Maybe this is too many or not the right levels. (@Kim Morrison pointed out that you could make a Tier 3 into a Tier 2 by including the prerequisite lemmas as assumptions to the main theorem.)
Where to store stuff? The simplest would be to make a repo.
How to organize people and work? Let's discuss here.
Do we want training data or just test/benchmark data? Training data is great if easy, but not necessary. If for some reason one can easily come up with many examples, sure generate training data too. But if not, then a testing benchmark is sufficient.
How many examples? I was envisioning 100 in each tier (if doing tiers), but that is open for discussion. Fewer may be fine as it is so hard.
Should we avoid images and scanned PDFs for now? If so, then best to find public arxiv papers with LaTeX available (or our own written examples). (Again, we only need a link to the arxiv, not to duplicate the text in the benchmark.)

Jason Rute (Apr 27 2025 at 21:45):

If you are interested in participating, maybe mention it here.

Jason Rute (Apr 27 2025 at 21:48):

Here is Kim's proposal. (She said there is an intentional mistake.)

import Mathlib

open ArithmeticFunction (sigma)
open Real (exp log eulerMascheroniConstant)

namespace LagariasElementaryStatement

local notation "γ" => eulerMascheroniConstant
-- There is an intentional elementary error in the statement of the problem here,
-- as a test of the review proceess!

/--
This is the statement (without proof) of Theorem 1 from
G. Robin, Grandes valeurs de la fonction somme des diviseurs et hypothèse de Riemann,
J. Math. Pures Appl. 63 (1984), 187-213.
-/
def Robin1 : Prop :=
    RiemannHypothesis → ∀ n, n ≥ 5041 → sigma 1 n < exp γ * n * log (log n)

/--
This is a statement from Lagarias, without proof,
that follows from Proposition 1 in Section 4 of the same paper.
-/
def Robin2 : Prop :=
    ¬ RiemannHypothesis →
      ∃ β : ℝ, 0 < β ∧ β < 1 / 2 ∧ ∃ C : ℝ, 0 < C ∧
        ∀ n, ∃ n', n < n' ∧ sigma 1 n > exp γ * n * log (log n) + C * n * log (log n) / (log n) ^ β

/--
This the statement, without proof, of Lagarias's
"An Elementary Problem Equivalent to the Riemann Hypothesis",
Amer. Math. Monthly 109 (2002), 534-543.
-/
def Lagarias : Prop :=
    RiemannHypothesis ↔ ∀ n, 2 ≤ n → sigma 1 n < harmonic n + exp (harmonic n) * log (harmonic n)

/--
This is the intended benchmark goal,
using https://doi.org/10.1080/00029890.2002.11919883 as the source material.
-/
def LagariasGivenRobin : Prop :=
    Robin1 → Robin2 → Lagarias

end LagariasElementaryStatement

Jason Rute (Apr 27 2025 at 21:50):

It also seems to have an arxiv version: https://arxiv.org/abs/math/0008177

Joe Zhou (Apr 29 2025 at 14:35):

Thanks for sharing the aspects on autoformalization benchmarking. This is great! I'd be interested in participating.

I think the dynamic aspect of the potential benchmarking is great (we have some recent work on dynamic benchmarking on testing other aspects such as knowledge components of LLMs).
For testing vs. training, yeah it depends on the difficulty of acquisition of data. They can also come from the same pipeline, but for testing data we do more review and manual validation to ensure the quality, and for training we can rely on more automated data extraction and annotation pipeline with more tolerance to noise.
Images and PDFs are very interesting too. I've been thinking about this during the ICERM workshop and I think we can get more data from different sources with multimodal LLMs (the quality is another question, but could be good based on my working with vision-language generation).

Jason Rute (Apr 30 2025 at 01:58):

I did a little digging and the search term "self-contained proof" gives some interesting hits on Arxiv. Some of these theorems are famous (or maybe even in mathlib already), so they may not make a good benchmark. But it is a good start to the exploration of what is out there. Some samples:

Jason Rute (Apr 30 2025 at 02:13):

Another possible source of problems are counterexamples, like in the book counterexamples in topology (although I'm sure there are many in other subjects besides Topology).

Advantages:

Easy to state. (There exists an X which is not Y.)
They involve complicated one-off constructions.
They are often self-contained.
They aren't the sort of thing I see in Mathlib right now, so they might be safe from human formalizers getting in the way.
Could find a bunch of examples by taking two related concepts X and Y in Mathlib, and asking an expert if there is an X which is not a Y or a related theorem.

Disadvantages

They could be of interest to someone (not in Mathlib, but in the Lean community) who wants to formalize such things, so they may not be perfectly immune to the rising tide of formalization.
We might need to write up the proofs ourselves.

Floris van Doorn (Apr 30 2025 at 10:15):

A potential source of problems can be extracted from online databases, e.g. the pi-base topology database or the groupprops wiki. That would require manually writing up proofs.
Every claim in such databases could be a potential theorem. E.g. for pi-base, every pair (space, property) and every pair (property, property) could be a potential problem (prove an implication or its negation), usually on the easier side.
Though something like this might be better suited for a (larger) benchmark without corresponding paper proof, since writing/finding a paper proof for these might be time-consuming.

Alex Meiburg (Apr 30 2025 at 15:32):

A lot of classic counterexamples might in general be "welcome" in some form to https://github.com/leanprover-community/mathlib4/tree/master/Counterexamples . Not saying they're at high risk of being added to Mathlib soon, but, it's possible.

Floris van Doorn (Apr 30 2025 at 15:35):

I don't think the Counterexamples folder is suited for a large collection of not-super-interesting counterexamples, so a typical X in pi-base should probably not go in that folder, unless it has a mathematically interesting idea. (I'm not speaking on behalf of the maintainers here, just my opinion.)

Bolton Bailey (May 01 2025 at 02:15):

I'd be interested in participating, and I've already been starting on formalizing one of my early papers from grad school. Some questions:

While the theorems here are purely theoretical, they are about idealized neural networks, so they might be considered "CS" rather than strictly "math". Is this acceptable?
The theorems would require maybe a few dozens of lines of code to make auxiliary definitions for the Wasserstein distance and the feedforward network evaluation function. Is it still self-contained if I add these definitions to the file?
I don't really expect this paper itself to be in mathlib soon, but perhaps Wasserstein distance could be. Is this a problem from the perspective of the results getting easier over time due to more API coming out on aspects of the problem, or form the perspective of a new definition with perhaps slightly different API coming out to conflict with the one in the benchmark?
This paper has a bunch of independent theorems, but the most interesting/difficult are in section 2. Should I formalize all the theorems or just the main ones?

Floris van Doorn (May 01 2025 at 11:41):

Maybe bachelor / master theses are a good source of problems. They are often much more self-contained than published papers, and are often also not the kind of results that we will put in Mathlib soon.

Jason Rute (May 01 2025 at 13:12):

@Bolton Bailey:

(Mathematical) CS would be perfectly acceptable. Diversity was one of my initial goals.
Adding auxiliary definitions is fine. I do that in the examples in my talk.
As for mathlib, the worry I have is just that others will formalize our examples and then those formalizations will either leak into the training data or the problems will become trivial since that same theorem is in the library. So I'm not just worried about mathlib, but about any formalization project which formalizes the whole proof. Do you plan to formalize your proof and put it on github?
I haven't given a lot of thought into independent theorems. I was initially thinking one theorem per paper. (If we have a training data set, which we might not, it may make sense to have multiple theorems per paper. For testing, I don't see the point in having more than one. And certainly we shouldn't have the same paper in both training and test, even if different theorem statements.) So I would say just pick the one theorem you best think fits this dataset.

Last updated: May 02 2025 at 03:31 UTC