Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Lean-based advanced mathematics benchmark

Nathan Bowler (Apr 15 2025 at 13:54):

Johannes Carmesin and I recently began an initiative to create a benchmark consisting of new PhD level theorems, formalised in Lean, where the answer should be a proof of that theorem formalised in Lean. There's a short (5 page) overview attached to this message.

Let us know if you want to get involved! We're still gathering a team for this, and we are particularly in need of people who have a lot of experience working with Lean.

And let us know what you think! Any comments, ideas or suggestions would be very welcome.

We were hoping somebody else would already be doing this, but after asking around a little we haven't found anyone. Let us know if you have come across any projects along these lines, so we can get in touch.

ProofBench.pdf

Auguste Poiroux (Apr 15 2025 at 15:08):

Nice project! I am interested in such benchmarks as well, we can have a discussion some time. If you are interested in research-level mathematics, you can check RLMEval. In short, we evaluate on existing Lean blueprint projects, which contain aligned informal and formal representations of statements and proofs. We evaluated only statement autoformalization methods, but this can be extended to neural theorem proving and proof autoformalization. Although there are some limitations to this benchmark as we have to make sure that LLMs are not (pre)trained on the selected Lean blueprint projects... That's where I believe your ProofBench project will shine with new unseen problems ;) Looking forward to the release/milestones of this project!

Thomas Zhu (Apr 15 2025 at 15:31):

I think this is a positive direction of research that should have more attention! At some point in the future, machine learning on Lean will saturate IMO problems and we need better problems that are out of the restricted high-school-competition domain.

Unrelated, I think an example of a problem you envision in ProofBench would really clarifying: is there an example problem consisting of the information (1)–(4) in the proposal?

Nathan Bowler (Apr 15 2025 at 16:04):

Thanks!

RLMEval looks great! Thanks for letting us know about it. It's good that there are benchmarks like that already, since the easier benchmarks will soon be saturated. As you say, we are hoping to avoid problems that might already be in the training data, which will mean a lot more work. We'll be in touch soon, @Auguste Poiroux

@Thomas Zhu: Putting together a few example problems with full information is one of the first milestones in our current roadmap, and I'll be sure to put a link in this chat to wherever we make them available. But it will still be a while before they are ready.

Jason Rute (Apr 15 2025 at 22:05):

@Nathan Bowler, congratulations on proposing a new benchmark. Here are a few questions I have about your benchmark, and for any other new formal benchmarks coming out (including @Auguste Poiroux's benchmark).

Is this a proof or an autoformalization benchmark? If the former, would you be concerned if the AI training has already seen the theorem proof (e.g., the original ArXiv paper) in English?
Are your problems already publicly formalized, in which case there is room for direct leakage, or are they of interest to other people formalizing math? For example, IMO and Putnam problems are of independent interest to formalizers (like mathcomp), conflicting with benchmarks made of these very same problems?
What quality control do you have? As you likely know, many Lean benchmarks have been plagued with typos, making them either false or trivially true. How sure are you that your benchmark has few if any of those? Will you correct them if they happen, and if so, how will you version those changes?
Will you keep your benchmark up-to-date as Lean and Mathlib change? Again, what versioning will there be?
Will you be okay with someone targeting your benchmark specifically, and using a lot of training and compute just to get a good score on your benchmark?
Who will control the narrative? For example, with MiniF2F, whoever publishes the paper controls the narrative. There are many MiniF2F versions (even for Lean 4) with different corrections. (Some only correct false problems, some correct trivially true ones as well.) Some use larger amounts of compute than others. (There is one company claiming 90% on their own custom version of MiniF2F, according to a blog post with almost no details.) On the other hand, the PutnamBench benchmark, while not perfect, controls the narrative themselves. They have a central website with a leaderboard, as well as an actively maintained benchmark repository.

And sorry, I haven't read your paper (or @Auguste Poiroux's) closely to know if these questions are obviously answered. (I also asked many of these questions to the PutnamBench benchmark when it came out a year ago.)

Auguste Poiroux (Apr 16 2025 at 08:57):

Hi @Jason Rute,
To add some clarification on RLMEval: RLMEval is a framework to evaluate on a set of Lean blueprint projects. RLM25 is an instantiation of this on 6 Lean blueprint projects. You can find the list of these projects here, or directly in the paper page 4. That said:

RLMEval has been developed as an autoformalization benchmark. More specifically a statement autoformalization benchmark. But it can be easily extended to tackle the proof autoformalization and automated theorem proving tasks.
Problems are already publicly formalized, hence one has to be careful when evaluating a given model with RLMEval. For RLM25, we selected projects for which the first contribution was committed after the training date of the models we evaluate. Why are we evaluating on existing Lean Blueprint projects? Because we believe it gives a much better picture of actual performance for people interested in using autoformalization for formalizing their mathematic results. For future evaluation, more recent projects will be selected. This approach is very similar to the miniCTX one, but applied to autoformalization. There is also the option of evaluating on private Lean blueprint projects.
I believe current Lean blueprint projects are high-quality and well-polished in general. This is particularly true for projects that are finished or almost finished. However, when evaluating on ongoing Lean blueprint projects, some mistakes may indeed appear.
RLM25 will stay as is to ensure consistent results (commits are fixed). However, we will likely release new versions of the benchmark using RLMEval (probably on a yearly basis).
Yes, this is not yet obvious how one can achieve that for statement autoformalization.
Thanks for pointing this out. We will keep this in mind for future releases. For RLM25, we will soon add results directly to the Github repo. We will add future published results as well. People interested in adding their results on this benchmark can directly contact me or make a PR directly to the repo.
Thanks for asking these questions! I believe these are very important points when building a benchmark.

Jason Rute (Apr 16 2025 at 11:02):

As for your point (4) it does seem like testing on data which is past the “light cone” of the model (to quote my physics colleague) is the optimal approach. I hope your benchmarks take off and you keep updating them with new versions. (Like how https://matharena.ai tests language models on math competitions immediately after the competition ends).

Nathan Bowler (Apr 16 2025 at 12:31):

Hi, @Jason Rute, thanks for the great questions. To be frank, I share some of the concerns which lie behind them. We'll be doing our best to mitigate them, but these are difficult issues and it will require some work. Some of your questions were already answered in the overview attached to my original message, but I completely understand that not everybody has time to look through such documents in detail. I will indicate excerpts from the overview by using quote marks.

It is intended as a combined proof and autoformalisation benchmark. "Not only are the skills required for the two domains of proof formalisation and development closely related, the boundary between them is hard to draw clearly. A precise and detailed proof is easier to formalise than a vague proof sketch. The higher the level of the proof, the closer the problem becomes to one of pure proof development. That is the reason for addressing both problems with a single benchmark. By so doing, we are also able to evaluate, for each question, where on this continuum the current system's abilities lie." We do not intend for the proofs to be public or accessible for training. We will use results whose proofs have not been published. "the evaluation portion of the benchmark ... will be kept secret to avoid contamination of the training data."
Our problems will not be already publicly formalised, nor publicly accessible. They will be freshly contributed by working mathematicians to avoid contamination. "Each question in the benchmark will consist of a theorem statement formalised in Lean, together with some additional helpful information. The theorem will be an unpublished result at the level which it would be reasonable to give as a short starter question to a PhD student."
Quality control will be provided for the problems by other working mathematicians (similar to peer review), and for the formalisation of the problems through review by other members of the formalisation team.
The benchmark will be regularly replenished with fresh problems. "after a quarantine period of one year, we will allow the contributors to publish the ideas if they wish, at which point the problem in question will be moved from the private to the public portion of the benchmark. The private portion will be refreshed with new questions on a rolling basis." This will certainly require careful versioning. The current plan is to have major overhauls once a year, including making sure we are up to date with Lean and Mathlib. Minor changes such as corrections of mistakes will be made as necessary, and only affect the part of the version number after the decimal point.
We intend that our questions will be sufficiently reflective of real-world mathematics that if anyone chooses to do this then their high score will reflect genuine usefulness.
We intend to control the narrative. As emphasised above, we will keep the questions private, and control the testing ourselves. We will maintain our own leaderboard.

Last updated: May 02 2025 at 03:31 UTC