Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Introducing APE-Bench I

Huajian Xin (Apr 27 2025 at 16:30):

We are excited to share our latest work, "APE-Bench I: Towards File-level Automated Proof Engineering of Formal Math Libraries." In this paper, we introduce a novel paradigm of Automated Proof Engineering (APE), which goes beyond the traditional focus on isolated theorem proving by applying large language models to real-world development and maintenance tasks within Mathlib4. Under this new paradigm, models are expected to generate syntactically correct and semantically aligned code patches based on natural language instructions and pre-edit Lean files.

Inspired by the design of SWE-Bench, we present APE-Bench I, the first benchmark targeting file-level structural edits in formal mathematics. We also develop a scalable verification infrastructure tailored for Lean, and combine compilation with LLM-based semantic evaluation. Finally, our work includes a systematic evaluation of current state-of-the-art models on this challenging new task. The paper has been uploaded (paper link), and all datasets and code will be open-sourced in the near future.

As the starting point of the APE-Bench series, this work lays the groundwork for automated proof engineering. Looking ahead, we plan to extend the benchmark towards multi-file coordination scenarios (APE-Bench II) and autonomous agent systems with task planning and feedback-driven repair capabilities (APE-Bench III), paving the way toward practical, scalable, and engineering-ready formal mathematics powered by LLMs.

image.png
image.png

Justin Asher (Apr 27 2025 at 16:39):

@Huajian Xin Thanks for sharing this. It appears o3-mini does very well. Have you started testing newer models like Gemini 2.5 Pro or o4-mini? I would be very interested in seeing how these models perform.

Auguste Poiroux (Apr 27 2025 at 18:53):

Great benchmark!
I have a few comments and improvement suggestions. These are not specific to your work only. They represent my opinion, do what you want with them :)

LLM-As-A-Judge: Semantic evaluation is super hard, I agree. However, when using LLM-As-A-Judge, it requires trusting the underlying LLM here (1) to be accurate and (2) to not be biased/gamed towards some predictions.
- I see that you do a majority voting over 4 samples per evaluation, which can mitigate (2) to some extent. But, it is known that LLMs have a tendency to overestimate the correctness of their own output. Claude Sonnet 3.7 is the Judge in your experiments. And Claude Sonnet 3.7 is also the only model evaluated with a 0.0% "Semantic Drop". Are the outputs overestimated? Or is Claude excellent at producing semantically valid code?
- Regarding (1), did you manually measure how accurate your LLM-As-A-Judge metric is? I think this would really strengthen your results and claims in the paper if you add more information about that.
- LLM-As-A-Judge is likely to return some false positives, even with majority voting. Doing pass@k on a metric with false positives can be quickly deceptive and overestimate the real accuracy as k increases.
Data contamination: Since you are using Mathlib4 as data for APE-Bench I, it is very likely that LLMs are trained on it. How do you address this issue?

Huajian Xin (Apr 28 2025 at 04:01):

Justin Asher said:

Huajian Xin Thanks for sharing this. It appears o3-mini does very well. Have you started testing newer models like Gemini 2.5 Pro or o4-mini? I would be very interested in seeing how these models perform.

Thanks for your comments! We will test them in the near future :)

Huajian Xin (Apr 28 2025 at 04:07):

Auguste Poiroux said:

Great benchmark!
I have a few comments and improvement suggestions. These are not specific to your work only. They represent my opinion, do what you want with them :)

LLM-As-A-Judge: Semantic evaluation is super hard, I agree. However, when using LLM-As-A-Judge, it requires trusting the underlying LLM here (1) to be accurate and (2) to not be biased/gamed towards some predictions.

I see that you do a majority voting over 4 samples per evaluation, which can mitigate (2) to some extent. But, it is known that LLMs have a tendency to overestimate the correctness of their own output. Claude Sonnet 3.7 is the Judge in your experiments. And Claude Sonnet 3.7 is also the only model evaluated with a 0.0% "Semantic Drop". Are the outputs overestimated? Or is Claude excellent at producing semantically valid code?

Regarding (1), did you manually measure how accurate your LLM-As-A-Judge metric is? I think this would really strengthen your results and claims in the paper if you add more information about that.

LLM-As-A-Judge is likely to return some false positives, even with majority voting. Doing pass@k on a metric with false positives can be quickly deceptive and overestimate the real accuracy as k increases.

Data contamination: Since you are using Mathlib4 as data for APE-Bench I, it is very likely that LLMs are trained on it. How do you address this issue?

Thanks for your comments! As for LLM-as-a-Judge, we mentioned in the paper that we also used o3-mini as the judge model and it also shows nearly zero semantic drop for Claude Sonnet 3.7 and even higher semantic drop for o3-mini itself. But Sonnet's judgement responses are more stable than o3-mini's and to control the complexity of overall evaluation framework for future comparison, we choose the Sonnet alone as the judgement. As for data contamination, all test cases are extracted from commits after Feb this year and it is believed that generally all these models are pretrained before that, and in response samples we also do not observed any response that is exactly same with the ground truth.

Last updated: May 02 2025 at 03:31 UTC