Zulip Chat Archive

Stream: general

Topic: FormalMATH Benchmarking Formal Mathematical Reasoning of LLM

Zhouliang Yu (May 09 2025 at 05:36):

Excited to introduce FormalMATH: a large-scale formal math benchmark with 5,560 formally verified Lean 4 statements from Olympiad and UG-level problems.
Best model performance: just 16.46% — plenty of room for progress!
We open-sourced the complete datasets with the one-click evaluation code.

Github: https://github.com/Sphere-AI-Lab/FormalMATH-Bench
Huggingface: https://huggingface.co/datasets/SphereLab/FormalMATH-All
Website: https://spherelab.ai/FormalMATH/
LeaderBoard: https://spherelab.ai/FormalMATH/#toggleButton

Jason Rute (May 09 2025 at 10:48):

I worry about having too many benchmarks which are all the same. What do you see this benchmark adding over say PutnamBench?

Jason Rute (May 09 2025 at 10:53):

The most interesting part of this benchmark is that it is IIUC automatically generated with a human in the loop. But that probably means a lot of mistakes. See the recent discussions on this Zulip about the CombiBench benchmark and the ProverBench benchmark for all the different ways that one can easily make mistakes with such benchmarks. The first mistake I found in your is in problem quantitative_reasoning_zh_blue_41. I think that is the forth problem I looked at. You made a common mistake forgetting that subtraction of natural numbers is truncated subtraction.

Jason Rute (May 09 2025 at 10:54):

The problem is false for x = 2.

Jason Rute (May 09 2025 at 11:05):

Also 5k problems is huge for a benchmark of this type. Many training datasets aren’t this big. Do you really intend this for benchmarking?

Eric Wieser (May 09 2025 at 11:07):

Which version of Lean/Mathlib is this dataset for? Do you plan to update it for future versions?

Joseph Myers (May 09 2025 at 20:34):

Could you give more details of the human verification stage? How much Lean experience did the "12 International Mathematical Olympiad medalist-level human experts" have? What instructions were they given for verification - for example, what common errors (such as natural number subtraction) were in their checklist of things to look for, if any? How many people checked each problem statement? How many errors were found? For errors in problems checked by more than one person, what are the statistics of how many people found each error out of all the checkers of that problem (allowing appropriately for cases where there were multiple errors in the same statement and different checkers found different subsets of them)? Based on such data, do you have any statistical estimate of the number of statements with undetected errors similar to the errors that did get detected?

Last updated: Dec 20 2025 at 21:32 UTC