Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: A dataset of proofs for 20 IMO problems and extracted lemmas

Roozbeh Yousefzadeh (Mar 04 2025 at 04:42):

Hello, we wanted to share our paper and dataset which was recently published at TMLR. We provide the complete proofs for 20 IMO problems from miniF2F and beyond. We also extract 1329 lemmas from those proofs. We then evaluate the ability of LLMs on proving the lemmas in our dataset. The best accuracy is 39% achieved by DeepSeek Prover. DeepSeek and Goedel only prove lemmas with short proofs (average length of less than 3 lines), relying heavily on automatic Lean solvers such as nlinarith.

Here is the paper: https://openreview.net/pdf?id=CrKMqRAhBo and here is the dataset: https://github.com/roozbeh-yz/IMO-Steps/tree/main

Would appreciate any feedback or comments from the community.

Notification Bot (Mar 04 2025 at 04:53):

This topic was moved here from #general > A dataset of proofs for 20 IMO problems and extracted lemmas by Kim Morrison.

Joseph Myers (Mar 04 2025 at 11:24):

As of what date was "Lean proof publicly available" in Table 1 determined? 1962 P2, 1965 P2, 2022 P2 and 2023 P4 all currently have solutions in Compfiles. 2022 P2 also had a Lean 3 solution (!3#15621) available as far back as July 2022.

Notification Bot (Mar 04 2025 at 12:54):

13 messages were moved from this topic to #Machine Learning for Theorem Proving > A dataset of proofs for 20 IMO problems and extracted lem... by Eric Wieser.

Roozbeh Yousefzadeh (Mar 05 2025 at 02:14):

Thanks for your comment and for pointing out those proofs. I was not aware of the proofs available there (some of them are written very recently, e.g., December 2024, and none of them are in the IMO directory of Mathlib). For the Table 1 in our paper, I checked the proofs in the IMO directory of Mathlib. I also checked the proofs published in the AI literature.

Would Compfiles like to incorporate my proofs for the problems that do not yet have a formal proof there?

To make up for the 4 proofs that are already available publicly, I will release the proofs for 4 other IMO problems that do not have a formal proof at Compfiles: 1961p1, 1966p4, 1978p5, and 1993p5.

Roozbeh Yousefzadeh (Mar 05 2025 at 02:36):

Here is the list of problems that their formal proofs are released with the paper but Compfiles does not have a complete proof for them: 1969p2, 1974p3, 1983p6, 1984p6, 1985p6, 1992p1, 1997p5, 2022p5, 2007p6 (shortlist). I hope my formal proofs will be useful for others.

David Renshaw (Mar 05 2025 at 15:38):

Would Compfiles like to incorporate my proofs for the problems that do not yet have a formal proof there?

Pull requests to add these to Compfiles would be welcome!

Andreas Gittis (Mar 05 2025 at 21:47):

It seems to me writing public solutions for IMO problems in the test set of an ML benchmark can have a negative effect on the value of that benchmark. Wouldn't it have been better to do the same thing for 20 IMO problems that are not in miniF2F?

Joseph Myers (Mar 05 2025 at 22:27):

"2007p6 (shortlist)" is an odd description; recent shortlist problems have identifiers such as A1, C2, G3 or N4 (Algebra, Combinatorics, Geometry, Number Theory), not "p6". Don't follow the miniF2F mistake of inventing random numbers for shortlist problems and claiming they are IMO problems when they aren't at all!

I don't know the Compfiles view of IMO shortlist problems, but to me it seems a reasonable place for such problems and solutions (correctly identified and with a correct formal statement corresponding to the informal one, of course).

Joseph Myers (Mar 05 2025 at 22:33):

I'd like to see public formal statements and solutions to all 392 past IMO problems (up to mathlib archive standards, meaning that all lemmas arising of more general use go in mathlib proper in appropriate generality). (I'm gradually filling out more geometrical definitions to at least allow more past geometry problems to be stated cleanly with mathlib definitions.)

It's for people using public benchmarks to deal with excluding material (both formal and informal) related to those benchmarks from training data or otherwise account for how such material may have influenced performance on the benchmark. If you want to train an AI to solve future IMO problems, training on public formal solutions to past such problems is a perfectly reasonable thing to do (although 392 problems might be rather too little to help very much in AI training).

Roozbeh Yousefzadeh (Mar 12 2025 at 04:50):

Thank you for your comments and feedback. I appreciate it.

Last updated: May 02 2025 at 03:31 UTC