Zulip Chat Archive
Stream: Machine Learning for Theorem Proving
Topic: chart minif2f progress over time
Quinn (Dec 02 2024 at 18:28):
has anyone seen a chart of minif2f progress over time? it's ok if it's fast and loose with the refactors/version control that it's undergone
Jason Rute (Dec 02 2024 at 20:49):
- https://paperswithcode.com/sota/automated-theorem-proving-on-minif2f-test
- https://paperswithcode.com/sota/automated-theorem-proving-on-minif2f-valid
They are missing MANY results, especially almost all newer ones. (That can be fixed by going through and adding them. A good but not complete list of recent models is here: https://leanprover.zulipchat.com#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/Proper.20LLM.20Lean4.20Integration.20with.20recursive.20checks.20for.20error/near/474449773).
Also they are grouped by pass@k, which is isn’t a good way to group things. In theorem proving, pass@1 with a tree search calling GPT-4 at every node and pass@100 with just GPT-4 (no tree search) are using similar amounts of compute/tokens. Whereas, a tree search using a smaller neural network using pass@100 is using far less compute than either. (The two DeepSeek-Prover papers have a slightly better way, where they approximately count calls to the model.)
Also, it is not clear what to do with current models using especially large amounts of compute like DeepSeek-Prover, InternLM2-Step-Prover, or (probably) Aristotle. This seems to be one of the ways models can improve there score on Minif2f. And Aristotle is just a blog post that sounds too-good-to-be-true so it is likely using a lot of compute. (If AlphaProof ever evaluated on MiniF2F it would also be using a lot of compute.) Also InternLM2-Step-Prover many to be consider as cheating since it trains on compfiles (although in fairness we don’t know if compfiles is in the training data of other LLMs).
And of course most of the best models are evenly split between Lean and Isabelle.
Maybe an even better source of recent numbers is just the tables in recent papers like InternLM2-Step-Prover, DeepSeek-Prover-v1.5, and ABLE which include their results and previous ones.
Quinn (Dec 02 2024 at 22:27):
wow. fantastic answer, thanks so much!
Last updated: May 02 2025 at 03:31 UTC