Zulip Chat Archive
Stream: Machine Learning for Theorem Proving
Topic: Percentage of Mizar theorems solved
Jason Rute (Jun 26 2024 at 00:32):
Moving these messages from #Machine Learning for Theorem Proving > DeepSeek-Prover.
Jason Rute (Jun 26 2024 at 00:36):
Don't think I can move them. I'll just quote them...
Jason Rute (Jun 26 2024 at 00:36):
Qian Hong said:
Jason Rute said:
Josef Urban likes to claim that automated methods can solve say 80% of Mizar theorems. (If someone could remind me where to find this reference that would be great.)
https://intelligence.org/2013/12/21/josef-urban-on-machine-learning-and-automated-reasoning/
Jason Rute (Jun 26 2024 at 00:36):
Mario Carneiro said:
This accords with my recollection: Josef did not say 80% now, he said 60% now and 80% in 10 years (20 from the date of that interview)
Jason Rute (Jun 26 2024 at 00:37):
Jason Rute said:
Thanks Mario Carneiro . I misremembered. But I’m actually referring to the 75% number in that paper which is also the number on this website. In that number he is combining many different solvers. If we did the same for MiniF2F, Mathlib, or Isabelle/HOL AFP (through the PISA benchmark) we would also probably see really high pass rates. (Heck, even in our Graph2Tac paper if we combine all 25 solvers we benchmarked at some point on the test set, we get a combined 50% pass rate, almost twice the pass rate of our best single method. Of course this naive combining requires x25 more compute than the single best method.)
Jason Rute (Jun 26 2024 at 00:38):
Lasse Blaauwbroek said:
Jason Rute said:
I misremembered. But I’m actually referring to the 75% number in that paper which is also the number on this website.
Note that this number is obtained by using human-supplied premises. As such, these proofs are not found fully autonomously. A bit further down on the website, it reports 58% proof rate while using trained premise selectors.
Jason Rute (Jun 27 2024 at 12:19):
@Josef Urban sent me the following detailed response over email (edited and posted here with his permission):
I did announce in 2021 that AI/TP methods could solve 75% of Mizar when helped by human premise selection [1].
I announced in my 2021 final ERC report [2] and later in the 2023 Mizar60 ITP paper [3,4] that AI/TP methods could solve 58.4% ("about 60%") fully automatically (i.e., in the hammer/chainy mode) when using a total CPU limit of 420s.
The evaluations that use CPU-time-limited portfolio go way back at least to the Mizar40 [5] and AR-for-Flyspeck [6] papers. They are also regularly used in hammer evaluations, ATP competitions like CASC, etc. Doing strategy and system combinations within an allowed CPU limit has been one of the major weapons in the ATP domain since the time of Tammet's Gandalf. I even started a series of systems in 2012 (BliStr [7]) that evolve complementary strategies. It's a nice AI research area of its own today (sadly, mostly escaping the DL/LLM-blindsided people).
In [2-6], we are usually pretty clear that there is no time limit applied to the total number of problems solved (75% recently and 56% in Mizar40). It's just an interesting number saying: In all of our attempts (which also use human premises), we could solve this many problems. We could try to construct a time-limited portfolio for this, but we were typically too exhausted by doing it for the hammer/chainy mode.
I did mention 80% in the hammer/chainy mode (and using the same HW resources as before, i.e. 420 s) in my 2013 MIRI interview [8], and then put it as one of the challenges and bets in my 2014 IHP talk [9]. I guess I do "like" to mention these challenges/bets [10]. Maybe because they come from a time when almost nobody was doing ML for TP, there was no money and AI millionaires in it, and I was crazy enough to attach quite a bit of my "poor Dutch postdoc" money to the bets. But that's very different from claiming that we have done them. The 80% bet comes up in 2034 and anybody can still bet me.
As a matter of fact, the 75% total number has been very recently increased to 80% just by using cvc5 in interesting ways. The paper was posted on arxiv this week [11].
[1]: https://github.com/ai4reason/ATP_Proofs/blob/master/75percent_announce.md
[2]: http://ai4reason.org/PR_CORE_SCIENTIFIC_4.pdf
[3]: https://doi.org/10.48550/arXiv.2303.06686
[4]: https://doi.org/10.4230/LIPIcs.ITP.2023.19
[5]: https://doi.org/10.1007/s10817-015-9330-8
[6]: https://doi.org/10.1007/s10817-014-9303-3
[7]: http://arxiv.org/abs/1301.2683
[8]: http://intelligence.org/2013/12/21/josef-urban-on-machine-learning-and-automated-reasoning/
[9]: https://www.ciirc.cvut.cz/~urbanjo3/ihp14/ihp14.html
[10]: http://ai4reason.org/aichallenges.html
[11]: https://arxiv.org/abs/2406.17762
Jason Rute (Jun 27 2024 at 13:00):
The more I think about it, the more I agree it is a forgotten fact in the DL/LLM world that diversity of approaches helps so much. If aliens attacked us and demanded we submit a computer program solving as many theorems as possible, it would almost certainly be the best idea to throw in as many solvers as possible. This might be more true of theorem proving more than most other applications of ML research since there is an obvious way to combine two solvers: run one and if it doesn't solve the problem, then run the other. (Of course, smart routing and scheduling of the various solvers would be even better as Josef says.)
Jason Rute (Jun 27 2024 at 13:00):
I still wonder what this realization means for benchmarks like MiniF2F, PISA, and the new ones coming out. If one really wants to get SoTA, then it would be best to union all known approaches. The danger is that this isn't too interesting unless it is done in an interesting and generalizable way, and I think many of us would hope (maybe naively) for more general single solutions that can replace the hodgepodge of existing approaches.
Jason Rute (Jun 27 2024 at 13:15):
Here are a few follow-up questions for @Josef Urban:
- What do you think is a reasonable ballpark estimate of the percent of Mizar theorems that could be solved in a “reasonable time” given a smart strategy combining existing solvers? (“Reasonable time” is relative, of course, especially given that computers are getting faster and GPUs are more and more commonplace.) I would be interested in either the setting with or without human-selected premises.
- Do you have detailed information on which solvers solved which theorems? I could see that data being very interesting to look into.
- Do your numbers on the ATP Proofs site include work outside your group? In particular, I’m thinking about TRAIL by my colleagues at IBM Research, especially their newest paper. It is an RL approach, just like your RL of TP paper and DeepMind’s Incremental Learning paper (which unfortunately only tests on TPTP). Actually, I’m not sure if you included any RL results, including your own, on the ATP Proofs site. If so, I would be curious why.
Jason Rute (Jun 27 2024 at 17:55):
Actually, I think I have a slightly better idea about question 3. It looks like your rlCop and plCop, Deepmind's IL w/ HER, and IBM's TRAIL and NIAGRA test on a smaller subset of the Mizar dataset, with about 2k problems. (I found the IL w/ HER results for Mizar in their ATP report.) There appear to be two such 2000 problem datasets, MPTP2078 and M2k, where M2k is much easier since it includes only problems already solvable by an ATP. I also assume both test sets include human premises making them easier.
Jason Rute (Jun 27 2024 at 17:55):
So I guess these methods can't be counted in the above 75% since they don't cover the whole Mizar library like the results in the ATP Proofs website.
Jason Rute (Jun 27 2024 at 17:56):
Nonetheless, it is interesting to point out that the best methods on MPTP2078, including both Vampire and the best RL methods, each get close to 75% alone. (See Table 2 in the NIAGRA paper for a full summary of current results.) Of course, this might be using more compute than any of the approaches listed on the ATP Proofs website. (Also RL is a multiple-problems-at-a-time approach instead of a one-problem-at-a-time approach). I wonder how all the results in Table 2 do when combined. The combination must easily be approaching over 80% (again in the easy-mode setting of using human-premises).
Jason Rute (Jun 27 2024 at 18:02):
Screen-Shot-2024-06-27-at-2.02.59-PM.png
Last updated: May 02 2025 at 03:31 UTC