Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Proposal: Apply premise selection to Hammer

Jason Rute (Jan 14 2020 at 01:22):

Here is a project which I think is very doable for an enterprising person. Recently, @Gabriel Ebner incorporated an SMT-based "hammer" into a branch of Lean3. Right now, it uses TF-IDF (with cosine similarity) to select premises/lemmas. However, Gabriel showed (on slide 28) that if one used better premise selection one could up-to double the success rate. I think this is a ripe opportunity to apply the recent advancements in machine learning premise selection for theorem proving.

Jason Rute (Jan 14 2020 at 01:25):

For supervised learning, Gabriel has already found a way to extract both the lemmas used in the original lean proofs as well as those used by the hammer. This would make a great training set. Then one could also apply reinforcement learning by using a trained premise selection model, extracting which lemmas have and have not been used, and then using this to train a new model. Repeat until convergence. Since the TF-IDF premise selector is already in C++, I imagine one could incorporate a trained TensorFlow model in there. (And still do the training in Python by using his code to extract the lemmas used.)

Jason Rute (Jan 14 2020 at 01:27):

See here for a discussion on some of this. https://leanprover.zulipchat.com/#narrow/stream/113488-general/topic/Hammer.20talk

Jason Rute (Jan 14 2020 at 01:31):

Of course a learned model of premise selection would have trouble with never-seen definitions (unlike TF-IDF). Google's HOList however showed that sometimes this is ok. They trained a model on the core HOL Light library, but it did respectably on the Flyspeck library even though there were a number of new definitions. Moreover, there is recent research by IBM about this exact problem.

Jason Rute (Jan 14 2020 at 01:35):

Here is one of the best results in the area of premise selection (also by IBM) at least according to the two standard machine learning benchmarks (HOLStep and the Mizar dataset). While this might be complicated for a first pass, I could imagine starting with something simpler. There are a lot of references in that paper (including the original HOLStep paper). Also, I think Isabelle's Sledgehammer uses naive Bayes for another example.

Jason Rute (Jan 14 2020 at 01:42):

Of course, this might be premature as a project since Gabriel is still improving the hammer (or maybe this was already something he had in mind to add to it). Either way, I could imagine using the dependency graph of Lean premises as another standard premise selection dataset (this one coming from dependent type theory).

Christian Szegedy (Jan 14 2020 at 02:49):

Note that our DeepHOL-zero system for HOList https://arxiv.org/abs/1905.10501 demonstrates how TF-IDF style similarity metric can be combined with standard RL to bootstrap premise-selection without imitation learning. DeepHOL-zero achieves results that are almost at the level of systems that were trained with a combination of RL and imitation learning.

Jason Rute (Jan 14 2020 at 02:56):

Yes. I was going to mention your work too. I forgot you used TF-IDF (or something similar) as a starting point.

Christian Szegedy (Jan 14 2020 at 03:41):

TF-IDF is not spelled out explicitly, but we use a modified version of it. We have not reported the comparisons with the standard TF-IDF, but our variant was superior in our experiments.

Last updated: Dec 20 2023 at 11:08 UTC