## Stream: Machine Learning for Theorem Proving

### Topic: Paper: Simple Dataset for ... Isabelle/HOL

#### Jason Rute (Apr 28 2020 at 23:35):

I really the abstract on this paper:

Recently, a growing number of researchers have applied machine learning to assist users of interactive theorem provers. However, the expressive nature of underlying logics and esoteric structures of proof documents impede machine learning practitioners from achieving a large scale success in this field. In this data description, we present a simple dataset that represents the essence of choosing appropriate proof methods in Isabelle/HOL. Our simple data format allows machine learning practitioners to try machine learning tools to predict proof methods in Isabelle/HOL, even if they are unfamiliar with theorem proving.

#### Jason Rute (Apr 28 2020 at 23:36):

I was just thinking that we need to make a list of all these datasets (especially as I add some to the list).

#### Jason Rute (Apr 28 2020 at 23:37):

Here is the dataset: https://archive.org/details/PaMpeR_2018

#### Jason Rute (May 04 2020 at 02:59):

So I read this paper. It's not really a paper at all, but a short description and motivation for a dataset. So let me briefly describe the dataset.

• It is the dataset used to build PaMpeR, a proof method recommendation system for Isabelle/HOL by the same author. The idea is very simple: predict the proof method (the Isabelle word for tactic) for a given goal (and context). The author already has a system for doing this (PaMpeR), but I think they are looking for assistance (or competition) in improving their prediction algorithm.
• The dataset is very simple. It is the simplest dataset for theorem proving that I know of, and the hope is that it will attract “machine learning researchers” to this area. The dataset is just 382,801 lines of this:
~/shared/afp-2018-02-26_eval/thys/Incompleteness/Coding_Predicates.thy722 auto 1,0,0,0,1,1,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


You may say: I see the proof method (auto in this case), but where is the goal? Also what are all those 0s and 1s? That is where the simplicity comes in. They have already gone through the difficult step of turning the goal (and context) into 108 (I count 113 actually) Boolean attributes. This has both significant plusses and minuses.

#### Jason Rute (May 04 2020 at 03:01):

On the minus side, I doubt it will attract serious ML researchers. This sort of thing is basically a solved problem (or at least is no harder than a number of other standard datasets). Also, by holding the “feature extraction” constant, they limit the possible improvements. It is very possible that the largest gains in accuracy would come from a better feature extractor. I would say much of the current research on ML for theorem proving is focused on feature extraction for formulas (especially for premises selection).

On the plus side this dataset is pretty easy to work with. If you are looking to try out machine learning and want a (theorem proving) dataset to start with, this one is a good first project.

#### Jason Rute (May 04 2020 at 03:18):

I plan to take a look at it soon. I might make my own intro to ML tutorial using this dataset. It will give me a chance to brush up on my skills, and hopefully share important information with those interested in ML for TP from the TP side. I’ll start with simple algorithms like logistic regression, SVMs, naive Bayes, etc. Since PaMPeR is implemented in Standard ML, these have the advantage that they would be easy to code. (I mean the prediction function would be easy to code, where the prediction function takes inputs from the 113 Boolean attributes in addition to the model parameters, e.g. the logistic regression coefficients. The parameters could be read from a file, where the parameters were trained using an ML package in Python.) Then I will venture into the powerful black boxes, like gradient boosted trees and neural networks. (Although, honestly a small neural network could be easy to program in Standard ML and be reasonably performant as well.)

The arXiv version of the PaMPeR paper has all the statistics of how well the author did against their benchmarks on their evaluation dataset. They provide a ranked list of tactics. About 29% of times their tactic is what the human used. (However, if they guess simp always, they would be right 26% of the time). If one looks at their top 5 predictions, the human tactic appears in the top-5 about 73.55%. (Again, compare this to the fact that the top 5 most frequent tactics account for 70.2% of tactics uses.)

It is too preliminary to say for sure, but after a little playing around, I think some simple models from sci-kit learn and some simple neural models can increase those numbers up to by 5 to 10 percentage points, but I don't know if that is significantly more useful or not.

I’ll post back when I have something for concrete.

#### Jason Rute (May 04 2020 at 03:20):

However, I should also mention that while the original PaMPeR paper provides some metrics, the dataset doesn’t contain is a clearly defined evaluation metric. While this makes it hard to “compete” on getting the best scores, it is a real issue that working data scientists have to deal with every day. Does one want to predict just the most likely tactic, or the top three most likely tactics? Given that some tactics appear way more often than others, is it better to suggest a rarely used tactic and get it right or to suggest the most common tactic just because it is the most common? Last, given that the goal of this is to help others write Isabelle, is it important to test the prediction on new areas of math. Maybe the algorithm can recognize that one theorem is just like another in the same file, but totally fail when it gets to a new file written by a different author doing different math in a different style? These are important questions for us to think about.

#### Brando Miranda (May 06 2020 at 21:12):

Jason Rute said:

I plan to take a look at it soon. I might make my own intro to ML tutorial using this dataset.

Did you ever come around doing that? I might be interested in doing this, at I'd like to help to lay out the plans here. It might be interesting since this seems to be a hard field to get into, so producing a tutorial seems worthwhile. Would it be in python? Does Isabelle have a python API? I couldn't tell if it did just from the papers & short reports.

#### Jason Rute (May 06 2020 at 23:07):

@Brando Miranda See this colab notebook for some initial results. I tried a number of basic models (that would be easy to code in Standard ML since that is what PaMPeR is written it). I've done absolutely no hyper parameter tuning or cross-validation, etc. You are welcome to take this and turn it into something useful (I've sort of got interested in other things right now). Here are a few TODOs I intended to do before loosing interest:

• hyperparameter tuning (manually and/or automatically, for example try different regularization types and amounts for logistic regression)
• k-fold cross-validation (the dataset is fast to train on, so it should be possible to do say 10-fold CV)
• contact Yutaka Nagashima and get his thoughts. (Better, get him involved here on Zulip!)
• I think a better version of CV would be to split the training set up by the file hierarchy into 10-100 roughly equal sections where for example /shared/afp-2018-02-26_eval/thys/Integration is a section. Then one would train on all the other sections and validate on this one. I think this is really important, because if tools like this are to be useful, it is important to apply them to other mathematical theories. This is my largest pet peeve against the ML for TP research. It doesn't taking the user workflow into account, leading to inflated accuracy numbers.

#### Jason Rute (May 06 2020 at 23:19):

Also, here is a neural network model. I used Fast AI, but that was maybe a mistake. Fast AI is a fast, but the interface is super confusing. I think Keras would have been better. However, I think (if my fast AI code isn't broken) that it gives pretty good results, not amazing, but pretty good.

#### Jason Rute (May 06 2020 at 23:22):

Something people outside of Machine Learning often don't know is that a small neural network like this (2-3 hidden layers) would be easy to code in almost any language (just the forward pass) and reasonably performant. If one only needs to evaluate it only hundreds of times a second or so (which would be enough for PaMPeR I think) that shouldn't be a problem. Then one could train the network in Tensorflow/Pytorch in python and save the weights in a file.

#### Jason Rute (May 06 2020 at 23:52):

@Brando Miranda Wrote:

Would it be in python? Does Isabelle have a python API? I couldn't tell if it did just from the papers & short reports.

To be clear, I was only talking about the dataset which is just a tabular dataset. There is no connection to Isabelle (and no formulas of any kind in the dataset since they already did the feature extraction). Really, it is a funny dataset since if one didn't know it came from theorem proving, it would look like any other tabular dataset out there.

#### Jason Rute (May 06 2020 at 23:54):

But if you are curious, I think PaMPeR is written in Standard ML (which I assume is the metalanguage for Isabelle). I think they even wrote (and trained) a decision tree in SML.

Last updated: Dec 20 2023 at 11:08 UTC