Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Paper: Tactic Learning and Proving for the Coq Proof Assista


view this post on Zulip Jason Rute (Apr 03 2020 at 02:54):

A paper recently came out on applying machine learning to Coq: Tactic Learning and Proving for the Coq Proof Assistant by Lasse Blaauwbroek, Josef Urban, and Herman Geuvers. Artifacts of Tactic Learning and Proving for the Coq Proof Assistant also includes datasets and Coq configurations for the project (but I don’t think it contains usable code to recreate one’s own Coq AI or to use theirs).

view this post on Zulip Jason Rute (Apr 03 2020 at 02:54):

It has been mentioned on the Coq discourse, but there is no discussion so far, so I’d like to start a discussion here.

view this post on Zulip Jason Rute (Apr 03 2020 at 02:55):

In short, the paper is TacticToe, but for Coq. (TacticToe was one of the first competitive applications of machine learning for ITP and was originally for HOL4. See my notes on it here.) This project follows a similar methodology to TacticToe, and since they don’t give it a name, I’m going to call it TacticToe for Coq. Let me mention some of the highlights of the paper.

view this post on Zulip Jason Rute (Apr 03 2020 at 02:55):

TacticToe for Coq is actually the third machine learning based theorem prover for Coq. The other two projects are CoqGym/ASTactic and Proverbot9001. (See some of my notes on the two projects here.) Annoyingly, they don’t mention Proverbot9001 even though they cite earlier papers and Proverbot9001 seems to be far and away the state of the art in machine learning applied to Coq. I’ll mention how this project compares to the others in a bit, but as usual, the lack of standard benchmarks makes such a comparison difficult.

view this post on Zulip Jason Rute (Apr 03 2020 at 02:55):

They make a big deal out of the fact that TacticToe for Coq is built using OCaml (the language Coq is implemented in). I agree. While Python is better for rapid prototyping and getting ML researchers involved, OCaml makes it more likely that this tool will eventually be available to the working proof engineer. So even if this project isn’t the state of the art, it may be the most likely tool that Coq engineers will actually use. However, I don’t think it is publicly available yet.

view this post on Zulip Jason Rute (Apr 03 2020 at 02:55):

They implement their own proof recording. I think it is completely separate to the proof recording used for the other two Coq projects. The techniques are similar to what I’m playing around with in Lean. They record all goal-state-tactic pairs and use those as a dataset to train a supervised algorithm.

view this post on Zulip Jason Rute (Apr 03 2020 at 02:55):

Unlike other Coq provers, this project doesn’t use neural networks. Instead they predict tactics by using k-nearest neighbors (k-NN) similar to the original TacticToe. If you are not familiar, the idea of k-NN that they take all goal-tactic pairs in the training set and encode the goals as vectors. Then for any new goal one finds the goal (in the training set) closest (in some metric) to the new goal. From that they can figure out which tactic to apply. Some more technical details about this part:

  • Since pure k-NN requires searching through all the training data every time, they use a faster approximate version called LSHF.
  • They use a variation of TF-IDF for formulas to encode the vectors. The distance metric is similar to cosine distance, but a bit different.
  • They don’t mention tactic argument selection (e.g. premise selection), but if it is similar to TacticToe for HOL4, then I think they find premises which are closest to the current goal in some metric (probably similar to the metric used for tactic selection).
  • They also don’t mention the list of tactics they use, but I’m sure it is pared down from the full list.

view this post on Zulip Jason Rute (Apr 03 2020 at 02:55):

They also use a proof search where the most likely tactics are explored more than less likely tactics. It seems similar to beam search or some tree searches which use a notion of “fuel”.

view this post on Zulip Jason Rute (Apr 03 2020 at 02:55):

Ok, now onto results! I’ve learned by now that pure percentages of solved theorems don’t mean a lot in isolation and it is difficult to compare results across theorem provers—and even within the same prover. However, TacticToe for Coq solves ~39% of the theorems in the Coq standard library. (I would have liked more discussion about test/train split, but I’ll give them the benefit of the doubt that they did this sufficiently.) On the other hand, CoqHammer solves about ~40-42% of theorems in the standard library (when you combine Vampire, E, and Z3).

view this post on Zulip Jason Rute (Apr 03 2020 at 02:56):

First, let’s compare this to TacticToe for HOL4 which solves 66.4% of the theorems in the HOL4 standard library (twice as much as the E prover on HOL4). I think this shows how similar ML strategies can have very different numbers for different theorem proving libraries. I don’t know if this means that HOL-based provers are more amenable to automation than dependent type theorem logics, or if the HOL4 tactics are more powerful, or if just the libraries are different.

view this post on Zulip Jason Rute (Apr 03 2020 at 02:56):

Second, let’s compare it to CoqGym/ASTactic and Proverbot9001. One problem is that all three projects use different test and training sets. TacticToe for Coq uses the Coq standard library. CoqGym uses the standard library and the packages listed on the Coq Package Index. Proverbot9001 trains on CompCert and test on different projects (they were also kind enough to retrain and retest CoqGym/ASTactic on their dataset). The saving grace of these three projects is that they all compare their results to CoqHammer. Assuming that the CoqHammer configurations are similar between the three projects (which honestly they might not be), we can hammer-normalize the solve rates. Specifically, in this paper, TacticToe for Coq solves 3240 theorems while CoqHammer solves 3534 theorems. Dividing the two numbers gives a hammer-normalized score of 0.92. Here are the hammer normalized scores for the three provers:

  • ASTactic: 0.69 in CoqGym paper, 0.62 in Proverbot9001 paper
  • Proverbot9001: 2.62
  • TacticToe for Coq: 0.91

view this post on Zulip Jason Rute (Apr 03 2020 at 02:56):

So it seems like Proverbot9001 is (far and away) better than the others, but a lot of caution is needed (especially since the test set for Proverbot9001 was much smaller). Really, this highlights the needs for uniform benchmarks, not only within a theorem prover, but across theorem provers as well. (I would love to see the TacticToe algorithm applied to HOL Light, Isabelle and Lean.)

view this post on Zulip Jason Rute (Apr 03 2020 at 02:56):

Finally, if the authors are reading this (e.g. @Josef Urban), I think I found a small error/typo. I don’t see where eq_z and eq_x come from in the feature set example on the bottom of page 3. I might not understand what a 2-shingle is, but if I do, x and z are not next to = in the syntax tree of the example formula.

view this post on Zulip Scott Morrison (Apr 04 2020 at 03:05):

Thanks for the great summary!

view this post on Zulip Rongmin Lu (Apr 06 2020 at 03:02):

(deleted)

view this post on Zulip Brando Miranda (May 09 2020 at 15:22):

@Jason Rute thanks for the summary. There is a detail that you didn't mention (as far as I could tell) that in the ASTactic/CoqGym paper they actually generate tactics by doing a tree expansion on the tactic grammar. This is not how Proverbot9001 does it (as I understand from your summary Proverbot9001 is mostly a tactic selection + premise selection paper).

Do people think this comment is an accurate representation of the main differences in proving methodology between Proverbot9001 & ASTactic/CoqGym?

view this post on Zulip Brando Miranda (May 09 2020 at 15:26):

As side comment relevant to the tactic generation via tree expansion, I believe ASTactic is the only paper I am aware of that generates tactics this way (instead of tactic selection + premise selection). Is this a correct (with high probability) statement? Can someone shed me some light on why generation of tactics via tree expansion is not a popular method? Do people know what are it's drawbacks?

view this post on Zulip Jason Rute (May 09 2020 at 21:32):

I have to admit that there is a lot I'm unfamiliar with (or forgot) in the various papers. (It's on my TODO list to look over these papers again and record more of the fine details.) In this case, what does it mean to "do a tree expansion on the tactic grammar"? Most of the current algorithms (DeepHOL, TacticToe, ProverBot, and this one) do some sort of tree search (MCTS, beam search, etc). Is doing a "tree expansion" a fundamentally different operation?

view this post on Zulip Jason Rute (May 09 2020 at 21:55):

Actually, I should be careful. There are two "tree"s in these algorithms. First is the search tree (trying tactics and backtracking if necessary). In this tree, each node is a goal stack (or something similar). The second tree is the proof. This gets a bit tricky. Let's for the moment assume tactics only apply to one goal at a time (which in Lean at least is not always true). Then sometimes when you apply a tactic to a goal it produces multiple new goals. The if this is the right set of goals, it doesn't matter which goal you prove first since you have to prove all of them. Some algorithms like DeepHOL make the simplifying assumption that you are always working on the first goal in the stack. Then your proof tree can be written as a flat tactic script since it is just a depth-first search through the proof tree you are building. Of course, since we don't know we are working on the right set of goals, and may need to backtrack, there could be a significant advantage of working on the goals out of order. I don't remember which algorithms allow this. I wonder (and I should really look more at the paper before I speculate too much) if by "tree expansion" they mean expanding the tactic tree by choosing which goal to work on instead of working on the first goal each time. If so, that could have some advantages, but I think it is still a matter of tactic selection and premise selection, right? Do they just use different terminology?

view this post on Zulip Jason Rute (May 09 2020 at 21:55):

Ok, I'm going to go read the paper again before wildly speculating more...

view this post on Zulip Jason Rute (May 09 2020 at 23:51):

I'm moving this conversation to Paper: Learning to Prove Theorems via Interacting with Proof so that we have a good place to discuss ASTactic and CoqGym.


Last updated: May 09 2021 at 22:13 UTC