Zulip Chat Archive
Stream: Machine Learning for Theorem Proving
Topic: Paper: An Experimental Study of Formula Embeddings for Autom
Jason Rute (Feb 08 2020 at 23:13):
I'm starting a new thing. As I find new (or old) papers on AI/ML for TP, I'm going to start threads to discuss them (and I encourage others to do the same).
Jason Rute (Feb 08 2020 at 23:13):
Here is the first: An Experimental Study of Formula Embeddings for Automated Theorem Proving in First-Order Logic by Ibrahim Abdelaziz, Veronika Thost, Maxwell Crouse, Achille Fokoue (from IBM research and related organizations).
Jason Rute (Feb 08 2020 at 23:13):
I'm currently enjoying reading the recent papers by this group. I like their systematic approach, and this paper seems to be a continuation of that trend. They look at numerous approaches to creating vector embeddings of first order formulas and show the trade-offs in term of both completion rate and runtime.
Jason Rute (Feb 08 2020 at 23:19):
I just started reading it, but I'd love to hear the opinions of others, especially the Google folks and @Stanislas Polu. We've already had debates on this Zulip (I guess some of them private) about if GNNs are worth it or if other sequence models (like LSTMs on the sequence of tokens) are better.
Jason Rute (Feb 08 2020 at 23:27):
Also, if anyone here knows the authors, feel free to encourage them to join the discussion...
Jason Rute (Feb 09 2020 at 00:46):
The game is RL of first order theorem proving in their TRAIL system (not to be confused with the TRAIL patterns approach which is also evaluated in this paper). They test on the Mizar and TPTP datasets. I think the idea is that each agent is given test theorem statements (but not proofs) to train on with RL and then they are tested on the test sets. (I don't know if the agents do any additional reinforcement learning from the test sets, but since it is RL, this should be clarified.) I think the agents create a policy over actions (but I'm not yet clear on the action space). Since the goal is proving theorems the algorithms are graded on the total numbers of problems solved (completion rate) in a fixed period of time (100 seconds per problem I think).
Jason Rute (Feb 09 2020 at 00:48):
The winner according to them is their TRAIL formula embedding algorithm which was first introduced in this paper. That algorithm has other advantages like being able to be trained on one theory database and still doing reasonable when tested on another (in the same logical system at least).
Jason Rute (Feb 09 2020 at 00:50):
However it should be noted that all the systems tested did worse than the Beagle prover baseline.
Jason Rute (Feb 09 2020 at 01:13):
Actually, let me correct myself here. I no longer think they are saying TRAIL patterns are the clear winner. Instead they highlight the TRAIL patterns approach and the message passing neural networks (MPNN) approach which is a type of graph neural network.
Jason Rute (Feb 09 2020 at 18:31):
Ok, I carefully read the paper and understand it much better. They are evaluating a number of embedding methods for use in a RL-based, first-order logic, saturation-based theorem prover. The embedding methods take a formula and turn it into a 64-dimensional vector. After this, an attention-based mechanism handles the action selection. Basically all the details of the RL and attention based prover are in the original TRAIL paper. The only difference here is the change of the embedding function. They experiment with two categories of embeddings: (1) hand-designed feature embeddings and (2) graph neural network embeddings.
Jason Rute (Feb 09 2020 at 18:31):
Before I talk about the details, it should be mentioned that they were careful not to assume a fixed vocabulary. It should also be mentioned that the problem they are trying to solve has a trade-off of speed-to-compute and quality of results. An algorithm which is fast to compute means that in a tree search one can explore more nodes. An algorithm which givens a better embedding can significantly improve the selection of nodes meaning the tree search goes down fewer dead-ends before reaching the goal. This is the trade-off they are exploring. This is good since a lot of the work on Graph Neural Networks has been done on "offline" problems where the one doesn't have to do a tree search and this speed concern doesn't matter.
Jason Rute (Feb 09 2020 at 18:32):
As for the "hand-designed" feature embeddings the key is that they are not learned, but instead given by an algorithm. However, it should be noted that these algorithms are sufficiently general that they could be applied to a number of problem domains and logics I believe. (Neither is a bag-of-words or TIDF approach, but that is a good example of the type of approach they are going for.) One hand-engineered approach is term-walks. This features in a lot of the work of @Josef Urban and co-authors. (However, one complaint I have is that most of these papers argument the term-tree walks by other features like the top-level symbol and the term length. I think if the authors of this paper added all these additional features to both of the hand-designed feature embeddings they might get better results.) The other hand-designed features are the TRAIL patterns from the original TRAIL paper. This approach is one of the best overall mostly because of its speed, but it does create a higher number of useless steps. (I don't quite understand why the term-walk approach, which is even faster and has few useless steps does worse. I am missing something?)
Jason Rute (Feb 09 2020 at 18:33):
As for the GNN embeddings, they try a number of graph neural net architectures. The winner seems to be MPNN which I think is an extension of the graph convolutional neural network with edge embeddings as well as node embeddings. I think, but am not certain, that the HOList GNN is more like a GCN without edge embeddings. I don't know about NeuroSAT off-hand. Also they tried out there GLSTM-MPNN from their other paper (which is a combination of a tree-LSTM and a MPNN). It does well, but not as well as the MPNN. The MPNN (as with the other GNNs) does take a lot longer to compute than the hand-designed features, which is a tradeoff.
Jason Rute (Feb 09 2020 at 18:33):
The most simple embedding approaches consider the formula as a sequence of characters and use one-hot encodings followed by standard sequence encodings like LSTMs [Alemi /et al./, 2016]. This encoding does not capture much logical information, not even syntactically.
I wish they included approaches like the kind listed above as a baseline. This is like what two of the HOList papers do with wave nets. It is also the approach being advocated by @Stanislas Polu. Tree-recursive neural nets seem like another natural approach to try.
Jason Rute (Feb 09 2020 at 18:33):
In the end the TRAIL patterns and the MPNN do fairly similarly on this problem. However, as the authors mention this is really dependent on a number of factors. As I see it here are some advantages of the "hand-crafted" features:
- For problems like saturation-style proving that involve a lot of steps, they are really fast.
- They generalize well to other domains since they are not fit to a particular vocabulary or theory. (I think this needs to be given serious concern, because when we deploy these systems in practice, they will encounter symbols and theories they have never seen before.)
- They aren't differential architectures and hence the other parts of the system can be fast ML models, like gradient boosted trees. (That would be interesting if the authors tried to replace their transformer architecture with a gradient booted tree.)
- They probably would do even better on this problem set if given a few additional hand-engineered features (like those used in https://arxiv.org/pdf/1805.07563.pdf or https://arxiv.org/pdf/1903.03182.pdf).
- They would be easy to implement in an ITP (like Lean).
Jason Rute (Feb 09 2020 at 18:33):
Here are the advantages of the GNN features (and in general fully learned features).
- One doesn't have to consider the logic involved too closely and these can generalize to a number of domains (although they need to be retrained for each setting).
- For problems like tactic based proving where there is more time needed to perform an action, having better embeddings is probably much more important than the increase in speed in computing the embeddings.
- This is still a new field and there is probably a lot of improvement that could be made in designing and speeding-up these neural architectures.
Jason Rute (Feb 09 2020 at 18:34):
Some additional thoughts:
- Also, one can consider creating "tactics" for saturation solvers which handle the routine steps, but let the agent guide the high-level decisions. For example @Jesse Michael Han has a paper out where he guides a SAT solver at a much higher level reducing the number of times one has to compute graph embeddings.
- I'd love to see an analysis like this done for HOList. I guess with the current high-level DeepHOL Python API, this wouldn't be hard to code. Joint collaboration of IBM Research and Google Research? :)
- The work of Google on reasoning in embedding space, and the work of DeepMind on Mu Zero shows that it is might be possible to avoid computing embeddings for the tree search. I think this would even be more true in FOL where the steps are more routine and the next states are more predictable (I think). In that case, one could do the tree search only in embedding space which would significantly speed things up. Or one can do the tree search in neural embedding space + hand-designed features like TRAIL patterns. This would maybe provide a best of both words approach (at least in settings like saturation proving where executing each step is really quick).
Jason Rute (Feb 09 2020 at 18:55):
To add some clarity to my last comment, if one follows something like mu zero, it would still use a neural embedding for the root of the tree search (and maybe for select nodes), but overall one can avoid computing an expensive embedding for each step of the search. For tactic proving, one can even avoid executing the tactics for some of the tree search. This would make the GNN embeddings more cost efficient.
Last updated: Dec 20 2023 at 11:08 UTC