Stream: Machine Learning for Theorem Proving

Topic: Papers on Neural Conjecturing

Jason Rute (Jun 09 2020 at 23:02):

Two papers recently came out on the topic of using neural networks to generate synthetic mathematical statements:

Both papers have a focus on computer generated mathematical conjectures generated by neural networks.

Jason Rute (Jun 09 2020 at 23:03):

I have have to read both papers more and I hope to report back here soon about each. While I haven't looked at the papers or their conjectures too closely, I want to emphasize that this language modeling approach they both use should not be judged alone on the quality of the conjectures generated. The ability to work with and synthesize mathematical formulas is just a first step in creating better machine learning assisted tools for interactive and automated theorem proving. Even if the conjectures are trivially true or false, it could be a good first step.

Jason Rute (Jun 11 2020 at 03:00):

Before going into either paper specifically, let me talk about how state of the art language modeling works in 2020. Consider these three natural language processing tasks:

• Translation: Given a sentence (or paragraph, document, etc) of one language, translate it to another language. This is a sequence-to-sequence task. That is we take a sequence (the source language sentence) and output another sequence (the target language sentence). We have also seen sequence-to-sequence tasks in mathematics.
• Classification: Given an English sentence, classify it according to its sentiment (positive, negative), or flag it for offensive content, or something similar. This is a classification task. One needs to classify a sequence into single category.
• Sentence completion: Given the first half of an English sentence suggest the next word or suggest the rest of the sentence. This is what smart phones, and now email programs do. This is a generative modeling task.

Jason Rute (Jun 11 2020 at 03:01):

There are a number of ways to accomplish the above goals, but in 2020, the state of the art is a Transformer architecture. It is a type of neural network. There are a number of variants, but these papers use two common variants:

• The Encoder-Decoder Transformer from the Attention Is All You Need paper is a sequence-to-sequence model. It has two parts. The Encoder takes in the input sequence and summarizes it into a vector. Then the Decoder takes that vector and uses it to generate an output sequence. For example, Deep Learning for Symbolic Mathematics uses this Transformer model to “translate” an integration problem into its antiderivative. (If you want to know more about this model, see The Illustrated Transformer as well as that author’s preceding post on Attention.)
• The GPT-2 model is a variant of the transformer that only contains the decoder. It is a generative model which given a partial sentence will suggest the next word in the sentence. You can use it in a number of ways. For example, if you start from nothing, you can use it to construct a sentence from scratch (unconditioned samples). If instead of picking the mostly likely next word each time, you pick one of the, say, top-10 mostly likely words at random, you have a random sample generator. One can also give GPT-2 a prompt, and it will complete the rest of the sentence. While it isn’t at human level, a pre-trained GPT-2 model can generate fairly realistic looking paragraphs of text. (If you want to learn more about GPT-2, see The Illustrated GPT-2.)

Jason Rute (Jun 11 2020 at 03:01):

Before getting into these papers more, let me mention a few important facts about transformer models.

• Transformer models should be trained on a large corpus of the language. Pre-trained versions of GPT-2 were trained using Wikipedia and other open-source English language corpuses.
• The good news about pre-training a language model like GPT-2 is that one just needs raw text. Unlike image processing, one doesn’t need human annotated labels. (This is less true for machine translation, where aligned text is still important.)
• For GPT-2, the pre-training is straight forward. The model is just trained to predict the next word in the text from the previous words. However, other transformer-based language models like BERT have a different architecture and they are trained with more complicated tasks, like trying to predict a missing word in a sentence or if two sentences are adjacent.
• The bad news about pre-training is that it can get expensive. A state-of-the-art language model can cost 100s of thousands of US dollars to train from scratch. This however can be mitigated by not going for state of the art (i.e. using a smaller model, training for fewer steps, and using your own hardware training at a slower rate).
• The silver lining is that once you have a pre-trained model, one can transfer it by fine-tuning it to another similar problem. For example, if you take a GPT-2 model trained on Wikipedia and further train it on arXiv, it will transfer its understanding of “standard” English to mathematical English. It will train much faster and be less expensive than training a model from scratch. It will also probably produce much better results.
• Also, while models like GPT-2 look like they are only good for sentence generation, one can use them for a number of natural language precessing tasks. Consider translation. If you train the models on aligned Spanish-English pairs such as ¿Cómo estás/ <$> How are you?, it will learn to translate. Then you just give it a prompt like Estoy bien. <$> and it should complete the text with I am well. (Here <\$> is just used as a token marking the end of the Spanish. It can, similarly, be used for classification tasks.

Jason Rute (Jun 11 2020 at 03:05):

Now let me summarize each paper quickly before going into the details.

• The first paper, First Neural Conjecturing Datasets and Experiments by @Josef Urban and Jan Jakubův, provides four datasets based on Mizar. Two of them are basically just the raw Mizar articles in different formats (and cleaned up a bit). Then one can use the pre-trained GPT-2 Model to construct synthetic Mizar “theorems” and/or “proofs”. The third and fourth are automated proofs of Mizar theorems using the E-prover. The important thing is that they contain both the theorem to prove and the premises used in the proof, so using a trained GPT-2 model, one can extract both synthetic “theorems” and/or “premises” useful for proving those theorems.
• The second paper, Language Modeling for Formal Mathematics by @Markus Rabe, Dennis Lee, Kshitij Bansal, and @Christian Szegedy, is very similar, but their aim is a bit different. They use the HOList dataset of HOL Light goals. (The dataset also contains tactics applied to those goals, but they throw this information away.) They train an Encoder-Decoder Transformer to guess missing sub-terms. For example, the Transformer Encoder would receive (x + (y + z) = <PREDICT> + z and the Decoder would be expected to produce y + z. (I’m simplifying the form of the input here, since they use s-expressions which include type infomation as well.) This pre-training task is similar to that used by language models like BERT, but with a key difference. They always mask out sub-expressions, instead of single tokens or arbitrary groups of tokens. Natural language models can’t (easily) do this. If a natural language model wanted to mask out an entire prepositional clause, it would have to find where the prepositional clause starts and ends which is not an easy task. But for mathematics (especially in s-expression notation) it is trivial. They show that a model trained just on this task can accomplish a number of interesting tasks like type-inference, filling in missing assumptions, equation completion, and free-form conjecturing.

Jason Rute (Jun 11 2020 at 03:05):

Ok, now to more details.

In the First Neural Conjecturing Datasets and Experiments paper, the first two datasets are just Mizar code in two formats. By training a GPT-2 models (for a few weeks) on this “language corpus” they are effectively making a generative model for Mizar code. This isn’t the first time that GPT-2 has been applied to computer code, but it is probably the first time it’s been applied to theorem proving. What can one do with such a Mizar model. Well, one can do a few things:

• Have it generate synthetic Mizar “theorems” from scratch.
• Have it generate a “theorem” and its “proof”.
• Give it a theorem as a prompt and have it fill in the proof.

Jason Rute (Jun 11 2020 at 03:06):

Now the question is “how good is this model”? They don’t really have much to say yet. They give some examples and have a “website” with more examples. (I don’t read Mizar well, so it is hard for me to follow.) But overall, they mention evaluation as future work. However, if you are interested, I think the model is available to play with.

Jason Rute (Jun 11 2020 at 03:06):

As for the second two datasets, the data is of the following form:

theorem statement
premise statement 1
premise statement 2
…
premise statement n


Dataset 3 is more verbose and the model was having trouble with all the parentheses and commas, so they stopped training on that and made Dataset 4, which is a cleaner version using Polish notation. The way one runs this model is to give it a prompt theorem statement and ask it to complete the pattern. By doing so it will guess premises useful for proving the theorem. These might be premises in the Mizar database, these might be new mathematical conjectures the agent has never seen before, or they might be gibberish (but I think they for the most part type check). One problem is that even if a premise is in the Mizar database, it might come after the theorem (or be the theorem itself). There is no way to currently enforce what is selected as a premise.

Jason Rute (Jun 11 2020 at 03:06):

Every one of the “premises” that the model predicts can be thought of as a conjecture. For the most part, they can automatically disprove many of the conjectures using an ATP system. For others, they can automatically prove them.

Jason Rute (Jun 11 2020 at 03:07):

I personally found the paper very difficult to follow, I think partly because of the page limit, and because I don’t understand Mizar syntax. However, I think this approach offers a lot of interesting potential even if it is not really that useful right now.

Jason Rute (Jun 11 2020 at 03:07):

In the second paper, Language Modeling for Formal Mathematics, they just train on one dataset. As I said, it is all the goal statements (including intermediate goal statements) that come up in HOL Light proofs. Then they mask subexpressions of these statements in two ways. First, they mask a single subexpression at random by replacing it with <PREDICT>. Also, to make the problem harder, they find another subexpression at random and mask all occurrences of it with <MASK>. Then the goal (since there are using a sequence-to-sequence transformer model) is to predict the subexpression masked with <PREDICT>. After the model is trained, then they use the trained model to solve certain types of problems:

• Type inference: By covering up a type subexpression with <PREDICT> , the model must guess what it is. There are two variations of this. First, the easy version just replaces some type with <PREDICT>, e.g. (f: <PREDICT>) (n: nat) (m: nat) = (0: nat) for example. In this case, we hope that the answer given is nat -> nat -> nat. The harder version, masks out all of the types. While this is not a uniquely solvable problem, the algorithm still gets it right over 90% of the time. (I imagine in many cases it can use context, like variable names, to guess at the types.) I get the impression that they are not trying to claim that neural networks are faster or better at type inference than classical algorithms, but instead they are trying to test their language model overall. (If it was Lean code, one could similarly train this same sort of model to do type class inference. For example one could feed a model trained on Lean expressions with has_add <MASK> <PREDICT> (n : nat) (0 : nat) = (n : nat) and the model would have to return nat.add.)
• Assumptions: Like the premise selection task from the other paper, they take goals or theorems of the form A -> B for expressions A and B and replace A with <PREDICT>. The goal is to predict the assumptions needed to prove a result. In many cases the predicted assumptions are what the human put. In other cases, they are other assumptions that would make the theorem true (sometimes trivially like putting false or repeating B.) They later also see if they can prove any of the assumptions with the DeepHOL theorem prover.
• Equality completion: Given an equality, they replace one half with <PREDICT> and have the model guess what was there. Again, this leads to interesting results. Sometimes the model guess what the human put. Other times, it guesses something else that is still true. Other times, it is false. Again, they try to prove some of these new results with the DeepHOL theorem prover.
• Last, they feed the model just <PREDICT> and have it go to town. The idea here is that the model will predict a theorem statement which serves as a free form conjecture. (Technically, they also have a way to label an expression as a theorem or an intermediate goal, so they actually feed the model <theorem> <PREDICT> where <theorem> says it is a theorem.) I don’t think they get any earth breaking conjectures here. Many are false, or repeats of something in the database. Some turn out to be true and moderately useful. Like many language models, their model tends not to stray far from the text (or theorems) in the training data. However, I think this approach opens up a lot of possibilities. Also the paper is well written and provides some nice examples.

Jason Rute (Jun 11 2020 at 03:07):

Here are some final thoughts of mine:

• I think these two papers show a lot of potential for language modeling ideas in theorem proving. While it would be awesome if an out of the box language model like an Encoder-Decoder Transformer or GPT-2 “just works”, I think more ideas will be needed to do useful stuff with these models.
• It is a glaring omission that Google didn’t attempt the obvious next step. Since they have both the goal statements and the tactic expression applied to that goal, they could do that prediction. Indeed the form of the input would be almost exactly the same. Indeed, I think if they just want to predict the tactic (and not its parameters), that should be a really easy transfer learning task. One can take a goal, run it through the Encoder to get a vector embedding, and then train a simple model like logistic regression to get the tactic. Selecting premises or other tactic parameters would require more thought though. (Actually, it looks like they plan to talk about this at AITP!)
• I wonder if for the Encoder step a graph neural network would do better. Counting parentheses is hard for humans and I think it would be hard for transformers too (based on my understanding of how they work). I could imagine a graph neural network (which is in some sense a generalization of a transformer) could be used to much better effect. It would be more efficient since it doesn’t require that every token connect to every other token. (Alternately, I could imagine that instead of the standard position embeddings used in transformers, one could use tree-aware position embeddings which take into account the position in the expression tree.)
• Similarly, I could imagine that for the Decoder, some autoregressive tree- or graph-like model could be more applicable. Of course, I don’t know that anything like that exists, so that would be an additional challenge to build it.
• Last, my guess of why Google is performing this line of research is in the hope of creating mathematical formula models that can be used for many tasks. As I already said, I think this Transformer Encoder could already be used to construct vector embeddings for mathematical expressions. That would be a huge gain. Pre-trained mathematical expression embeddings would be applicable to a number of theorem proving tasks, like tactic and lemma selection, tactic argument construction, finding a theorem in a database of theorems without knowing its name or exactly how it is written, finding a theorem in arXiv or in another ITP library from its HOL Light (or Lean) expression, aligning theorems across databases, etc. I know this has been the focus of many of Google’s recent works (see for example their other AITP abstract).
• For the Mizar generator project, I look forward to hearing more as well, especially if Mizar users try it out. I’d be very interested in their thoughts.

Jalex Stark (Jun 11 2020 at 03:10):

Jason Rute said:

For example, the Transformer Encoder would receive(x + (y + z) = <PREDICT> + z and the Decoder would be expected to produce y + z.

i feel silly asking this, but just to check my understanding, should this say the following?

For example, the Transformer Encoder would receive(x + (y + z)) = <PREDICT> + z and the Decoder would be expected to produce x + y.

:face_palm:

Jalex Stark (Jun 11 2020 at 03:13):

I think long expositions like this on zulip are easier to read if there are more breaks between messages

Jalex Stark (Jun 11 2020 at 03:13):

if for no other reason than "it's easier to refer back to a specific paragraph, e.g. via the quote and reply button"

Jason Rute (Jun 11 2020 at 03:13):

Breaks in time? Or breaks in the message blocks?

Jalex Stark (Jun 11 2020 at 03:14):

breaks in the message blocks

Jalex Stark (Jun 11 2020 at 03:14):

have roughly as many zulip messages as paragraphs

Jason Rute (Jun 11 2020 at 03:14):

I tried to break it up by paragraph, but I guess all the bullets got lumped together. I'll try to break it up more in the future.

Jalex Stark (Jun 11 2020 at 03:15):

(I could be totally wrong, anyway; these kinds of things are impossible to get exactly right)

Jalex Stark (Jun 11 2020 at 03:17):

you mention that extending to "given tactic state, predict the next tactic applied" benefits from transfer learning

Jalex Stark (Jun 11 2020 at 03:17):

if you had the weights from the model trained in the paper, would that be a useful place to start on the tactic prediction stuff?

Jalex Stark (Jun 11 2020 at 03:18):

it's farther from the end-goal than the example you gave earlier of using a "predict next word" model to do language translation

Jason Rute (Jun 11 2020 at 03:21):

Predict tactic could mean a few things, but if you just mean predict the tactic (and not the arguments), I imagine that if I had the vector embeddings for all the goals in HOLstep, I could train a fairly good logistic regression model in a few minutes. (Writing the code to parse and extract everything would be a days work, and I guess hyper parameter tuning might be needed too.)

Jason Rute (Jun 11 2020 at 03:24):

One thing I'm not sure about however, is if the fact that the model was trained with <PREDICT> in every encoded goal, will screw up the embeddings if I have an embedding without a <PREDICT>. However, think if that is the case, one could do fine-tuning on the end-to-end model to get a good result.

Jalex Stark (Jun 11 2020 at 03:25):

hmm i think i need to hack through a bunch of confusion before i can usefully ask followup questions

Jalex Stark (Jun 11 2020 at 03:25):

thanks for the writeup!

Jason Rute (Jun 11 2020 at 03:26):

Another concern for theorem proving is that I don't know how long transformers take to run. They might be too computationally expensive for something that needs to run in a tree search. I'd be curious what Google has to say about it at AITP. Unfortunately, I won't be there, and Google has a lot of restrictions on sharing slides. :(

Jason Rute (Jun 11 2020 at 03:27):

Actually, I think I take that back. The slides from last years AITP are up for Google's talk.

Jalex Stark (Jun 11 2020 at 03:28):

does it help if it's one of those "monte carlo tree searches"?

Jason Rute (Jun 11 2020 at 03:28):

(BTW, I don't know what to call Szegedy's lab at Google Research. Google? Google Research? Szegedy, et al.? N2Formal? The HOList group?)

Jalex Stark (Jun 11 2020 at 03:29):

where if the intuition of the model is good in the current domain, then it doesn't have to explore very many branches?

Jason Rute (Jun 11 2020 at 03:30):

If a model is really good, then it can be really slow since it doesn't have to branch much. That is why humans can solve theorems. However, I think we aren't there yet, so we need the right balance of accuracy (for lack of a better term) and speed.

Jalex Stark (Jun 11 2020 at 03:30):

if you're citing them in a paper it'll be by lists of coauthors, seems fine to fall back on that in informal conversations if nothing else is more natural

Jalex Stark (Jun 11 2020 at 03:31):

why are transformers hard to run? is it worse than the usual problems with deep neural nets?

Jalex Stark (Jun 11 2020 at 03:32):

do they do a lot of sequential computation or something?

Jason Rute (Jun 11 2020 at 03:33):

They are n-squared in a sense since they have to compare every word to every other word. I assume that makes them slower than other models, but I don't have numbers to back that up.

Jalex Stark (Jun 11 2020 at 03:35):

empirics are pretty useful for finite-scale runtime questions

Jason Rute (Jun 11 2020 at 03:38):

An interesting way to get around the slow model however is to use a learned world model. Basically, the network learns a compact representation of the state (in this case the goal with assumptions) and it also learns how an action (tactic) transitions its compact representation for state0 to the representation for state1. Then it doesn't have to work with raw observations (formulas in this case) as often or make as many calls to the simulation (the theorem prover in this case). This is the idea behind mu-zero, and dreamer, and a number of recent papers in reinforcement learning in the past year. I think it has a lot of potential for theorem proving.

Markus Rabe (Jun 11 2020 at 03:56):

Thanks for the great summary Jason! I fully agree that this is only a first step.

Markus Rabe (Jun 11 2020 at 04:05):

And yes, Transformer models are computationally heavy, especially if you want to run them on CPUs. There is a lot of great work recently on how to improve that. For example Reformer https://arxiv.org/abs/2001.04451, and perhaps this one: https://arxiv.org/abs/2006.04768. Our current implementation has a couple of performance issues that are unrelated to the Transformer architecture so I cannot give a concrete estimate right now.

Markus Rabe (Jun 11 2020 at 04:10):

And it is a great observation that the model might not produce great embeddings if no <PREDICT> token occurs in the formula.

Christian Szegedy (Jun 11 2020 at 06:35):

Jason Rute said:

Here are some final thoughts of mine:

• It is a glaring omission that Google didn’t attempt the obvious next step. Since they have both the goal statements and the tactic expression applied to that goal, they could do that prediction. Indeed the form of the input would be almost exactly the same. Indeed, I think if they just want to predict the tactic (and not its parameters), that should be a really easy transfer learning task. One can take a goal, run it through the Encoder to get a vector embedding, and then train a simple model like logistic regression to get the tactic. Selecting premises or other tactic parameters would require more thought though. (Actually, it looks like they plan to talk about this at AITP!)

Actually, we did that significantly earlier (we have used transformers since November), but need some more work before publishing. We will come out with a paper once the results are mature.
Also Dennis has prototypes that generates the whole tactic application as a sequence, with (pointers to) parameters, at once. This simplification of the infrastructure would also help with integrating with other proof assistants, such as Lean. There are still various scalability issues, but most of them are fixable.

Christian Szegedy (Jun 11 2020 at 06:38):

Jason Rute said:

They are n-squared in a sense since they have to compare every word to every other word. I assume that makes them slower than other models, but I don't have numbers to back that up.

Trivial evaluation code is even worse with transformer: O(n^3). You need special caching code to make it O(n^2).

Christian Szegedy (Jun 11 2020 at 18:21):

Jason Rute said:

(BTW, I don't know what to call Szegedy's lab at Google Research. Google? Google Research? Szegedy, et al.? N2Formal? The HOList group?)

I think N2Formal is the most accurate, however it is an inofficial Google internal code-name, but I don't think there is danger of confusion with anything, so it is fine in informal exchanges.

Jason Rute (Jun 18 2020 at 12:18):

One correction that I was recently thinking about is about parentheses. I said that "counting parentheses" would be difficult for a transformer. As I think about it, a transformer can theoretically count parentheses. (For those who know how a transformer attention head works, here is a hand crafted example. Create the query and keys so that q * k is 1 if the key is to the left of the query and 0 if to the right, using the position embeddings only. Then the value network returns 1 for ( and -1 for ) and 0 otherwise. The resulting weighted average will be the number of open left parentheses divided by the total number of symbols to the left. Then using the current position embedding it should be possible to recover some sort of depth at least for the main token. And this is just one way of many that the transformer could learn this information. They don't even need the parentheses since the arity of each of the type formers is known, so they can learn depth from Polish notation alone I believe by doing the same thing but +1 for any symbol with a positive arity and -1 for symbols with 0 arity.) Nonetheless, I still stand by what I said about adding the depth or tree position as an explicit given quantity (by possibly adjusting the position encoding to be a tree position encoding or using a graph neural network). In particular, 2/3rds or so of the tokens are parentheses and the paper mentions that the "S-expression syntax is thus very verbose, which can cause some expressions to not fit into the size constraints of our Transformer model." Actually, even without a depth/tree encoding, I could imagine that one can still drop parentheses since the Polish notation should be sufficient.

Christian Szegedy (Jun 26 2020 at 01:44):

Jason Rute said:

Actually, even without a depth/tree encoding, I could imagine that one can still drop parentheses since the Polish notation should be sufficient.

We had small technical issues with the polish notation (the fact that our s-expressions) have some poor choice of token names that makes them slightly ambiguous if we drop the parenthesis. However, it is certainly possibly. Also we think that adding several other position encodinds could help the transformer to work with the tree structure as several earlier papers suggest. We just wanted to try the simplest possible baseline first.

Brando Miranda (Jun 03 2021 at 19:14):

@Markus Rabe First I wanted to give you & your team kudos for the skip-tree paper. The results are quite interesting! I had a question about the evaluation part of your skip-tree trained transformer model for conjecturing useful terms/lemmas/thms. Apologies if this is an easy question but I found that section hard to digest without a table with nice percentages like the other ones you had in the paper. But at one point in the paper you say that most of the conjectures were alpha-equivalent to theorem statements already in the database. Would this not mean that model is not "really" conjecturing but reconstructing the original data set (perhaps in a different format)? Perhaps a good table would be to see how many times non-alpha equivalent conjectures were used as premises to prove unseen theorems e.g. validation set? Let me know your thoughts or if I misunderstood that section. Thanks for your time!

Markus Rabe (Jun 03 2021 at 22:05):

Hey Brando! I we once had a statement in the paper like "many of the conjectures are alpha-equivalent", but I think it aren't "most". Unfortunately we have no hard numbers on this (and it would be a bit of work to determine that).

What we have is that some of the statements that were used most in the "usefulness" test were actually new and there were no have alpha-equivalent statements in the theorem database. While this shows some potential for neural conjecturing, I do not think that this is a great way to generate conjectures. The whole approach certainly stays close to the training set, and I see this experiment mostly as a motivating example that generating new and useful conjectures is possible - even with such a naive approach. In the future we will need conjecturing approaches that help us to explore new theories outside of our training set, and such a conjecturing should happen in a more goal-directed manner.

Brando Miranda (Jun 22 2021 at 15:28):

Got it thanks for the response! So it seems they do stay close to the data set as expected and as you mentioned to create good conjectures that should be "the goal" of the system during training (or something like that). Thanks again!

Brando Miranda (Aug 09 2021 at 17:31):

@Markus Rabe Hi Markus! I was curious about the details of your optimizer in the paper "Mathematical Reasoning via Self-supervised Skip-tree Training". What type of optimizer did you use (+ scheduler) and what hyperparameter search did you do to find them?

Markus Rabe (Aug 20 2021 at 02:58):

@Brando Miranda : We built on top of the BERT codebase. You can find a snapshot of our codebase from around the time of that paper here: https://github.com/tensorflow/deepmath/blob/experimental/deepmath/deephol/train/translate.py