Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Announcing Graph2Tac, a prover based on Tactician's new API

Jason Rute (Jan 12 2024 at 15:24):

We are pleased to announce Tactician's API, a new AI interface for theorem proving, building on Tactician. This includes a new graph-based dataset of over 520k definitions (of which 260k are theorems) in 120 Coq packages, one of the largest datasets for AI interactive theorem proving. We also present a new state-of-the-art neural theorem proving solver Graph2Tac, designed for proving theorems in Coq projects with new definitions not seen during training.

See the announcement on Coq Discourse.

We would love to hear your feedback!

Jason Rute (Jan 12 2024 at 15:34):

Min-Hsien Weng said:

Kaiyu Yang said:

Aesop automatically applies some built-in tactics, e.g., simp_all, and it also provides a set of "rule builders" that allow users to configure the set of tactics for expanding a node during best-first search. Our extension basically allows users to provide not only fixed tactics but also "tactic generators MVarId → MetaM (Array (String × Float))". The tactic generator may or may not be based on LLMs.

I came across the TensorFlow GNN [Graph Neural Networks] (https://github.com/tensorflow/gnn) library. It helps to work with graph data structures in neural networks, such as using GNNs to model abstract syntax trees (ASTs) in compilers.

It seems that the output of tactic generators could be graphical data structures combined with rankings. GNN may potentially make it easier to implement "tactic generation/Aesop as a neural network problem."

It's funny you mentioned this. :smile: This is exactly the approach (using GNNs to model ASTs) we took in Graph2Tac (which was already in review at that point), even using TensorFlow GNN.

Alex Sanchez-Stern (Jan 12 2024 at 17:58):

Really cool work! How does the api interact with Coq on the backend, is it using coq-serapi, coq-lsp, or some other interface?

Jason Rute (Jan 12 2024 at 18:02):

I’ll let @Lasse give more details, but it is neither. It is a custom interface built on top of Tactician.

Lasse Blaauwbroek (Jan 12 2024 at 18:23):

@Alex Sanchez-Stern: The interaction happens through an OCaml plugin for Coq (namely, the Tactician plugin). The big difference between this and using SerAPI or an LSP is that it inverts the control. If you use SerAPI, the external prover (for example, a Python program) acts as a client that sends queries to Coq (the server). When you use a plugin, Coq becomes the initiator, while the external prover becomes a server. (Note however, that in our interaction protocols, 'client' and 'server' is somewhat fluid: Once Coq asks an external proving server to synthesize a proof, the roles may be reversed such that the agent can instruct Coq to run a tactic on an arbitrary proof state and receive feedback.)
An advantage of such a model is that a Coq user can interact with the external prover through regular Coq commands and tactics (such as Tactician's synth tactic).

The OCaml plugin extracts data directly from Coq's kernel and converts it either to text (with Coq's built-in printer) or to our new graph representation. The actual communication protocol we use is based on Cap'n Proto, which provides a rich remote procedure calling protocol (RPC) as well as the data format for our dataset.

Alex Sanchez-Stern (Jan 12 2024 at 18:26):

Oooh yeah, that makes sense. Cool!

Jason Rute (Jan 12 2024 at 19:22):

The user facing interface with the synth tactic and Suggest commands is similar to Lean Copilot (where of course Tactician has had this interface for a while). As for the internal interface between Coq and the python client, I was hinting at our paper when I said the following. I don't remember the context offhand, but we were discussing Lean Copilot and how it could handle storing definition embeddings. Note, our interface (unlike Lean Copilot I think) is able to process new definitions in real time.

Jason Rute said:

I know of another (unpublished, but soon to be released) project that works like that. The ITP client sends a whole bunch of goal and environment data to the server to process. The server keeps a list of vector embeddings for each definition and also handles predicting new tactics. The client handles the tree search (which I think it would in this case too since aesop handles the search, but it could go the other way as well where the server does the search as in Lean gym).

Jason Rute (Jan 17 2024 at 16:00):

Kaiyu Yang said:

Hi Jason Rute , congrats on the great work! I really like the comparisons in Graph2Tac between Transformers and simpler models such as KNN. I'm wondering why you chose to train the Transformer model from scratch on Coq data, instead of fine-tuning an existing language model such as ByT5 or Mistral 7B. The latter experiment seems more interesting to me, as LLMs are less about Transformers than about pertaining the model on Internet-scale data. E.g., RWKV or Mamba are probably more like LLMs compared to Transformers without pretraining.

I'm moving this conversation to this thread. Our primary reason for training a Transformer from scratch is because our test theorems have been on the internet for a while, so it is very possible they ended up in the pertaining data creating a possibility of data leakage. (This is still very poorly understood topic in my opinion.) Another lesser concern is that it wasn't as much of an apples-to-apples comparison with the graph model, but given how much better the graph model does than the transformer, it is a fair question how much pertaining would help (especially if we could be sure to avoid leakage).

One nice thing about our setup is that it wouldn't be that much work to swap our language model with another as long as they have a proofstate -> tactic interface. I hope this makes it possible to test other approaches, like for example the recent Llemma experiments, in Coq.

Kaiyu Yang (Jan 17 2024 at 16:16):

Our primary reason for training a Transformer from scratch is because our test theorems have been on the internet for a while

That's a good point. I think it doesn't hurt to include pretrained LLMs in the comparison. You could point out the possibility of data contamination, and it's up to the reader how to interpret the results. I'm less sure about the point on "apples-to-apples comparison", as whether a class of models can benefit from large-scale pretraining is an important factor.

Lasse Blaauwbroek (Jan 17 2024 at 16:50):

@Kaiyu Yang I believe that an important task of a research paper is to interpret results. History has time-and-time-again shown that research can be very easily misinterpreted, both by fellow researchers, the press and the general public.

As such, I believe that just 'letting it be up to the reader to interpret the results' is not good enough. I am fairly confident that showing results of pre-trained models together with results of non-pre-trained models would result in wide-spread misinterpretation of that data. Just adding a note about the possibility of contamination is not enough in my opinion.

Kaiyu Yang (Jan 17 2024 at 16:58):

showing results of pre-trained models together with results of non-pre-trained models would result in wide-spread misinterpretation of that data.

If some models' advantage is to benefit from large-scale pretraining, it is not fair to compare them with other models only in the setting without pretraining.

Jason Rute (Jan 17 2024 at 17:02):

I'm having trouble parsing that sentence @Kaiyu Yang?

Lasse Blaauwbroek (Jan 17 2024 at 17:08):

The big question with pre-training is whether its advantage comes from (1) an inherent advantage of having large and diverse datasets or (2) because similar data is already present in the pre-training set.

If the advantage comes mainly from (2), then the comparison is not fair. In such a situation, the task of the LLM would be more to translate the knowledge it has into the appropriate format, rather than inventing this knowledge from scratch. Such a translation task is obviously also interesting, but it is a very different task. This is something that is not commonly realized by people. As such, I consider mixing these two settings without a proper analysis of the results (which is very difficult and possibly impossible to do), to be inappropriate.

Kaiyu Yang (Jan 17 2024 at 17:12):

Jason Rute said:

I'm having trouble parsing that sentence Kaiyu Yang?

This is what I meant: For example, if Transformers can benefit significantly from pretraining whereas RNNs can benefit very little, we shouldn't only compare RNNs with Transformers (w/o pretraining).

Jason Rute (Jan 17 2024 at 17:14):

The big question with pre-training is whether its advantage comes from (1) an inherent advantage of having large and diverse datasets or (2) because similar data is already present in the pre-training set.

Without getting into the debate if it is (1) or (2), I think this is incredibly important to figure out. And it may depend on different approaches. I could see an approach that uses GPT-4 directly as having (2) be a huge factor, and an approach which retrains a model from pre-training as having (1) be the main factor.

Jason Rute (Jan 17 2024 at 17:14):

I share both @Lasse's skeptisism and @Kaiyu Yang's fear about throwing the baby out with the bathwater. How do we as a community solve this?

Jason Rute (Jan 17 2024 at 17:16):

Also, one thing we didn't explore is synthetic pretraining which is a hot topic.

Min-Hsien Weng (Jan 23 2024 at 23:49):

Kaiyu Yang said:

This is what I meant: For example, if Transformers can benefit significantly from pretraining whereas RNNs can benefit very little, we shouldn't only compare RNNs with Transformers (w/o pretraining).

Here is my personal experiences. The pretrained LLMs, like Mistral-7b, need far more data (500,000+ texts) than supervised models (20,000) to reach a better accuracy. But LLMs have a better robustness; fine-tuned Mistral-7b has a better accuracy against unseen data (0.86) while the supervised model (SGDClassifier) suffered overfitting, leading to a decrease in accuracy (0.92 -> 0.76).

I agree with @Kaiyu Yang. Evaluating the performance of different models may need a more systematic approach, due to the numerous factors involved, including training time, data, and LLM-specific factors like longer inference times. This helps us to choose a suitable model for different case study.

Lasse Blaauwbroek (Jan 24 2024 at 00:27):

Let me try to clarify my position: I'm certainly not looking to block anyone from publishing any research based on LLM's in this space. But it seems obvious that we need some rules for engagement in this space. I don't know what these rules should be. But I can at least come up with something that I would consider the 'minimum acceptable':

Pre-training sets should be publicly available. This at least affords any researcher the opportunity to go and search for contaminations.
I would expect a good-faith effort on the part of authors to find and remove data contaminations from their dataset. This would at least include removing (a) exact copies of the test set, and derived data such as (b) auto-generated documentation, (c) latex documents such as papers written to accompany the code and, (d) forks of the original project. If one is operating under the hypothesis that pre-trained LLMs get their advantage from a large and diverse dataset, it should be possible to cast a pretty wide net and exclude any data that might possibly be considered 'suspicious'.

Lasse Blaauwbroek (Jan 24 2024 at 00:33):

How do we as a community solve this?

Perhaps a fun way to explore this is to organize some kind of game/competition. In such a competition we would have a fixed test set of mathematical libraries on which a machine learning model is evaluated. Anyone can then submit a pre-training dataset in which they have cleverly hidden a copy of the test set, such that a machine learning model can exploit this copy to have a large advantage. That is, the performance of a model should increase significantly when training on a non-tampered pre-training set vs a tampered-with pre-training set.
Then, the challenge for other researchers is to find where and how the copy has been hidden. Whenever someone finds a copy of the test set, someone buys them a beer :-)

Min-Hsien Weng (Jan 24 2024 at 01:14):

Lasse Blaauwbroek said:

How do we as a community solve this?

Perhaps a fun way to explore this is to organize some kind of game/competition.

I highly recommend Kaggle. That platform has hosted a large number of compeitions and many experiences of dealing with data leakage and contamination. Most importantly, there are many machine learning grandmasters on the platform and they often come up with innovative and mind-blowing solutions!

namibj (Feb 10 2024 at 19:16):

Lasse Blaauwbroek said:

The OCaml plugin extracts data directly from Coq's kernel and converts it either to text (with Coq's built-in printer) or to our new graph representation. The actual communication protocol we use is based on Cap'n Proto, which provides a rich remote procedure calling protocol (RPC) as well as the data format for our dataset.

I found the description of the graph structure in the ArXiv paper lacking; can you provide something better, by chance?

I'd love to get a good enough understanding to plan how to replace your GNN with a Chromatic Self-Attention-based transformer.

And also to explore diffusion models for the possibility of synthesizing arbitrary tactic arguments, as this allows emergence of (meta) tactics once RL training gets involved.

Jason Rute (Feb 10 2024 at 19:21):

Did you see this paper? It describes the dataset and graphs in detail: https://arxiv.org/abs/2401.02950

Jason Rute (Feb 10 2024 at 19:23):

Also you can explore the graphs here.

Jason Rute (Feb 10 2024 at 22:00):

As for replacing the GNN with graph-based Transformers, that is certainly an option. Transformers are typically slower, so that is a trade off. The advantage of text Transformers is they can synthesize full tactics with arguments.

Jason Rute (Feb 10 2024 at 22:02):

Also @namibj if you want to actually use our interface with your own model, we could show you a bit more around the code. I think it might be a bit hard to replace our GNN with your own without some support.

namibj (Feb 10 2024 at 22:07):

Jason Rute said:

Also you can explore the graphs here.

Thanks, I'll have a look.
Seems far better than the very cut down example graphs in the paper (you may have updated it by now; I read it ~2 weeks ago and how lambdas work/exist, let alone more, wasn't clear to me).

namibj (Feb 10 2024 at 22:17):

Jason Rute said:

As for replacing the GNN with graph-based Transformers, that is certainly an option. Transformers are typically slower, so that is a trade off.

They don't suffer as much from vanishing gradient during training, though.

The advantage of text Transformers is they can synthesize full tactics with arguments.

I'm not looking at text transformers beyond using papers on them for prior art as they're similarly discrete in output as the AST-based graphs used here.
I'm hoping to manage to rig one of the graph-generating diffusion model approaches to this, though, to more-naturally generate tactic arguments. Possibly even hypotheses relevant to some theorem, as a reverse operation to the usual RAG; but that's something to try after it works competitively to Graph2Tac.

namibj (Feb 10 2024 at 22:23):

Jason Rute said:

Also namibj if you want to actually use our interface with your own model, we could show you a bit more around the code. I think it might be a bit hard to replace our GNN with your own without some support.

Thanks for the offer; we'll see about that after I took care of LeanCopilot RAG and analyzed the graph structuring to see how it could fit appropriately with chromatic self-attention.

Last updated: May 02 2025 at 03:31 UTC