Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Paper: HyperTree Proof Search for Neural Theorem Proving

Jason Rute (May 25 2022 at 00:48):

There is a new neural theorem prover paper by Meta AI (along with Gabriel Ebner and Amaury Hayat).

Title: HyperTree Proof Search for Neural Theorem Proving
Authors: Guillaume Lample, Marie-Anne Lachaux, Thibaut Lavril, Xavier Martinet, Amaury Hayat, Gabriel Ebner, Aurélien Rodriguez, Timothée Lacroix
Twitter: https://twitter.com/GuillaumeLample/status/1529107248013795333?cxt=HHwWioCz1Zi8vbgqAAAA
arXiv: https://arxiv.org/abs/2205.11491

The paper has state of the art results for the Lean MiniF2F benchmark (which you may recall from the recent OpenAI work) and on Metamath. I notice that a number of the authors are in this Zulip. Would they be interested in telling us more about the paper? :smile:

Excited to release our latest work: https://arxiv.org/abs/2205.11491 We present a new algorithm, HyperTree Proof Search (HTPS) inspired by the recent success of AlphaZero. Our model is able to prove mathematical theorems in a fully automated way and significantly outperforms the SOTA. 1/n https://twitter.com/GuillaumeLample/status/1529107248013795333/photo/1

- Guillaume Lample (@GuillaumeLample)

Jason Rute (May 25 2022 at 15:05):

I actually have a comparison question that maybe one of the authors or @Stanislas Polu could address. In this paper it seems that they are directly comparing the GPT-f Metamath results to this paper's Metamath results, but it is not clear to me that the numbers are comparable.

The comparison I'm questioning is this from the abstract:

In particular, we show that with HTPS alone, a model trained on annotated proofs manages to prove 65.4% of a held-out set of Metamath theorems, significantly outperforming the previous state of the art of 56.5% by GPT-f.

Here is what I gather from the two papers:

In Generative Language Modeling for Automated Theorem Proving (with the GPT-f model) they say: "We split that dataset between a train set and two valid and test sets each containing ∼1k proofs sampled randomly (∼90k proof steps each)." The valid set seems to be the one used for expert iteration with a final pass rate of 56.5% after Expert Iteration. The test set which is not used for expert iteration and only looked at the end has a pass rate of 56.22%.
In this paper (with the Evariste model), they say: "We first derive a graph of dependencies between statements, and generate a random train-valid-test split of theorems, with 1000 valid and test theorems. We use the DAG to ensure that each theorem in the valid or test set is not used to prove another theorem." Again, the validation theorems are used for a type of iterative training (which they call "online training"). However, the number 65.4% posted above is for supervised learning only (on the validation set), with no online training. If online training is added, I think it becomes an 82.6% pass rate on the validation set.

So I'm observing two things:

The Evariste model using their proof search, even without any form of iterative training, outperforms GPT-f which does expert iteration. When a form of iteration (online learning) is added, it does significantly better.
However, the validation sets used for measuring success are very different. The validation set for GPT-f is a random sample of theorems. The validation set for Evariste if I understand correctly consists entirely of "leaf theorems" which are not used in the proof of any other theorem.
- On one hand, these "leaf theorems" may consist of newer theorems which have haven't had a chance to be used later and therefore are harder to prove. Also, by using leaf theorems one is avoiding certain types of data leakage which could come from training on a theorem which comes after a test/validation theorem.
- On the other hand, these "leaf theorems" can't be lemmas (since lemmas are used to prove another theorem), and since Metamath has no automation, I assume Metamath users break up theorems into lots of lemmas. This means (if I understand correctly), that the validation and test sets consist of lots of easier proofs which just combine all the lemmas together to get the final result. If so, it could be that this validation set is much easier than the one used by GPT-f and that could account for the much higher pass rate.

Stanislas Polu (May 25 2022 at 15:09):

That seems like a fair point. But I think the Lean / miniF2F numbers are what matters really here :P I don't think they have any Metamath miniF2F numbers shared in the paper do they? That would be a good way to compare the two approaches on Metamath I presume.

Jason Rute (May 25 2022 at 15:12):

Another small typo is that there are a number of references to "Table 6.1" which doesn't exist.

Guillaume (May 25 2022 at 15:43):

Ah yes, the Table 6.1 is in fact the Table 1 in Section 6, not sure what happened here.. Thank you for pointing this out!

Regarding the split, we experimented two different ways:
1) make the train-valid-test split fully random
2) make it with exclusively with leaves from the DAG in the valid and test

We did not mention it in the paper, but we also had an extra constraint to guarantee (in both cases) that theorems in the valid/test do not have a theorem in the training set with the same statement. This can happen in Metamath as you have many theorems called xxxx and xxxxOLD with the same statement but a different proof.

In early experiments, we did not see any difference of performance between the two splits, so we stuck the second one, as it simplifies things a bit at evaluation time (e.g. we never need to worry about specifying forbidden labels with the DAG). However, we did not re-evaluate this at the end as the focus was indeed to work with Lean.

Timothee Lacroix (May 25 2022 at 15:58):

In this paper, we feel like we pushed the numbers reasonably far, but the gap to truly "insightful" imo proofs in lean remains quite large despite our efforts ! Something amazing needs to happen to allow exploration as efficient as the one found in two player games. We also strongly believed in OpenAI's "manual" curriculum approach, but it seems statements only isn't enough...

Stanislas Polu (May 25 2022 at 15:59):

Agreed <3

Lasse (May 29 2022 at 07:11):

I'm having some trouble parsing some of the details in appendix A1:

Conversely, let U ⊂ G be the set of invalid nodes. A node g ∈ G \ U is invalid if it has been
expanded but has no tactics in the hypergraph, or all of its tactics have an invalid child. Formally:
{(g, t, c) ∈ T } = ∅ or ∀(g, t, c) ∈ T , c ∩ I != ∅

What here is U? It seems that this paragraph mandates that U is always the empty set. Is this a typo? Additionally, what is I? It does not appear to be defined anywhere.

Fabian Glöckle (May 29 2022 at 19:56):

I am not an author, but yes, I assume there are some typos there.
I think U and I are supposed to be the same and refer to the set of invalid nodes, and invalidity is defined for nodes g \in G.
Note that the first formula of "Formally:" means that the intersection of T with the preimage of {g} under the first projection is empty, and the second one makes the definition recursive (but well-foundedly).

Fabian Glöckle (May 29 2022 at 19:59):

(Say, the minimal set with these two properties)

Jason Rute (May 29 2022 at 23:16):

Since we are mentioning typos:

Subgoals sharing a metavariable cannot be solved in isolation. This is addressed in Polu and Sutskever [16] by using as input the entire tactic state.

I think the reference you wanted to mention is the PACT paper since you are referring to Lean and tactic states. (Disclosure, I’m an author on the PACT paper.)

Jason Rute (May 29 2022 at 23:59):

Also, thanks to the authors for the response to my earlier questions.

Lasse (May 30 2022 at 08:52):

Ah, that is probably indeed how it needs to be interpreted. Thanks!

Ayush Agrawal (Feb 14 2023 at 07:09):

Hi @Timothee Lacroix , In HTPS paper, I didn't quite understand the line:
"In order to batch calls to the policy and critic models over more nodes to expand, we run several selections sequentially, using a virtual loss [Chaslot et al, Alphazero] to produce different partial proof-trees. "
Since you are using virtual loss, why are the selections done sequentially? As I understand, virtual loss is used for tree level multi-threaded parallelization [Chaslot et al. https://project.dke.maastrichtuniversity.nl/games/files/articles/multithreadedMCTS2.pdf], are you using that?

Timothee Lacroix (Feb 14 2023 at 14:38):

Ayush Agrawal said:

Hi Timothee Lacroix , In HTPS paper, I didn't quite understand the line:
"In order to batch calls to the policy and critic models over more nodes to expand, we run several selections sequentially, using a virtual loss [Chaslot et al, Alphazero] to produce different partial proof-trees. "
Since you are using virtual loss, why are the selections done sequentially? As I understand, virtual loss is used for tree level multi-threaded parallelization [Chaslot et al. https://project.dke.maastrichtuniversity.nl/games/files/articles/multithreadedMCTS2.pdf], are you using that?

Hey, so in our set-up, the language model was slow enough that doing selection sequentially was not too much of a bottleneck.
We still need virtual loss because we do several selections before getting any feedback from the model / proving environment.
Let me know if that makes sense :)

venom Yu (Mar 03 2023 at 09:57):

Hi, after reading this paper, I have two questions:

How is the HTPS's efficiency? For example, how many nodes will be expanded every time a proof search is completed, and how long will it take?
In section 7.1.1, I'm confused that the Evariste is trained on minif2f-curriculum, and then evaluated on minif2f-curriculum again. And for a fair comparison, I think GPT-f should also be online-trained in the same setting as Evariste, but I notice that the GPT-f only achieved 36.6% on miniF2F-test as "Formal Mathematics Statement Curriculum Learning" reported. Does it mean online-training is meaningless to GPT-f? Could you please provide more detail about Table 2 in Section 7?

Thanks in advance!

Jason Rute (Mar 03 2023 at 22:14):

I think I'm reading Table 7.1.1 as follows: The rows correspond to the data being tested on and the (big) columns correspond to the data being trained on. I think the Evariste and GPT-f models use reinforcement learning on the statements (not the proofs). The numbers with daggers are where the training and testing theorems align. Since training doesn't use proofs, this isn't cheating. This is saying that if you have a set of similar theorem statements without proofs and 1 to 7 days to search for a proof, you can find a decent number of proofs. I've heard people even ask for a similar feature. Imagine you are starting a big development and you have broken it into lemmas. All your remaining work is then to fill in the lemmas. As these systems get better, it seems reasonable to have a process which continuously searches for days over all your lemma statements filling in proofs as it goes and learning from those proofs. (Also, as these algorithms get faster, 1 to 7 days could turn into a few hours.)

Jason Rute (Mar 03 2023 at 22:14):

Having said that, I'm also confused on a few details:

How is miniF2f-curriculum different from miniF2f-train and mini-F2F valid? They are disjoint according to the Polu et al. paper I think.
Where can one find miniF2f-curriculum?
What is the GPT-f 30.6 number in the table 7.1.1? I assume it would mean that the result of training GPT-f with online curriculum learning on miniF2F-curriculum (since that is the point of the Polu et al. paper), but if so, why doesn't it have a dagger and why can't I find it in the Polu et al. paper?

Jason Rute (Mar 03 2023 at 22:15):

Let's ping @Timothee Lacroix.

Eric Wieser (Mar 03 2023 at 22:44):

miniF2F curriculum is here, I think: https://github.com/openai/miniF2F/tree/statement_curriculum_learning

Eric Wieser (Mar 03 2023 at 22:44):

I have no idea why it's in a different branch

venom Yu (Mar 04 2023 at 05:04):

I see, if the online training is done on the statement, does pass rate@64 still make sense? It is equivalent to having searched this statement many times (a lot more than 64) during training period.

fzyzcjy (Mar 04 2023 at 10:28):

@venom Yu IMHO, if the model has perfectly memorized the previously found answer, then pass@64 will be the same as pass@1, and in the extreme case, the model should provide a search "linked-list" instead of a search "tree" since it provides the standard next-step each time perfectly. However, if the model does not, then it is "re-finding" the answer even after the long training period, then pass@64 does increase the probability of finding the answer.

venom Yu (Mar 04 2023 at 11:00):

@fzyzcjy I agree, but if the model have searched the answer for a long time. The comparison with GPT-f would be unfair because GPT-f has not seen these statements before. It's hard to say whether the improvement is due to HTPS or other factors.

fzyzcjy (Mar 04 2023 at 11:02):

@venom Yu I am also a bit confused what makes HTPS improved, since the ablation study seems still have several variable factors.

Jason Rute (Mar 04 2023 at 12:36):

I think in general it is really hard to compare any 2 neural automated theorem provers. They all use very different data, techniques, testing data, and compute budget. Having said that, do you feel the values in the inner most big column of 7.1.1 are (roughly) comparable? If I understand correctly, they all use the statements from MiniF2F-curriculum for iterative refinement (or RL or expert iteration or whatever you want to call it), and they test on the same statements. Of course they use different techniques and compute so it isn’t a perfect isolation of what exactly works best.

Jason Rute (Mar 04 2023 at 12:42):

As for pass@64 (which the GPT-f results also use), I think it makes sense just because it works. As I say elsewhere, it is a bit surprising that restarting the model from scratch 64 times is better than running it 64 times longer. It really shows that we don’t have a good grasp on exploration in these models.

Jason Rute (Mar 04 2023 at 12:44):

Overall, I feel the improvement in this field has been incremental, and until we start going down vastly different directions where it is just clear that something works much better than previous results (our own Alexnet or AlphaFold moment), it will be hard to compare what is best.

Kaiyu Yang (Mar 04 2023 at 19:11):

Note that to calculate Pass@64, HyperTree Proof Search uses different hyperparameters each time (see the top of page 9 or Appendix C). So it's different from simply running the search 64 times.

Jason Rute (Mar 04 2023 at 19:29):

@Kaiyu Yang Do you happen to know if this is the same with the Polu et al paper as well?

Kaiyu Yang (Mar 04 2023 at 19:33):

@Jason Rute I didn't notice similar statements in Polu et al.

Last updated: May 02 2025 at 03:31 UTC