Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Maintaining Papers with Code for Automated Theorem Proving

Jason Rute (Feb 09 2020 at 03:21):

There is a site https://paperswithcode.com. It is contains most modern ML papers, especially those with associated open source code bases. They also maintain lists of leaderboard specifying which models are state of the art. (Such leaderboards should be taken with a grain of salt, but at the same time I think they are quite useful in guiding the field.)

Jason Rute (Feb 09 2020 at 03:21):

Anyway, the site is broken down by topic, and I've been, for the most part, unofficially maintaining the Automated Theorem Proving Topic section. I mostly do the following:

Add papers which haven't already been added. (Most papers are quickly added by others or by automated tools, but some get left out.)
Adding links to the code for the papers. This is really important and one of the main purposes of the site.
Adding items to the leaderboards and add new leaderboards. I think leaderboards/benchmarks/datasets are one of the best approaches to advertise theorem proving to the ML community. However, my approach has been to only add a leaderboard if (1) it is a clear benchmark/dataset that someone could try to compete against, and (2) there have been at least two papers on that particular dataset/benchmark (so at least two data points demonstrating that others think this is a valuable metric). But that is just my approach. It also helps if that benchmark/dataset has a well-defined name. Since TP problems are often of the form "prove as many theorems within time limit T," it is hard to sum up the benchmark in a short title if it doesn't have a well-defined name. (One unfortunately can't add links or a description.)

Jason Rute (Feb 09 2020 at 03:21):

Is anyone interested in helping me maintain that part of the site? My role isn't official, and anyone can add stuff to the site, but I think it is important that those with interest and understanding in ML and TP maintain that section. I've been preoccupied with other stuff and haven't added anything for a few months.

Tom Chen (Nov 29 2022 at 00:31):

I have added a few entries to https://paperswithcode.com/sota/automated-theorem-proving-on-minif2f-test (sledgehammer, sledgehammer with heuristics, which are baselines discussed in papers, such that readers of the webpage can have an idea how ATP performs; as well as PACT)

Jason Rute (Nov 29 2022 at 13:32):

Thanks for doing this. What does Pass@8 mean in this context, and more importantly is it helpful or confusing? For example, all good models (including Hammer) have search. So it doesn't mean what it means in say AlphaCode, where Pass@8 means one of the top ten results is correct (but the model can't be sure which one), since in theorem proving you can check if a proof is correct and if not, then backtrack. I think, if I remember correctly that OpenAI started using that term to mean just rerunning the model eight times. I think it is a useful metric, but I also think it is gets into the difficulties of comparison. For example, the Minerva solution, while being Pass@100 in some sense also visits fewer proof-states than some other models which do more searching inside a single search tree. But of course the Minerva model probably uses way more compute (as say measured in FLOPS).

Jason Rute (Nov 29 2022 at 13:32):

It is a sad state of affairs that the best thing to do in a proof search sometimes is to just throw everything away and start from scratch. :frown: A good model should be able to learn from its mistakes.

Jason Rute (Nov 29 2022 at 13:41):

Also the line:

GPT-f 0.366 HyperTree Proof Search for Neural Theorem Proving 2022

confuses me. What is this referring to, and should it be 36.6? (Or should all the numbers be between 0.0 and 1.0?) Lean GPT-f is confusing. I think it is referring to the model mentioned above under Lean expert iteration.

Jason Rute (Nov 29 2022 at 13:44):

Actually, I really think Pass@N is confusing. In the original Mini-F2F paper, they say:

Pass rates are reported as Pass@N where N is the number of proof search attempts per statement. Pass@N is computed by running more attempts per statement, averaged to get an unbiased, low-variance estimate.

Note, they say averaged.

Jason Rute (Nov 29 2022 at 13:49):

So I think in the OpenAI papers, Pass@N is used to just used to reduce variance in the pass rate estimate.

Jason Rute (Nov 29 2022 at 13:52):

And the HTPS paper also uses average pass rate mostly, except in a few cases where it uses cumulative pass rate.

Jason Rute (Nov 29 2022 at 13:53):

So in general, the results of pass@8 and pass@64 are directly compare-able. The latter just has lower variance.

Tom Chen (Nov 29 2022 at 13:56):

@Jason Rute

What does Pass@8 mean in this context, and more importantly is it helpful or confusing?

It also confuses me, and it was there before my modification. Maybe we should delete that pass@8?

GPT-f 0.366

Also not by me and yes that looks buggy ;)

Tom Chen (Nov 29 2022 at 13:57):

0.366 modified now.

Hmm but what about "Lean Expert Iteration 29.6 34.5 36.6". More passes looks better. Maybe we should just keep the highest score?

Jason Rute (Nov 29 2022 at 13:58):

The Minerva result uses 100 to mean 100 attempts (but again this is just a form of proof search).

Tom Chen (Nov 29 2022 at 13:58):

Then I guess maybe we should just pick highest score and merge all those "pass@N" columns

Jason Rute (Nov 29 2022 at 13:59):

I think it would be good to have one column with the lowest variance number.

Tom Chen (Nov 29 2022 at 13:59):

Anyway here is a screenshot backup before further changes
image.png

Jason Rute (Nov 29 2022 at 13:59):

We could also add tags for which ITP.

Tom Chen (Nov 29 2022 at 13:59):

Jason Rute said:

I think it would be good to have one column with the lowest variance number.

Not very sure, so what should we do for "Lean Expert Iteration" row?

Jason Rute (Nov 29 2022 at 13:59):

The link for PACT is just a repeat the the Lean GPT-f line above I think. I don't think Thor reproduced it.

Tom Chen (Nov 29 2022 at 14:00):

Jason Rute said:

We could also add tags for which ITP.

done, ATPs are labelled (all others are ITP)

Jason Rute (Nov 29 2022 at 14:01):

Before we rush to change too much, we could ask the authors: @Albert Jiang @Timothee Lacroix @Guillaume @Kunhao Zheng, etc ?

Tom Chen (Nov 29 2022 at 14:01):

Jason Rute said:

The link for PACT is just a repeat the the Lean GPT-f line above I think. I don't think Thor reproduced it.

Sorry do not get it. Thor mentions in its paper: image.png

Tom Chen (Nov 29 2022 at 14:02):

Good idea! So I will not modify it further. Anyway previous modifications are harmless: They either add new rows (without touching old data) or labels, or fix explicit bug (0.366 vs 36.6)

Jason Rute (Nov 29 2022 at 14:03):

The PACT paper (PACT is the training data methodology) has the Lean GPT-f model. The original PACT paper didn't test on MiniF2F since it didn't exist. It was tested on MiniF2F in the miniF2f paper, but using Lean GPT-f model from the PACT paper.

Jason Rute (Nov 29 2022 at 14:04):

Some people call the model PACT. Others call it Lean GPT-f. It is ambiguous.

Tom Chen (Nov 29 2022 at 14:05):

Then looks like we should merge "Lean GPT-f" and "PACT" into one row and name it something like "PACT i.e. Lean GPT-f"

Jason Rute (Nov 29 2022 at 14:05):

Or "Lean GPT-f (a.k.a. PACT)"

Jason Rute (Nov 29 2022 at 14:15):

Wait, sorry! I misread the Pass@N stuff. It is talking about the number of attempts. But what they do is run many more than N attempts and average (where I think they split all the attempts into buckets of size N, take the Pass@N for each bucket, and average all the rates across buckets) to get Pass@N. So it does make it make sense that it is going up with N since that is more attempts. But at the same time, it is still possible to compare pass@100 for one model and pass@8 for another. The first uses more compute (all other things being equal), but both models are performing a single, large proof search in the end.

Jason Rute (Nov 29 2022 at 14:16):

Maybe it is easiest to leave it as is. Or to have a pass rate column and a separate attempts column.

Kunhao Zheng (Nov 29 2022 at 14:22):

For this question I'm not quite sure. Maybe summoning @Stanislas Polu @Jesse Michael Han better know the term here. :D

Kunhao Zheng (Nov 29 2022 at 14:25):

My thought is that the rationale behind the pass@N is the indeterminism of sampling of language models. If you set the sampling budget to 128, the 128 proof terms sampled this time/attempt could be different from the next time/attempt, which determines how the proof tree grows in the tactic states that follows. So each attempt you grow a different proof tree.

Jason Rute (Nov 29 2022 at 14:26):

Last nitpick: I think all the OpenAI models listed there use external data in that the models started from a pre-trained variation of GPT-3 and also incorporate arXiv, Github, and math stackexchange.

Kunhao Zheng (Nov 29 2022 at 14:28):

Kunhao Zheng said:

My thought is that the rationale behind the pass@N is the indeterminism of sampling of language models. If you set the sampling budget to 128, the 128 proof terms sampled this time/attempt could be different from the next time/attempt, which determines how the proof tree grows in the tactic states that follows. So each attempt you grow a different proof tree.

And that's why in minif2f paper you would not find other than pass@1 for the tidy. Because you can regard tidy as a deterministic sampling, that always output the L+HL, pass@N with N>1 will be the same as pass@1.

Jason Rute (Nov 29 2022 at 14:31):

What is strange about Pass@N for say a MCTS algorithm is that one could either do pass@64 visiting N nodes each time or do pass@1 visiting Nx64 nodes. It just happens that the first does much better right now.

Jason Rute (Nov 29 2022 at 14:33):

And an ATP like Hammer, your could run it 64 times in parallel for pass@64 (assuming it is stochastic), or run it 64 times longer for pass@1.

Kunhao Zheng (Nov 29 2022 at 14:57):

I give a careful read on the original GPT-f paper https://arxiv.org/abs/2009.03393, section 4.3.1 and 4.4.

Note that there are some typos around higher vs lower logprob...

Decreasing the number of expansions per goal e increases the variance as we less systematically explore the action space when expanding each goal. The e value can be taken quite high as each auto-regressive proof step generation uses the same query for a given goal and can therefore be batched together. Increasing e too much may also hurt performance given the breadth first nature of the cumulative logprob ordering. Increasing the number of attempts per proposition a decreases variance and consistently improves performance up to reasonably high values (We use a = 32 attempts for our final benchmarking). We found that a = 4 limited the variance in our evaluations while remaining tractable given the amount of compute we had available. Finally the proof search depth d has little impact on variance but naturally improves performance (we take d = 128 to evaluate our models and d = 256 for our final benchmarking).

Kunhao Zheng (Nov 29 2022 at 15:05):

The e here corresponds to the number of new nodes that Jason mentioned. I think the conclusion in this paragraph is under the assumption that the priority queue size is fixed.

The search setup used in miniF2F tidy is that, we do not grow the search (i.e. insert new tactic states back into the priority queue) once the queue is full OR the tactic state reach the depth limit.

Kunhao Zheng (Nov 29 2022 at 15:14):

Oh also that's under the setting of cumulated logprob for priority queue. In my experiment of miniF2F tidy, for each new tactic state, its value is computed by the value given by the pre-determined list (and in GPT-f setting it's given by value function) multiplied by the parent's value. So I guess that's different from MCTS where value function is not decaying somehow?

Kunhao Zheng (Nov 29 2022 at 15:16):

That said, even a same proof step in different depth of a tree will have different values due to the decaying we set.

Guillaume (Nov 29 2022 at 21:37):

Tom Chen said:

Anyway here is a screenshot backup before further changes
image.png

Thor + expert iteration on auto-formalised theorems should not be added there, because many of the auto-formalized problems are actually in miniF2F test. So if you do expert iteration on these you already have performed quite some search on test theorems (so this is not a pass@1 at all). Using a similar approach in Evariste we have more than 50% on miniF2F-test with pass@1 when we do the same thing.

Guillaume (Nov 29 2022 at 21:39):

It's not clear what's the best thing to report, but the only way to make this fair is to only allow the model to see the same set of problems during online MCTS / expert iteration (even though it's not the most interesting setting because in practice you want to do your iterations with as many statements as possible)

Albert Jiang (Nov 29 2022 at 23:38):

Guillaume said:

Tom Chen said:

Anyway here is a screenshot backup before further changes
image.png

Thor + expert iteration on auto-formalised theorems should not be added there, because many of the auto-formalized problems are actually in miniF2F test. So if you do expert iteration on these you already have performed quite some search on test theorems (so this is not a pass@1 at all). Using a similar approach in Evariste we have more than 50% on miniF2F-test with pass@1 when we do the same thing.

I agree with Guillaume that Thor + expert iteration is not pass@1. A portion of miniF2F (the MATH dataset ones) might have been seen between passes, if they are formalised correctly. So I'd frame this as an online learning that allows a few tries (few = # expert iteration + 1, in our case few=3) per problem and you can do additional learning in between tries.

Albert Jiang (Nov 29 2022 at 23:39):

It's in the using extra training data category though which tells you that it's not magic that it performs better than just Thor

Guillaume (Nov 30 2022 at 01:39):

I think I would call that test statements visible during training, or something similar, to highlight the fact that the model is never provided with the proof of test theorems.

Tom Chen (Dec 22 2022 at 02:12):

Revisiting the pass@N: In HTPS paper appendix C, they mentioned that there are a lot of hyper-parameters such as decoding temperature, depth penalty, exploration constant, etc. Their approach is to randomly choose those hyper-parameters at the beginning of each proof. Thus, one more reason why pass@N is better when N is larger may be that, the hyper-param happens to be good with more trials.

Last updated: May 02 2025 at 03:31 UTC