Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: COPRA

Scott Morrison (Oct 10 2023 at 00:31):

Jason Rute (Oct 10 2023 at 00:36):

I haven’t got past the abstract but I really hope they address the massive possibilities of data leakage of testing GPT-4 on compcert. Do they have any reason to think compcert is not in the training data for GPT-4?

Scott Morrison (Oct 10 2023 at 00:36):

I only skimmed but didn't notice any mention of leakage.

Scott Morrison (Oct 10 2023 at 00:37):

Figure 8a seems to be saying they are successful on almost the same subset of problems as ReProver (although more efficient on those). I suspect that should be interpreted as discouraging news.

Jason Rute (Oct 10 2023 at 00:40):

Also not past the abstract:

We present the first language-agent approach to formal theorem-proving.

Even if they wouldn’t count the GPT-f style approaches, what about Draft-Sketch-Prove, Baldur, or Sagredo? (I guess Sagredo could be considered contemporaneous, unpublished, and untested, but I’m still curious how it is different.)

Jason Rute (Oct 10 2023 at 00:45):

I shouldn't be too critical without reading it. Looks very interesting! Also, I'd love to hear from the authors, Amitayush Thakur, Yeming Wen, Swarat Chaudhuri, but I don't see any of them on Zulip. :/

Jason Rute (Oct 10 2023 at 00:55):

Actually, I should take my Compcert comment back a little bit. Basically, every paper that uses a pre-trained LLM and tests it on an existing set of ITP theorems which came before the model was trained is probably just as guilty of this. This includes (my own) PACT paper, LeanDojo/ReProver, and Baldur. (Also the miniF2F benchmark isn't immune either since even if the formal proofs aren't available, the informal ones are.) Nonetheless, I'd still like to see it discussed, especially since they are comparing it to ProverBot9001 which has a much simpler training dataset that doesn't have this problem.

Kaiyu Yang (Oct 10 2023 at 01:09):

what about Draft-Sketch-Prove, Baldur, or Sagredo?

It depends on what "language-agent" really means. I don't think the term has a well-accepted meaning, but if we consider the following definition in the paper's abstract

Language agents, which use a large language model (LLM) capable of in-context learning to interact with an external environment,

Draft-Sketch-Prove and Badur are not language agents in this sense?

Scott Morrison (Oct 10 2023 at 01:12):

What is the in-context learning that COPRA is doing?

Jason Rute (Oct 10 2023 at 01:31):

In context learning usually means putting examples in the prompt for the task you want to do. But with modern instruction-finetuned models, it might also just mean putting anything extra in the prompt like instructions, previous states, and useful information. I think Figures 1 and 4 sort-of shows the information they put in the prompt but I’d have to read closer.

Jason Rute (Oct 10 2023 at 02:07):

I should say I think “in-context leaning” also implies you don’t need to additionally train/fine-tune the model. You just take an off the shelf model.

Utensil Song (Oct 10 2023 at 04:46):

I am always skeptical on these works that deeply rely on black-box LLMs, it's difficult to isolate the contribution factors and make them transferable to open LLMs. I wish to at least see experiments of the same method done on another open LLM that has been finetuned on properly curated ITP corpus with leakage control. Prompt engineering efforts are sensitive to the target LLM, prompts that work well on GPT-4 etc. simply can't trigger the similar way to organize the information on open LLMs. These two lines of work seem to be very diverging. Finding (iii) of the experiment suggests "the use of GPT-4, as opposed to GPT-3.5, within the agent is essential for success", which is somewhat disappointing.

The debug-style agent is a research idea months old and has quite some real applications. By feeding failed attempts and the feedback from ITPs to the LLM, it takes significantly less interactions (81.6 -> 1.75 for passed ones) to achieve the similar level of correctness, which is the main advance of the method. As for pass rate, it's on the same level of Reprover and worse than Proverbot. For the interesting proofs, the paper lacks qualitative analysis like other work done on black-box LLMs, thus less inspiring for effectiveness of and improvements on the method.

Utensil Song (Oct 10 2023 at 04:50):

what about Draft-Sketch-Prove, Baldur, or Sagredo?

The paper says: " the use of informal proofs (in DSP) and repair models (in Baldur) — are orthogonal to our approach. "

Utensil Song (Oct 10 2023 at 05:06):

The MDP approach, with DFS & backtrack, making of the goal information, gives COPRA the abilities to better reach local optimal.

But in real practice of proving, most of the time when I mimic the way of only making use of the goal information to prove, I end up somewhere I need to make use of knowledge outside the context (e.g. the way to prove it in informal proofs like setting extra intermediate goals which could be something very far from the superficial features of the current theorem; the way to attack similar structure of the theorem but in a different field).

The approach is a good starting point, but a proper agent should be making good use of all of the pretrained knowledge, finetuned adapting ability, external knowledge base, tools (goals should not be the only source of feedback from the environment, in practice, you'll see people trying to make #mwe or run commands to get more information from the ITP, that's what a proactive agent should also try) etc. in an integrated manner. I would argue that for teaching such an agent, the informal interactions on this Zulip are also essential and vital.

Utensil Song (Oct 10 2023 at 05:17):

There is also an interesting comment from the paper:

We depart from prior RL-based work for theorem-proving by imposing a partial order on MDP states, allowing rewards to have a textual component, and allowing history-dependent policies.

Reward design is a crucial part of RL, rewards in the textual form is an important gain from LLMs. The same applies to history-dependent policies.

These are the major things the LLMs make one exciting: for the first time, human accumulated knowledge expressed in a semantic form can be used to convey quantitative information to machines that are running statistics based algorithms for decision-making, then feedback to humans in an interpretable way. (We are still far from actually arriving there, but the path has been laid down.)

And what agents revealed, are the abilities to communicate to any existing functionalities, using a protocol that can be expressed in text form, then react to them.

Jason Rute (Oct 10 2023 at 12:36):

As for it being worse than ProverBot9001, that seems to come down to speed. When measured by speed, ProverBot9001 does much better. When measured by inference steps, this paper does much better. It is an interesting trade-off when one solver makes better decisions and another works by brute force speed. For example, as language models get faster and better, it is possible this exact same approach, but with a different language model, could be 100x faster and then outcompete ProverBot9001. (But again, it is hard to trust that this finding would generalize to new projects that the language hasn't already seen because I imagine Compcert is in the GPT-4 training data.)

Utensil Song (Oct 10 2023 at 13:38):

By worse than ProveBot in pass rate I was refering to

image.png

Utensil Song (Oct 10 2023 at 13:40):

But this is under the contraint of a cap of 60 inferences only on COPRA (so it's unfair, sorry I didn't read carefully enough), no figure is available in the paper for pass rate when it has unlimited time/inference count, only that

as shown in the figure, COPRA proves many more theorems than PROVERBOT9001 if only 60 inferences as allowed.

Amitayush Thakur (Oct 10 2023 at 21:30):

Thank you so much for your valuable feedback. I would like to answer some of the questions raised above.

Amitayush Thakur (Oct 10 2023 at 21:34):

Kaiyu Yang said:

what about Draft-Sketch-Prove, Baldur, or Sagredo?

It depends on what "language-agent" really means. I don't think the term has a well-accepted meaning, but if we consider the following definition in the paper's abstract

Language agents, which use a large language model (LLM) capable of in-context learning to interact with an external environment,

Draft-Sketch-Prove and Badur are not language agents in this sense?

We intend to have more human-like interactions with ITP and generate proof step-by-step along with using external knowledge.
a. In Baldur, the model is not trained to interact with ITP and generate the proof steps individually (like humans do). It generates the whole proof in one go and then trains another proof-repair model to fix the proof.
b. In the case of DSP, while it might be using external knowledge (like generating informal proofs, hammer), there is no interaction with the ITP. The proof sketch is generated all at once and then a hammer is used to fill the holes. Also, it is important to note that in certain situations like software verification (e.g. CompCert) the notion of informal proof is not well defined and there is no informal statement to begin with (which is required for approaches like DSP).

Amitayush Thakur (Oct 10 2023 at 21:49):

Jason Rute said:

I haven’t got past the abstract but I really hope they address the massive possibilities of data leakage of testing GPT-4 on compcert. Do they have any reason to think compcert is not in the training data for GPT-4?

The data leakage problem can be ignored to an extent given that GPT-4 alone could not prove all the theorems by itself. In fact, there is a 2x gain in the number of proofs when we use GPT-4 as a language agent vs GPT-4 alone. It is important to note that GPT-4 is not trained on elaborate proof states by hooking up the ITP unlike ReProver, Proverbot, or PACT, so it is hard to measure the impact of data leakage. Another interesting observation was most of the proof generated by GPT-4 Prover Agent as well as GPT-4 one-shot were very different from the proofs mentioned in the CompCert or miniF2F dataset. It doesn't look like GPT is producing these proofs from memory even if it were trained on these datasets.

The improvement on few-shot vs agent:
image.png

An example where GPT-4 failed to produce from memory:
image.png

Amitayush Thakur (Oct 10 2023 at 22:31):

Jason Rute said:

As for it being worse than ProverBot9001, that seems to come down to speed. When measured by speed, ProverBot9001 does much better. When measured by inference steps, this paper does much better. It is an interesting trade-off when one solver makes better decisions and another works by brute force speed. For example, as language models get faster and better, it is possible this exact same approach, but with a different language model, could be 100x faster and then outcompete ProverBot9001. (But again, it is hard to trust that this finding would generalize to new projects that the language hasn't already seen because I imagine Compcert is in the GPT-4 training data.)

Speed correlates with the number of inferences needed because not just running inferences on LLM consumes time, but also running the wrong proof step on the ITP consumes more time. So even if per-step inference is slow, overall there is a big gain in wall clock time due to multiple factors including not having to waste time running wrong proof steps on the ITP.

We run a comparison of wall clock time with ReProver, and despite the speed per inference being very slow for GPT (because of throttling and token limits), the time taken to finish the proofs on average is about 9x faster if we compare wall clock times. (even failing to prove is faster). This empirically establishes that there is a direct correlation between the number of inferences and actual wall clock time. The number of inferences seems to be a more reasonable metric than the wall clock because there can be differences in hardware, network issues, etc.
image.png

Amitayush Thakur (Oct 10 2023 at 22:36):

Utensil Song said:

The MDP approach, with DFS & backtrack, making of the goal information, gives COPRA the abilities to better reach local optimal.

But in real practice of proving, most of the time when I mimic the way of only making use of the goal information to prove, I end up somewhere I need to make use of knowledge outside the context (e.g. the way to prove it in informal proofs like setting extra intermediate goals which could be something very far from the superficial features of the current theorem; the way to attack similar structure of the theorem but in a different field).

The approach is a good starting point, but a proper agent should be making good use of all of the pretrained knowledge, finetuned adapting ability, external knowledge base, tools (goals should not be the only source of feedback from the environment, in practice, you'll see people trying to make #mwe or run commands to get more information from the ITP, that's what a proactive agent should also try) etc. in an integrated manner. I would argue that for teaching such an agent, the informal interactions on this Zulip are also essential and vital.

I agree with you that we need a lot more context than just goals to prove something. In our work, we tried to use goals, history, some useful lemmas (through retrieval), error feedback, etc., but I do agree a lot more needs to be done here to make it closer to the human way of proving theorems.

Amitayush Thakur (Oct 10 2023 at 22:51):

Utensil Song said:

But this is under the contraint of a cap of 60 inferences only on COPRA (so it's unfair, sorry I didn't read carefully enough), no figure is available in the paper for pass rate when it has unlimited time/inference count, only that

as shown in the figure, COPRA proves many more theorems than PROVERBOT9001 if only 60 inferences as allowed.

We had a strict limit of 60 queries for GPT-4, and there were a number of theorems for which we didn't finish the proof because of this constraint. However, this constraint was not put for ReProver and Proverbot9001, both of which ran more than 1k inferences for proving some theorems.

Scott Morrison (Oct 11 2023 at 00:24):

Thanks very much, @Amitayush Thakur, for coming here to post replies! It's really great to see authors of papers working with Lean here.

Kaiyu Yang (Oct 11 2023 at 01:42):

@Amitayush Thakur Thank you for the explanation! I have a question about the evaluation metric: Why not directly evaluate the number of successful proofs within a wall time budget of k minutes (with varying k)? The end of page 7 seems to suggest that the pass@k-infefrences metric is designed to be correlated with this metric?

Kaiyu Yang (Oct 11 2023 at 01:47):

In addition, I'd be really impressed if COPRA can perform competitively with finetuned models (e.g., ReProver) on problems with in-domain training data (e.g., the test set of LeanDojo Benchmark). ReProver is not expected to perform well on miniF2F since it is based on a relatively old pretrained model (ByT5) and fintuned on mathlib. In contrast, GPT-4 has probably seen much more data related to miniF2F during training.

Utensil Song (Oct 11 2023 at 04:45):

Amitayush Thakur said:

Utensil Song said:

But this is under the contraint of a cap of 60 inferences only on COPRA (so it's unfair, sorry I didn't read carefully enough), no figure is available in the paper for pass rate when it has unlimited time/inference count, only that

as shown in the figure, COPRA proves many more theorems than PROVERBOT9001 if only 60 inferences as allowed.

We had a strict limit of 60 queries for GPT-4, and there were a number of theorems for which we didn't finish the proof because of this constraint. However, this constraint was not put for ReProver and Proverbot9001, both of which ran more than 1k inferences for proving some theorems.

Thanks for the clarification, COPRA outperforms them remarkably under such constraint only on itself.

While I appreciate the main advance of the paper (i.e. significantly accelerated proving), I wonder how many theorems COPRA eventually proved with this constraint lifted, should this more complete result be mentioned in the paper, although it's not your emphasis ?

Utensil Song (Oct 11 2023 at 05:04):

Amitayush Thakur said:

Jason Rute said:

I haven’t got past the abstract but I really hope they address the massive possibilities of data leakage of testing GPT-4 on compcert. Do they have any reason to think compcert is not in the training data for GPT-4?

The data leakage problem can be ignored to an extent given that GPT-4 alone could not prove all the theorems by itself. ......Another interesting observation was most of the proof generated by GPT-4 Prover Agent as well as GPT-4 one-shot were very different from the proofs mentioned in the CompCert or miniF2F dataset. It doesn't look like GPT is producing these proofs from memory even if it were trained on these datasets. ......

Sure, GPT-4 one-shot result can corroborate this. However, one can only be certain about it with more sampling with a higher temperature etc. instead of just one sample, or maybe more ablation experiment, not only the data leakage factor, but also on the foundational model factor.

I sincerely hope this work is transferable to open LLMs, do you have an educated guess of why the use of GPT-4 is crucial to the method? If the factor is not it knows how to prove the exact theorem beforehand, is it that it understands and execute the composite instructions better, or it benefits from certain designated design of fine-tuning? It is possible to identify the capability gap and shed lights on how open LLMs should finetune better to make this method transferable?

Sorry I understand these are open research questions. It's just so excited to see COPRA is able to prove in much less inference rounds (it also means more accurate attempts, which is a quantitative change that can lead to a qualitative change, i.e. lead to more sophisticated emergent behaviors).

Jason Rute (Oct 11 2023 at 12:37):

@Amitayush Thakur Thank you for responding. As for time, I guess I was asking more about your model verse ProverBot9001 than verse ReProver. Since ProverBot9001 isn't a Transformer language model, it is significantly faster. Do you have any reason to think right now that given the same amount of time, that Corpa would prove more theorems than ProverBot9001? I can't find in your paper what you say the stopping criteria is for testing Corpa and ProverBot9001 on Compcert, but I think it must be a timeout since neither model gets to a nice number of average inferences on failure. (If your stopping criteria was just 60 inferences, then the average number of inferences of failure would be about 60, no?) So that suggests to me that you ran each model for the same amount of time and that Corpa rarely got up to the 60 inference limit either because of a timeout or because it exhausted its search tree.

Jason Rute (Oct 11 2023 at 12:37):

As an aside, I'd love versions of Figures 4 and 5 but with time on the x axis. Since Proverbot seems to be about 10x faster if you changed Figure 5 to plot time I think you would find that ProverBot beats Copra on time. (Of course, as language models get faster that may quickly change.)

Jason Rute (Oct 11 2023 at 12:37):

I also understand that there is no perfect comparison between different models which use vastly different resources. GPT-4 runs in parallel on servers while I imagine ProverBot9001 has more meager resource usage (running on a CPU even?). This again is why plots like Figure 5 are so important, because they show the behavior at different resource amounts.

Jason Rute (Oct 11 2023 at 12:37):

On a practical level, while it is easy to laugh at models like ProverBot9001 as too simple since they are not using Transformers and assume they are obviously bad, it isn't clear to me that this is the case at all, and it could be that such models are much-much better than GPT-f (PACT), HTPS, ReProver, and the such. (And even more so since they can run on a CPU.) We just haven't compared the two approaches well enough. (For this reason, I am glad you did this comparison on CompCert even if I still have many questions about it.)

Jason Rute (Oct 11 2023 at 12:37):

As for CompCert data leakage, thank you for your comments. Are these in the paper? If not I think you should put them in the appendix. (I think every paper testing an LLM on problems that existed before the LLM was trained needs to address these problems.) But I think your logic might be a bit flawed.

Jason Rute (Oct 11 2023 at 12:38):

The data leakage problem can be ignored to an extent given that GPT-4 alone could not prove all the theorems by itself. In fact, there is a 2x gain in the number of proofs when we use GPT-4 as a language agent vs GPT-4 alone.

This is an important point, but that doesn't mean there is no leakage, and that the leakage doesn't significantly help the model. Just that GPT-4 (which is a stochastic model) couldn't parrot character-for-character a proof.

Jason Rute (Oct 11 2023 at 12:38):

It is important to note that GPT-4 is not trained on elaborate proof states by hooking up the ITP unlike ReProver, Proverbot, or PACT, so it is hard to measure the impact of data leakage.

I used to think the same when it came to GPT-f in the PACT paper (which I am an author of). But then I realized the following. For the first tactic, the goal is the same as the theorem statement, so there is no hidden information. Even for later goals there might be a lot of overlap between the theorem statement and the proof state. Moreover (if CompCert was in GPT-4's training data), the model would have seen that certain lemmas and tactics were used to prove a fact.

Jason Rute (Oct 11 2023 at 12:38):

Another interesting observation was most of the proof generated by GPT-4 Prover Agent as well as GPT-4 one-shot were very different from the proofs mentioned in the CompCert or miniF2F dataset. It doesn't look like GPT is producing these proofs from memory even if it were trained on these datasets.

Another interesting observation worth putting in the paper. (Is this in an appendix? An example would be nice.) But again, it isn't really conclusive. It also depends on how the proofs overlap. Does the Copra proof still use the same key lemma as the original, or start with the same initial sequence of tactics, or does it look like a completely different direction of proof?

Jason Rute (Oct 11 2023 at 12:38):

Finally, I've been a bit critical of your paper, but it does seem to have a lot of interesting advancements and features. I look forward to reading it in more detail. For one, it really makes good use of GPT-4's large context window to throw in all sorts of useful information like the history of the proof search so far and retrieved lemmas. It would be great to see, in this paper or a future one, a detailed ablation of which components are most useful.

Amitayush Thakur (Oct 12 2023 at 00:44):

Thank you so much for taking the time to write detailed comments. I think this has helped me better understand some aspects that we might have missed discussing in the paper's current version. Also, some interesting studies that we can add to better understand the contributions of different components within our approach.
After reading the comments, I can see 4 broad concerns that I think can be addressed to an extent in the next version and can be helpful in future work too.

The motivation for the metric pass@k-inferences versus pass@k, pass@k-minutes, etc.
Transferability of prompts from GPT-4 to other models.
Data leakage problems.
Need for more ablations.

Besides these, I would also like to discuss the comparison with Proverbot9001 because of certain unique aspects that are at play.

Amitayush Thakur (Oct 12 2023 at 00:46):

Kaiyu Yang said:

Amitayush Thakur Thank you for the explanation! I have a question about the evaluation metric: Why not directly evaluate the number of successful proofs within a wall time budget of k minutes (with varying k)? The end of page 7 seems to suggest that the pass@k-infefrences metric is designed to be correlated with this metric?

1. The motivation for the metric pass@k-inferences versus pass@k, pass@k-minutes:
The metric pass@k used in most theorem-proving approaches is inspired by the pass@k metric used in evaluating models for code (https://arxiv.org/pdf/2107.03374.pdf). However, pass@k metric for theorem proving has not been well defined. For example, an approach that samples the whole proof once at a time versus another approach that calls LLM for each proof step to complete one proof sample report using the same metric pass@1. The motivation behind the pass@k metric is to show the effectiveness of the model in generating proofs which means that if pass@1 is higher for something then it is more effective at finding proofs than something that has a similar number for pass@64. So our motivation for pass@k-inferences is that a good metric should showcase the effectiveness of the guidance model and the search process together regardless of the details of the implementation. For example, Baldur or DSP generates the whole proof once at a time versus GPT-f, PACT, ReProver, or Proverbot generating the proof step by step. We argue that pass@k-inferences is a better metric because it correlates with the proof-finding time to an extent. Since the exact time might not always be a good reflection of the effectiveness because of hardware differences, network throttling, etc., it makes sense to not compare directly on metrics like pass@k-minutes or pass@k-seconds. Not only these metrics will be brittle and very sensitive to the size, hardware, and other implementation details of the model, but not every search implementation will be based on a timeout. For example, Proverbot9001 doesn’t use timeout-based search (and hence we don’t compare on the basis of time with Proverbot9001). It searches based on depth-limited search without any backtracking (where it continues the search until it finds a node that doesn’t lead to any correct tactic). The depth window keeps increasing by a couple of steps as long as it keeps finding some correct tactic this is very different from other searches where the depth is fixed and pre-decided. What we want to argue is that pass@k-inferences reasonably resembles pass@k-minutes at the same time while avoiding brittleness and differences in hardware. Pass@k-inferences is more fair than pass@k because it compares approaches regardless of the complexity of the underlying search mechanism. We can definitely add a plot for ReProver vs Copra pass@k-minutes and pass@k-inferences and show that these metrics are correlated, and only the scale of the plot changes. However, for all practical reasons, pass@k-inferences is a much more balanced metric that tries to accommodate factors responsible for the effective search of proofs.

Kaiyu Yang (Oct 12 2023 at 00:58):

@Amitayush Thakur Thanks for elaborating on the pass@k-inferences metric. I fully agree that wall time can be sensitive to hardware differences. However, I think it is what end users ultimately care about. For mathematicians writing proofs in Lean, if they're going to use a proof automation tool powered by LLMs, the number of LLM inferences is probably not important, whereas the wall time is critical. In addition, the pass@k-inferences metric favors large models compared to small and more efficient models. And it does not allow comparisons between LLM-based methods and non-LLM-based methods.

Since different metrics have different trace-offs, maybe the paper could be made stronger by including multiple metrics.

Min-Hsien Weng (Oct 12 2023 at 01:12):

Amitayush Thakur said:

Thank you so much for taking the time to write detailed comments.

Thanks @Amitayush Thakur for responding. I just had minor questions about the "failure dictionary" (please correct me if If I overlooked anything in the paper :sweat_smile: ).
(1) In Figure 3, a failure dictionary takes a proof goal as input and returns a set of actions that can be avoided. I wonder if there is an existing dictionary that we can use, or if we need to build our own.
(2) I'm curious if we should keep track of the last step (previous tactics), in addition to the proof goal, in this dictionary. Sometimes, we may find that two tactics appearing together cause errors, regardless of the goals they are given. If we could collect this mutually exclusive information, it may be valuable and can help speed up the search for tactics.

Amitayush Thakur (Oct 12 2023 at 01:16):

Kaiyu Yang said:

And it does not allow comparisons between LLM-based methods and non-LLM-based methods.

Inferences = guidance for non-LLM-based methods (that is why we limit inference to exactly one sample only to eliminate any LLM-specific concepts). Maybe a better name would have been pass@k-guidance-steps. It can be used for most of the theorem-proving approaches where an ML model is consulted for the generation of a tactic or arguments for the tactic. We actually do the comparison with the number of steps Proverbot uses in the search as a proxy to the guidance it receives from its trained model.

Since different metrics have different trace-offs, maybe the paper could be made stronger by including multiple metrics.

I agree with you and we will add a plot comparing pass@k-minutes as well as pass@k-inferences in the next version.

Amitayush Thakur (Oct 12 2023 at 01:25):

Utensil Song said:

Amitayush Thakur said:

Jason Rute said:

I haven’t got past the abstract but I really hope they address the massive possibilities of data leakage of testing GPT-4 on compcert. Do they have any reason to think compcert is not in the training data for GPT-4?

The data leakage problem can be ignored to an extent given that GPT-4 alone could not prove all the theorems by itself. ......Another interesting observation was most of the proof generated by GPT-4 Prover Agent as well as GPT-4 one-shot were very different from the proofs mentioned in the CompCert or miniF2F dataset. It doesn't look like GPT is producing these proofs from memory even if it were trained on these datasets. ......

Sure, GPT-4 one-shot result can corroborate this. However, one can only be certain about it with more sampling with a higher temperature etc. instead of just one sample, or maybe more ablation experiment, not only the data leakage factor, but also on the foundational model factor.

I sincerely hope this work is transferable to open LLMs, do you have an educated guess of why the use of GPT-4 is crucial to the method? If the factor is not it knows how to prove the exact theorem beforehand, is it that it understands and execute the composite instructions better, or it benefits from certain designated design of fine-tuning? It is possible to identify the capability gap and shed lights on how open LLMs should finetune better to make this method transferable?

Sorry I understand these are open research questions. It's just so excited to see COPRA is able to prove in much less inference rounds (it also means more accurate attempts, which is a quantitative change that can lead to a qualitative change, i.e. lead to more sophisticated emergent behaviors).

2. Transferability of prompts from GPT-4 to other models:
One open LLM that comes close to ChatGPT is Llama-2. We did try some examples manually with Code Llama, but it couldn’t correctly follow the formatting instructions as opposed to GPT-3.5 and GPT-4. However, we didn’t have the bandwidth to do extensive comparisons with Code Llama because it was released a couple of weeks ago. We plan to do that in the future work. As far as why GPT-4 is better, I did check out some of the prompts that were generated by GPT-4 vs GPT -3.5 and it looks like GPT-4 has fewer hallucinations. One common theme was GPT-4 agent corrected errors when it saw past mistakes, while GPT-3.5 agent didn’t and hallucinated with similar mistakes. This is also reflected in the average number of inferences when the proof is not found which is higher in the case of the GPT-4 agent because it can change and try a variety of different tactics as compared to getting stuck in a repetition loop with the same tactic.
image.png

Amitayush Thakur (Oct 12 2023 at 01:35):

Min-Hsien Weng said:

Amitayush Thakur said:

Thank you so much for taking the time to write detailed comments.

Thanks Amitayush Thakur for responding. I just had minor questions about the "failure dictionary" (please correct me if If I overlooked anything in the paper :sweat_smile: ).
(1) In Figure 3, a failure dictionary takes a proof goal as input and returns a set of actions that can be avoided. I wonder if there is an existing dictionary that we can use, or if we need to build our own.
(2) I'm curious if we should keep track of the last step (previous tactics), in addition to the proof goal, in this dictionary. Sometimes, we may find that two tactics appearing together cause errors, regardless of the goals they are given. If we could collect this mutually exclusive information, it may be valuable and can help speed up the search for tactics.

(1) The failure dictionary has the life of proof search, it is not a global dictionary. However, I do see a point in trying to keep a global dictionary. In our implementation, the dictionary gets built while proving a theorem and then it is cleared once the theorem is proved. So we can only restrict the LLM from making repeated mistakes while solving just one problem.

(2) Do You mean previous to the previous tactic or just the last tactic? We keep track of the Last tactic and some history of the tactics that we show in the prompt. We don't associate a pair of tactics (or sequence of tactics) with bad actions for a given proof state. However, I agree with you that it might be helpful.

Amitayush Thakur (Oct 12 2023 at 02:00):

Jason Rute said:

The data leakage problem can be ignored to an extent given that GPT-4 alone could not prove all the theorems by itself. In fact, there is a 2x gain in the number of proofs when we use GPT-4 as a language agent vs GPT-4 alone.

This is an important point, but that doesn't mean there is no leakage, and that the leakage doesn't significantly help the model. Just that GPT-4 (which is a stochastic model) couldn't parrot character-for-character a proof.

3. Data leakage problems:
I think data leakage is a hard problem to solve for any in-context Learning approach, and there is no way to prove that training has an impact or no impact. I also agree that it is important to discuss this in any work that uses pre-trained LLM. We actually discuss this to an extent when we talk about gains from just a few shot GPT-4 to GPT-4 agents in our paper and how it is not just fetching it from memory.

Another interesting observation worth putting in the paper. (Is this in an appendix? An example would be nice.) But again, it isn't really conclusive. It also depends on how the proofs overlap. Does the Copra proof still use the same key lemma as the original, or start with the same initial sequence of tactics, or does it look like a completely different direction of proof?

I just realized that we have some examples in our appendix from miniF2F test set where the proof mentioned and our initial proof step don’t match. In fact, the whole proof is very different. However, I agree this should be highlighted and we should make an explicit qualitative comparison with the proofs which are mentioned in the miniF2F test dataset. We can do the same thing for CompCert as well.
In the figure below all the proofs we found are completely different (including the first steps) from the proofs mentioned in the miniF2F repo.
image.png

Amitayush Thakur (Oct 12 2023 at 02:09):

Jason Rute said:

It would be great to see, in this paper or a future one, a detailed ablation of which components are most useful.

4. Need for ablations:
We had to restrict to 60 steps not just because of the cost of GPT-4, but also because of OpenAI’s per-month quota. We tried to do a few ablations like ablation of why backtracking is essential while searching. It turns out that the longer the proof the more backtracking helps. I do agree with the need for more ablations, we can think about doing more extensive ablations on open LLMs in the future.

Amitayush Thakur (Oct 12 2023 at 02:22):

Jason Rute said:

On a practical level, while it is easy to laugh at models like ProverBot9001 as too simple since they are not using Transformers and assume they are obviously bad, it isn't clear to me that this is the case at all, and it could be that such models are much-much better than GPT-f (PACT), HTPS, ReProver, and the such. (And even more so since they can run on a CPU.) We just haven't compared the two approaches well enough. (For this reason, I am glad you did this comparison on CompCert even if I still have many questions about it.)

I think Proverbot9001 is a very interesting approach. We draw a couple of interesting ideas from their work including maintaining a partial order on proof states, which avoids trying tactics that don't obviously simplify the goal. I think Proverbot's depth-limited search approach is also interesting in itself. Moreover, the CompCert dataset needs more domain-specific knowledge to prove theorems as opposed to miniF2F. LLMs can benefit from training on MathWiki, Math Stack Exchange, and math books while solving miniF2F, however, CompCert is very domain-specific and may not have direct equivalences with the kind of math present in the math books. In fact, that was one of our motivations to compare on CompCert. We wanted to see how well these approaches work for tasks of different domains. It will be interesting to study how automatic theorem proving in math compares with automatic theorem proving for software verification. How much we can leverage the techniques developed in one domain to another. For example, we actually tried the concept of partial order on proof states in both Lean and Coq which helped in getting rid of some redundant tactics (which didn't simplify the goal).

Amitayush Thakur (Oct 12 2023 at 02:30):

Jason Rute said:

Amitayush Thakur Thank you for responding. As for time, I guess I was asking more about your model verse ProverBot9001 than verse ReProver. Since ProverBot9001 isn't a Transformer language model, it is significantly faster. Do you have any reason to think right now that given the same amount of time, that Corpa would prove more theorems than ProverBot9001? I can't find in your paper what you say the stopping criteria is for testing Corpa and ProverBot9001 on Compcert, but I think it must be a timeout since neither model gets to a nice number of average inferences on failure.

Proverbot9001 doesn’t use timeout-based search (and hence we don’t compare on the basis of time with Proverbot9001). It searches based on depth-limited search without any backtracking (where it continues the search until it finds a node that doesn’t lead to any correct tactic). The depth window keeps increasing by a couple of steps as long as it keeps finding some correct tactic this is very different from other searches where the depth is fixed and pre-decided.
That said, I agree that we should have a plot with time-axis, and we can do that plot for ReProver vs Copra.

(If your stopping criteria was just 60 inferences, then the average number of inferences of failure would be about 60, no?) So that suggests to me that you ran each model for the same amount of time and that Corpa rarely got up to the 60 inference limit either because of a timeout or because it exhausted its search tree.

We also have an additional optimization of early stopping which leads to the low number of inferences on failed proofs. When GPT-4 repeats incorrect actions matching some known action from the failure dictionary, we do preemptive backtracking to avoid trying the same tactics that we know will fail. So this way we fail much quicker.

Amitayush Thakur (Oct 13 2023 at 16:20):

Just in case anyone would want to test it on their own benchmarks, we have our code here:
https://github.com/trishullab/copra
One needs an OpenAI account and access to GPT-4 to run it.

Alex Sanchez-Stern (Oct 17 2023 at 19:50):

Hey all, Alex Sanchez-Stern here, author of Proverbot9001. I wanted to respond to a few things on here.

Alex Sanchez-Stern (Oct 17 2023 at 19:50):

It searches based on depth-limited search without any backtracking (where it continues the search until it finds a node that doesn’t lead to any correct tactic).

This isn't accurate, Proverbot9001 does use backtracking. You might be thinking of an ASTactic or TacTok style search.

Alex Sanchez-Stern (Oct 17 2023 at 19:53):

The depth window keeps increasing by a couple of steps as long as it keeps finding some correct tactic this is very different from other searches where the depth is fixed and pre-decided.

Again, not quite accurate. Proverbot9001 does have a mechanism for adding extra fuel, but it's not for simply producing tactics which don't error; it's only for "refunding" the cost of a solved subgoal, so that the depth limit is not for the whole proof considered as a linear series of tactics, but for the depth of the proof as a tree.

Alex Sanchez-Stern (Oct 17 2023 at 20:04):

Amitayush Thakur said:

Jason Rute said:

As for it being worse than ProverBot9001, that seems to come down to speed. When measured by speed, ProverBot9001 does much better. When measured by inference steps, this paper does much better. It is an interesting trade-off when one solver makes better decisions and another works by brute force speed. For example, as language models get faster and better, it is possible this exact same approach, but with a different language model, could be 100x faster and then outcompete ProverBot9001. (But again, it is hard to trust that this finding would generalize to new projects that the language hasn't already seen because I imagine Compcert is in the GPT-4 training data.)

Speed correlates with the number of inferences needed because not just running inferences on LLM consumes time, but also running the wrong proof step on the ITP consumes more time. So even if per-step inference is slow, overall there is a big gain in wall clock time due to multiple factors including not having to waste time running wrong proof steps on the ITP.

We run a comparison of wall clock time with ReProver, and despite the speed per inference being very slow for GPT (because of throttling and token limits), the time taken to finish the proofs on average is about 9x faster if we compare wall clock times. (even failing to prove is faster). This empirically establishes that there is a direct correlation between the number of inferences and actual wall clock time. The number of inferences seems to be a more reasonable metric than the wall clock because there can be differences in hardware, network issues, etc.
image.png

So while this analysis is good for ReProver, it doesn't hold for Proverbot9001, because the average time to predict and execute a tactic in Proverbot9001 is closer to 100-1000x faster than COPRA, without accounting for hardware differences. Once you account for the fact that the GPT-4 queries are running on OpenAI hardware, and Proverbot9001 makes search predictions on a consumer grade cpu, you can add another few orders of magnitude to that. So it's unlikely that any language model approach in our current paradigm will be able to approach Proverbot9001 in terms of prediction speed, but a big enough accuracy improvement could help it compete.

Alex Sanchez-Stern (Oct 17 2023 at 20:06):

This is why the 60 inferences limit is tricky, and should probably be discussed more in the next revision; in the time it takes COPRA to make 60 predictions, Proverbot9001 could have already searched 60,000 proof states.

Alex Sanchez-Stern (Oct 17 2023 at 20:25):

Utensil Song said:

But this is under the contraint of a cap of 60 inferences only on COPRA (so it's unfair, sorry I didn't read carefully enough), no figure is available in the paper for pass rate when it has unlimited time/inference count, only that

as shown in the figure, COPRA proves many more theorems than PROVERBOT9001 if only 60 inferences as allowed.

The issue with this sort of comparison is that Proverbot9001's search is not built to be limited by number of inferences (we have some experimental searches we're building which work better with this kind of constraint, but probably even a simple breadth-first search would do better on this metric than Proverbot9001's search). So you can't directly measure pass@k-inferences on Proverbot9001, so the authors used a proxy for that by taking existing results with no inference limit, and counting how many inferences were actually used before getting to a solution or hitting the depth limit. This is a similar metric to pass@k-inferences, but not quite the same.

Alex Sanchez-Stern (Oct 17 2023 at 20:26):

Jason Rute said:

As for it being worse than ProverBot9001, that seems to come down to speed. When measured by speed, ProverBot9001 does much better. When measured by inference steps, this paper does much better. It is an interesting trade-off when one solver makes better decisions and another works by brute force speed. For example, as language models get faster and better, it is possible this exact same approach, but with a different language model, could be 100x faster and then outcompete ProverBot9001. (But again, it is hard to trust that this finding would generalize to new projects that the language hasn't already seen because I imagine Compcert is in the GPT-4 training data.)

Well, it doesn't technically just come down to speed. As far as we can tell, Proverbot9001 also is just able to prove more theorems than COPRA regardless of resource limits; it's possible that COPRA would solve some of these if given more resources, but probably not all.

Alex Sanchez-Stern (Oct 17 2023 at 20:32):

Hopefully the authors will be able to add a threats to validity section before publication so that readers can be aware of the subtleties here.

Jason Rute (Oct 17 2023 at 23:36):

@Alex Sanchez-Stern

Well, it doesn't technically just come down to speed. As far as we can tell, Proverbot9001 also is just able to prove more theorems than COPRA regardless of resource limits; it's possible that COPRA would solve some of these if given more resources, but probably not all.

This is where it gets tricky. I’m a bit confused, but COPRA has a limit on 60 inferences. I can’t tell if they hit that limit as I discussed above. If they are regularly hitting that limit then it could be if they increased that limit they would solve more theorems than ProverBot. Figure 6 suggests this as well. But this might take a lot more time (and money). I’d love @Amitayush Thakur ‘s thoughts on this.

Resource limits are one of the many things that are challenging about this field. If I recall correctly @Alex Sanchez-Stern your ProverBot9001 paper also doesn’t use time as the stopping criteria but some other limit. It’s also not always clear what resource cost a user wants to optimize. For example we often use web apps on our smart phones instead of some locally installed app which might require less overall resources. These web apps can take advantage of economies of scale like data centers. Also, I think the modern LLM approaches have a huge advantage over older approaches like ProverBot9001. They are much faster to develop, deploy, and modify even if they aren’t as good. I think this is a big reason why they are taking over. It’s just a lot easier to build an AI with an LLM (and while I feel COPRA is currently too big and bulky to be practical, another LLM more like LeanInfer could be just the right thing).

Overall I wish there was more comparison between models even if messy, and I think the best comparison is just to give the models to the end user so they can say which ones (if any), they like.

Utensil Song (Oct 18 2023 at 03:01):

IMHO it's not about what resource cost a user wants to optimize, and taking advantage of economies of scale like data centers is not a fair comparison here.

If an algorithm is consuming less entropy from the universe to retrieve a better or equivalent result, it's a strong indication that it has done something more correct or smart, and it must have some insight on real intelligence. In applying this principle, the efforts of curating data for training should also be counted in. Advances in hardware and data availabilities make DL methods triumph in applications but ultimately what we have learned from them need to be distilled back to a more crystallized form, which would be more interpretable and economic.

I believe people working on ITP/ATP field (or attracted by it) are inherently symbolism to an extent, and may see connectionism a healthy and productive supplement but not superior, whereas arguably connectionism may be taking the products of symbolism (e.g. actually writing the proofs, tactics, or a more traditional algorithm of ATP etc.) for granted as "data", and the principle here would need to count them in and give proper credit (as a significant contributing factor to the success of ML-based AIs). A truly "end-to-end" solution would need to include the efforts and the intelligence of generating these data.

In this view, only when the definition of an inference is properly aligned to the granularity of computational budget, better performance on pass@k-inferences metric starts to make sense. As @Alex Sanchez-Stern has pointed out, "the average time to predict and execute a tactic in Proverbot9001 is closer to 100-1000x faster than COPRA, without accounting for hardware differences", which means Proverbot9001 uses significantly less computational budget and achieving better results per Table 1 in the COPRA paper, thus has a huge advantage in the sense of what we can learn from the approach.

Utensil Song (Oct 18 2023 at 03:16):

BTW, I'm curious if there is an equivalent to Proverbot9001 (for Coq) for Lean to try. Efforts on this might be seen obsolete in th era of LLMs but in the view above, it can be both good for end-user (in wallclock time) and research (to better identifying the factors that have advantages).

Jason Rute (Oct 18 2023 at 03:26):

No, I don’t think there is anything like proverbot (or tactician or ASTactic or magnushammer) in Lean, which is a bit sad since it makes comparison hard with other state of the art models.

Jason Rute (Oct 18 2023 at 03:27):

I think Tactician would be the easiest to replicate.

Jason Rute (Oct 18 2023 at 03:32):

But cross platform approaches like COPRA are nice for comparison. If indeed ProverBot beats COPRA very strongly, but COPRA doesn’t beat ReProver that strongly, maybe this suggests ProverBot is better that ReProver. (Of course this really should be taken with a grain of salt.)

Amitayush Thakur (Dec 18 2023 at 19:52):

Hi everyone,

After good feedback from the community, we incorporated various changes in our approach and analysis. Please check out our latest version: Paper: https://arxiv.org/abs/2310.04353, Code: https://github.com/trishullab/copra
Some of these are listed below:

Analysis of correlation between pass@k-seconds and pass@k-guidance-steps: We show that pass@k-guidance, correlates very well with wall-clock time for finding proofs by using the metric pass@k-seconds. pass@k-seconds measures the number of proofs that an approach can find in less than k seconds. The plots for pass@k-guidance-step and pass@k-seconds show a similar trend. (see A.1.4 for details)
Transferability of prompts to open model: Added code to test the effectiveness of our strategy on open LLMs like CodeLLama. Just like GPTs, COPRA improves the ability of CodeLLama to prove theorems.
An analysis of data leakage: If we set aside these straightforward simple cases, then about 92%
of the proofs generated by COPRA are either different from the proofs mentioned in the miniF2F
or do not have proof mentioned in the miniF2F dataset. Out of all proofs generated by COPRA
about 25.37% proofs are for theorems that have no proofs mentioned in the miniF2F test dataset as
compared to 22.95% for REPROVER. (Check sections A.1.5 and A.2 in our new version)
Added capabilities to allow the use of informal proofs which are first generated one-shot using the GPT models.
Added more than 5 different ablations to understand the role of retrieval and informal proofs better.

With our latest changes and capabilities, we can achieve 29.92% on miniF2F. We are planning to add more capabilities and experiments in the near future (including the possible support for Lean 4 and Isabelle in addition to Coq and Lean 3).

Alex Sanchez-Stern (Dec 18 2023 at 20:39):

Hmm, looks like the Proverbot9001 comparison didn't change much except changing the claim that "COPRA is faster than Proverbot9001" to "COPRA is faster than Proverbot9001 in terms of number of model inferences", is that right? Oh, and the COPRA CompCert results seemed to get much worse in this new version of the paper? In the old version COPRA proved 76/118 theorems with an average proof time of 140.86 seconds, whereas in the new version COPRA only proves 57/118 theorems, while taking more than twice as long (average time of 338.04).

Amitayush Thakur (Dec 18 2023 at 21:20):

Doing a wall clock comparison with Proverbot needs code changes in Proverbot's code. We do plan to do it later. However, with the various experiments which we have done, we wanted to get this version out first.

The older version was with GPT-4, the new version is with GPT-4-turbo. Hence, there is a number difference. However, we also changed the way to calculate the number of proofs done by COPRA by excluding one-shot GPT results and always running the COPRA agent. The 76 result was for the ensemble of COPRA and one-shot GPT. Even now the ensemble is 67.

Amitayush Thakur (Dec 18 2023 at 21:22):

During the rebuttal period we got the feedback to exclude one-shot results and report both ensemble and COPRA separately (which we did for both miniF2F and CompCert)

Utensil Song (Dec 19 2023 at 01:24):

Excited to learn that CORPA added tests on open LLMs, from the paper, CodeLlama proves 0 theorems without CORPA and 14/244 with CORPA, but the full experiment setup COPRA + One-Shot + Informal-One-Shot (LLM + retrieval + informal proof) hasn't been tested on CodeLlama, nor experiments done on other newer LLMs with better built-in Lean capabilities, is there a reason for this?

Amitayush Thakur (Dec 21 2023 at 06:03):

Utensil Song said:

Excited to learn that CORPA added tests on open LLMs, from the paper, CodeLlama proves 0 theorems without CORPA and 14/244 with CORPA, but the full experiment setup COPRA + One-Shot + Informal-One-Shot (LLM + retrieval + informal proof) hasn't been tested on CodeLlama, nor experiments done on other newer LLMs with better built-in Lean capabilities, is there a reason for this?

We didn't compare all ablations for CodeLlama because the results were not very encouraging (like 5%), but it does establish that COPRA as a model agnostic approach works. Our code is available on GitHub and can be readily used by anyone to run any model that supports Hugging Face's TGI just by changing the model name (and may be the Chat serialization format if it is not based on Llama 2)

Last updated: May 02 2025 at 03:31 UTC