Zulip Chat Archive
Stream: Machine Learning for Theorem Proving
Topic: PACT Paper
Joe Palermo (Apr 23 2021 at 19:54):
Hi, I'm reviewing the PACT paper and I'd love to clarify a few questions I have. @Jason Rute @Jesse Michael Han
1 - For the diagram in Figure 1, how are the expressions in the boxes generated? Also I understand the tactic proof and I can see how the proof term it generates corresponds to it and proves the theorem, but what exactly are the arrows in the syntax tree pointing at?
2 - Section 3.2 describes a recursion into the structure of each proof term. Are the sub-terms simply exprs within the expr of the whole proof term? i.e. this would imply that it is a recursion into an expr-tree as defined here?
Jesse Michael Han (Apr 23 2021 at 20:55):
re: 2, it's a recursion through the proofterm expr
, but it's not quite expr.mfold
--- we skip lambda-abstractions and instead accumulate them in a list of binders for creating synthetic tactic states later on
in the lean-step
codebase the core function is expr.mfold_with_binders
Jason Rute (Apr 24 2021 at 12:25):
re 1: I agree it is a confusing picture. Each edge represents a context + goal and the child tree of that edge is the term which completes that goal. So let's work from the top. Here is the tactic proof:
lemma sub_ne_zero_of_ne (a b : int) (h : a ≠ b) : a - b ≠ 0 :=
begin
intro hab,
apply h,
apply int.eq_of_sub_eq_zero hab,
tactic.trace_result
end
I add the tactic.trace_result
at the end so we can see what the proof being generated is. It is:
id (λ (hab : a - b = 0), h (int.eq_of_sub_eq_zero hab))
Ignore the extra id
so we get
λ (hab : a - b = 0), h (int.eq_of_sub_eq_zero hab)
Now, like in any mathematical formalism, this expression is a tree. That is the tree drawn in figure one. Specifically, the constructors are lam
for lambda abstraction and app
for function application, so we can write this expression as something like
( lam (hab : a - b = 0) ( app h ( app int.eq_of_sub_eq_zero hab ) ) )
Now, for every subexpression (or at least the ones in the body of the lambda and on both sides of the application), we have a context and goal. The context is what variables have occurred in outer lambda binders and the goal is the type of the subexpression. These form the edges of the tree.
Jason Rute (Apr 24 2021 at 12:25):
Does that make sense?
Joe Palermo (Apr 24 2021 at 14:05):
@Jesse Michael Han Thanks - I'm going to dig into the code
@Jason Rute Ahh yes that makes sense. Thank you for the clarification.
Joe Palermo (Apr 27 2021 at 19:54):
@Jesse Michael Han Is it fair to say that the PACT data pipeline as described in Section 3.2 is best described by lean_step_main_core_aux
? I've been reading the code and I noticed that mfold_with_binders
is used only in gather_used_premises
as a step before starting lean_step_main_core
.
Jesse Michael Han (Apr 27 2021 at 21:43):
oops yeah, mfold_with_binders
was the first implementation of the idea but lean_step_main_core_aux
is the more general version that gathers all the data i needed at once
Joe Palermo (Apr 28 2021 at 17:57):
The paper mentions the use of the fairseq API:
"The fairseq backend supports querying a locally hosted Transformer via the Fairseq CLI (Ott et al., 2019), returning a list of candidates found by beam search, ranked by cumulative log-probabilities."
I'm assuming you trained some models using the fairseq API as a quick baseline to compare against OpenAI's models. I'm assessing different libraries for training LMs, and I'm getting mixed signals about fairseq. I'm curious if you had problems with it or if you would recommend it. Thanks!
Jason Rute (Apr 29 2021 at 11:52):
@Yuhuai Tony Wu would know best about FairSeq.
Jason Rute (Apr 29 2021 at 11:59):
If your question is more generally which LM library to use for generating Lean term proofs, I generally don’t know enough from the practical side to be super helpful, but one concern I think you should consider is whether to start with a pretrained model. I think it is fairly clear now that pretrained models, especially if they contain math, but even if they don’t, are good starting points. This will of course limit the options of models and libraries you can use and limit your ability to do domain specific encoding. Also, if you work with a model trained on the Pile or another dataset containing GitHub, you should double check if any Lean repos are in that dataset.
Joe Palermo (Apr 29 2021 at 13:11):
Thanks, definitely planning to use a pre-trained model. Was planning to pre-train on LIME (https://arxiv.org/abs/2101.06223) also.
Brando Miranda (May 20 2021 at 17:52):
hi @Jason Rute @Jesse Michael Han! I was also reading the paper and saw one term that was used that had no reference I could use to figure out what it meant. In particular, it said:
...it is possible to augment a seed dataset of human proofs with new successful trajectories using rein-forcement learning or expert iteration.
What is expert iteration? Is there a paper or reference to learn about it? Thanks for time in advance
Jesse Michael Han (May 20 2021 at 17:54):
expert iteration is described in this paper: https://arxiv.org/abs/1705.08439
Brando Miranda (May 20 2021 at 18:55):
I'm curious, is expert iteration not very popular? (or does not work very well) I've never heard of it before and have not seen it before mentioned by name before PACT. Does someone know? (Thanks for the reference!)
Yuhuai Tony Wu (May 20 2021 at 19:18):
Expert iteration can be understood as a very general concept, also known as iterated amplification and distillation. Essentially the idea is that you start with a weak agent, and apply some amplifier (usually some search method) to obtain better data ("expert trajectories"), and then you distill the data back to the agent by supervised learning. AlphaGo/AlphaZero also relies on the same idea .
Brando Miranda (Aug 09 2021 at 22:01):
Hi PACT authors! I was curious about the details of the optimizer you used. What type of optimizer did you use (+ scheduler) and what hyperparameter search did you do to find them (if any)?
Aidan Swope (Aug 09 2021 at 23:04):
The optimization strategy is described on page 6 of the PACT paper
(I am not an author of the work)
Jason Rute (Aug 09 2021 at 23:08):
If you still have questions after looking at the paper, I think @Stanislas Polu would know best.
Yichen Xu (Mar 30 2022 at 10:30):
Hi @Jason Rute @Jesse Michael Han! I was curious about the details of co-training when reading the paper. When training the model on the tactic
and mix1
mix2
datasets, will the language modeling loss of the prompt part (for instace, the lm losses computed from GOAL <TacticState> PROOFSTEP
part in proof step samples) get masked? Or would the model be optimized from the loss calculated on the whole sample?
Jason Rute (Mar 30 2022 at 10:36):
(deleted)
Stanislas Polu (Mar 30 2022 at 10:36):
We used the whole sample, no masking at training :+1:
Jason Rute (Mar 30 2022 at 10:40):
Is it also the same for the validation loss we report in the charts? Whole string, no masking? @Stanislas Polu
Stanislas Polu (Mar 30 2022 at 10:43):
Validation loss is masked :+1:
Yichen Xu (Mar 30 2022 at 10:59):
A lot of thanks!
Yichen Xu (Mar 30 2022 at 17:40):
Hi @Stanislas Polu @Jason Rute! Thanks again for the clarification. I have another small question about the tactic
dataset: I followed the instructions from lean_proof_recording and run the full pipeline to extract the dataset. While the B.2 section of the paper said that the dataset has around 128K examples, I got over 200K examples finally. I am wondering is the difference of runtime environment causing this? Or would PACT clean/filter the generated samples before training the model on them?
Arthur Paulino (Mar 30 2022 at 17:52):
mathlib has grown since then. That's probably why
Yichen Xu (Mar 30 2022 at 17:54):
And it seems that the local context classification task is missing from the lean-step-public repo. Is it because the released extraction script for mix1
and mix2
is not that up-to-date? :D
EDIT: I realised that the Proof step classification
task is actually the local context classifcation
task. Sorry for the noise!
Yichen Xu (Mar 30 2022 at 17:54):
Arthur Paulino said:
mathlib has grown since then. That's probably why
Ah yes! That looks reasonable.
Yichen Xu (Mar 30 2022 at 18:00):
So the dataset size nearly doubled after one year, which may result in better model performance even without the changes in algorithms! It is exciting to see that the machine learning models somewhat grow together with humans. :D
Yichen Xu (Apr 20 2022 at 02:15):
Hey @Stanislas Polu! Recently I have been trying to reproduce the results of PACT on a GPT model trained by our lab. I am curious about the technical details about decoding strategy: is PACT using nuclear sampling? and how would temperature affect the performance?
Yichen Xu (Apr 20 2022 at 02:26):
And we found that our model is really vulnerable to overfitting on mix1 and mix2. Did this also happen in your settings?
Stanislas Polu (Apr 20 2022 at 09:56):
@Yichen Xu we generally simply sample at T=1.0; which is rather high but the diversity of samples overall helps performance we found.
Yichen Xu (Apr 20 2022 at 13:16):
Tons of thanks! :^)
venom Yu (Feb 16 2023 at 14:23):
@Jason Rute Hi, I am trying to reproduce some Automated Theorem Proving work and I noticed that they were all pretrained on WebMath dataset. I wonder whether this dataset has been released?
Jason Rute (Feb 16 2023 at 16:29):
@venom Yu No. The closest public dataset I know is the Pile, especially the arXiv, GitHub, and stack exchange parts of the Pile.
Yuhuai Tony Wu (Feb 16 2023 at 18:14):
https://huggingface.co/datasets/hoskinson-center/proof-pile from @Zhangir Azerbayev also quite valuable asset
Zhangir Azerbayev (Feb 16 2023 at 18:37):
Yuhuai Tony Wu said:
https://huggingface.co/datasets/hoskinson-center/proof-pile from Zhangir Azerbayev also quite valuable asset
Thanks for the mention Tony. I've also trained 1.3B and 6.7B parameter models on this datase. Here are the huggingface links: 1.3B, 6.7B
Jason Rute (Feb 16 2023 at 22:01):
Nice!
venom Yu (Feb 17 2023 at 06:41):
Thanks a lot!
Last updated: Dec 20 2023 at 11:08 UTC