Zulip Chat Archive

Stream: Is there code for X?

Topic: Code for GPT-F or other LLM used for ATP

Dongwei Jiang (May 12 2023 at 02:02):

Hi there, does anyone know how can I find the code/API for GPT-F? I want to replicate some of OpenAI's work and figure out some of the weaknesses of the current ATP, I've checked Jesss Michael Han's repo and it seems like the API is no longer available (https://github.com/jesse-michael-han/lean-gptf/issues/12). I tried to email him and got no response...

Jason Rute (May 12 2023 at 09:56):

The API is turned off. There is a similar API from Meta AI based on the hypertree proof search paper however. ( #general > api suggestion vscode integration) I believe that is still up. Also, the code to generate the data for this paper is still available from two repositories. You can find a lot of the repos listed in #Machine Learning for Theorem Proving > What is the AMI project's vision for AI in Lean?

Jason Rute (May 12 2023 at 09:57):

Also cross posting is discouraged.

Dongwei Jiang (May 12 2023 at 17:03):

Thank you for directing me towards the model developed by Meta. It certainly appears to be quite promising. However, what I am particularly interested in is an end-to-end solution, one that encompasses not only the model but also the search algorithm, such as a priority queue, that assembles the entire proof from individual proof steps. My interest in this is primarily to gain a more in-depth understanding of the current model's errors on benchmarks like miniF2F, and potentially to classify these errors using insightful heuristics, such as proof length, proof tree depth, the number of lemmas used, and so forth.

While I am aware that there are repositories containing the original data used for training the model, it seems that my current best course of action is to gather this data, reformat it to structures like "GOAL <GOAL> PROOFSTEP <PROOFSTEP>", and train it on open-source models, such as LLAMA, correct?

I also want to apologize for any confusion caused by my cross-posting. I assure you it won't happen again.

Jason Rute (May 13 2023 at 14:49):

@Dongwei Jiang I don't think there is an easy way to completely reproduce either paper. The easiest might be as you said to (1) get the data, (2) train your own language model, and (3) run a proof search with your model.

As for (1), I mention the repositories where the original data came from, but at the same time, I think there are better approaches now to getting the data including Mario's AST extractor, and (I don't think it is released yet, so I don't want to share more, but) I know someone is making a special tool for this sort of thing.

As for (3), lean-gym might be a good choice. It isn't supported anymore, but I think it is also the environment that OpenAI used for later papers. If you use the code we used for the PACT paper (which I think is on Jesse's repo), then I think that also has some support for the FairSeq library, and if you know metaprogramming in Lean, it would be easy to use any similar LLM interface.

Dongwei Jiang (May 20 2023 at 15:31):

@Jason Rute Sorry for the delay in responding and thank you for your advice! I've been tied up working on the code. Here's what I've accomplished so far and my plans moving forward:

1 I've extracted data from the PACT paper, using guidelines from lean-step-public.
2 For the second step, I've decided to use LLAMA as my base model for fine-tuning. I've had some success with the alpaca-lora project, but there's more work needed for successful fine-tuning, like setting parameters and handling vocab.
3 For the third step, I've looked at lean-tpe-public. It seems to only contain beam search code (in Lean), and I didn't find a best-first search as there's no passing of score. I also checked lean-gptf, which seems to have something similar to best-first search. lean-gym didn't seem helpful in this regard. I'll still need to figure out how Lean works and how to tie all these steps together.

If you have any other suggestions or comments, I'd appreciate them! Once I've managed to recreate some of these paper's methods using these open-source tools, I'll document it to help anyone else who might want to work on it in the future.

Jason Rute (May 21 2023 at 00:05):

@Dongwei Jiang I'm a bit confused:

1 I've extracted data from the PACT paper, using guidelines from lean-step-public.

That is some of the data we used for PACT, but it doesn't have the most important part, the tactics! You need to get them from another source, like https://github.com/jasonrute/lean_proof_recording

3 For the third step, I've looked at lean-tpe-public. It seems to only contain beam search code (in Lean), and I didn't find a best-first search as there's no passing of score.

That is has best first search. It is implemented in https://github.com/jesse-michael-han/lean-tpe-public/blob/master/src/evaluation.lean. I don't know what you are saying about beam-search. The fairseq API uses beam search decoding, while the openai api uses sampling (with a temperature parameter) for decoding, but to my understanding that is only where beam search is used.

I also checked lean-gptf, which seems to have something similar to best-first search.

That to my understanding doesn't have any search (or if it does, it was never fully implemented). It is just a tactic to make tactic suggestions.

lean-gym didn't seem helpful in this regard.

LeanGym is probably the easiest way to interact with Lean. You have to do all the search on the python side, but you can easily implement your own best-first search algorithm (or iterated depth first search, or MCTS, or whatever).

Dongwei Jiang (May 21 2023 at 23:52):

@Jason Rute Thank you for your response! I have to admit, my understanding of these repositories is not as deep as I'd like it to be, so your help is really appreciated. You're right, the lean-step-public repository is mainly for classification tasks like proof step classification and seq2seq tasks, such as theorem name prediction and next-lemma prediction. But what I need is the tactics task, so I've started extracting data from your repository.

When it comes to the search part, I have missed the evaluation file you mentioned, probably because I'm not very familiar with Lean. I'll definitely dig into it more. And thanks again for explaining the other two repositories!

Dongwei Jiang (May 28 2023 at 16:06):

@Jason Rute Thank you for your previous response; it has been immensely helpful. I've since dived further into your repository, read the introduction to Lean and the first five chapters of theorem proving in Lean, and I've come across a few additional points that I'd appreciate your insights on.

With respect to tactics data generation, I've read your notebook and I have generated the 'cleaned_training_data'. However, I'm uncertain about the formatting of data for LLM fine-tuning. Should the data be presented in the "GOAL <TacticState> PROOFSTEP <Tactic>" format? Or, should I just keep the 'GOAL' as a condition (as it will be provided by Lean) and just fine-tune the 'Tactic' part?

In terms of the search, I'm still familiarizing myself with lean-tpe-public and it seems I need to implement something similar to the fairseq API considering my usage of a custom language model. However, I've noticed that the documentation surrounding the fairseq API is somewhat limited, and I'm curious if you could provide any additional insights to aid in my understanding of it (like how does the fairseq API work and what do I need to make the search work with custom language model).

Furthermore, I have a question about debugging. I'm curious to know if it's possible to debug this Lean project in a similar manner to how we typically handle debugging in a Python project? My preliminary research suggests that this might not be possible, but I thought it worth inquiring nonetheless.

Jason Rute (May 29 2023 at 21:06):

Dongwei Jiang said:

With respect to tactics data generation, I've read your notebook and I have generated the 'cleaned_training_data'. However, I'm uncertain about the formatting of data for LLM fine-tuning. Should the data be presented in the "GOAL <TacticState> PROOFSTEP <Tactic>" format? Or, should I just keep the 'GOAL' as a condition (as it will be provided by Lean) and just fine-tune the 'Tactic' part?

How you train your model is up to you. If you are using an encoder-decoder architecture, you would probably put the proof state as the input and the tactic as the target (generating a loss only for the tactic). If you were using a decoder-only architecture, as we did in the paper, then you would have to let the model know where the proof state ends and the tactic begins. (Also, you need to decide if you want to mask the loss on the input. I believe we didn't. IIRC, we trained on all the tokens.) That is the purpose of the GOAL and PROOFSTEP tags. The other purpose also was that we put in lots of other auxiliary tasks like predicting lemma names. To do this, we wanted the model to know which task it was working on, so we used other tags besides GOAL and PROOFSTEP to signify the type of data that was to follow. This is basically a form of prompt engineering, and you could do it many different ways.

In terms of the search, I'm still familiarizing myself with lean-tpe-public and it seems I need to implement something similar to the fairseq API considering my usage of a custom language model. However, I've noticed that the documentation surrounding the fairseq API is somewhat limited, and I'm curious if you could provide any additional insights to aid in my understanding of it (like how does the fairseq API work and what do I need to make the search work with custom language model).

The main API we used is actually the OpenAI api which was really simple. We used curl to call the API and used Lean's JSON functionality to parse the response. You should just need to figure out the API for the language model you use. Having said that see below about Lean Gym.

Furthermore, I have a question about debugging. I'm curious to know if it's possible to debug this Lean project in a similar manner to how we typically handle debugging in a Python project? My preliminary research suggests that this might not be possible, but I thought it worth inquiring nonetheless.

The document you referenced was for Lean 4, but all this data and tools you are referencing is for Lean 3. Of course you can debug with print statements, and such, but there isn't a nice debugger for Lean 3. But let's back up. I think if you really want to use lean-tpe-public, you will need to learn Lean 3 meta-programming which is different from learning Lean 3 and may not be a good use of your time. I'd recommend either partnering with someone who knows Lean 3 meta-programming well, or instead using Lean Gym. Lean Gym basically replaces lean-tpe-public and is what Open AI (I think) used for later lean projects. It handles all the meta-programming for you, and exposes everything you need in a Python API. You can do whatever tree search you want in Python, call whatever language model interface, and generally have much more control and less pain. TL;DR use Lean Gym instead of lean-tpe-public.

Now, I know others are working on tools to make all this easier for both Lean 3 and Lean 4, especially Lean 4. When those tools come out this should be less painful for ML researchers like yourself.

Dongwei Jiang (Jun 01 2023 at 13:21):

@Jason Rute Thank you so much for the information!

1 The clarification regarding the training data is appreciated. I'm using LLAMA, so I will follow the same training data organization method in your paper.
2 The purpose of Lean Gym is much clearer now. With the help of run_tac and the state information provided by REFL, implementing my own search algorithm appears much more feasible. I have implemented a Python wrapper for the REFL API already and I'm planning to start with a best-first search.
3 Your advice on collaborating with a mathematician resonates with me. I believe such a partnership could be beneficial. I'll look into making connections in the "new members" thread.

Thanks again for your suggestions.

Dongwei Jiang (Jun 10 2023 at 12:47):

Hi @Jason Rute ,

I hope you're doing well. First, let me express my gratitude for your invaluable advice so far. Following your guidance, I've proceeded with data preparation, model training, and search. However, I'm observing that the model is underperforming, failing to prove anything in miniF2F.

Given this context, I've encountered a couple of queries related to the training data derived from your repository, which I've organized as per the PACT paper ("GOAL " + line from train.src + " PROOFSTEP " + line from train.tgt).

1 I've noticed certain keys appearing multiple times in train.src, associated with different train.tgt entries. For instance, the first two lines in train.src are the same, but the first two lines in train.tgt are different:

train.src

α : Type u, b : buffer α,   i : ℕ,  h : i < b.size, v : α   ⊢ b.write ⟨i, h⟩ v = b.write' i v
α : Type u, b : buffer α,   i : ℕ,  h : i < b.size, v : α   ⊢ b.write ⟨i, h⟩ v = b.write' i v
α : Type u, i : ℕ,  v : α,  b_fst : ℕ,  b_snd : array b_fst α,  h : i < buffer.size ⟨b_fst, b_snd⟩  ⊢ buffer.write ⟨b_fst, b_snd⟩ ⟨i, h⟩ v = buffer.write' ⟨b_fst, b_snd⟩ i v
α : Type u, i : ℕ,  v : α,  b_fst : ℕ,  b_snd : array b_fst α,  h : i < buffer.size ⟨b_fst, b_snd⟩  ⊢ ⟨b_fst, b_snd.write ⟨i, h⟩ v⟩ = ⟨b_fst, b_snd.write' i v⟩
α : Type u, b : buffer α,   i : ℕ,  h : i < b.size, v : α   ⊢ b.write ⟨i, h⟩ v = b.write' i v

train.tgt

cases b; unfold write write'; simp [array.write_eq_write']
cases b; unfold write write'
unfold write write'
simp [array.write_eq_write']
cases b

Upon inspection, it seems that one of them acts as the "master tactic", while the others function as "sub-tactics". Although it's understandable because of the way we traced the data, I'm concerned this might be causing ambiguity during training. Given our step-by-step proofing approach, would it be advisable to remove the "master tactics" from training (by deleting the longest entry with the same key)?

2 I've attempted to prove some theorems in Lean Gym using training data. However, I find that the intermediate tactics state from a theorem in Lean Gym bears similarities to the traced tactic state, albeit with subtle differences. Take the theorem "complex.tan_neg" as an example:

lean-gym output:

lean --run src/repl.lean
["init_search", ["complex.tan_neg", ""]]
{"error":null,"proof_steps":[],"search_id":"0","tactic_state":"⊢ ∀ (x : ℂ), complex.tan (-x) = -complex.tan x","tactic_state_id":"0"}

train.src corresponding data:

x : ℂ   ⊢ complex.tan (-x) = -complex.tan x

train.tgt corresponding data:

simp [tan, neg_div]

the lemma in original mathlib:

@[simp] lemma tan_neg : tan (-x) = -tan x := by simp [tan, neg_div]

I had assumed that the "tactic_state" from Lean Gym would be exactly the same as the entries in train.src. But that does not seem to be the case. Is there any more processing I need to do during the search?

Lastly, I wanted to share that I have connected with another Lean enthusiast who also has a keen interest in ATP. This connection was facilitated through a professor I'm collaborating with. Your recommendation has indeed been helpful!

Thank you for your patience and support as I navigate this learning curve.

Jason Rute (Jun 10 2023 at 18:26):

I'll get back to you later when I have more time (maybe tomorrow or Monday), but screenshots are hard to follow. It is better to use markdown #backticks.

Dongwei Jiang (Jun 10 2023 at 23:48):

Jason Rute said:

I'll get back to you later when I have more time (maybe tomorrow or Monday), but screenshots are hard to follow. It is better to use markdown #backticks.

Thanks for the info! I've replaced the screenshot with markdown.

Jason Rute (Jun 11 2023 at 01:39):

@Dongwei Jiang I’m very interested to hear about your reproduction. I personally wasn’t involved in the model training side of the PACT paper so I’m especially curious to see how it goes.

As for the duplication, that is due to compound tactics, e.g. tactics with semicolons. We left them in there for the PACT paper and it let the model produce proofs with semicolon tactics for example. I don’t know what is best, but that is what we did. We did however (if I recall correctly) remove any exact duplicates where the proofstate-tactic pair were the exact same. As for the goal mismatch, what you are seeing is due to the following: when you write a theorem in Lean, you often use the pattern:

theorem foo (x : nat) : x = x := …

But Lean stores the theorem as

forall (x : nat), x = x

This does lead to a discrepancy. Since in the former the first goal is

x : nat
|- x = x

But in the latter the first goal is

|- forall (x : nat), x = x

To get to the first goal, you need the tactic intro x. After that the proof should be the same. I don’t think this was a big deal in practice (as there still was a lot of data using intro x), but I could be mistaken.

Nonetheless you do have to make sure the goals are processed the same way in the data and in Lean Gym. After the intro x in my example you should expect the goals to be exactly the same, including little things like making sure there are tabs separating the hypotheses instead of new lines.

It would also be a good idea to check you can prove theorems found in the training data.

Dongwei Jiang (Jun 12 2023 at 13:39):

@Jason Rute Thanks for your detailed explanations on duplication and goal mismatch issues. They were very helpful!

I'm currently finetuning LLAMA on the Lean data. The initial results aren't stellar, perhaps due to the domain difference between original LLAMA training and Lean states/tactics. It's also possible that the LORA/PEFT setup I've used might need adjustments.

I'll be sure to share my training setup and results when there's substantial progress. Thank you for your continued support!

Dongwei Jiang (Jul 20 2023 at 04:23):

Hi @Jason Rute Just circling back with a few more updates on my LLAMA finetuning. Using the standard Alpaca repo, I've successfully finetuned the 7B LLAMA model with 4 A100 80G on the data from your repo. During training, LLAMA demonstrates solid performance on the validation set.

However, when testing on miniF2F, the accuracy is only around 11% with argmax for Pass at 1. The generated proofs are mostly simple one-liners like norm_num, ring, or linarith. I think the result underscores the need for adding more contextual information to tactic generation - a concept similar to premise selection in ReProver or PACT, which I plan to explore further.

Simultaneously, I've been dabbling in formalizing natural language reasoning problems to Lean, with a focus on using ATP for problem-solving. This approach, I believe, could help reduce hallucination in LLM during reasoning and leverage the power of ATP models for proof generation on natural language problems. So this might slow my progress on conventional theorem proving.

Jason Rute (Jul 20 2023 at 08:06):

I wonder if this thread should get moved to #Machine Learning for Theorem Proving . I think people their should find your progress interesting?

Jason Rute (Jul 20 2023 at 08:16):

@Dongwei Jiang

the accuracy is only around 11% with argmax for Pass at 1

To be clear, what does this mean, especially the argmax part? Are you using a proof search like best first search, or just taking the top-1 tactic at each step until failure?

Dongwei Jiang (Jul 20 2023 at 10:04):

@Jason Rute Yes, you're correct. I'm currently utilizing a strategy of taking the top-1 tactic at each step until failure. While a best-first search approach might offer improved results, I guess its impact could be marginal at this stage, as the model seems primarily to be matching tactic states that follow simpler proofs like norm_num, ring, or linarith, rather than constructing robust chains of mathematical reasoning.

Regarding my work on formalizing natural language reasoning tasks, I'd be delighted to collaborate with anyone interested. So I suppose I'll post something about it in the machine learning channel.

Marcello Seri (Sep 04 2023 at 07:10):

I just stumbled on https://leandojo.org/ which seems to be quite new for lean

Johan Commelin (Sep 04 2023 at 07:11):

Last updated: May 02 2025 at 03:31 UTC