Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: ProofGPT's performance on theorem proving

Shun Zhang (May 25 2023 at 23:02):

Hello everyone,

I'm new to this forum. I have been working on code generation and am now interested in theorem proving (particularly in Lean). I'm now looking for a backbone transformer to use or fine-tune.

I was wondering if ProofGPT has been evaluated on proving Lean theorems. The authors of the ProofNet paper mentioned that they "leave an investigation of formal theorem proving and proof autoformalization to future work." But ProofGPT seems to be the only open-sourced model that can be used for theorem proving?

So I have the following questions.

Is ProofGPT the most appropriate open-sourced model for Lean theorem proving? I still want to play with a model or possibly fine-tune it, so I do not want to use GPT-3.5/GPT-4 at this stage.
Would minif2f (or other datasets?) be a suitable evaluation set, that is, would it be too difficult or too easy? I'd like to run some evaluation experiments, but I just wanted to make sure it's meaningful.

Thanks!

Dongwei Jiang (May 26 2023 at 12:57):

Hi Shun, like you, I'm also new to this forum and to the task of theorem proving. It seems we're encountering some similar issues. I'd like to share my current understanding and experience, but please bear in mind that I'm still learning. Corrections or additional insights from more seasoned members would be very welcome.

Is ProofGPT the most appropriate open-sourced model for Lean theorem proving? I still want to play with a model or possibly fine-tune it, so I do not want to use GPT-3.5/GPT-4 at this stage.

Although I'm not extensively familiar with ProofGPT, I've been working on replicating the results from the PACT paper using LLAMA as the backbone. If you're interested, you might find additional details here.

Would minif2f (or other datasets?) be a suitable evaluation set, that is, would it be too difficult or too easy? I'd like to run some evaluation experiments, but I just wanted to make sure it's meaningful.

I'm also having this question. The current SOTA result on miniF2F is not high (<50% for Pass@64). But after reviewing some of the human-written proofs for miniF2F, I found a wide range of difficulty levels. Some proofs are surprisingly straightforward, involving just one line of tactics or simple calculations rather than complex proofs (more examples can be found here and here). Conversely, others are quite complex or lengthy, which potentially raises the chance for an LLM to make a mistake. I'm interested in understanding which questions current Large Language Models struggle to answer correctly.

On a separate note, I believe I've read your impactful paper on program synthesis that uses pass rates on test cases as a heuristic for forward-checking with MCTS. If you're open to discussing it, I'd be curious to understand why you shifted your interest from program synthesis to theorem proving. I'm still in the process of understanding the role theorem proving can play in enhancing LLM's general abilities (reasoning and planning).

Shun Zhang (May 27 2023 at 00:10):

Hi Dongwei, Thanks for your reply. It's very helpful!
I'm actually considering theorem proving as a new test domain, as it appears to be more challenging than code generation. We may have to come up with new algorithms that can reason to better solve theorem proving problems.

Zhangir Azerbayev (May 27 2023 at 00:47):

Hi Shun, glad to hear you're interested in the proofGPT model.

As far as I'm aware, proofGPT is the only open-source model fine-tuned on math, so it is a natural candidate for work on theorem proving. I'd expect proofGPT to be better than the pre-trained models used in the openai and meta papers, due to the former having a larger parameter count and more pre-training.

Minif2f is a challenging benchmark, and I can't think of an obvious way in which the test data could've contaminated the proofGPT training set.

Alex Sweeney (May 27 2023 at 05:27):

(deleted)

Shun Zhang (May 27 2023 at 07:08):

@Zhangir Azerbayev Thanks for confirming that I'm on the right track! It's also good to know that minif2f is not in ProofGPT's training set.

Jason Rute (May 27 2023 at 10:42):

@Zhangir Azerbayev is ProofGPT’s training data, both pre training and math training publicly available? So that one could just grep for the names of the minif2f problems?

Jason Rute (May 27 2023 at 12:43):

Note most of the data leakage in the PACT paper came from the general purpose pretraining (common crawl) and not from the math specific pretraining (GitHub, arXiv, stack exchange). And of course the miniF2F problems are in many cases standard competition problems so their natural language versions are very possibly in the training data.

Zhangir Azerbayev (May 27 2023 at 16:10):

@Jason Rute The general pre-training dataset is the pile and the math-specific set is available here. I agree the natural language versions are likely in the pile, although the formal versions are almost certainly not (another example of data decontamination being an ill posed problem).

Last updated: May 02 2025 at 03:31 UTC