Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: MATH-AI workshop at NeurIPS

Rémy Degenne (Oct 27 2022 at 10:02):

There will be a workshop about AI and math at NeurIPS on December 3 in New Orleans: https://mathai2022.github.io/
It looks similar to the MATH-AI workshop that took place at ICLR. There is no list of papers on the website yet.

Tom Chen (Nov 23 2022 at 23:13):

By the way looks like they have a list now: https://mathai2022.github.io/papers/

Jason Rute (Oct 31 2023 at 13:28):

It looks like the papers for MATH-AI 2023 recently dropped on OpenReview. The official website hasn’t been updated yet.

Jason Rute (Oct 31 2023 at 13:29):

(I’m stealing the thread from the 2022 workshop, but this year it is still part of NeurIPS.)

fzyzcjy (Oct 31 2023 at 23:11):

I have checked it a few days ago as well, but it seems that there is still no pdf yet (only abstract)
image.png

Jason Rute (Oct 31 2023 at 23:19):

Yeah, I think I jumped the gun. I should have waited until it was on the website including pdfs.

fzyzcjy (Oct 31 2023 at 23:22):

Anyway looking forward to it!

Michal Novomestsky (Nov 02 2023 at 11:38):

Hi all,

Has anyone here attended a similar NeurIPS workshop in the past? If so, what format are they hosted in? Are they a series of short lectures or more of an open discussion/networking session, or something else entirely? I'd love to attend this year's, but as an undergrad in Australia I'd need to fund my own way there (not cheap for a student :sob:).

If anyone has any experience from previous workshops/NeurIPS events, I'd greatly appreciate any advice or comments on what it was like - I heard that some NeurIPS events are streamed online as well, but if this workshop isn't (which seems to be the case as far as I can tell) or if people suggest it's more worthwhile to come in person, then I'll see if I can't find an academic at my uni who'll be able to reimburse the travel costs on the uni's behalf for me.

Thanks!

Rémy Degenne (Nov 02 2023 at 11:56):

I have attended and organized such workshops, and the format depends a lot on what the organizers want, but most often the program contains talks by invited speakers, talks by some of the authors who contributed papers, and poster sessions. See for example the program of the same workshop last year: https://mathai2022.github.io/schedule/ .

Kaiyu Yang (Nov 02 2023 at 14:04):

I haven't checked with NeurIPS. From past experience, the talks will be live-streamed to virtual NeurIPS attendees during the conference, and the recordings will be available later to the public. The poster sessions will be entirely in-person.

Michal Novomestsky (Nov 03 2023 at 12:09):

Hey Remy and Kaiyu,

Thanks for the details! The fact that it's livestreamed is a huge help :smile:

Jason Rute (Nov 24 2023 at 02:12):

I don’t know when it was updated but it looks like the paper list for MathAI was finally uploaded. Lots of interesting looking papers!

RexWang (Nov 27 2023 at 16:32):

There is a mix-up in the official website's list of papers, where the ~~author details and~~ link for paper 29 titled "Plan, Verify and Switch: Integrated Reasoning with Diverse X-of-Thoughts" points to the fifth paper (Chameleon). The correct version is available on arXiv at 2310.14628.pdf. May I report the issue here?

Additionally, the link to the first paper(TinyGSM) has gone missing, and I can't find one online.

Jason Rute (Nov 27 2023 at 17:07):

@RexWang I would reach out the the organizers (although many of them are on this Zulip so you could DM them).

RexWang (Nov 29 2023 at 16:16):

I have briefly read the abstracts of the listed papers, among which 7 are related to formalization. Excluding Paper No. 22, lemur, which discusses formal program verification, there are 6 papers in this category. Several of these have already been discussed in the lean zulip forum.

llmstep: The paper LLM proofstep suggestions in Lean, with repository at wellecks/llmstep and the community post llmstep. The work is based on LeanDojo and provides practical tools, as well as releasing code for fine-tuning and evaluation.
COPRA: The paper A Language-Agent Approach to Formal Theorem-Proving, with repository at trishullab/copra, the community post COPRA and an openreview here. This paper has also been submitted to this year's ICLR.
The paper Temperature-scaled large language models for Lean proofstep prediction by Meta. It appears that no code has been released, and the core idea is "temperature scaling as a regularization method for multi-epoch training on small datasets."
Magnushammer: The paper A Transformer-Based Approach to Premise Selection with dataset: premise_selection_in_isabelle, and the community post: New paper: Magnushammer. This work focuses on premise selection methods, further improving upon Sledgehammer.
LeanCopilot: The paper Towards Large Language Models as Copilots for Theorem Proving in Lean. Similar to llmstep, this can be highly useful for Lean users, but it seems that the code has not been made publicly available.
The paper LLM vs ITP and the website: Simon Frieder. This work references Formalizing 100 Theorems and has released the GHOSTS dataset: friederrr/GHOSTS and LLMKnow. However, the link for "LLMKNOW" at https://llmknow.friederrr.org is not accessible (currently indicating "coming soon"). The study discusses "analyzing whether the knowledge contained in LLMs matches that encoded in ITPs", which is quite interesting; this essentially explores evidence that LLMs could be used for ITP.

Jason Rute (Nov 29 2023 at 17:25):

I would also include Llemma, which while mostly about natural language math, does have a couple experiments about formal math in Lean and Isabelle, and they put a lot of work into making sure it had good coverage of ITP data (both at the level of code and also proofstate-tactic pairs).

Jason Rute (Nov 29 2023 at 17:29):

Note, this is just a workshop, so these papers are ridiculously short, with most of the details in the appendix. These usually will not be the only version. Many have fuller length versions on arXiv or submitted to conferences.

Jason Rute (Nov 29 2023 at 17:30):

Some of these have already been talked about here #Machine Learning for Theorem Proving, but others haven’t and probably should be.

Jason Rute (Nov 29 2023 at 17:31):

Also most of the authors of the above papers are on Zulip, and if they haven’t already, should probably share more details about their work. :)

Jason Rute (Nov 29 2023 at 17:37):

I notice @RexWang mentioned one of these was submitted to ICLR, but actually there are a number of these (and more about ITP) currently being reviewed for ICLR 2024. Go to https://openreview.net/group?id=ICLR.cc/2024/Conference , select the active submissions tab, and search for “theorem proving” (or more specific keywords like “autoformalization”, “Lean”, “Coq”, or “Isabelle”.)

RexWang (Nov 29 2023 at 17:59):

Thanks a lot for the suggestions!! @Jason Rute :grinning: Indeed, I should not have neglected the paper llemma. I will give more thoughts and feedback after delving further into them.

Rémy Degenne (Dec 10 2023 at 06:50):

Who here will attend NeurIPS this week? Answer with :+1: to indicate that you will be there!

I am starting the long trip from France to New Orleans.

Junyan Xu (Dec 16 2023 at 04:40):

@Albert Jiang live tweeted the MATH-AI panel, featuring Sir Timothy Gowers, Talia Ringer, and Tony Wu.

Kevin Buzzard (Dec 16 2023 at 09:06):

Yesterday I bumped into someone who attended Neurips (Sara Veneziale, who presented on her work doing algebraic geometry inspired by neural networks) and she expressed surprise that so many talks mentioned lean "even though there were hardly any mathematicians there" :-)

Jason Rute (Dec 16 2023 at 14:30):

I think Lean for whatever reason (I have a few speculative thoughts as to why) has become the go to place for neural theorem proving research, with Isabelle in second place, and Coq in third.

Alex Sanchez-Stern (Dec 16 2023 at 19:04):

I thiink the Coq tools that predate the LLM proving are still SOTA though, hard to tell.

Alex Sanchez-Stern (Dec 16 2023 at 19:10):

My impression is that the new lean+llm tools are still playing catch up, happy to be corrected though.

Kaiyu Yang (Dec 16 2023 at 19:23):

Alex Sanchez-Stern said:

I thiink the Coq tools that predate the LLM proving are still SOTA though, hard to tell.

My belief is quite the opposite :rolling_on_the_floor_laughing:. It also depends on the evaluation strategy. If I make the walltime limit small enough, tidy should outperform both of them?

Abhishek Anand (Dec 17 2023 at 03:20):

What papers from the broader NeurIPS conference/workshops (other than math-ai) did people find potentially useful for theorem proving? I found the tree of thought paper (https://arxiv.org/abs/2305.10601) promising for improving the backtracking tree searches in theorem proving: specifically, the idea of the LLM reflecting and self-critiquing its various proof strategies+steps and THEN picking the most promising next strategy+step

Jason Rute (Dec 17 2023 at 04:45):

Kaiyu Yang said:

Alex Sanchez-Stern said:

I thiink the Coq tools that predate the LLM proving are still SOTA though, hard to tell.

My belief is quite the opposite :rolling_on_the_floor_laughing:. It also depends on the evaluation strategy. If I make the walltime limit small enough, tidy should outperform both of them?

I'd love to revisit this conversation in a week... :)

Kaiyu Yang (Dec 17 2023 at 05:01):

I'd love to revisit this conversation in a week... :)

Are you preparing a manuscript "tidy is all you need"?

Albert Jiang (Dec 17 2023 at 19:47):

Jason Rute said:

I think Lean for whatever reason (I have a few speculative thoughts as to why) has become the go to place for neural theorem proving research, with Isabelle in second place, and Coq in third.

The very fact that this discussion is so active on the Lean zulip and not the other two is very telling.

Jason Rute (Dec 17 2023 at 20:10):

I know the history of that since I started this stream. I was having some private discussions with Brando Miranda, Jesse Michael Han, and Stan Polu here on Zulip as well as a long thread on the general channel (which was mostly an argument between me and Tim Daly). Then I met Markus Rabe at Lean Together in Jan 2020. We generally agreed it would be good to have a dedicated place to discuss this stuff. While at that point none of the work was done in Lean, it seemed good anyway to start a stream here and everyone agreed. Many of the early discussions were about HOList actually. (It still surprises me that even with the popularity of Isabelle for AI research for theorem proving, that there is almost no discussion on the Isabelle Zulip.)

Jason Rute (Dec 17 2023 at 20:13):

There has started to be more discussion on the Coq Zulip.

Wenda Li (Dec 18 2023 at 11:47):

Jason Rute said:

I know the history of that since I started this stream. I was having some private discussions with Brando Miranda, Jesse Michael Han, and Stan Polu here on Zulip as well as a long thread on the general channel (which was mostly an argument between me and Tim Daly). Then I met Markus Rabe at Lean Together in Jan 2020. We generally agreed it would be good to have a dedicated place to discuss this stuff. While at that point none of the work was done in Lean, it seemed good anyway to start a stream here and everyone agreed. Many of the early discussions were about HOList actually. (It still surprises me that even with the popularity of Isabelle for AI research for theorem proving, that there is almost no discussion on the Isabelle Zulip.)

After an initial setup, I guess there isn't much difference between the ML model for different ITPs -- models are not instructed to tell the difference between simple type theory and dependent type theory. In this case, it does not matter which Zulip to have those discussions. Also, I believe @Jason Rute here has done an excellent job of compiling progress in this area and moderating the discussions :-)

Alex Sanchez-Stern (Dec 18 2023 at 17:37):

Kaiyu Yang said:

Alex Sanchez-Stern said:

I thiink the Coq tools that predate the LLM proving are still SOTA though, hard to tell.

My belief is quite the opposite :rolling_on_the_floor_laughing:. It also depends on the evaluation strategy. If I make the walltime limit small enough, tidy should outperform both of them?

Hmm, interesting. What evaluation strategy do you think benefits one versus the other? It seems like in terms of number of predictions/model queries a tool like COPRA might win out, but if you measure total time needed to complete the proofs Proverbot9001 and Diva are still ahead, since their predictions are orders of magnitude faster.

Alex Sanchez-Stern (Dec 18 2023 at 17:44):

Unless the eval times for the newer tools are particularly short, I don't think they win out on speed; the proverbot9001 evaluation runs in a few hours on university resources. And in terms of total numbers, they seem to be behind (although again, we only have transitive evidence so far through COPRA, since the benchmarks are different).

Alex Sanchez-Stern (Dec 18 2023 at 18:42):

Maybe there's some eval evidence I haven't seen?

Jason Rute (Dec 18 2023 at 19:47):

I think you two aren’t disagreeing as much as you think. Of the models which predict tactic steps and do a tree search, there is a trade off between per-step prediction strength and per-step speed. Roughly heuristics > kNN models > tree models > graph models > transformers > large language models when it comes to speed. Hopefully it would be the opposite when it comes to per prediction usefulness. It seems that if the wall clock time was 1 hours and one had unlimited access to GPT-4 then a model like COPRA might do quite well since each step is really good even if it takes a long time (and a lot of money!) to run. On the fast end, if you just have 1/10 of a second on a laptop then a fast heuristic-based model like aesop might be the best option. The question is what is the current balance of time per prediction strength (and resources/money) and that can only be determined by experimentation. It isn’t obvious. (Hopefully we’ll release something in a few days which gives a bit more data in this direction, but again more experiments are likely needed.)

Alex Sanchez-Stern (Dec 18 2023 at 20:03):

Hmm, that could prove to be the case. But right now as far as I can tell, the evidence points to the conclusion that, for any particular time limit, Proverbot9001 would prove more theorems than, say, ReProver or COPRA. Is there a time scale in particular that you think the lean tools would perform better at than the Coq tools?

Alex Sanchez-Stern (Dec 18 2023 at 20:12):

Jason Rute said:

It seems that if the wall clock time was 1 hours and one had unlimited access to GPT-4 then a model like COPRA might do quite well since each step is really good even if it takes a long time (and a lot of money!) to run.

It looks to me like COPRA was given much longer than an hour to run on it's compcert subset. Did you mean a (single-threaded) hour per-proof? In that case Proverbot90001 would have about 4x as much time as in it's normal eval, so it's unlikely that COPRA would reach it's success levels. But one would have to do the eval to be sure.

Alex Sanchez-Stern (Dec 18 2023 at 20:13):

And then once you consider the hardware differences between using Open-AI's advanced hardware and few university GPU's the gap widens.

Jason Rute (Dec 18 2023 at 20:15):

Yes, I meant one hour per proof.

Jason Rute (Dec 18 2023 at 20:17):

And that Copra change some of its hyper parameters to account for this like returning more suggestions. Similarly proverbot9001 could change its hyper parameters and use more hardware. It was just a thought experiment to prove a point.

Alex Sanchez-Stern (Dec 18 2023 at 20:24):

Yeah it's unclear, since you hit diminishing returns pretty quickly. But even under that thought experiment it still looks like Proverbot9001 would prove more theorems. If we allow Proverbot9001 to use 64 threads to compensate for the hardware differences in OpenAI, then giving it an hour per proof would correspond to a 500x increase in time budget (the first 500 proofs finish in an hour, there's a long tail of 8 proofs that take another hour or two), but only a 10x increase in time budget for COPRA (the paper reports an average time of about 6 minutes per-proof, so being generous we could call that the max).

Abhishek Anand (Dec 18 2023 at 20:27):

Do you think that the non-LLM based ML approaches of Proverbot9001 would have done as well as ChatGPT in the lemur work in coming up with code/loop invariants? https://arxiv.org/pdf/2310.04870.pdf

Alex Sanchez-Stern (Dec 18 2023 at 20:34):

Hmm not sure, I haven't tried loop invariant generation, it's a pretty different task. But i noticed that paper doesn't compare to any of the classic invariant generation tools like Daikon (from the 90's), not sure how well it performs in that context.

Abhishek Anand (Dec 18 2023 at 20:39):

IIUC, in Daikon, you need to run your program many times to discover invariants. Lemur does it without running the program. But I agree that could have been a good comparison.

Alex Sanchez-Stern (Dec 18 2023 at 20:42):

Hmm that's a fair point. But they didn't do a quantitative comparison to any prior tools, there are certainly more than a few prior loop invariant generation tools that don't require running the program.

Kaiyu Yang (Dec 19 2023 at 03:39):

Alex Sanchez-Stern said:

Hmm, that could prove to be the case. But right now as far as I can tell, the evidence points to the conclusion that, for any particular time limit, Proverbot9001 would prove more theorems than, say, ReProver or COPRA.

What evidence are you referring to? The experiments in COPRA?

Alex Sanchez-Stern (Dec 19 2023 at 18:15):

Yeah, that seems to be the only comparison point

Kaiyu Yang (Dec 21 2023 at 03:51):

Alex Sanchez-Stern said:

Yeah, that seems to be the only comparison point

I don't think that is a valid comparison point. Even if COPRA experiments suggest "Proverbot9001 > COPRA on CompCert under some evaluation setting/criterion" and "COPRA > ReProver on MiniF2F under some evaluation setting/criterion" (I think these are what @Alex Sanchez-Stern is suggesting, but I don't have sufficient knowledge of COPRA to comment on whether they are true. Maybe @Amitayush Thakur is the best person to comment on that), there are too many alternative hypotheses. For example, here is one:

DISCLAIMER: BELOW IS JUST A HYPOTHESIS:
COPRA is based on GPT-4, whereas Proverbot9001 and ReProver are finetuned models specialized in formal proofs. GPT-4 has seen many problems related to high-school math and math olympiads during training (from textbooks, online forums, etc.), so it can perform competitively with finetuned models on MiniF2F. However, GPT-4 is less familiar with the domain of verified compilers since there is much less data about it on the Internet. Therefore, it cannot outperform models finetuned on CompCert.

People might have different opinions on how plausible this hypothesis is versus others hypotheses. My main point is that we have too little empirical evidence to suggest "X is state of the art". And I think asserting "X is state of the art" without context (dataset, evaluation protocol, etc.) can be misleading, as it oversimplifies the problem and suggests there is a single, universal state of the art. I hope @Jason Rute's forthcoming work will shed more light on this question.

Amitayush Thakur (Dec 21 2023 at 06:26):

Kaiyu Yang said:

Alex Sanchez-Stern said:

Yeah, that seems to be the only comparison point

I don't think that is a valid comparison point. Even if COPRA experiments suggest "Proverbot9001 > COPRA on CompCert under some evaluation setting/criterion" and "COPRA > ReProver on MiniF2F under some evaluation setting/criterion" (I think these are what Alex Sanchez-Stern is suggesting, but I don't have sufficient knowledge of COPRA to comment on whether they are true. Maybe Amitayush Thakur is the best person to comment on that), there are too many alternative hypotheses. For example, here is one:

DISCLAIMER: BELOW IS JUST A HYPOTHESIS:
COPRA is based on GPT-4, whereas Proverbot9001 and ReProver are finetuned models specialized in formal proofs. GPT-4 has seen many problems related to high-school math and math olympiads during training (from textbooks, online forums, etc.), so it can perform competitively with finetuned models on MiniF2F. However, GPT-4 is less familiar with the domain of verified compilers since there is much less data about it on the Internet. Therefore, it cannot outperform models finetuned on CompCert.

People might have different opinions on how plausible this hypothesis is versus others hypotheses. My main point is that we have too little empirical evidence to suggest "X is state of the art". And I think asserting "X is state of the art" without context (dataset, evaluation protocol, etc.) can be misleading, as it oversimplifies the problem and suggests there is a single, universal state of the art. I hope Jason Rute's forthcoming work will shed more light on this question.

For guidance steps <= 60 COPRA outperforms Proverbot9001 with regards to the number of theorems proved, we could not run COPRA on more steps because of budget limits. Proverbot9001 can be faster than COPRA with regards to wall-clock time, but we don't have evidence to firmly establish that Proverbot9001 > COPRA on the CompCert dataset in terms of proving more theorems alone (given a higher number of proof steps budget). My hypothesis is the following:

There is a tradeoff in the size of the model vs the number of proof steps that one can sample from the model. However, if the model size is too small then the sampled proof steps can be bad enough to keep moving the proof in the wrong direction. A proof going in the wrong direction will also mean executing those wrong proof steps will take more time on ITP as opposed to longer sampling time, but with much better proof steps coming from a bigger model, we can still finish the proofs faster overall. (I guess this is kind of similar to what @Jason Rute was mentioning)

I would like to point out that comparison between so many different approaches is not smooth and transitive because we don't have interfaces to seamlessly do a transferability study of various approaches on different datasets. I would love to be a part of any initiatives where we try to create domain and language agnostic abstractions so that it becomes easy to compare future works.

Alex Sanchez-Stern (Dec 21 2023 at 19:03):

Hmm, right the comparisons points we have definitely aren't great. With a fixed wall clock time, it seems pretty likely that Proverbot9001 proves more compcert theorems though, as the wall clock time of each step is about 100x faster than COPRA, so COPRA would have to, at 60 steps, beat Proverbot9001 at 6000 steps, which isn't the way the data is trending in the COPRA eval.
But it's certainly very weak evidence, I definitely wouldn't try to publish it :grinning_face_with_smiling_eyes: . Still, we don't yet have any evidence for the alternative theories, so I think the point I was originally trying to make still stands (not sure if I was very clear on what that point is). The comparison points are really weak, but so far the evidence tends in the direction that models with small specialized architectures still outperform LLM-based tools, so it's not yet clear whether LLM's are the most promising way to prove theorems yet. That's not to rule LLM-based tools out, they may still prove to be the stronger proving models, but they haven't yet.
To know for sure how the current approaches compare to each other, we'll need new progress in cross-language benchmarks and tools, which is why I think it's such an important direction. Even that doesn't necessarily tell us which approaches will work the best going forward, but knowing what works best right now will be a great indicator of which approaches to pursue.

Alex Sanchez-Stern (Dec 21 2023 at 19:19):

Oh, and I want to also say that one of the most promising advances I've seen in making these kinds of comparisons more robust is Amitayush's great work in developing the cross-language abstraction that COPRA is built on. As far as I know, it's the first tool to be able to run on more than one prover, which in opinion is a huge contribution to the future of the field. I would be happy to help with that work on language agnostic abstractions that Amitayush mentioned, and am planning on looking at the existing code in COPRA for developing my future tools.

Amitayush Thakur (Dec 21 2023 at 19:39):

We are already planning to separate the ITP abstraction code from COPRA to a new repository while adding support for Isabelle (probably by next month). We want to create a Python library from these abstractions so that anyone can use it for collecting data from various ITPs in a uniform format, testing their approach on any ITP, and do cross-training on different domains and languages.

Jason Rute (Dec 21 2023 at 19:41):

The DT-Solver paper also does a fairly nice cross platform comparison, even reimplementing past works. Note the Lisa and GPT-f in that paper are very similar methodologies. Unfortunately the code isn’t shared.

Jason Rute (Dec 21 2023 at 19:42):

Also miniF2F is a useful cross platform comparison, even though it also brings into account the power of the library and tactics in each language. It is unfortunate it is not translated to Coq.

Jason Rute (Dec 21 2023 at 19:46):

The GPT-f style of transformer + tree search is actually fairly easy to build (and has been replicated many times in Isabelle and Lean) and is a good baseline approach.

Jason Rute (Dec 21 2023 at 20:18):

So I think there has already been good Lean-Isabelle cross platform comparison. Hence, I think we have a fairly good idea how a Sledgehammer clone for Lean would fair against something like ReProver. But the sort of approaches which @Alex Sanchez-Stern is advocating for (Proverbot9001, Tactician/k-NN, ASTactic and its successors) are Coq only and have no equivalents in Isabelle or Lean, nor is there anything like the GPT-f/Lisa baseline I mentioned above for Coq. Another point is that Coq typically tests on entirely new projects which might advantage different approaches (although something like this scenario is what ReProver was designed for).

Jason Rute (Dec 21 2023 at 20:20):

And Coq projects usually use less compute, running without a GPU.

Alex Sanchez-Stern (Dec 22 2023 at 19:22):

It's ironic that there has been so much more cross-prover work between Lean and Isabelle, since the foundations of Coq and Lean are much more similar. It is probably just because of the previous disconnect of two communities though; people who come from theorem proving people use Coq, while people coming from machine learning picked up on Isabelle and then Lean before there was much discussion between the two.

Last updated: May 02 2025 at 03:31 UTC