Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: MATH-AI workshop at NeurIPS


Rémy Degenne (Oct 27 2022 at 10:02):

There will be a workshop about AI and math at NeurIPS on December 3 in New Orleans: https://mathai2022.github.io/
It looks similar to the MATH-AI workshop that took place at ICLR. There is no list of papers on the website yet.

Tom Chen (Nov 23 2022 at 23:13):

By the way looks like they have a list now: https://mathai2022.github.io/papers/

Jason Rute (Oct 31 2023 at 13:28):

It looks like the papers for MATH-AI 2023 recently dropped on OpenReview. The official website hasn’t been updated yet.

Jason Rute (Oct 31 2023 at 13:29):

(I’m stealing the thread from the 2022 workshop, but this year it is still part of NeurIPS.)

fzyzcjy (Oct 31 2023 at 23:11):

I have checked it a few days ago as well, but it seems that there is still no pdf yet (only abstract)
image.png

Jason Rute (Oct 31 2023 at 23:19):

Yeah, I think I jumped the gun. I should have waited until it was on the website including pdfs.

fzyzcjy (Oct 31 2023 at 23:22):

Anyway looking forward to it!

Michal Novomestsky (Nov 02 2023 at 11:38):

Hi all,

Has anyone here attended a similar NeurIPS workshop in the past? If so, what format are they hosted in? Are they a series of short lectures or more of an open discussion/networking session, or something else entirely? I'd love to attend this year's, but as an undergrad in Australia I'd need to fund my own way there (not cheap for a student :sob:).

If anyone has any experience from previous workshops/NeurIPS events, I'd greatly appreciate any advice or comments on what it was like - I heard that some NeurIPS events are streamed online as well, but if this workshop isn't (which seems to be the case as far as I can tell) or if people suggest it's more worthwhile to come in person, then I'll see if I can't find an academic at my uni who'll be able to reimburse the travel costs on the uni's behalf for me.

Thanks!

Rémy Degenne (Nov 02 2023 at 11:56):

I have attended and organized such workshops, and the format depends a lot on what the organizers want, but most often the program contains talks by invited speakers, talks by some of the authors who contributed papers, and poster sessions. See for example the program of the same workshop last year: https://mathai2022.github.io/schedule/ .

Kaiyu Yang (Nov 02 2023 at 14:04):

I haven't checked with NeurIPS. From past experience, the talks will be live-streamed to virtual NeurIPS attendees during the conference, and the recordings will be available later to the public. The poster sessions will be entirely in-person.

Michal Novomestsky (Nov 03 2023 at 12:09):

Hey Remy and Kaiyu,

Thanks for the details! The fact that it's livestreamed is a huge help :smile:

Jason Rute (Nov 24 2023 at 02:12):

I don’t know when it was updated but it looks like the paper list for MathAI was finally uploaded. Lots of interesting looking papers!

RexWang (Nov 27 2023 at 16:32):

There is a mix-up in the official website's list of papers, where the author details and link for paper 29 titled "Plan, Verify and Switch: Integrated Reasoning with Diverse X-of-Thoughts" points to the fifth paper (Chameleon). The correct version is available on arXiv at 2310.14628.pdf. May I report the issue here?


Additionally, the link to the first paper(TinyGSM) has gone missing, and I can't find one online.

Jason Rute (Nov 27 2023 at 17:07):

@RexWang I would reach out the the organizers (although many of them are on this Zulip so you could DM them).

RexWang (Nov 29 2023 at 16:16):

I have briefly read the abstracts of the listed papers, among which 7 are related to formalization. Excluding Paper No. 22, lemur, which discusses formal program verification, there are 6 papers in this category. Several of these have already been discussed in the lean zulip forum.

  1. llmstep: The paper LLM proofstep suggestions in Lean, with repository at wellecks/llmstep and the community post llmstep. The work is based on LeanDojo and provides practical tools, as well as releasing code for fine-tuning and evaluation.

  2. COPRA: The paper A Language-Agent Approach to Formal Theorem-Proving, with repository at trishullab/copra, the community post COPRA and an openreview here. This paper has also been submitted to this year's ICLR.

  3. The paper Temperature-scaled large language models for Lean proofstep prediction by Meta. It appears that no code has been released, and the core idea is "temperature scaling as a regularization method for multi-epoch training on small datasets."

  4. Magnushammer: The paper A Transformer-Based Approach to Premise Selection with dataset: premise_selection_in_isabelle, and the community post: New paper: Magnushammer. This work focuses on premise selection methods, further improving upon Sledgehammer.

  5. LeanCopilot: The paper Towards Large Language Models as Copilots for Theorem Proving in Lean. Similar to llmstep, this can be highly useful for Lean users, but it seems that the code has not been made publicly available.

  6. The paper LLM vs ITP and the website: Simon Frieder. This work references Formalizing 100 Theorems and has released the GHOSTS dataset: friederrr/GHOSTS and LLMKnow. However, the link for "LLMKNOW" at https://llmknow.friederrr.org is not accessible (currently indicating "coming soon"). The study discusses "analyzing whether the knowledge contained in LLMs matches that encoded in ITPs", which is quite interesting; this essentially explores evidence that LLMs could be used for ITP.

Jason Rute (Nov 29 2023 at 17:25):

I would also include Llemma, which while mostly about natural language math, does have a couple experiments about formal math in Lean and Isabelle, and they put a lot of work into making sure it had good coverage of ITP data (both at the level of code and also proofstate-tactic pairs).

Jason Rute (Nov 29 2023 at 17:29):

Note, this is just a workshop, so these papers are ridiculously short, with most of the details in the appendix. These usually will not be the only version. Many have fuller length versions on arXiv or submitted to conferences.

Jason Rute (Nov 29 2023 at 17:30):

Some of these have already been talked about here #Machine Learning for Theorem Proving, but others haven’t and probably should be.

Jason Rute (Nov 29 2023 at 17:31):

Also most of the authors of the above papers are on Zulip, and if they haven’t already, should probably share more details about their work. :)

Jason Rute (Nov 29 2023 at 17:37):

I notice @RexWang mentioned one of these was submitted to ICLR, but actually there are a number of these (and more about ITP) currently being reviewed for ICLR 2024. Go to https://openreview.net/group?id=ICLR.cc/2024/Conference , select the active submissions tab, and search for “theorem proving” (or more specific keywords like “autoformalization”, “Lean”, “Coq”, or “Isabelle”.)

RexWang (Nov 29 2023 at 17:59):

Thanks a lot for the suggestions!! @Jason Rute :grinning: Indeed, I should not have neglected the paper llemma. I will give more thoughts and feedback after delving further into them.

Rémy Degenne (Dec 10 2023 at 06:50):

Who here will attend NeurIPS this week? Answer with :+1: to indicate that you will be there!

I am starting the long trip from France to New Orleans.

Junyan Xu (Dec 16 2023 at 04:40):

@Albert Jiang live tweeted the MATH-AI panel, featuring Sir Timothy Gowers, Talia Ringer, and Tony Wu.

Kevin Buzzard (Dec 16 2023 at 09:06):

Yesterday I bumped into someone who attended Neurips (Sara Veneziale, who presented on her work doing algebraic geometry inspired by neural networks) and she expressed surprise that so many talks mentioned lean "even though there were hardly any mathematicians there" :-)

Jason Rute (Dec 16 2023 at 14:30):

I think Lean for whatever reason (I have a few speculative thoughts as to why) has become the go to place for neural theorem proving research, with Isabelle in second place, and Coq in third.

Alex Sanchez-Stern (Dec 16 2023 at 19:04):

I thiink the Coq tools that predate the LLM proving are still SOTA though, hard to tell.

Alex Sanchez-Stern (Dec 16 2023 at 19:10):

My impression is that the new lean+llm tools are still playing catch up, happy to be corrected though.

Kaiyu Yang (Dec 16 2023 at 19:23):

Alex Sanchez-Stern said:

I thiink the Coq tools that predate the LLM proving are still SOTA though, hard to tell.

My belief is quite the opposite :rolling_on_the_floor_laughing:. It also depends on the evaluation strategy. If I make the walltime limit small enough, tidy should outperform both of them?

Abhishek Anand (Dec 17 2023 at 03:20):

What papers from the broader NeurIPS conference/workshops (other than math-ai) did people find potentially useful for theorem proving? I found the tree of thought paper (https://arxiv.org/abs/2305.10601) promising for improving the backtracking tree searches in theorem proving: specifically, the idea of the LLM reflecting and self-critiquing its various proof strategies+steps and THEN picking the most promising next strategy+step

Jason Rute (Dec 17 2023 at 04:45):

Kaiyu Yang said:

Alex Sanchez-Stern said:

I thiink the Coq tools that predate the LLM proving are still SOTA though, hard to tell.

My belief is quite the opposite :rolling_on_the_floor_laughing:. It also depends on the evaluation strategy. If I make the walltime limit small enough, tidy should outperform both of them?

I'd love to revisit this conversation in a week... :)

Kaiyu Yang (Dec 17 2023 at 05:01):

I'd love to revisit this conversation in a week... :)

Are you preparing a manuscript "tidy is all you need"?

Albert Jiang (Dec 17 2023 at 19:47):

Jason Rute said:

I think Lean for whatever reason (I have a few speculative thoughts as to why) has become the go to place for neural theorem proving research, with Isabelle in second place, and Coq in third.

The very fact that this discussion is so active on the Lean zulip and not the other two is very telling.

Jason Rute (Dec 17 2023 at 20:10):

I know the history of that since I started this stream. I was having some private discussions with Brando Miranda, Jesse Michael Han, and Stan Polu here on Zulip as well as a long thread on the general channel (which was mostly an argument between me and Tim Daly). Then I met Markus Rabe at Lean Together in Jan 2020. We generally agreed it would be good to have a dedicated place to discuss this stuff. While at that point none of the work was done in Lean, it seemed good anyway to start a stream here and everyone agreed. Many of the early discussions were about HOList actually. (It still surprises me that even with the popularity of Isabelle for AI research for theorem proving, that there is almost no discussion on the Isabelle Zulip.)

Jason Rute (Dec 17 2023 at 20:13):

There has started to be more discussion on the Coq Zulip.

Wenda Li (Dec 18 2023 at 11:47):

Jason Rute said:

I know the history of that since I started this stream. I was having some private discussions with Brando Miranda, Jesse Michael Han, and Stan Polu here on Zulip as well as a long thread on the general channel (which was mostly an argument between me and Tim Daly). Then I met Markus Rabe at Lean Together in Jan 2020. We generally agreed it would be good to have a dedicated place to discuss this stuff. While at that point none of the work was done in Lean, it seemed good anyway to start a stream here and everyone agreed. Many of the early discussions were about HOList actually. (It still surprises me that even with the popularity of Isabelle for AI research for theorem proving, that there is almost no discussion on the Isabelle Zulip.)

After an initial setup, I guess there isn't much difference between the ML model for different ITPs -- models are not instructed to tell the difference between simple type theory and dependent type theory. In this case, it does not matter which Zulip to have those discussions. Also, I believe @Jason Rute here has done an excellent job of compiling progress in this area and moderating the discussions :-)

Alex Sanchez-Stern (Dec 18 2023 at 17:37):

Kaiyu Yang said:

Alex Sanchez-Stern said:

I thiink the Coq tools that predate the LLM proving are still SOTA though, hard to tell.

My belief is quite the opposite :rolling_on_the_floor_laughing:. It also depends on the evaluation strategy. If I make the walltime limit small enough, tidy should outperform both of them?

Hmm, interesting. What evaluation strategy do you think benefits one versus the other? It seems like in terms of number of predictions/model queries a tool like COPRA might win out, but if you measure total time needed to complete the proofs Proverbot9001 and Diva are still ahead, since their predictions are orders of magnitude faster.

Alex Sanchez-Stern (Dec 18 2023 at 17:44):

Unless the eval times for the newer tools are particularly short, I don't think they win out on speed; the proverbot9001 evaluation runs in a few hours on university resources. And in terms of total numbers, they seem to be behind (although again, we only have transitive evidence so far through COPRA, since the benchmarks are different).

Alex Sanchez-Stern (Dec 18 2023 at 18:42):

Maybe there's some eval evidence I haven't seen?

Jason Rute (Dec 18 2023 at 19:47):

I think you two aren’t disagreeing as much as you think. Of the models which predict tactic steps and do a tree search, there is a trade off between per-step prediction strength and per-step speed. Roughly heuristics > kNN models > tree models > graph models > transformers > large language models when it comes to speed. Hopefully it would be the opposite when it comes to per prediction usefulness. It seems that if the wall clock time was 1 hours and one had unlimited access to GPT-4 then a model like COPRA might do quite well since each step is really good even if it takes a long time (and a lot of money!) to run. On the fast end, if you just have 1/10 of a second on a laptop then a fast heuristic-based model like aesop might be the best option. The question is what is the current balance of time per prediction strength (and resources/money) and that can only be determined by experimentation. It isn’t obvious. (Hopefully we’ll release something in a few days which gives a bit more data in this direction, but again more experiments are likely needed.)

Alex Sanchez-Stern (Dec 18 2023 at 20:03):

Hmm, that could prove to be the case. But right now as far as I can tell, the evidence points to the conclusion that, for any particular time limit, Proverbot9001 would prove more theorems than, say, ReProver or COPRA. Is there a time scale in particular that you think the lean tools would perform better at than the Coq tools?

Alex Sanchez-Stern (Dec 18 2023 at 20:12):

Jason Rute said:

It seems that if the wall clock time was 1 hours and one had unlimited access to GPT-4 then a model like COPRA might do quite well since each step is really good even if it takes a long time (and a lot of money!) to run.

It looks to me like COPRA was given much longer than an hour to run on it's compcert subset. Did you mean a (single-threaded) hour per-proof? In that case Proverbot90001 would have about 4x as much time as in it's normal eval, so it's unlikely that COPRA would reach it's success levels. But one would have to do the eval to be sure.

Alex Sanchez-Stern (Dec 18 2023 at 20:13):

And then once you consider the hardware differences between using Open-AI's advanced hardware and few university GPU's the gap widens.

Jason Rute (Dec 18 2023 at 20:15):

Yes, I meant one hour per proof.

Jason Rute (Dec 18 2023 at 20:17):

And that Copra change some of its hyper parameters to account for this like returning more suggestions. Similarly proverbot9001 could change its hyper parameters and use more hardware. It was just a thought experiment to prove a point.

Alex Sanchez-Stern (Dec 18 2023 at 20:24):

Yeah it's unclear, since you hit diminishing returns pretty quickly. But even under that thought experiment it still looks like Proverbot9001 would prove more theorems. If we allow Proverbot9001 to use 64 threads to compensate for the hardware differences in OpenAI, then giving it an hour per proof would correspond to a 500x increase in time budget (the first 500 proofs finish in an hour, there's a long tail of 8 proofs that take another hour or two), but only a 10x increase in time budget for COPRA (the paper reports an average time of about 6 minutes per-proof, so being generous we could call that the max).

Abhishek Anand (Dec 18 2023 at 20:27):

Do you think that the non-LLM based ML approaches of Proverbot9001 would have done as well as ChatGPT in the lemur work in coming up with code/loop invariants? https://arxiv.org/pdf/2310.04870.pdf

Alex Sanchez-Stern (Dec 18 2023 at 20:34):

Hmm not sure, I haven't tried loop invariant generation, it's a pretty different task. But i noticed that paper doesn't compare to any of the classic invariant generation tools like Daikon (from the 90's), not sure how well it performs in that context.

Abhishek Anand (Dec 18 2023 at 20:39):

IIUC, in Daikon, you need to run your program many times to discover invariants. Lemur does it without running the program. But I agree that could have been a good comparison.

Alex Sanchez-Stern (Dec 18 2023 at 20:42):

Hmm that's a fair point. But they didn't do a quantitative comparison to any prior tools, there are certainly more than a few prior loop invariant generation tools that don't require running the program.

Kaiyu Yang (Dec 19 2023 at 03:39):

Alex Sanchez-Stern said:

Hmm, that could prove to be the case. But right now as far as I can tell, the evidence points to the conclusion that, for any particular time limit, Proverbot9001 would prove more theorems than, say, ReProver or COPRA.

What evidence are you referring to? The experiments in COPRA?

Alex Sanchez-Stern (Dec 19 2023 at 18:15):

Yeah, that seems to be the only comparison point


Last updated: Dec 20 2023 at 11:08 UTC