Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: ML for Theorem Proving Tutorial - NeurIPS 2023

Zhangir Azerbayev (Dec 01 2023 at 01:05):

Hi everyone,

At this year's NeurIPS, I will be moderating a panel discussion at the Tutorial on Machine Learning for Theorem Proving made up of a number of acclaimed ML researchers. I'm starting this thread to solicit questions from the Lean community. If you have any questions about what formal math can do for machine learning and what machine learning can do for formal math, ask away!

Kevin Buzzard (Dec 01 2023 at 08:11):

"ChatGPT is now a year old and we were told it would change everything but in terms of applications to mathematics beyond school level it has changed nothing. When will it change anything?"

Scott Morrison (Dec 02 2023 at 00:18):

While appreciating the spirit of the question, the letter seems incorrect; I get the impression quite a few people use Moogle now.

Kevin Buzzard (Dec 02 2023 at 09:15):

Happy to change "changed nothing" to "changed epsilon"

Albert Jiang (Dec 02 2023 at 11:47):

Kevin Buzzard said:

"ChatGPT is now a year old and we were told it would change everything but in terms of applications to mathematics beyond school level it has changed nothing. When will it change anything?"

Out of curiosity: who said ChatGPT would change everything? There was quite some excitement over transformers or LLMs for mathematics but I don't recall anything specifically related to ChatGPT. Moogle also has little to do with ChatGPT iiuc.

Albert Jiang (Dec 02 2023 at 11:48):

I also appreciate the spirit of the question :)

Kevin Buzzard (Dec 02 2023 at 13:09):

The answer to your question is "Twitter". I'm only trying to cause trouble because I'm still a skeptic until I see evidence.

Kevin Buzzard (Dec 02 2023 at 13:13):

I absolutely agree that moogle is useful, I use it always instead of docs search or anything else, even if I know something else will work. It's just that I always know that I could have done it another way. But the answer to "what can AI do for me" is "auto-translate Mazur's paper "Modular curves and the Eisenstein ideal" into lean" (because I need it for Fermat) and this still feels like a very very long way away

Fabian Glöckle (Dec 02 2023 at 20:43):

“Most people overestimate what they can achieve in a year and underestimate what they can achieve in ten years.”

Kevin Buzzard (Dec 03 2023 at 01:35):

I would love to see theorem provers eating mathematics within ten years! It would vindicate my change of area. But the strides we've taken recently, eg formalising several results in modern combinatorics in real time, have come from hard human graft.

Abhishek Anand (Dec 03 2023 at 21:32):

AFAIK, none of the current work on LLMs for Lean or Coq try to get LLMs to write custom tactics to solve the proof goals even when there exists reusable algorithms to solve the goal. If that is correct, I would like the panel to discuss the challenges in doing that. My guess is: like humans, LLMs cannot do long mechanical computations reliably; they seem better at writing programs that do such things.

Junyan Xu (Dec 04 2023 at 02:44):

Take a look at Peano. I think the compositional approach has great potential!

Jason Rute (Dec 04 2023 at 15:14):

@Zhangir Azerbayev who is on the panel? Is the focus all forms of AI theorem proving, including formal (ATPs and ITPs), natural language, and possibly informal reasoning in real world ambiguous situations, or just certain subsets?

Jason Rute (Dec 04 2023 at 15:19):

Oh, I see the website has a panelist section. I think that is going to be a group with very diverse perspectives and differing viewpoints which is good.

Kaiyu Yang (Dec 04 2023 at 15:26):

Is the focus all forms of AI theorem proving, including formal (ATPs and ITPs), natural language, and possibly informal reasoning in real world ambiguous situations, or just certain subsets?

The presentation (by Albert, Emily, and me) will be pretty focused on machine learning for proving theorems in ITP (with synergies to natural language and software verification, e.g., Draft Sketch and Prove). I expect the panel discussion to be complementary to the presentation., so it will be broadly related to theorem proving, including topics such as LLMs for informal mathematical reasoning, code generation, etc.

Jason Rute (Dec 04 2023 at 15:27):

My questions are as follows:

What existing methods are showing the most promise right now for users of formal theorem provers. What are the best current systems? (I think there will be vast disagreement here among the panelists.)
(If there is disagreement) what can we do to improve this via better benchmarks, competitions, etc?
What methods and ideas show the most promise in the future to take AI theorem proving to the AlexNet/Alpha-Go/Alpha-Fold/ChatGPT moment where we really believe major progress is happening?

Patrick Massot (Dec 04 2023 at 15:33):

Question: how can actual users actually use those methods instead of listening to people talking about them or watching demos we cannot reproduce?

Kaiyu Yang (Dec 04 2023 at 15:42):

Question: how can actual users actually use those methods instead of listening to people talking about them or watching demos we cannot reproduce?

There are a few ongoing projects giving you access to these models directly in Lean, such as LLMStep and LeanInfer. We're currently working on augmenting LeanInfer with some new features and will probably release something new this month. If you have any comments on these tools (e.g., how their user experience can be improved), I'm more than happy to hear!

Jason Rute (Dec 04 2023 at 15:57):

Patrick Massot said:

Question: how can actual users actually use those methods instead of listening to people talking about them or watching demos we cannot reproduce?

For Coq there is Tactician which runs on a normal CPU and is not too hard to install via Opam. It is better than people give it credit for and for a while I’ve said there needs to be a Lean clone of it. The Isabelle people have been using Sledgehammer frequently and the Coq people, CoqHammer. Both, on the level of a normal CPU, are still probably top tier systems.

Jason Rute (Dec 04 2023 at 15:59):

But yes, most of this field is just research software which only the authors know how to run.

Kaiyu Yang (Dec 04 2023 at 16:08):

With cloud release, I expect the current version of LeanInfer to be pretty easy to install and run on most systems (Linux, Mac, or Windows WSL, w/ or w/o GPUs), and we're happy to help if anyone opens an issue on GitHub.

Jason Rute (Dec 04 2023 at 17:38):

Kaiyu Yang said:

With cloud release, I expect the current version of LeanInfer to be pretty easy to install and run on most systems (Linux, Mac, or Windows WSL, w/ or w/o GPUs), and we're happy to help if anyone opens an issue on GitHub.

What does “cloud release” mean in this context? Just that it is released as a GitHub release, or that you intend to make a cloud interface to use the model?

Kaiyu Yang (Dec 04 2023 at 17:48):

Cloud release is a feature supported by lake. It allows me to upload the package's artifacts (e.g., binary libraries) to GitHub. When the user runs lake build in a downstream package, it will not re-compile everything in LeanInfer but fetch the artifacts I uploaded. LeanInfer uses FFI and depends on a few 3rd-party C++ libraries (OpenBLAS, CTranslate2, etc.); it can be challenging for downstream users to build it locally w/o cloud release.

Alex Sanchez-Stern (Dec 04 2023 at 19:15):

So, by cloud release, they mean binary release? Either way you're downloading stuff from github, doesn't seem any more or less "cloud" than a normal release.

Alex Sanchez-Stern (Dec 04 2023 at 19:18):

I understand the urge to use buzzwords though

Alex Sanchez-Stern (Dec 04 2023 at 19:25):

Maybe if the binaries were being built on a server when you pushed your code as part of CI, the term "cloud release" would be more appropriate? Though that's how many binary releases work anyway, so maybe not.

Kaiyu Yang (Dec 04 2023 at 19:44):

I don't know why lake chose to use this term, but I guess "cloud" here just means "not local".

Alex Sanchez-Stern (Dec 04 2023 at 19:49):

You mean not "from source"? Usually the "release" doesn't refer to the build process, but the process by which someone obtains the software. In either case, they are downloading the software from github, so you could call it a "github" release either way. By this definition of "cloud release", almost all software used everywhere is a "cloud release".

Alex Sanchez-Stern (Dec 04 2023 at 19:50):

I'm not trying to blame you for this language though; clearly it's the lake team that got a bit overzealous in their marketing.

Alex Sanchez-Stern (Dec 04 2023 at 19:52):

Though, if you're trying to communicate about the release to others, "binary release" is the term that most people will understand as referring to the kind of release you're doing, where the binary is pre-built.

Kaiyu Yang (Dec 04 2023 at 20:00):

Hmm.. "binary release" doesn't work because the artifacts are not restricted to binaries. I agree "GitHub release" might be more accurate.

Alex Sanchez-Stern (Dec 04 2023 at 20:03):

"Github release" would apply to both you original source release and the new release though, so it doesn't differentiate the two. Maybe "pre-built release" then, if you want to make it clear that some built artifacts are not binaries. Keep in mind though that this is still generally referred to as a "binary release" in other contexts. If you install a debian package for instance, since it's not built from source its considered a "binary release", even though it also include artifacts that are not program binaries, such as man pages.

Jeremy Avigad (Dec 05 2023 at 22:14):

I am sorry to be late to the party, but I just thought of a good topic for discussion. Research in ML is generally driven by benchmarks, which give objective ways to compare systems and measure progress. But that is sometimes orthogonal to the goal of having tools that are genuinely useful to mathematicians. Benchmarks don't measure whether the tools are straightforward to set up and use in practice, whether they actually work on the problems that people want them to solve, and whether they work in the contexts and ways that people want to use them.

This is really a prelude to Patrick's question. I expect we'd have more usable tools if researchers had better incentives to put in the effort needed to get them in the hands of everyday users, and if there were better ways of assessing and rewarding their success in that regard. For me, a popularity contest on Zulip is much more meaningful than any benchmark competition. We can let people vote every year on their favorite tools and see which ones come out best.

@Zhangir Azerbayev It would be great if there were some discussion of this in the tutorial. Think of it as the old-fashioned version of the alignment problem.

Jeremy Avigad (Dec 05 2023 at 22:34):

discussion of this

(I.e. the question of incentives, not the popularity contest per se!)

Alex Sanchez-Stern (Dec 05 2023 at 23:18):

Hmm, maybe there are really two tracks for comparing these tools. Some of them have the goal of making users experience easier right now, while others have the goal of pushing the boundaries of what a tool will be able to do in 10 years. It's sort of a backend/frontend distinction, where the frontends can compete on popularity and ease of use, and the backends can compete on proving more theorems. For instance, Llemma can be used as a new backend for llmstep, reusing the frontend work.

From my own perspective, I don't think a 20%-ish proof completion rate for Coq on projects like CompCert (which is state-of-the-art) is really good enough to be useful to most users, so I haven't done much frontend work yet. But when the backends are powerful enough to get closer to 60% completion rate, then they'll be really useful and work on frontends is more important.

Alex Sanchez-Stern (Dec 05 2023 at 23:19):

But of course, the problem is that we don't have good backend benchmarks either, so the ecosystem is very fragmented and it's hard to tell which backends are state-of-the-art.

Kevin Buzzard (Dec 06 2023 at 00:07):

I thought of a less grumpy-old-man question than my earlier one (for which I apologise). I met Christian Szegedy at AITP in 2019 and I asked him when computers would enable me to retire early and he said "within 10 years". And I told him that I would remember the date 2029. Since then he came out with this

https://x.com/ChrSzegedy/status/1534082344096702464?s=20

and then this tweet making the claim more precise

https://x.com/ChrSzegedy/status/1534250947811024896?s=20

Do people think that these two claims of Szegedy's (autoformalisation of 100 textbooks, solution by machine of ten interesting open human conjectures by 2029) are optimistic or pessimistic?

Wenda Li (Dec 06 2023 at 00:24):

I don't think you will be retiring before 2029, Kevin. Mathematics is not just about proofs. To me, definitions are the hardest part. Even following a textbook, coming up with the right definition in a proof assistant is non-trivial, and it is hard for me to imagine how a machine learning model can achieve it like a pro mathematician in the near future.

Jason Rute (Dec 06 2023 at 01:37):

Alex Sanchez-Stern said:

But of course, the problem is that we don't have good backend benchmarks either, so the ecosystem is very fragmented and it's hard to tell which backends are state-of-the-art.

To emphasize how bad this is, I don't know what would do better on a CPU laptop in proving theorems within 10 minutes: CoqHammer/SledgeHammer (ATP backend), MagnusHammer (premise selection + powerful tactics), LeanInfer/LLMStep (language models), ProverBot9001/ASTactic (simpler neural networks), Tactician (simple but fast machine learning backend), or some new GPT-4-based system. It probably would depend on the sort of benchmark, but I don't know. I think the only things we really know for sure are that combining methods, both naively (running two different solvers for 5 mins each) and smartly (like Thor) really help, and that reinforcement learning (or expert iteration) works. (I'd like to be proved wrong.)

Sid (Dec 06 2023 at 12:20):

Zhangir Azerbayev said:

Hi everyone,

At this year's NeurIPS, I will be moderating a panel discussion at the Tutorial on Machine Learning for Theorem Proving made up of a number of acclaimed ML researchers. I'm starting this thread to solicit questions from the Lean community. If you have any questions about what formal math can do for machine learning and what machine learning can do for formal math, ask away!

How serious is the problem of data contamination. For example solutions for a large portion of miniF2F are likely discussed in some form or another across the internet. Can this lead to a boost in performance for those problems even when the LLM is used to guide exact provers rather than directly generating a natural language proof?

Alex Sanchez-Stern (Dec 06 2023 at 18:03):

Even worse than that: the formal code for miniF2F was first committed to github in May 2021, a month before GPT3.5's training cutoff date. So not only has the model seen discussions of the problems, but it's actually trained on formal solutions to them. From what I can tell, no LLM-based tool has been evaluated on data that wasn't in it's training set, since most major Coq, Lean, and Isabelle projects were present on github before the training cutoff for GPT3.5, which trained directly on github code.

Jason Rute (Dec 06 2023 at 19:59):

While I agree on skepticism here, I think what you are saying @Alex Sanchez-Stern is a tiny bit too strong. Many MiniF2F test problems don’t have published proofs (although some do :sad: ). Also, in the appendix of the PACT paper, we do a very careful evaluation where we test on theorems added to Lean after the models were trained. I think the first evaluation of MiniF2F used models which were trained before MiniF2F. Also some LLM models have experiments or ablations were they use non-pretrained models. Also Lean grows so fast it is reasonable to test models on new theorems which came after pretraining.

Jason Rute (Dec 06 2023 at 20:01):

Of course, this is really tricky and different levels of data contamination are a huge problem. There is also the issue of “soft” data contamination where a solution of a similar theorem is in the data. Maybe some partial progress of a theorem discussed on Zulip, or the same theorem in a different prover, or the natural language version of the theorem, or in the case of competition problems the same problem with different numbers but where the main trick is the same. At some point it isn’t even clear what is cheating and what is efficient use of background knowledge.

Alex Sanchez-Stern (Dec 06 2023 at 20:14):

Ah, I guess in your PACT paper you trained your own LLM from scratch, including pre-training? The paper says "we use decoder-only Transformers similar to GPT-3", and pretrained on WebMath. So, that would be safe from the kind of contamination I was talking about. Maybe I should refine my statement to "From what I can tell, only one LLM-based tool...". Of course, to do so, you had to use a particularly small "large" language model, ten times smaller than the smallest Llamma model.

Most other LLM work isn't so careful though, many are based on GPT-3 or other internet trained public large models, causing this problem. The problem with ablating on non-pretraining is that since it involves a significant drop in accuracy, it's not clear how much of that drop is because pre-training is generally useful and how much of it is because of test leakage. And for the tools based on large public models, it's just infeasible. I haven't seen any LLM-based prover evaluations where they only test on theorems from past a particular date, do you have any pointers to those?

Jeremy Avigad (Dec 06 2023 at 20:19):

Hmm, maybe there are really two tracks for comparing these tools. Some of them have the goal of making users experience easier right now, while others have the goal of pushing the boundaries of what a tool will be able to do in 10 years.

That's a reasonable distinction, but I'd still argue that pushing the boundaries of what a tool will be able to do in 10 years still requires paying attention to what mathematicians want the tools to do right now. Otherwise, it's not at all clear whether the long-term research is making progress toward anything genuinely useful.

Alex Sanchez-Stern (Dec 06 2023 at 20:24):

Hmm, you think automatically being able to prove more theorems is potentially not useful? Of course there's a lot of interesting questions to be asked outside of "theorems being proved", like the ability to help with other parts of the mathematicians workload such as formalization and lemma decomposition, and the abstraction boundary for those is definitely unclear, begging more real-world usage and user-studies. But in the domain of "given a lemma, prove it for me", it seems like the abstraction boundary is pretty well defined, and there is a lot of certainly-useful progress to be made in widening the set of theorems to be proven, regardless of how people end up using the tools. Of course one could argue that the types of theorems that we're working on proving aren't necessarily the ones people want proved, but that comes down to good benchmark selection.

Jeremy Avigad (Dec 06 2023 at 20:43):

The benchmarks are focused on proving certain types of theorems expressed in certain ways in certain libraries with certain types of prior training and certain contextual information. Does that make them better at proving the kinds of theorems that we really want them to prove in the contexts that we really want to prove them? Testing the tools in the wild would provide some reassurance.

Jason Rute (Dec 06 2023 at 21:11):

@Alex Sanchez-Stern You misunderstand. You said “all” LLM tools and I was just pointing out rare exceptions. Our PACT work does pretrain and is heavily prone to data contamination and that appendix I talk about discusses that in length. In that appendix one evaluation is on theorems which came after training the model so “hard” contamination is physically impossible. But I don’t disagree this is a pervasive problem without a good understanding of if it matters in pratice (or even how to measure if it matters in practice).

Jason Rute (Dec 06 2023 at 21:21):

As for other papers which test on theorems past the light cone of training, I think @Sean Welleck said he wanted to wait for some more new theorems in Mathlib before testing LLMStep for this very goal. I also know @Zhangir Azerbayev talks about the concept occasionally, so he might know of some papers which do this. I think occasional competitions with brand new problems would be helpful. The IMO Grand challenge (now AI-MO) is an extreme example of this.

Jason Rute (Dec 06 2023 at 21:23):

And this is really an issue effecting all of AI right now, not just theorem proving.

Alex Sanchez-Stern (Dec 06 2023 at 21:39):

Oh I'm sorry, I didn't mean to misrepresent your work. By "one evaluation is on theorems which came after training the model", are you referring to the experiment in paragraph 3 of section B1 of this paper? It seems to be removing theorems with the same name as those in the training set from the test set. I wouldn't necessarily call that testing on theorems which came "after", there doesn't seem to be a temporal dimension. A theorem which was based on one from the training data but named something different wouldn't be excluded, right?

Alex Sanchez-Stern (Dec 06 2023 at 21:58):

Or did you mean a different experiment?

Jesse Michael Han (Dec 06 2023 at 22:00):

he meant this paragraph at the end of section 4
image.png

Jason Rute (Dec 06 2023 at 22:06):

Sorry, I was typing on my phone and probably not very clear. Here are all the relevant experiments from the PACT paper related to data contamination:

Figure 3 in the main text shows training without WebMath pretraining. Since WebMath only uses Arxiv, stack exchange, and Python github code, it probably wasn't a big source of data contamination anyway. Also, this still uses a pre-trained model (including common crawl pretraining) where data contamination is possible.
The "time-stratified evaluation" on p 8 tests on theorems which came after the first version of the paper. In this case, "hard" data contamination where the exact same theorem is in train and test is impossible. (Soft forms of data contamination, including the possibility that a small number of these are theorems which were renamed, is still possible.) Also, note that the distribution changed a lot from the main test set. These might be harder problems. Also, unlike the main test set, they contain definitions and require lemmas not seen during training.
In Appendix B.1 we see what happens if we remove all theorem names found in the test set. Of course again, there could be soft contamination where the theorem was like you suggested based on another theorem found in the training set, but I imagine this was rare if it happened at all.
In Appendix B.1 we try restricting to test theorems from the main test set, but which were added after either (1) the pre-training set was created (April 18, 2020), or (2) the webmath training set was created (September 11, 2020). The first doesn't remove too many theorems. The later removes a lot more and changes the distribution a lot.

None of these evaluations are conclusive, but they do help to explore the situation. And again, various levels of soft (and maybe hard) contamination are certainly possible. Also, even the ProverBot and CoqGym papers could have some forms of contamination where the same theorem shows up twice, for example in two different libraries, one used for training and one for testing. I don't know if this was checked.

Jason Rute (Dec 06 2023 at 22:09):

This sort of evaluation should be a lot easier to do on models like Llemma where all the data is public. I know the Llemma authors tried to remove test data from their training (but it still might have ended up in the pretraining for LLaMa or Code-LLaMa).

Alex Sanchez-Stern (Dec 06 2023 at 22:55):

Ooh I see, the "time stratified evaluation" from section 4 one makes a lot of sense for that claim. I guess I was just thrown off because you talked about it being in the appendix.

Your evaluation seems really good either way, though it would be hard to scale to a larger LLM which might be part of the reason others don't do as much diligence. Generally it's the pre-training leakage that is hardest to avoid and most common, not the fine tuning, so Llemma using public data to tune doesn't give me much confidence; it's in the base LLaMa training where the real leakage lies.

Jason Rute (Dec 06 2023 at 23:11):

But the LLaMa base model pretraining data is public, I think, so the Llemma authors could (hint hint!), do a similar analysis. Even someone else could do an in depth analysis and write a paper on it.

Alex Sanchez-Stern (Dec 06 2023 at 23:26):

Yeah you're right, with LLaMa they could do an analysis and remove pre-training data from testing, that would be a much better evaluation for Llemma. I'd be interested to see the results of such a non-leaking evaluation.

It can be hard to find benchmarks that aren't on github, or appeared on github for the first time after February of 2023 (the LlaMa paper doesn't specify a cutoff date, but was published then). Llama 2 specifies a cutoff of September 2022, which is a bit better, but Meta decided to stop making the training data sources available with Llama 2, so while you'd be able to use any github after 9/2022, you would have to avoid other software repositories before that date like gitlab.

Zhangir Azerbayev (Dec 07 2023 at 00:06):

@Jason Rute We've done a past-the-light-cone evaluation of a few language models, including Llemma, on the Hungarian national math exam. The major takeaway from these results are that the well-known pretrained models (e.g GPT, Llama, Llemma, Grok) generalize well to guaranteed unseen data, whereas most of the open-source finetunes are overfit to the benchmarks.

The results above are all informal, but when we get around to releasing Llemma-f we'll use future mathlib as a test set.

Zhangir Azerbayev (Dec 07 2023 at 00:25):

My basic takes on evaling language models in 2023 are

The models are now good enough that you don't have to create artificial benchmarks. You can use "natural benchmarks" that are past the light-cone of your training set, such as exams, olympiads, recent Lean projects, etc.
Generally speaking, don't restrict yourself to evals that can be scored automatically. ProofNet autoformalization and the Hungarian math exam required grading solutions by hand and it was worth it. Obviously, this consideration doesn't apply to formal theorem proving.
If you choose to use one of the well-established benchmarks (e.g MATH or MMLU), in my experience data contamination doesn't affect results much for pretrained models. See sec 3.5 of the Llemma paper. Although this might change for larger models that are better at memorizing the training set.
Point 1 being made, real users are the best test set (provided you can serve your model efficiently).

Alex Sanchez-Stern (Dec 07 2023 at 00:42):

Hmm, I've definitely seen ChatGPT spit out code verbatim from the CPDT book, we tried to test one of our lemma finding benchmarks on it but it figured out that we were using an example from CPDT and included the original variable names in the solution (which weren't in the prompt). Maybe it's more able to memorize data from formal proofs and code than it is for informal proofs? Not sure, the leakage problem might be very different from a formal vs informal perspective.

Zhangir Azerbayev (Dec 07 2023 at 00:48):

Alex Sanchez-Stern said:

Hmm, I've definitely seen ChatGPT spit out code verbatim from the CPDT book, we tried to test one of our lemma finding benchmarks on it but it figured out that we were using an example from CPDT and included the original variable names in the solution (which weren't in the prompt). Maybe it's more able to memorize data from formal proofs and code than it is for informal proofs? Not sure, the leakage problem might be very different from a formal vs informal perspective.

This is the "this might change for larger models" caveat biting. I recall that a while ago Dan Hendrycks said something like "if something is on the internet more than four times, GPT-4 knows about it". I've found that to be a reasonable heuristic. However, if you're evaling something more like a Llama 7B model, I wouldn't worry much about data contamination (unless it's in the finetuning data).

Alex Sanchez-Stern (Dec 07 2023 at 00:53):

Hmm interesting. So, somewhere between 7B and 175B models gain the ability to spit out proof data verbatim from the pre-training data. Hopefully someone will explore that a bit more to find out where that line is that pre-training leakage becomes a serious problem. Do you have any data or experiments that shows that Llama 7B can't spit out it's training data that I could look at?

Zhangir Azerbayev (Dec 07 2023 at 00:59):

Alex Sanchez-Stern said:

Hmm interesting. So, somewhere between 7B and 175B models gain the ability to spit out proof data verbatim from the pre-training data. Hopefully someone will explore that a bit more to find out where that line is that pre-training leakage becomes a serious problem. Do you have any data or experiments that shows that Llama 7B can't spit out it's training data that I could look at?

Sec 3.5 of the Llemma paper.

I also wouldn't characterize it as a binary "can regurgitate proofs from the training data" or "can't regurgitate proofs from the training data". For a fixed model, memorization will depend on how many times something appears in the training set. For example, it doesn't take a particularly strong language model to memorize the proof that no rational number squares to 2, but even GPT-4 likely won't recall a proof it's seen only once. In Llemma, we found that for our models and for the frequency that MATH problems appeared in our training set, there wasn't much memorization going on.

Alex Sanchez-Stern (Dec 07 2023 at 01:11):

Oh, for some reason I thought that section 3.5 of the Llemma paper is only looking at leakage of the tuning data, not the pre-training data. It says it checks for test hits in the WebMath and AlgebraicStack datasets, but not the github data that the base LLaMa model was trained on. Maybe the assumption is that if something doesn't get memorized in tuning, then it won't get memorized in training because training occurred farther in the past?

Zhangir Azerbayev (Dec 07 2023 at 01:27):

Alex Sanchez-Stern said:

Oh, for some reason I thought that section 3.5 of the Llemma paper is only looking at leakage of the tuning data, not the pre-training data. It says it checks for test hits in the WebMath and AlgebraicStack datasets, but not the github data that the base LLaMa model was trained on. Maybe the assumption is that if something doesn't get memorized in tuning, then it won't get memorized in training because training occurred farther in the past?

The effect size of contamination in the more recently seen math data should be at least as high as the effect size of contamination in the code/english pretraining.

Zhangir Azerbayev (Dec 07 2023 at 02:39):

Wenda Li said:

I don't think you will be retiring before 2029, Kevin. Mathematics is not just about proofs. To me, definitions are the hardest part. Even following a textbook, coming up with the right definition in a proof assistant is non-trivial, and it is hard for me to imagine how a machine learning model can achieve it like a pro mathematician in the near future.

When I took undergrad NLP, much of the first lecture was devoted to making us appreciate how difficult the field's problems were. There was a laundry list of examples such as Winograd schemas, Gricean maxims, abstract literature, and so forth that professors used to make solving NLP sound impossible. However, NLP is now more or less solved, and it wasn't solved via insight into these difficult subproblems, but by formulating an extremely general learning objective such that when a massive amount of compute is applied, all the NLP tasks fall out as a byproduct.

I expect something similar to happen in mathematics. We will eventually discover a simple learning rule that incentivizes a model to keep growing its knowledge of interesting mathematics. Once we feed this learning rule with sufficient compute, stochastic gradient descent will figure out all the hard subproblems.

So I wouldn't despair if there's some aspect (e.g formulating definitions) of mathematical practice that feels intractable. All I worry about is doing something that scales well with compute.

Also see this blog post from 2012, and compare it to the state-of-the-art now.

Shreyas Srinivas (Dec 07 2023 at 15:29):

would it really matter if LLMs could prove theorems if we couldn't understand what the proof was or how it was arrived at?

Adam Topaz (Dec 07 2023 at 15:33):

yes. it would matter. We don't do math because we want to prove the most theorems. We do math because we want to understand what is going on.

Adam Topaz (Dec 07 2023 at 15:41):

Maybe I misunderstood what that was in response to... sorry. Are you referring to LLM's spitting out proofs from their training data?

Shreyas Srinivas (Dec 07 2023 at 15:44):

Adam Topaz said:

Maybe I misunderstood what that was in response to... sorry. Are you referring to LLM's spitting out proofs from their training data?

Yes. As you said just now, we want to understand what is going on. So what's the point of some DNN spitting out an incomprehensible proof that just happens to get past a theorem prover. We wouldn't even know if it was exploiting a bug in the kernel

Junyan Xu (Dec 07 2023 at 16:05):

I think the good old ATP methods are much more likely to spit out incomprehensible proofs than LLMs :)

Shreyas Srinivas (Dec 07 2023 at 16:12):

Fair point, but that is also likely to be one of the reasons why they aren't used outside that specific niche despite their more powerful automation. It isn't exactly fun to keep manipulating terms with trial and error until the bit of hocus pocus at the last line works.

Shreyas Srinivas (Dec 07 2023 at 16:13):

What I am trying to get at is, that even with powerful LLM based provers, mathematicians will be needed to understand what they are saying and apply them along productive directions. Therefore elegant and insightful proofs should probably be the gold standard. Not superhuman AI mathematicians that can somehow just prove complex theorems without being comprehensible to trained humans.

Jason Rute (Dec 07 2023 at 17:58):

I think you are jumping ahead many generations of this tech. Right now if AI could fill in basic lemmas with readable human style Lean/Coq/Isabelle code that would significantly improve the UX experience. We are already seeing the advantages of this sort of thing in the space of regular programming.

As for the distant future where AI can solve open problems with pages of computations, then hopefully those same systems can be used more like scientific exploration tools where we can query them with pointed questions so a human can begin to understand why the result is true. Also it isn’t completely crazy to expect that such powerful super mathematicians could also be super theoreticians, coming up with new beautiful areas of mathematics, and super teachers generating human focused explainers carefully explaining these concepts to us mortals. Of course this is science fiction, but I think that is what we are talking about here, no?

Jason Rute (Dec 07 2023 at 18:30):

Put another way, I think the 1900s fear of AI being some left-brain automatron outputting incomprehensible but logically correct facts with no greater awareness of human communication norms is largely different from the current AI systems which output beautiful prose, careful explanations, and human style output, only to have it be devoid of any logical content. When we figure out how to align these two sides of the coin, I think we will find making the outputs human readable isn’t going to be as much of a problem as you fear. Indeed ChatGPT and the much better LLM-based tree-search provers like ReProver already output fairly readable Lean proofs since they were trained on human proof data.

Jason Rute (Dec 09 2023 at 13:24):

This is an addendum to my previous question: Do you think there is much value for working in AI for theorem proving specifically versus working in general AI? Conversely, will the problems and difficulties in AI for theorem proving just be solved more or less automatically by more general advancements in AI brought about by folks who have little knowledge of theorem proving, with only minimal effort needed to apply those general solutions to this particular subdomain?

Jason Rute (Dec 11 2023 at 12:02):

Remind me. Will any part of this be recorded and later made available? If so which parts? The tutorial? The panel discussion?

Jason Rute (Dec 11 2023 at 22:57):

How was the tutorial and panel discussion?

Alex Sanchez-Stern (Dec 11 2023 at 23:07):

Good! I hope it's recorded in some way, there was some good discussion.

Jason Rute (Dec 12 2023 at 02:06):

When I look at the 2022 tutorials, they do seem to be recorded including the panels. [Edit: I have no idea if that took hours, days, weeks, or months to put up. I'll just have to be patient.]

Jason Rute (Dec 12 2023 at 02:06):

I imaging (and hope) the panel had very different options, which I think leads to good discussions.

Kaiyu Yang (Dec 12 2023 at 06:36):

Hi, regarding the video recording, I believe it includes the panel and will be released some time after the conference. I'm checking with the conference's tutorial chairs to confirm and will post an update later.

Min-Hsien Weng (Dec 12 2023 at 21:57):

Kaiyu Yang said:

Hi, regarding the video recording, I believe it includes the panel and will be released some time after the conference. I'm checking with the conference's tutorial chairs to confirm and will post an update later.

Great! Love to to see the new "bring your own LLM" feature included in the tutorial!

Abhishek Anand (Dec 13 2023 at 00:16):

I just logged into nips.cc with my virtual pass and can already see the recording. So I can at least confirm that the the entire ML for theorem proving session was recorded, including the panel discussion. Hopefully the video will be released to the public soon.

Jason Rute (Dec 13 2023 at 18:57):

From https://nips.cc/virtual/2023/index.html it says:

All recordings will be available about a month after the conference.

Kaiyu Yang (Dec 19 2023 at 04:03):

Slides & demos are available here: https://machine-learning-for-theorem-proving.github.io/. The video recording will be released one month after the conference and will be updated on our website.

Junyan Xu (Dec 19 2023 at 05:03):

Thanks! But Demo: Training LLMs for Tactic Generation and Combining with Proof Search is a dead link as of now, and LeanCopilot links to LeanDojo.

Kaiyu Yang (Dec 19 2023 at 05:13):

Junyan Xu said:

Thanks! But Demo: Training LLMs for Tactic Generation and Combining with Proof Search is a dead link as of now, and LeanCopilot links to LeanDojo.

Permission issue, fixing it now.

Kaiyu Yang (Jan 20 2024 at 19:14):

The tutorial's video recording is now publicly available: https://neurips.cc/virtual/2023/tutorial/73946

Last updated: May 02 2025 at 03:31 UTC