Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Bangalore talk on LLM+ITP

Kevin Buzzard (Apr 11 2023 at 08:19):

I gave a talk last week in Bangalore (it's not online, as far as I know) about where I (as a mathematician who knows something about theorem provers but has only a passing understanding of things like large language models) think we are when it comes to computers taking my job. Here are the slides.

talk.pdf

I post them here for two reasons. Firstly I am mildly concerned that people are now asking me to give talks about large language models etc, when I am certainly not an expert in them and the total number of queries I have (consciously) made to any such system is three, namely the three ChatGPT queries I report on in the above talk. Note also that I am still very much a skeptic: a large part of me is still utterly baffled about why everyone is making such a fuss about these systems when it is so trivial to get them to talk absolute nonsense (see the ChatGPT proof that there are infinitely many even numbers ending in 7 which I highlight in the talk). So if there are people who think I'm being unfair about these models then perhaps I should be told sooner rather than later, because I don't want to be spreading misinformation publically.

The second reason is that I am slowly trying to piece together what I think should be the "correct" approach to world domination of mathematics by computers, which I am proposing is some kind of LLM/ITP joint venture. My ideas are still extremely vague and again it would be good if I were shot down quickly if I'm talking nonsense.

Any comments would be appreciated! The reason I am asking publically now is that I am in the process of writing a brief survey article about this stuff for a mathematics journal, and in contrast to the transient nature of talks, the things I say in the survey article are actually going to end up in the mathematics literature so should ideally not contain major misunderstandings on my part.

Siddhartha Gadgil (Apr 11 2023 at 08:23):

Just one comment: many criticisms are on the grounds as you said that "it is so trivial to get them to talk absolute nonsense". The real question is whether it is possible to get them to talk enough sense.

Scott Morrison (Apr 11 2023 at 10:03):

@Kevin Buzzard, this is the point of sagredo --- of course it's easy for LLMs to spout nonsense, but Lean is really good at detecting nonsense, so we should have the two talk to each other! Moreover GPT has been trained as a conversationalist, so receiving feedback about its mistakes results in surprisingly good outcomes (sometimes).

Scott Morrison (Apr 11 2023 at 10:04):

Still, it is no where close to actually doing any mathematics in any sense whatsoever. Maybe in the near future it will be a good tool for writing proofs so boring that they would only ever be written down while formalizing.

Shreyas Srinivas (Apr 11 2023 at 10:55):

In the context of this discussion, I have a question to AI folks: Do we need LLMs to automate lean proofs? How would it compare to just train a large NN on say mathlib?

Kevin Buzzard (Apr 11 2023 at 10:59):

I guess something I don't understand is @Daniel Selsam 's position that at least for the IMO grand challenge, large language models are what is needed. Daniel do you think that you can get gold on the IMO without an ITP? I think that my talk is somehow trying to figure out how this could happen, and failing to convince myself that it could. If my understanding of your position is correct, how far do you think LLMs can go in terms of mathematics? Scott's approach seems far more promising to me, but here I have no feeling whatsoever about how far it might go. For the LLM alone I'm finding it really hard to imagine it going much further (but I'm not an expert which is exactly why I'm asking).

Shreyas Srinivas (Apr 11 2023 at 11:28):

Kevin Buzzard said:

I guess something I don't understand is Daniel Selsam 's position that at least for the IMO grand challenge, large language models are what is needed. Daniel do you think that you can get gold on the IMO without an ITP? I think that my talk is somehow trying to figure out how this could happen, and failing to convince myself that it could. If my understanding of your position is correct, how far do you think LLMs can go in terms of mathematics? Scott's approach seems far more promising to me, but here I have no feeling whatsoever about how far it might go. For the LLM alone I'm finding it really hard to imagine it going much further (but I'm not an expert which is exactly why I'm asking).

If it is any consolation, CS theorists don't know either. Deep learning is one of those areas of CS where the theory is well behind practical usage.

Jason Rute (Apr 11 2023 at 12:27):

a large part of me is still utterly baffled about why everyone is making such a fuss about these systems when it is so trivial to get them to talk absolute nonsense (see the ChatGPT proof that there are infinitely many even numbers ending in 7 which I highlight in the talk). So if there are people who think I'm being unfair about these models then perhaps I should be told sooner rather than later, because I don't want to be spreading misinformation publicly.

A lot of AI optimism is due to extrapolation of history.

There has been a lot of progress in the past decade. For example, consider image generation. A year ago, the best image generators were specialty GANs which can generate one type of image (faces, shoes, etc) or general purpose image generators which could generate very low resolution images or dream-like abstract images. But with the introduction of DALLE-2, we can are now able to generate images of most things and these are just getting better and better (to the point where fake photo-realistic images are a real concern). Also, the research has moved from pure academics to new companies, apps, and hobbyist communities where are pushing the state of the art at lightning speed. (And we are seeing the same thing with language models.)
The methods of AI are quite general. There are a few standard neural network architectures like Transformers, and they are very well suited for a number of tasks. This means it is easy to do research on more standard domains like natural language processing or image processing, but then use the same methods (sometimes out of the box) on esoteric domains like interactive theorem proving. MagnusHammer (#Machine Learning for Theorem Proving > New paper: Magnushammer) for example uses very standard ideas in text retrieval which have nothing to do with theorem proving or premise selection (or even mathematics/code).
Scaling in terms of both size and data give these models magical properties. Very large models like GPT-4 which are trained on a lot of data have a huge ability to generalize to new situations. One very impressive feat was the auto-formalization paper which came out less than a year ago. With very little Isabelle data (and even less aligned to natural language mathematics) it was able to translation theorem statements. Then the Draft-Sketch-Prove paper showed it was able to translate theorem proofs after generating them in natural language. I think one way to think about this is that these models now come with general purpose knowledge, not unlike undergraduate students entering college. You don't have to teach them to read and write and what every single English word means. They also have a lot of intuitions and biases about the world (good and bad) like we have. The trick now is how to teach and guide these models, just like we can teach a student. That is still unsolved, but it is seeming less unrealistic by the month.

Now, all that said, yes, these models are mostly crap on mathematics. And it is hard to predict the specifics of the future. Is scaling enough? Is reinforcement learning the answer? Or better decoding methods? Or LLM/ITP interaction? Or neural-symbolic methods of another sort? Or embodied AI/robotics? Or world-models? I don't have a clear answer. But I think two general things are true: (1) this field is moving quickly and a lot of things which seem impossible this year, will seem normal next year (or the year after, etc), and (2) until these large models are trustworthy, having an unforgiving ITP-like system checking the result (likely in collaborative manner) will be a huge value-add in domains where it is possible to do this.

Jason Rute (Apr 11 2023 at 12:37):

And (3) a lot of the more ambitious predictions are over-selling and will take years or decades longer than predicted.

Kevin Buzzard (Apr 11 2023 at 12:43):

Right. The thing which I'm kind of wrestling with right now is that although I do understand that there has been exponential improvement recently, I find it hard to infer from this that exponential improvement will continue. I know it's not maths, but I've been around in maths for long enough to know how it works, and how it works is that there are occasional new insights which suddenly lead to important new progress -- theorems being proved that the experts thought were inaccessible -- and then the ideas are understood conceptually and are taken as far as they will go, and then progress calms down again.

Jason Rute (Apr 11 2023 at 12:50):

I think another thing is that we as a community have a tendency to do, is tend toward extremes. Either these systems are crap, or they can solve the Riemann Hypothesis. I think if we could get a system at the scale of an undergrad REU student, we could find all sorts of good uses for it. They could prove the correctness of boring but important stuff like that computer code follows a specification. Or some long calculation that we don't care to do. I think people have been saying for a while now that GPT-like language models are too untrustworthy to be useful, but once ChatGPT was released it seems to be very useful to a lot of people (to the point that they pay for it). Once the theorem proving, coding, and Lean skills in these systems are a bit better, certain folks who know how to get the most out of them, could find a way to use them to do really cool stuff.

Shreyas Srinivas (Apr 11 2023 at 13:10):

The "exponential improvement" thing has been happening more or less for 10 years now. Also nobody understands the limits of these systems. At what point do they stop generalising and just memorise stuff? How sensitive are they to being manipulated (see : https://cs.stanford.edu/~mpkim/pubs/undetectable.pdf)? So the hype makes sense. You just keep trying new architectures and applying them to new domains.

That said I wish there were tools that were focussed on making it easy to do small things, (for example selecting the most likely next step in a proof if I sketch it) and doing it well, rather than trying to do everything at once. Or tools that can identify quick fixes to proofs when some def or theorem statement were to change

Fabian Glöckle (Apr 11 2023 at 13:15):

Shreyas Srinivas said:

In the context of this discussion: I have a question to AI folks: Do we need LLMs to automate lean proofs? How would it compare to just train a large NN on say mathlib?

The reason is "data scarcity": mathlib is nowhere close to the dataset sizes required for effective machine learning. So the approach in papers like https://arxiv.org/abs/2102.06203 and https://arxiv.org/abs/2205.11491 is to:

first, pretrain on natural language (to get the transformer's information processing circuits going),
then, train on supervised data like mathlib,
then generate your own data from proof searches and continue training on these.

The qualitative change that we may have witnessed with current LLMs or might be witnessing soon is that they allow us to disentangle the two components of mathematical knowledge and mathematical reasoning capability. A human mathematician, from a certain point on, may no longer learn new ways of manipulating symbols according to certain rules in their minds but rather learn new symbols and rules for manipulation, i.e. acquire new knowledge. In easy domains, LLMs have already reached the stage where the rules and knowledge of a new domain can be explained to the model in text, and then the model interacts sensibly with this new environment.

Generally @Kevin Buzzard , I think the hype around LLMs for math is less because they have read many proofs and sometimes reproduce them correctly in a text interface but rather because they show an "emergent reasoning capability" that no one explicitly trained them to have but that just appears in large models trained on a complicated and interesting distribution. And if you want to do mathematics on computers, it is a reasonable assumption that the programs that are best at adapting to a large variety of challenging tasks will also be best at doing mathematics.

Kevin Buzzard (Apr 11 2023 at 13:25):

If "emergent reasoning capability" means anything like "I learnt the proof that there are infinitely many prime numbers and now can generalise it to a completely false but plausible-looking proof that there are infinitely many prime numbers which end in 7" then I seriously question whether this is actually useful for mathematics. To give another example: Diophantine equations (solving polynomial equations in integers or rationals) is super-super-brittle in the sense that if you slightly change the input then you can completely change the nature of the problem. For example finding all the rational solutions $(x,y)$ to $x^4+y^4=16$ (obvious solution: (2,0)) was a problem solved in the 1700s, but finding all the rational solutions to $x^4+y^4=17$ (obvious solution: (2,1)) was a problem which was only solved in the late 1990s after decades if not centuries of breakthroughs in conceptual understanding. One needs a lot more than the ability to reason by analogy to do mathematics; one needs a profound conceptual understanding of what is going on. I am concerned that the AI enthusiasts have no real understanding of how phenomenally profoundly difficult mathematics actually is.

Fabian Glöckle (Apr 11 2023 at 13:39):

I cannot speak for everyone and fully agree with what @Jason Rute wrote above about the tendency for extremes. Concerning your example, I think there are still orders of magnitude between what might count as a breakthrough in LLM reasoning and the type of research mathematics you describe.

Certainly, the future will not be simply: let's read in all mathematical textbooks and articles, and given a research problem, generate the article that solves it beginning to end. If there was training data the way it exists for humans (lectures mixing formal and casual explanation styles, tutorials which explain the process of finding solutions, working with others and learning from their solving techniques, feedback from others, PhD supervisors, reviewers,...) it would be way easier to learn this "process of doing mathematics" that you have in mind (once sufficiently "smart" models are available to make sense of these inputs).

Kevin Buzzard (Apr 11 2023 at 13:45):

What do people think that reasonable short to medium-term goals are with LLMs and mathematics?

In my talk I suggest a chat bot which can operate at beginner PhD level. I have seen what looks to me like good progress towards informal to formal translation of mathematical statements. When will we see the statements of the theorems in the stacks project being accurately translated into Lean? This is something I mention in the talk. I've seen LLMs translating basic undergraduate mathematics into Lean but I have no understanding of how well this scales to modern algebraic geometry.

Kevin Buzzard (Apr 11 2023 at 13:45):

Note that I'm not asking for proofs.

Fabian Glöckle (Apr 11 2023 at 13:50):

"Emergent reasoning capability" would for example mean that you can explain the rules of a new game or a made-up programming language to the model and it will interact "sensibly" with it (i.e. play legally and not stupidly or write coherent small programs for simple tasks).
This is where we are already for simple domains, and currently both model capabilities and bars are moving.

Fabian Glöckle (Apr 11 2023 at 13:53):

Kevin Buzzard said:

What do people think that reasonable short to medium-term goals are with LLMs and mathematics?

In my talk I suggest a chat bot which can operate at beginner PhD level. I have seen what looks to me like good progress towards informal to formal translation of mathematical statements. When will we see the statements of the theorems in the stacks project being accurately translated into Lean? This is something I mention in the talk. I've seen LLMs translating basic undergraduate mathematics into Lean but I have no understanding of how well this scales to modern algebraic geometry.

I very much like the stacks project translation project. From what I have heard, coming up with the right definitions and library layout is often a very hard part in a formalization effort for a new theory?

Siddhartha Gadgil (Apr 11 2023 at 13:58):

Kevin Buzzard said:

If "emergent reasoning capability" means anything like "I learnt the proof that there are infinitely many prime numbers and now can generalise it to a completely false but plausible-looking proof that there are infinitely many prime numbers which end in 7" then I seriously question whether this is actually useful for mathematics. To give another example: Diophantine equations (solving polynomial equations in integers or rationals) is super-super-brittle in the sense that if you slightly change the input then you can completely change the nature of the problem. For example finding all the rational solutions $(x,y)$ to $x^4+y^4=16$ (obvious solution: (2,0)) was a problem solved in the 1700s, but finding all the rational solutions to $x^4+y^4=17$ (obvious solution: (2,1)) was a problem which was only solved in the late 1990s after decades if not centuries of breakthroughs in conceptual understanding. One needs a lot more than the ability to reason by analogy to do mathematics; one needs a profound conceptual understanding of what is going on. I am concerned that the AI enthusiasts have no real understanding of how phenomenally profoundly difficult mathematics actually is.

If prompted with Euler-Riemann's zeta-function proof it is likely enough that GPT-3.5/4 will generalize to a proof a little like the start of Dirichlet's correct proof. Obviously we do not know which proof will generalize. So we should feed it all known proofs, and then start a loop with analogical reasoning from the LLM and errors from lean (a la Sagredo).

Again, that it gives nonsense is not a reasonable measure. The measure is whether it can be prompted with some work to give useful answers.

Fabian Glöckle (Apr 11 2023 at 14:03):

Regarding your question to @Daniel Selsam on LLMs and ITPs for IMO exercises: for me you cannot do without LLMs because they are our closest attempt at "all-purpose reasoners". But current LLMs lack grounding in mathematical semantics as they have been largely trained on a next word prediction task. You can either add grounding via feedback from an ITP, or via self-critiquing: a sibling model evaluates the soundness of an argument and this feedback is used to train the original model. This procedure has been used extensively to make chatbots friendly and "aligned" with our values (https://arxiv.org/abs/2212.08073) but there is nothing that prevents this method (called RLHF in the literature) from being used for capabilities instead of alignment (i.e. for mathematical soundness in our case instead of low toxicity etc.).

Note that many of the failure modes you describe for ChatGPT are due to the fact that the system is designed to be a helpful chatbot in everyday situations: Usually, just making an attempt at an answer and not being overly shy about being wrong is useful. Usually, it is an unstated premise of a question that it must be satisfyingly answerable in some dozen lines. When mathematics diverges from these "chatbot assumptions" (which of course it does!), the training procedure / finetuning / the kind of dialogs you have with it must reflect this!

Daniel Selsam (Apr 11 2023 at 18:50):

Kevin Buzzard said:

I guess something I don't understand is Daniel Selsam 's position that at least for the IMO grand challenge, large language models are what is needed. Daniel do you think that you can get gold on the IMO without an ITP? I think that my talk is somehow trying to figure out how this could happen, and failing to convince myself that it could. If my understanding of your position is correct, how far do you think LLMs can go in terms of mathematics? Scott's approach seems far more promising to me, but here I have no feeling whatsoever about how far it might go. For the LLM alone I'm finding it really hard to imagine it going much further (but I'm not an expert which is exactly why I'm asking).

Sadly, there is no established science for extrapolating LLM behavior on uniquely challenging tasks like this. I personally assign pretty high probability to GPT-n (+ some wrapper algorithm, e.g. chain-of-thought prompting plus iterative critique/refine) solving IMO purely in LaTeX without any formal system behind it. But I can't point to any rigorous evidence of this conjecture. The summary thought experiment is just GPT-4 is outrageously smarter than GPT-3, so what might something that much smarter than GPT-4 be like? Comparing GPT-4 to GPT-2 is even more ridiculous, so what might something that much smarter than GPT-4 be like? Not trying to convince anyone, just sharing my thoughts.

Kevin Buzzard (Apr 11 2023 at 18:54):

Thanks for sharing! For sure the area is moving fast, and it's interesting to see that even the experts are not sure exactly where it's going. It remains fascinating to watch, and one thing I like about watching it here is that the discussion seems to be totally free of this whole "AI is/is not going to mean the imminent end of humanity" thing which seems to be all over Twitter etc.

Zhangir Azerbayev (Apr 11 2023 at 22:28):

Daniel Selsam said:

Kevin Buzzard said:

I guess something I don't understand is Daniel Selsam 's position that at least for the IMO grand challenge, large language models are what is needed. Daniel do you think that you can get gold on the IMO without an ITP? I think that my talk is somehow trying to figure out how this could happen, and failing to convince myself that it could. If my understanding of your position is correct, how far do you think LLMs can go in terms of mathematics? Scott's approach seems far more promising to me, but here I have no feeling whatsoever about how far it might go. For the LLM alone I'm finding it really hard to imagine it going much further (but I'm not an expert which is exactly why I'm asking).

Sadly, there is no established science for extrapolating LLM behavior on uniquely challenging tasks like this. I personally assign pretty high probability to GPT-n (+ some wrapper algorithm, e.g. chain-of-thought prompting plus iterative critique/refine) solving IMO purely in LaTeX without any formal system behind it. But I can't point to any rigorous evidence of this conjecture. The summary thought experiment is just GPT-4 is outrageously smarter than GPT-3, so what might something that much smarter than GPT-4 be like? Comparing GPT-4 to GPT-2 is even more ridiculous, so what might something that much smarter than GPT-4 be like? Not trying to convince anyone, just sharing my thoughts.

I would be very surprised if GPT-n for some n not much bigger than 4 can't get an IMO gold medal. However, I am concerned that the GPT paradigm will hit a ceiling on its mathematical capabilities, and for all we know that ceiling might be lower than the level required to contribute new ideas to research mathematics.

The precise version of my concern is that scaling with the data we have available to us won't be able to reduce loss on arXiv beyond some floor that is higher than the intrinsic entropy of the data. In the infinite parameter and pre-training data limit, we'll use up all the mathematical text in the world quite quickly, and our pre-training corpus will mostly consist of things like the internet, books, periodicals, audio transcripts, and court cases because this is what exists at >10T token scale. In this scenario, the improvements in mathematical reasoning will come from

more parameters
transfer learning from non-math data.

It is obvious from the mathematical form of the scaling laws that (1) is not enough to approach the instrinsic entropy of the arXiv. What about transfer learning from non-math data? In the classic Kaplan et al. (2020) there is a finding that

When we evaluate models on text with a different distribution than they were trained on, the results are strongly correlated to those on the training validation set with a roughly constant offset in the loss – in other words, transfer to a different distribution incurs a constant penalty but otherwise improves roughly in line with performance on the training set.

If this finding generalizes and also applies to the case where we do some fine-tuning on the OOD task, this is really bad news. It means that the best loss GPT-n could get on arxiv is $E+P$ , where $E$ is the intrinsic entropy of arXiv and $P$ is the fixed size penalty we get from mostly training on internet/books/whatever and transferring to arXiv. For all we know, $E+P$ might correspond to a rather lousy mathematician.

@Daniel Selsam are you aware of any work that might shed light on this question? I am aware of Hernandez et al. 2021, but as far as I remember they don't see what happens when you scale the amount of non-target pre-training data. If there is none, I'm planning on running these experiments myself.

Daniel Selsam (Apr 12 2023 at 03:32):

@Zhangir Azerbayev what about the "Infinite Monkeys Playing With Lean" Theorem? :infinity: :monkey: :keyboard:

Johan Commelin (Apr 12 2023 at 05:34):

It is obvious from the mathematical form of the scaling laws that (1) is not enough to approach the instrinsic entropy of the arXiv.

@Zhangir Azerbayev is this something that can be measured? I really don't know much about this... but if you can attach numbers to "the intrinsic entropy of the arXiv" that would be really cool.

Fabian Glöckle (Apr 12 2023 at 09:39):

The point is that "Kolomogorov complexity is not computable", meaning that there is no program that can do optimal compression. But for each finite amount of computational resources, there exists an optimal compressor for this amount of computational resources and it is the hope of the ML community that their model architectures are somewhat close and moving closer to optimal resource-limited compressors.

The idea that "AI and compression are the same thing" is at the heart of several "AI formalisms" like https://en.wikipedia.org/wiki/Solomonoff%27s_theory_of_inductive_inference or https://en.wikipedia.org/wiki/AIXI

Johan Commelin (Apr 12 2023 at 09:51):

I know you explained this to me 2 years ago. But I didn't know about the wiki links. Thanks!

Eric Taucher (Apr 12 2023 at 11:57):

Kevin Buzzard said:

a large part of me is still utterly baffled about why everyone is making such a fuss about these systems when it is so trivial to get them to talk absolute nonsense

Are you asking this in relation to just proofs are about ChatGPT as such in general?

I started to write a response but it had nothing specifically about proofs so shelved it until I knew more of what you are seeking. :smile:

In short, most of it was about how ChatGPT saves me time and trouble on the little tasks that it can do correctly and when all of that time is added up it is a few hours each day. It also helps me to explore areas that have a high learning curve to get started, it is a compass that at least points me more often than not in the right direction.

Kevin Buzzard (Apr 12 2023 at 18:00):

Yeah I know chatGPT is brilliant at some things. I'm specifically talking about getting it to be good at mathematical proofs. I guess this answers my question: I am failing to distinguish chatGPT in general from chatGPT as it pertains to taking my job

Martin Dvořák (Apr 15 2023 at 20:59):

Fabian Glöckle said:

The idea that "AI and compression are the same thing" is at the heart of several "AI formalisms" like https://en.wikipedia.org/wiki/Solomonoff%27s_theory_of_inductive_inference or https://en.wikipedia.org/wiki/AIXI

BTW, could some of these formalisms be formalized in Lean? Could it be an interesting project?

Last updated: May 02 2025 at 03:31 UTC