Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: LLM+ITP(+AZ)?

Alex Kontorovich (Apr 17 2023 at 14:01):

To @Kevin Buzzard's point about LLM+ITP: I've been thinking about this a bit, and my thoughts can be summarized by watching this (hilarious) clip of ChatGPT playing Stockfish:
https://www.reddit.com/r/AnarchyChess/comments/10ydnbb/i_placed_stockfish_white_against_chatgpt_black/
The "opening" seems rather normally played, which is amazing, as ChatGPT knowings nothing about the rules of chess; it seems to have successfully inferred how pieces move "by osmosis". But then we get to the middle, and its queen jumps over pieces, rooks move like knights, pawns randomly appear out of nowhere (and Stockfish dutifully replies as best it can, including a final, merciful capture of Chat's king). This is how it feels to me to interact with ChatGPT with math; you ask it to prove something from a first year graduate course, and it correctly states the definition, correctly identifies the key property to be checked, and correctly concludes the end of the proof. But it's that "midgame", where all the "beef" of the proof happens, that it just writes complete BS (not unlike some of our undergrads/beginning grad students!...).

But we actually do know how to make amazing chess engines using NN, namely, AlphaZero. So my guess is that we would want LLM (maybe for the opening, to get to the beef) + (ITP+AZ : AlphaZero-type NN trained on the "rules" of math via an ITP) working in concert. Have the DeepMind people (who made AlphaZero) gotten into this business at all? (I have some contacts there; might check it out...) Needless to say, the "space" of possible "moves" in math seems unfathomably larger than that of chess; so maybe this approach is doomed to fail. But in many situations in practice, there are actually only a small handful of things that make sense to attempt in a given state of a proof, and perhaps those can be learned!..?

About changing x^4+y^4=16 to 17, I'm reminded of the experiment where chess experts vs lay people are briefly shown positions, either from actual games or from random arrangements, and asked to reproduce them. Unsurprisingly, the chess experts seeing actual games are able to reproduce a lot of the structure, and if they're off, entire groups of pieces are off while preserving their relationships; meanwhile lay people are actually slightly better than experts at reproducing random arrangements (presumably because the experts waste their time looking for structure in the chaos?). So I agree, math is "globally" difficult, and we shouldn't expect too much from ML immediately; but I think there are lots of situations (e.g. lots of analytic number theory arguments which amount to finding the "right" place to apply Cauchy-Schwartz, followed by more-or-less standard techniques in estimating exponential sums) where it really has the potential to "learn" and be a big boost in the hands of someone capable of harnessing its power...

Just my two cents... I would be happy to hear all the points where I'm revealing my ignorance...

Fabian Glöckle (Apr 17 2023 at 14:22):

ITP+AZ has been tried: https://arxiv.org/abs/2205.11491
and I guess the authors would agree that replacing the language model with a well-pretrained large language model would further increase performance
(of course there is an inference time - quality tradeoff for the choice of model size, and hierarchical methods like https://arxiv.org/abs/2210.12283 will probably turn out best)

Alex Kontorovich (Apr 17 2023 at 14:30):

Is there a way for a "mere mortal" to play with this stuff? Or does one have to work at DeepMind/OpenAI/etc to actually mess around with the models and not just their outputs...?

Albert Jiang (Apr 17 2023 at 15:47):

Alex Kontorovich said:

Is there a way for a "mere mortal" to play with this stuff? Or does one have to work at DeepMind/OpenAI/etc to actually mess around with the models and not just their outputs...?

Currently "mere mortals" can only play with the VSCode plugin for HTPS, which works in an autocomplete style (does not have search) afaik. @Jason Rute gave a good talk which covered widely available stuff here

Also: stick around, we're going to have more and more tools open-sourced tools people can play with on their laptop!

Jason Rute (Apr 17 2023 at 16:23):

Yes my talk covers the AlphaZero approach to theorem proving. Also look at Tony’s talk in the same IPAM meeting. The Draft-Sketch-Prove approach is similar to what @Alex Kontorovich was saying by using LLMs to “jump start” opening moves of a formal proof by having the model give a natural language description of the proof which is converted into a formal proof sketch which is then filled in by another model.

Fabian Glöckle (Apr 17 2023 at 17:24):

And regarding action spaces: for some time it was very fashionable in Reinforcement Learning to put gigantic numbers about the size of the search space. but of course as @Alex Kontorovich says, the question is actually how well-structured / "learnable" the data is (say, what the "dimension" of the "data manifold" is, not of the raw space into which we embed it)
And I think many mathematicians agree that the "sensible action space" is significantly smaller and can be learned (Timothy Gowers for instance: https://gowers.wordpress.com/2022/04/28/announcing-an-automatic-theorem-proving-project/)

But Reinforcement Learning is very data inefficient, so one should use supervised data and pretraining as much as possible before turning to RL for the last bits.

Kevin Buzzard (Apr 17 2023 at 18:35):

Most of the time, the sensible action space is small. But at that point where you need the new idea which gets you further than current techniques seem to go, the action space is vast and not well-understood.

Fabian Glöckle (Apr 17 2023 at 18:38):

Of course. But I like to think of this as a continuum.

Kevin Buzzard (Apr 17 2023 at 18:39):

I don't really know what you mean by that. When I'm formalising proofs I really feel like there are times when I'm on autopilot and there are other times where I'm entering a genuine idea into the system, so for me the amount of "cleverness" needed to write the next line in a tactic proof feels quite discontinuous. But maybe I'm misunderstanding you.

Fabian Glöckle (Apr 17 2023 at 18:41):

I think there is anything from "doing autopilot cleanup moves" via "applying standard techniques in standard ways" and "applying standard techniques in nonstandard ways" to "creating novel ideas which build on existing well-understood notions" and "creating novel ideas which build on novel notions"

Kevin Buzzard (Apr 17 2023 at 18:43):

Aah I see. Yes, I agree with you that there's a continuum of ideas of that nature. But when you're actually proving a theorem, you are jumping around within that continuum. For example the proof that there's infinitely many primes contains one brilliant idea $1 + \prod_{i=1}^n p_i$ and the rest is kind of mechanical.

Fabian Glöckle (Apr 17 2023 at 18:44):

More formally, you could measure the amount of surprise a certain mathematical agent has upon reading a sentence in a proof. An autopilot cleanup move has zero surprise / information content, and my claim is that while complexity may be exponentially distributed (information theory defines surprise := -log probability), it may still be the case that for every level of surprise there exsists a proof line with that surprise.

Kevin Buzzard (Apr 17 2023 at 18:47):

In the corpus of all proofs, I believe you!

Fabian Glöckle (Apr 17 2023 at 18:48):

FWIW Timothy's project is all about demystifying these "rabbit out of the hat" steps, doing boring mechanical search space reductions until there is no rabbit left.

Fabian Glöckle (Apr 17 2023 at 19:03):

Ridiculuously simple example:

Let us say we are proving:
forall m,n : Nat, exists x : Nat, x > m and x > n
The rabbit would be x := max(m, n) + 1.

If instead, we postpone the decision for x and focus on the first goal, we might retrieve forall k : Nat, k + 1 > k and close the goal with x := m+1. Turning to the second goal, now |- m+1 > n, applying transitivity with m+1 > m gives |- m \geq n, which cannot be proved.

However, instead of giving up, we have found a case split that we can exploit: it remains to prove
|- n > m |- exists x : Nat, x > m and x > n. The same as above fails in the case where we focus on the first goal first (x := m+1) but succeeds if we focus on the second goal first: x := n+1 leads to n > m |- n+1 > m. In other words, we have implicitly been led to the definition of max and discovered a simple "novel concept".

Arguably, this proof has a lower surprise level than the one that starts with "take x := max(m,n) + 1", and one could dream of such methods to lower the surprise level even in more complicated cases...

Kevin Buzzard (Apr 17 2023 at 20:01):

Right -- but the problem is that this doesn't scale to "let's make the amazing obervation that proving that a deformation ring is isomorphic to a Hecke algebra might be the key observation which unlocks Fermat's Last Theorem". This observation was made by a singular genius, and no amount of training on the internet will get you there -- I think.

Fabian Glöckle (Apr 17 2023 at 20:56):

I like to think of "genius" as a concept expressible in standard physics (as opposed to something transcendental), and as a continuum ;) Clearly, predictive language modelling won't get you there, but that hasn't been claimed in this thread I think. Rather, everyone agrees that we have to add a component of reasoning that encompasses a certain amount of search but is not "mere search" (similar to what happened in my example). You are right that there is a difference between your example and mine, and that is that while in my example, the reasoning can be expressed in the object language ("let's turn this proof goal which doesn't depend on x anymore into a by_cases split"), in yours the reasoning takes place in an "unsound but sufficiently sound" quotient of mathematical logic, where human mathematicians have established new rules like "every statement in this subtheory should easily be refutable or provable with a certain set of tools" or "this proof has generalization flexibility here and here, but seems to be a precise fit / seems to barely go through at this point". And while it is not at all clear how to model such ways of reasoning right now in maths (and we are still very far from it), what makes me slightly optimistic is that it is not unheard of that machine learning models develop new intuitions and concepts or learn them on an ad-hoc basis.

Scott Morrison (Apr 17 2023 at 21:14):

@Kevin Buzzard I think you should also notice in your own post above that you reached for an extreme example of "genius". At present there is just so much low hanging fruit in terms of "demystifying minor genius" (as in @Fabian Glöckle's example about discovering max) that I feel we should try for a certain agnosticism about major genius until we've done the work on minor genius that we're pretty sure is doable! I think we might just be surprised, either way, at the gap that remains.

Zhangir Azerbayev (Apr 18 2023 at 06:45):

Kevin Buzzard said:

Right -- but the problem is that this doesn't scale to "let's make the amazing obervation that proving that a deformation ring is isomorphic to a Hecke algebra might be the key observation which unlocks Fermat's Last Theorem". This observation was made by a singular genius, and no amount of training on the internet will get you there -- I think.

I wouldn't confidently assert that no amount of training on the internet will get us to a singular genius. The intuition behind how it could is that if you have a model that's really good at generalizing to unseen data, it should be able to generalize to levels of "insight", "cleverness", or whatever else you want that are not in the training data. This possibility is discussed at length in the essay Just Ask for Generalization.

In practice, it's an open question whether we have enough mathematical text to train LMs that are good at math: a lot depends on how helpful non-math data is for improving at math.

Zhangir Azerbayev (Apr 18 2023 at 20:02):

Re the topic of ITP+AZ, my prediction is that inspiration from alphazero will not be very helpful in designing mathematical AI. Reinforcement learning essentially consists of two steps:

Try new stuff
Keep doing the stuff that led to high reward.
The problem is that while we have lots of wonderful and sophisticated ways of doing (2), we don't have any ideas for doing (1) except random guessing. So alphazero-style algorithms (more precisely, algorithms based on Monte Carlo Tree Search), work well in domains like Go where you can find the best move by guessing, but don't extend well to problems with large and combinatorial action spaces like next-tactic prediction.

When we build an automatic mathematician, self-improvement will likely play an important part, but I expect it look more like a autogpt, a language model cascade, or a wake-sleep algorithm than like alphazero.

Fabian Glöckle (Apr 18 2023 at 20:40):

Clearly you should not use a random policy for exploration, but that is also not what RL methods or AlphaZero suggest (in AlphaZero and Online RL algorithms, you'd explore based on the current policy, possibly with some exploration bonus or entropy regularization). In the absence of finetuning data, RLHF is what makes predictive language models useful. That said, I fully agree that getting LLMs into a loop and abstraction finding a la DreamCoder are very promising ideas!

Zhangir Azerbayev (Apr 19 2023 at 01:27):

Fabian Glöckle said:

Clearly you should not use a random policy for exploration, but that is also not what RL methods or AlphaZero suggest (in AlphaZero and Online RL algorithms, you'd explore based on the current policy, possibly with some exploration bonus or entropy regularization). In the absence of finetuning data, RLHF is what makes predictive language models useful. That said, I fully agree that getting LLMs into a loop and abstraction finding a la DreamCoder are very promising ideas!

This is a point on which I was unclear: when I said "do something random" I meant randomly sampling from the current policy.

RLHF has been very useful for alignment, but not tremendously useful for acquiring new capabilities. We can RLHF a model into not saying racist things or being better at following instructions, but (at least going off of publicly available information), we can't RLHF a model into being much better at problem solving.

The reason RLHF works well for alignment is that the desired behavior is already assigned quite high probability by the base model. If there's an ingenious mathematical trick the base model thinks is low probability, RLHF won't find it except by luck, which becomes exponentially less probable as your action space becomes more complicated.

Fabian Glöckle (Apr 19 2023 at 08:50):

My assumption would be that training on all math textbooks would give you enough "base capabilities" and the question is how to turn these into useful behavior instead of text prediction. And there, "aligning" models with the rules of sound mathematical reasoning would look like "acquiring a new skill" while in reality it just means slightly favoring something that was already there, namely the "soundness feature" over other features in the model's activations.

Fabian Glöckle (Apr 19 2023 at 08:54):

But my answer to the question of creativity is also not that you should sample ideas from the model until the singular ingeniuous insight is sampled from the haystack, but rather that as described above, externally brilliant behavior might be the result of a long iteration of boring search space reductions and small-scale sampling.

Zhangir Azerbayev (Apr 19 2023 at 17:13):

Fabian Glöckle said:

But my answer to the question of creativity is also not that you should sample ideas from the model until the singular ingeniuous insight is sampled from the haystack, but rather that as described above, externally brilliant behavior might be the result of a long iteration of boring search space reductions and small-scale sampling.

It sounds like we agree then. My point about AZ being the wrong tree to bark down was the narrow one that "directly optimizing for reward by sampling from the policy saturates quickly in complex action spaces". I am open to any exploration method more sophisticated than sampling from P(proof | conjecture).

I agree that RLHF can be used to favor soundness.

Patrick Nicodemus (Apr 23 2023 at 23:20):

Zhangir Azerbayev said:

In practice, it's an open question whether we have enough mathematical text to train LMs that are good at math: a lot depends on how helpful non-math data is for improving at math.

Christian Szegedy has the radical idea of scanning whole textbooks so that the model can read the text and the diagrams. Some convolutional layer at the beginning will recognize the text and the diagrams and learn to read the formulas and interpret the diagrams and diagrams. No more processing and formatting text - just pixels!

Patrick Nicodemus (Apr 23 2023 at 23:21):

He stresses the reduction in engineering effort this would save

Patrick Nicodemus (Apr 23 2023 at 23:26):

This basically sounds crazy to me and absurdly inefficient but I guess if you're doing research on Creative mode it makes sense

Wojciech Nawrocki (Apr 23 2023 at 23:31):

Less structure in the representation of data can be beneficial, for example in handling non-English scripts.

Adam Topaz (Apr 23 2023 at 23:31):

I don’t think it’s that crazy (from my naive mathematician’s understanding). Especially when it comes to diagrams. At least for humans it’s MUCH easier to understand diagrams (like commutative diagrams, for example) as opposed to, say, the latex source that produces them, or some other text-only encoding.

Scott Morrison (Apr 24 2023 at 00:08):

GPT4 is already multimodal (graphical input), and although there's no ChatGPT or API access to this (?) I wouldn't be at all surprised if people are already trying this.

Jason Rute (Apr 24 2023 at 02:09):

It seemed crazy to me years ago when Szegedy and the N2Formal team first suggested it almost 4 years ago, but given what has happened since then, including the vast power of general and multimodal models, I wouldn't be shocked if I saw a model which had a GPT-3 level of ability from raw images of text (and GPT-4-level if appropriately instruction/RLHF tuned). Indeed, @Wojciech Nawrocki shared with me this paper and this paper which purport to be able to process text from images. I honestly don't think we are going to hit a local maximum with the learning-from-raw-data-on-the-internet paradigm until at least we have multimodal models which learn from e.g. YouTube videos and hence can processes speech, (closed caption) text, and images/video (including e.g. chalkboard written text/drawings from lecture videos and live coding tutorial videos).

Albert Jiang (Apr 24 2023 at 15:37):

I think we have at least a good two years before we run out of internet data. I think it's eventually going to hit a limit, so maybe not incredibly useful for research level mathematics, where the data is scarce and dense. But to put any limit on the scaling approach at this point in history would be a big mistake.

Albert Jiang (Apr 24 2023 at 15:38):

My (totally unfounded) guess is that GPT4 already uses youtube.

Fabian Glöckle (Apr 24 2023 at 15:44):

I think an issue with internet data is that there is lots of high school / first year undergrad math and then research math, but (compared to print) very little expository / textbook-style intermediary content.

Zhangir Azerbayev (Apr 26 2023 at 00:18):

Patrick Nicodemus said:

Christian Szegedy has the radical idea of scanning whole textbooks so that the model can read the text and the diagrams. Some convolutional layer at the beginning will recognize the text and the diagrams and learn to read the formulas and interpret the diagrams and diagrams. No more processing and formatting text - just pixels!

All of arXiv + every math textbook ever published is still quite small by modern language modeling standards. I wouldn't be surprised if gpt-4 is already trained on a bunch of OCRed technical books.

To me it sounds like training on raw pixels would be much more difficult than training a small transformer that maps pdfs to LaTeX. There are already open source models that do this. Although having an image encoder for diagrams is very useful.

Scott Morrison (Apr 26 2023 at 00:37):

Given that GPT4 is multimodal and can read images, I would be rather surprised if they are not already using it to turn images of text into text for subsequent training data, at least at experimental scale.

Junyan Xu (Apr 26 2023 at 03:46):

I wouldn't expect GPT4 to be more efficient than existing OCR tools for turning images of text into text. It could be used to create embeddings of images though, which could capture text layout, diagrams and other information that would be discarded by OCR, and support cross-modality semantic search.

Junyan Xu (Apr 26 2023 at 03:51):

Recently this multimodal model was published: https://huggingface.co/Vision-CAIR/MiniGPT-4 but I guess we shouldn't expect it to solve the ecole polytechnique problem as GPT4 did ...
(Addendum: https://llava-vl.github.io/ is another recent model, achieves SoTA on ScienceQA)
Screenshot_20230425-234744.png

Scott Morrison (Apr 26 2023 at 03:56):

I'm surprised you say it wouldn't be more efficient! I would expect that with an sufficiently high accuracy required, there's nothing even to compare it to for efficiency. That is, GPT's background knowledge will make it a better OCR tool than anything else available (because with some careful prompting one could probably have it correct ambiguities in the visual data using contextual knowledge, without making too much stuff up).

Junyan Xu (Apr 26 2023 at 04:05):

If the image quality is low and a lot of guessing is required then it may be necessary to use a language model, otherwise it just cost too much to run ... google translate app on your phone does decent OCR but LLMs hardly run on it.

Notification Bot (Aug 29 2023 at 00:05):

3 messages were moved from this topic to #Machine Learning for Theorem Proving > Nougat, academic OCR model from Meta AI by Junyan Xu.

Last updated: May 02 2025 at 03:31 UTC