Zulip Chat Archive

Stream: Lean Together 2021

Topic: LeanStep (Jason + Jesse's talk)

Johan Commelin (Jan 06 2021 at 10:37):

This stuff is amazing. Jesse showed me a gpt-generated proof of a lemma in the Witt vector project. It's not extremely deep, but it's certainly not a refl lemma either. I'm impressed.

One challenge for ATP's (I think Brian Conrad said this at some point) is: can they prove infinitude_of_primes without human input.
How does gpt fair on that one? You have this tricky step where you have to cook up a prime number that is sufficiently big, and you basically have to come up with the n! + 1 trick out of thin air. Does it go anywhere?

And if you give it the p := min_fac (n! + 1). Can it take it from there?

Stanislas Polu (Jan 06 2021 at 11:42):

With the data it was trained on it is highly unlikely, even with a lot of search I think. But eventually, the hope is that as we build up a lot of synthetic data by sampling proofs (with a fair amount of search) and training on that signal, the model will get a chance to build such intuitions. It also obviously depends on the type of statements we use for that.

Stanislas Polu (Jan 06 2021 at 11:43):

In the meantime, also I think there's a ton of value in having a system that can assist humans in a productive way, and what we have today is not far from it. If you really only have to provide these insights to formalize infinitude_of_primes, then it's probably going to make working with Lean much easier for a whole lot of people?

Reid Barton (Jan 06 2021 at 11:50):

Right, I think the more useful metric for the short term is not really "how many top-level theorems does gpt prove" but more like how much shorter can proofs become if you try to prove every goal using gpt (especially the goals created by haves, or side conditions to lemmas for rw, or calc steps, etc.)

Johan Commelin (Jan 06 2021 at 11:57):

Sure, I definitely think that this can be very useful before we get infinitude_of_primes_without_human_input.

Johan Commelin (Jan 06 2021 at 11:57):

But I also think it's a nice benchmark. It will certainly turn some heads.

Stanislas Polu (Jan 06 2021 at 11:59):

Agreed :+1:

matt rice (Jan 06 2021 at 12:38):

when we say shorter proofs here, I took this to mean actual length of the text of the proof, and not e.g. size of the terms evaluated, correct?

Jason Rute (Jan 06 2021 at 14:08):

To put infinitude of primes in context, I think that theorem would have gotten Euclid (or whoever actually came up with it) the Fields medal of the time. Until that point (in my revisionist history) the way one proves that a set is infinite is to find a generator for the set. This instead is a proof where one shows that if there is a last element then a contradiction happens if you pick a very creative number. The thing that is going to be hardest to teach AI is to come up with creative solutions to problems. It might figure out (or have built in) proof-by-contradiction and that if a set isn't infinite then there is a finite set of numbers that whose product is a number. But to know that it needs to take that product and subtract one. That is human reasoning at its finest. (Also, most humans mathematicians are taught that proof, so it is unclear how many of them would be able to come up with it on their own.)

Johan Commelin (Jan 06 2021 at 14:10):

I very much agree with your final observation. :thumbs_up:

Kevin Buzzard (Jan 06 2021 at 17:29):

The problem of course is that mathematicians are taught it and they think it's brilliant but 5 years later they think it's trivial and have forgotten the brilliance of the idea; by that point it seems difficult to persuade them that actually is difficult.

Jesse Michael Han (Jan 07 2021 at 17:17):

@Jeremy Avigad During Jason's talk, you asked about how gpt performs on analysis proofs --- indeed, analysis was the largest module (23K LOC) with the largest disparity in favor of the Transformer models from our evaluation. Here are the analysis proofs it found. We get some repeated tactic applications because the model is not conditioned on the previous tactic states, only the current one, but many of the proofs are nontrivial and diverge from their mathlib counterparts.

gpt_analysis.txt

some samples:

lemma norm_le_zero_iff : ∀ {α : Type u_1} [_inst_1 : normed_group α] {g : α}, ∥g∥ ≤ 0 ↔ g = 0 :=
begin
   intros,
   simpa [le_antisymm_iff, norm_nonneg] using @norm_eq_zero α _ g,
end

lemma asymptotics.is_O_with.add : ∀ {α : Type u_1} {F : Type u_4} {E' : Type u_6} [_inst_2 : has_norm F] [_inst_4 : normed_group E'] {g : α → F} {l : filter α} {c₁ c₂ : ℝ} {f₁ f₂ : α → E'}, is_O_with c₁ f₁ g l → is_O_with c₂ f₂ g l → is_O_with (c₁ + c₂) (λ (x : α), f₁ x + f₂ x) g l :=
begin
   intros,
   filter_upwards [h₁, h₂],
   intros x hx₁ hx₂,
   apply le_trans (norm_add_le _ _),
   apply le_trans (add_le_add hx₁ hx₂),
   rw add_mul,
end

lemma differentiable_on_const_add_iff : ∀ {𝕜 : Type u_1} [_inst_1 : nondiscrete_normed_field 𝕜] {E : Type u_2} [_inst_2 : normed_group E] [_inst_3 : normed_space 𝕜 E] {F : Type u_3} [_inst_4 : normed_group F] [_inst_5 : normed_space 𝕜 F] {f : E → F} {s : set E} (c : F), differentiable_on 𝕜 (λ (y : E), c + f y) s ↔ differentiable_on 𝕜 f s :=
begin
   intros x hx,
   intros E unique_diff_on_univ,
   intros,
   simp [differentiable_on, differentiable_within_at_univ],
end

Jesse Michael Han (Jan 07 2021 at 17:28):

Fun observation: in the first example above, it used @norm_eq_zero with the correct number of arguments (and even the "dreadful @ _ idiom"), but the string @norm_eq_zero never appears in mathlib (commit 33483a3d).

Jeremy Avigad (Jan 07 2021 at 17:49):

Thanks for this information! I am excited about the potential for gpt for interactive use. I know when/how to use intro and cases, so, for me, the real win is when we have a tool that suggests theorems and rewrite rules. So here are some useful queries for the model:

If I were to use simp or rw right now, what additional rewrite rules do you think I should add?
If I were to use apply or refine or exact within the next three or four steps, what is your best guess as to the theorem I would use?

Patrick Massot (Jan 07 2021 at 17:50):

Yes, this is was also my remark during the talk.

Jeremy Avigad (Jan 07 2021 at 17:54):

I guess another useful query:

If I were to use cases or rcases on something other than a hypothesis, what would it be?

Stanislas Polu (Jan 07 2021 at 18:07):

Hi @Jeremy Avigad. Anything that can be prompted by a prefix of the full completion will work as expected. So as an example we can create a specialized tactic gptf-rw that will "force" the use of rw as a tactic and the model will gadly complete the steps. For cases or rcases we can do the same. There is no trivial way to specify (do not use the hypotheses) but we can query the model and filter out the response to achieve that goal.

Stanislas Polu (Jan 07 2021 at 18:07):

Does that make sense?

Stanislas Polu (Jan 07 2021 at 18:08):

This is great feedback (thanks @Jeremy Avigad and @Patrick Massot) we commit to working on exactly that in the coming weeks and will circle back once we have something working.

Stanislas Polu (Jan 07 2021 at 18:09):

The main blocker right now is one commit to the lean repository, after that we'll be able to share the system with a broader set of people very easily cc @Jesse Michael Han

Kevin Lacker (Jan 07 2021 at 18:12):

is there a PR you're referring to?

Stanislas Polu (Jan 07 2021 at 18:13):

Yes, Jesse's lean-tpe branch here: https://github.com/jesse-michael-han/lean/tree/lean-tpe

Stanislas Polu (Jan 07 2021 at 18:13):

But it needs clean-up I believe :)

Jesse Michael Han (Jan 08 2021 at 15:50):

Kevin Lacker said:

is there a PR you're referring to?

all the changes have been incorporated into this PR by @Edward Ayers

Chris Hughes (Jan 09 2021 at 19:12):

Was late in watching this talk. Am I right in saying that the gpt tactic is just producing a string and doesn't by design know how to produce a string that is syntactically correct, so if it does produce meaningful strings it's because it "learnt" to do that and not because it's been programmed to do so?

Jason Rute (Jan 09 2021 at 19:22):

Yes that is correct. It “learnt” in the sense of machine learning what a valid tactic string looks like. It uses a string-to-string machine learning algorithm called a Transformer that is often used for natural language where there is no formal grammar.

Jason Rute (Jan 09 2021 at 19:28):

If you are curious, both of these threads have some discussion on transformers, including the GPT transformer. #Machine Learning for Theorem Proving > Papers on Neural Conjecturing and #Machine Learning for Theorem Proving > GPT-f paper.

Kevin Lacker (Jan 11 2021 at 08:15):

in a very fuzzy nonrigorous way, I suspect that a lot of the "power" of the gpt is spent on learning the lean syntax, and if there were some way to integrate a tree search that knew about valid lean syntax with an AI model doing AI things, it would be more effective

Kevin Lacker (Jan 11 2021 at 08:15):

like you can teach a gpt to play chess but it isn't as good as the monte carlo based ones

Johan Commelin (Jan 11 2021 at 08:24):

If you would print chess moves in regular chess notation, one move per line. Is it possible to "measure" how much "power" gpt spends on figuring out the syntax?

EricGT (Jan 11 2021 at 10:19):

FYI - I have not used anything from this site but do keep it on reference for future needs. The site https://huggingface.co/ is about solving NLP. They have models using GPT. I found one when searching for chess: https://huggingface.co/shtoshni/gpt2-chess-uci . Take all of this with a grain of salt.

Stanislas Polu (Jan 11 2021 at 13:11):

@Kevin Lacker in the current result we don't have any tree-search as explained during the talk. But we're working on adding that which should give an healthy boost. Only the valid tactics would be kept as is the case in the original GPT-f paper. Interestingly, in AlphaGo MCTS was almost exclusively used to train the value function (the policy network was probably not even needed as the "valid" action space in each board state is fairly limited). In formal maths we have the increased difficulty that the action space even limited to "valid" steps is enumerable, so we need a strong enough policy to productively expand each node. But same as is the case for games, lot of the "proving power" will probably come from a trained value function, which does not even exist yet in the Lean setup.

Stanislas Polu (Jan 11 2021 at 13:12):

All in all, completely agreed with your statement, just wanted to provide a bit of additional context.

Kevin Lacker (Jan 11 2021 at 17:02):

how would you train a value function? the value is something like, the value for lemma X is an estimate of whether proving lemma X would be helpful in concluding theorem Y?

Jason Rute (Jan 11 2021 at 17:05):

Kevin Lacker said:

how would you train a value function? the value is something like, the value for lemma X is an estimate of whether proving lemma X would be helpful in concluding theorem Y?

@Kevin Lacker You should check out the GPT-f paper, or the discussion at #Machine Learning for Theorem Proving > GPT-f paper. But some simple approach (not too different from AlphaGo is to train a binary classifier to say if the model's tree search is able to find a proof at a particular goal state which came up during some sort of iterative training (expert iteration or RL). (For transformer models you can use tricks for binary classification as done in the GPT-f paper.)

Kevin Lacker (Jan 11 2021 at 17:06):

Johan Commelin said:

If you would print chess moves in regular chess notation, one move per line. Is it possible to "measure" how much "power" gpt spends on figuring out the syntax?

I only tried playing chess against GPT-3, so I don't have any fancy analysis, but "syntax" was probably the wrong thing for me to say. GPT-3 playing chess will almost always say moves that are the correct syntax, but sometimes it will make illegal moves, like it will try to move its pawn from e2 to e4 but it already moved that pawn earlier and it "forgot". I am sure you could make a better one by training solely on chess games but imo this sort of failure hints that the GPT architecture is doing some work to detect legal moves that could just be done by hard-coding them

Johan Commelin (Jan 11 2021 at 17:11):

Thanks for the anecdote

Jason Rute (Jan 11 2021 at 17:24):

As for the syntax discussion, my thinking on this has evolved over time. It is certainly true that a model could be given what is called an "inductive bias" forcing it to give correct syntax. The HOList models have a stronger inductive bias for example. Also, even with transformer models there are ways to supply tree position encodings rather than text position encoding However there are some considerations:

It is a lot easier to use a text model like a transformer, since one doesn't have to engineering what formal syntax looks like. For example, if our tool we presented last week used the HOList technology, it would be a lot more engineering to make a DSL for tactics (and even more so for Lean since Lean needs term and name parameters way more than HOL Light tactics, and HOList doesn't handle those well).
There is an economy of scale thing going on. OpenAI and other companies (like HuggingFace) have a lot of infrastructure around pertained text models since (1) they are easy to pre-train (you basically dump a large part of the internet into the model) and (2) they are extremely flexible. This makes it easy to create a new model for some esoteric purpose like theorem proving or symbolic mathematics. Someone could take time to gather lots of tree structured data for a tree position transformer, but this would take a lot of curation work. (Also, it would be a lot of work to parse lean text into a syntax tree, especially with Lean's change-as-you-go parser.)
It provides flexibility. We are not using low level formal lean syntax, but high level pretty printed and human written syntax. This is shorter and easier to fit into our model. This is easy with a text transformer since it doesn't need to know the exact syntax used.
Transformers can also learn unspoken syntax rules, such as good variable names and common ways to write code. The output looks more human like than if it just outputted an AST.

Jason Rute (Jan 11 2021 at 17:26):

As for chess, note, this sort of unbiased approach is also used for chess with MuZero. (They use conv nets instead of transformers, but the idea is similar.) One doesn't need to program in the rules of chess to the model. Instead the model observes human games enough to learn the action space, and then improves its understanding via RL. It is interesting that illegal moves and really dumb moves basically look the same to the model. I think in that paper, they argue this even helps learning in the long run.

Jesse Michael Han (Jan 11 2021 at 17:29):

Kevin Lacker said:

in a very fuzzy nonrigorous way, I suspect that a lot of the "power" of the gpt is spent on learning the lean syntax, and if there were some way to integrate a tree search that knew about valid lean syntax with an AI model doing AI things, it would be more effective

It's well known that transformers will tend to produce syntactically correct outputs if trained sufficiently, to the point that structured decoding is not really needed.

When it comes to neural theorem proving using gptf, there are two search procedures: a search for a candidate sequence of tokens by the decoding procedure wrapping the transformer, and a search for candidate proof by the tree search which wraps the decoding procedure wrapping the transformer. The outer search already knows about valid Lean tactics, but the inner search over sequences of tokens does not know about Lean syntax. One can certainly try making this inner search Lean-syntax-aware and having it mask out invalid tokens during decoding. This can improve the predictions returned by the inner search. In practice, however, the model usually predicts valid Lean syntax (interactive tactic invocations in mathlib tend to not have very complicated parse trees) and most failures occur in the outer search, when the tactic execution fails to make progress. Currently, I don't think we would see much benefit from such domain-specific engineering in the decoding process.

If we were to ask much more of the transformer, say predicting entire proof terms instead of tactic steps, then we might see more benefits from structured decoding.

Kevin Lacker (Jan 14 2021 at 22:40):

MuZero does know about the tree structure of the chess game, though, it isn't just reading in a sequence of chess moves and putting it into a GPT

Kevin Lacker (Jan 14 2021 at 22:42):

"tree search" isn't quite the right way to describe a theorem prover, though, a set of possible proofs really maps to a DAG rather than the way a set of possible chess games maps to a tree and that seems like a big difference

Jason Rute (Jan 14 2021 at 22:45):

Board games are also DAGs, but tree search is still the term usually used.

Jason Rute (Jan 14 2021 at 22:51):

Lean GPT-f is also not currently outputting more than one tactic at a time, so it isn’t that different from outputting a chess move. A complete proof, sequence of tactics comes from a tree search.

Jason Rute (Jan 14 2021 at 23:04):

But to be clear, there are many differences between Muzero and GPT-f. The only point I wanted to make is that technically muzero also (like GPT-f) doesn’t know what a valid move is a priori. It has to be learned. (One not small difference also is that muzero is constrained to output 64x64 possible moves (many not valid) while GPT-f can output any text string.)

Last updated: May 02 2025 at 03:31 UTC