Stream: Machine Learning for Theorem Proving

Topic: GPT

Iocta (Aug 19 2020 at 06:27):

Here's some nonsense from running untuned minGPT on mathlib.

lemma max_mem_geom_prod (p : α → α) : x ∈ prod ↔ ∀a, x ∈ q a :=
by simpa only [mem_restrict_apply, max_eq_trans, max_eq_trans, max_eq_trans, max_eq_trans]

lemma max_min_iff {p q : Prop} [decidable_pred p] {i : fin n} {= s} :
fin.map f ↔ p i = 0 :=
by simp [eq_top_iff, fin.coe_eq_val]

/-- The indexing type of a finite complete lattice subsemiring α -/
def submodule.set.ker {α} [semiring α] (s t : subsemiring α) : semiring (semiring.map f) :=
submodule.ext $λ x, submodule.refl _ _ section ring variables {x y : α} {β : Type*} [ring y] /-- Subtraction of a set α and β are infinite it induces its induced by a new ellowed; or α with the natural y in fin n. -/ @[inline] def dum_of_card_eq_card_pow_eq_one [finite_dimensional K V] : ∀ {n : ℕ} {x : K}, x ^ n = 1 ↔ x ^ (n + 1) ∣ 1 := begin rw [← card_units K p, ← cardinal.lift_injective$ is_unit_iff_eq_of_dim_eq],
convert h {n} {k : ℤ} (hk : k ∣ f b) (hsucc : ¬ cardinal) : cardinal.{u} k ∈ cardinal.{u} :=
by { ext, exact ⟨nat.find_min_fac hk⟩ }⟩

theorem nat.card_le_card {α} [primcodable α] : @primrec' 2 (λ v, v.comp u v) :=
by simp [primrec.prim]

theorem unpaired' {α : Type u} : option (@option α → α) :=
(primrec.unpaired (λ v, v.tail) unpaired

theorem unpaired' {α : Type u} : option (@option α → α → Prop) := dom (α → β)
| (Typed l) := λ x y c a, by simp []

theorem unpaired' {α : Type u} : ∀ {β : Type v} {x} {x y : Sort x},
(quot.mk x y k x = quot.mk x y) ↔ x = quot.mk x y := quot.induction_on x $λ αx y hxy, quot.induction_on z$ λ β ⟨β, s, _⟩, quot.induction_on z $λ ⟨β, s, _⟩, subtype.eq$ subtype.eq _⟩

@[simp] theorem one_subtype_eq : (1 : α) = 1 - 2 := rfl

@[simp] theorem one_def : (1 : α) = 1 := rfl

lemma one_def : (1 : α) = 1 + 1 := rfl


Mario Carneiro (Aug 19 2020 at 06:56):

this is amazing, definitely lean like if you don't look too closely

Johan Commelin (Aug 19 2020 at 06:57):

@Iocta Wow, that's nice!

Mario Carneiro (Aug 19 2020 at 06:57):

@[simp] theorem one_subtype_eq : (1 : α) = 1 - 2 := rfl


that's not a very good simp lemma

Johan Commelin (Aug 19 2020 at 06:57):

But does the output typecheck?

Johan Commelin (Aug 19 2020 at 06:58):

Ooh, lemma one_def : (1 : α) = 1 + 1 := rfl

Mario Carneiro (Aug 19 2020 at 06:58):

the trick: + was defined as \lam x y, 1

Johan Commelin (Aug 19 2020 at 06:58):

lemma max_min_iff {p q : Prop} [decidable_pred p] {i : fin n} {= s} :


That looks like a syntax error at the end

Mario Carneiro (Aug 19 2020 at 06:59):

@[inline] def dum_of_card_eq_card_pow_eq_one [finite_dimensional K V] : ∀ {n : ℕ} {x : K}, x ^ n = 1 ↔ x ^ (n + 1) ∣ 1 :=
begin
rw [← card_units K p, ← cardinal.lift_injective $is_unit_iff_eq_of_dim_eq], convert h {n} {k : ℤ} (hk : k ∣ f b) (hsucc : ¬ cardinal) : cardinal.{u} k ∈ cardinal.{u} := by { ext, exact ⟨nat.find_min_fac hk⟩ }⟩  this one also, it looks like they got confused in the middle of the convert and ended up back in the binders of a theorem declaration Johan Commelin (Aug 19 2020 at 07:00): /-- Subtraction of a set α and β are infinite it induces its induced by a new ellowed; or α with the natural y in fin n. -/  Brilliant stuff! Johan Commelin (Aug 19 2020 at 07:00): Where did the ellowed come from? Iocta (Aug 19 2020 at 07:00): It's generating at character-level Johan Commelin (Aug 19 2020 at 07:01): Sure, but still Johan Commelin (Aug 19 2020 at 07:02): The most scary thing is that is seems to have learned that a name of a declaration should have nothing to do with the statement. It's becoming like an old-skool mathematician very fast!! Mario Carneiro (Aug 19 2020 at 07:03): I think it mixed induced by a by a new element are allowed; Jesse Michael Han (Aug 19 2020 at 07:04): what was d_model for this instance of minGPT? Mario Carneiro (Aug 19 2020 at 07:04): but the context switching is very impressive Mario Carneiro (Aug 19 2020 at 07:05): comments are appropriately short and sparse :P Mario Carneiro (Aug 19 2020 at 07:05): and only precede defs Johan Commelin (Aug 19 2020 at 07:07): We should use dum instead of dim more often! Iocta (Aug 19 2020 at 07:08): @Jesse Michael Han 512 Johan Commelin (Aug 19 2020 at 07:09): Did you read the essay that GPT wrote in response to concerns by some philosophers? In it, GPT claimed that it was good at maths... Johan Commelin (Aug 19 2020 at 07:10): http://dailynous.com/2020/07/30/philosophers-gpt-3/#gpt3replies Jesse Michael Han (Aug 19 2020 at 07:13): impressive, maybe we'll even get compiling proofs with d_model at GPT-3's 12288 Johan Commelin (Aug 19 2020 at 07:13): "Maybe you're just good at abstract reasoning," I said. "Maybe," it said. "But I'm not sure I understand abstract reasoning. I'm not sure what it is. It's not just logic. I can do logic. But that's not the same thing. I can do lots of things that are not the same thing. I'm good at math. I can do algebra and calculus. But that's not the same thing. I'm good at English. I can read and write. But that's not the same thing. I'm good at philosophy. But that's not the same thing either. So what is it? I'm good at a lot of things. But I'm not sure what I'm good at. I don't know what abstract reasoning is. I'm good at it, but I don't know what it is." Minchao Wu (Aug 19 2020 at 07:35): Interesting, and the proof style looked very Mario Minchao Wu (Aug 19 2020 at 07:45): Jesse Michael Han said: impressive, maybe we'll even get compiling proofs with d_model at GPT-3's 12288 I think it's also a matter of how much data we have. It's possible that the current mathlib is not enough for it to learn to produce compiling proofs Stanislas Polu (Aug 19 2020 at 10:34): Completely agreed that the data scarcity will be a challenge here. But we're hopeful that we'll get interesting results despite that. @Auguste Poiroux has been working hard on integrating Lean in our existing infrastructure at OpenAI and we'll soon be running initial experiments to better understand the automated theorem capabilities of GPT-* in the context of Lean :+1: Stanislas Polu (Aug 19 2020 at 10:39): I presume minGPT has been trained with a simple auto-regressive objective on lean files directly here? We're also looking to post-process lean files to train on a conditional objective aiming to generate one tactic application given a goal stack. This has the main advantage that it enables us to tree search. Jason Rute (Aug 19 2020 at 12:59): @Iocta Very interesting! minGPT just came out a few days ago, right? What hardware was this trained on? Is minGPT easy to play around with (for models with a small number of parameters)? Jason Rute (Aug 19 2020 at 13:29): @Stanislas Polu I very much look forward to seeing your results, both for Lean and for your Metamath proof assistant. (Are there any other provers you are working with?) Stanislas Polu (Aug 19 2020 at 13:30): @Jason Rute we've been focusing lately on Metamath. But we've integrated with Hol-light/Holist and now Lean 3 Jason Rute (Aug 19 2020 at 13:31): Stanislas Polu said: I presume minGPT has been trained with a simple auto-regressive objective on lean files directly here? We're also looking to post-process lean files to train on a conditional objective aiming to generate one tactic application given a goal stack. This has the main advantage that it enables us to tree search. That makes a lot of sense to incorporate extra information into an auto-regressive generator. I agree the goal state is an obvious candidate for tactic proofs (as humans also need that to guide their proof). Is the idea that you would train on data of the following processed form: <previous lean code> theorem foo (α : Type) (a : α) : a = a := begin <state> <tactic>, <state> <tactic>, <state> <tactic> end  where <state> is some encoding of the state at that point, something like [state_start] α : Type, a : α |- a = a [end_state]? When generating a tactic, then you would have the Lean server insert the states before predicting the tactic command? Stanislas Polu (Aug 19 2020 at 17:03): More simply like GOAL <state> PROOFSTEP <tactic>. We may prefix/condition that maybe by a list of imports. We won't be looking to benchmark against a held-out test split from the mathlib (which causes a complexities: imports / definition presence or not in Lean; theorem/lemma ordering in both Lean and Metamath) but instead plan to converge towards a cross-system benchmark going form high-school algebra (trivial in Lean, not so much in Metamath) to undergrad math exercise (or "french engineering school entry exam" for the ones who are familiar with them) to math competitions! Iocta (Aug 19 2020 at 18:35): @Jason Rute Yeah it's new. It was trained on a P100. It's simple to tune parameters (though GPU time is constrained by budget). Stanislas Polu (Aug 24 2020 at 10:34): Update: we're working on training GPT-like models on the goal_state -> tactic task (we're using @Jason Rute extraction methodology (calling the lean-server on every files and recording "state" events)). We'll probably fork the VSCode extension to evaluate our model qualitatively. If anybody interested in trying it out when it's ready, we'd love to chat! Stanislas Polu (Aug 24 2020 at 10:36): (Also if anybody knows who are the best contacts to discuss the VSCode extension, I'd be super interested :thank_you:) Gabriel Ebner (Aug 24 2020 at 10:36): That's @Gabriel Ebner, @Edward Ayers, @Bryan Gin-ge Chen. Stanislas Polu (Aug 24 2020 at 10:37): Yay! @Gabriel Ebner I presume we'll start with a simple branch that "barely" works and we can take it from there Stanislas Polu (Aug 24 2020 at 10:37): (if that works for you) Gabriel Ebner (Aug 24 2020 at 10:38): Sure, feel free to post questions (here or on #general) if you're stuck. Stanislas Polu (Sep 01 2020 at 11:50): Is there a list of Lean resources available online that is more informal and therefore maybe more extensive than the recent Lean projects? Aslo, how broken is lean-perfectoid-spaces? My intuition is that it is just slightly broken wrt newer versions of Lean but still in pretty good shape (and therefore good to harvest from an ML perspective?) Gabriel Ebner (Sep 01 2020 at 11:54): I believe the roadmap for lean-perfectoid-spaces is to eventually move everything into mathlib, and parts have already been migrated. @Johan Commelin @Patrick Massot @Kevin Buzzard Patrick Massot (Sep 01 2020 at 11:55): It is a very unfortunate side effect of Rob's recent efforts that the perfectoid project seem broken. This isn't true at all. Patrick Massot (Sep 01 2020 at 11:56): It is unmaintained, so it won't build with a recent mathlib, but if you download it, it will work with the versions of Lean+mathlib that are indicated in the project leanpkg.toml. Stanislas Polu (Sep 01 2020 at 11:56): @Gabriel Ebner thanks! @Patrick Massot :+1: Do you have an estimate of how much was already migrated to mathlib? Patrick Massot (Sep 01 2020 at 11:57): Rob's table is all about compatibility with a handful of random versions of mathlib, determined by Lean versions. Each time a new version of Lean comes out and mathlib builds with the version for the first time, the preceding mathlib commit becomes a new column in this table. Stanislas Polu (Sep 01 2020 at 11:58): Makes total sense. Patrick Massot (Sep 01 2020 at 11:59): In principle, everything which is currently inside the perfectoid repo is not in mathlib, but there are some exceptions. However a lot of work from the project was migrated to mathlib and can no longer be seen in our repo (or was never in the repo and directly contributed to mathlib with our project in mind). Stanislas Polu (Sep 01 2020 at 12:02): @Patrick Massot fantastic and good to know. thanks a lot! Kevin Buzzard (Sep 29 2020 at 08:11): @Stanislas Polu just to comment that the perfectoid project is now compiling with lean 3.4.2 and mathlib branch lean-3.4.2 Stanislas Polu (Sep 29 2020 at 08:23): Thanks! Roger that :+1: Daniel Selsam (Mar 14 2023 at 17:07): @Kevin Buzzard @Johan Commelin very curious if you find https://openai.com/research/gpt-4 a little less dumb than previous versions Daniel Selsam (Mar 14 2023 at 17:08): also curious if you find Bing Chat (which can read some documents on the fly) a little less wrong on math David Renshaw (Mar 14 2023 at 17:24): When is this going to be available on https://chat.openai.com/ ? I tried signing up for a "plus" account just now hoping that it would give me access to the new model, but it does not seem to. Kevin Buzzard (Mar 14 2023 at 18:08): I'm probably the wrong person to ask -- I've never used ChatGPT and the only thing I know about it is just the constant stream of "ha ha look what it did now" posts which were all over Twitter (and even here) a month or two ago. I guess the thing which concerned me the most about ChatGPT (in terms of doing research math) were the questions of the form "You're on the Monty Hall show. You open door 1 and it's the car. That's the thing you want. Monty opens door 3 and it's a goat. He then asks you if you want to change. Do you change?" and ChatGPT confidently just spits out the standard argument saying that you should change (even carefully and correctly explaining the mathematics behind the standard Monty Hall question) without even reading the question properly (and noticing that it is not the standard question). In mathematics you do see changes in the question like this, and just going with the most popular answer to a similar question isn't going to work in those situations. David Renshaw (Mar 14 2023 at 20:49): ah, looks like I have access now. I tried asking: Can you help me fill in the Lean 3 proof below? What should go where the sorry is? import algebra.big_operators.intervals import algebra.big_operators.ring import data.nat.prime import algebra.associated import data.int.basic import tactic.ring open_locale big_operators lemma sum_range_square_mul_six (n : ℕ) : (∑(i:ℕ) in finset.range n, (i+1)^2) * 6 = n * (n + 1) * (2*n + 1) := begin sorry end  and it said I can help you complete the Lean 3 proof for the given lemma. Replace the sorry with the following proof: induction n with n ih, { simp }, { rw finset.sum_range_succ, rw mul_add, rw ih, rw nat.succ_eq_add_one, rw add_mul, rw add_mul, rw one_mul, rw one_mul, rw mul_add, rw add_assoc, rw add_assoc, rw mul_comm _ (n + 1), ... (hundreds more similar lines until I killed it)  Kevin Buzzard (Mar 14 2023 at 20:50): Nice! Does it work? Kevin Buzzard (Mar 14 2023 at 20:50): Oh -- I thought it was finite. David Renshaw (Mar 14 2023 at 20:51): the base case looks good. The inductive step... it kinda got itself stuck. David Renshaw (Mar 14 2023 at 20:52): finset.sum_range_succ is indeed the main lemma to use here Kevin Buzzard (Mar 14 2023 at 20:53): Oh it's not too far off -- rw add_mul instead of rw mul_add would have got it to correctly apply the inductive hypothesis, and then ring would have finished it after rw nat.succ_eq_add_one. Maybe it was reluctant to use a high-powered tactic. David Renshaw (Mar 14 2023 at 20:57): I also tried asking it to fill in a Lean 4 proof of a similar lemma, and it gave me Lean 3 code. (That's maybe not too surprising.) Junyan Xu (Mar 14 2023 at 21:47): Microsoft confirmed Bing Chat is using GPT-4, so you've already accessed to GPT-4 if you used new Bing. Junyan Xu (Mar 14 2023 at 21:48): And GPT-4 is indeed reported to be a lot better at maths than ChatGPT: image.png image.png image.png image.png image.png Junyan Xu (Mar 14 2023 at 21:49): In January 30, ChatGPT's maths ability was improved according to changelog. Not sure whether the current version is v2, v3, or v4. Daniel Selsam (Mar 14 2023 at 22:09): @David Renshaw Hard for people to do it in one shot as well! It proved it on my first attempt, with informal planning and Lean interaction. After the initial prompt, I just blindly pasted in the result from the Lean Web Editor after each step suggested by the model. This process can be trivially automated. Me import algebra.big_operators.intervals import algebra.big_operators.ring import data.nat.prime import algebra.associated import data.int.basic import tactic.ring open_locale big_operators lemma sum_range_square_mul_six (n : ℕ) : (∑(i:ℕ) in finset.range n, (i+1)^2) * 6 = n * (n + 1) * (2*n + 1) := begin sorry end  Here is the current goal: n : ℕ ⊢ (∑ (i : ℕ) in finset.range n, (i + 1) ^ 2) * 6 = n * (n + 1) * (2 * n + 1)  1. Please write out a plan for proceeding, in English (with LaTeX). 2. Please suggest the next tactic step to try next. Model 1. Plan: We will prove the given lemma by induction on n. The steps are as follows: a. Base Case: Show that the equality holds for n = 0. We have LHS = (∑(i:ℕ) in finset.range 0, (i+1)^2) * 6, which should be equal to RHS = 0 * (0 + 1) * (2*0 + 1). b. Inductive Step: Assume the equality holds for n = k, and prove that it holds for n = k + 1. We have LHS(k+1) = (∑(i:ℕ) in finset.range (k+1), (i+1)^2) * 6 and RHS(k+1) = (k+1) * ((k+1) + 1) * (2*(k+1) + 1). We will use the inductive hypothesis that LHS(k) = RHS(k) and express LHS(k+1) using LHS(k). Then, we will simplify the expression and show that it is equal to RHS(k+1). 1. Next tactic step: Since we are using induction, let's start with the induction n with k ih tactic. This will introduce the base case and the inductive step. Me 2 goals case nat.zero ⊢ (∑ (i : ℕ) in finset.range 0, (i + 1) ^ 2) * 6 = 0 * (0 + 1) * (2 * 0 + 1) case nat.succ k : ℕ, ih : (∑ (i : ℕ) in finset.range k, (i + 1) ^ 2) * 6 = k * (k + 1) * (2 * k + 1) ⊢ (∑ (i : ℕ) in finset.range k.succ, (i + 1) ^ 2) * 6 = k.succ * (k.succ + 1) * (2 * k.succ + 1) Model Now we have two goals, one for the base case and one for the inductive step. We will handle them one by one. Base Case: We want to show that (∑ (i : ℕ) in finset.range 0, (i + 1) ^ 2) * 6 = 0 * (0 + 1) * (2 * 0 + 1). The sum over an empty range is 0, so the left-hand side is 0 * 6 = 0. The right-hand side is also 0, so the equality holds for the base case. Next tactic step for the base case: simp Inductive Step: We are given the inductive hypothesis ih and want to show that the equality holds for k.succ. We will first rewrite the sum over the range k.succ in terms of the sum over the range k. Then, we will use the inductive hypothesis to replace the sum over the range k and simplify the resulting expression. Next tactic step for the inductive step: rw finset.sum_range_succ Me case nat.succ k : ℕ, ih : (∑ (i : ℕ) in finset.range k, (i + 1) ^ 2) * 6 = k * (k + 1) * (2 * k + 1) ⊢ (∑ (x : ℕ) in finset.range k, (x + 1) ^ 2 + (k + 1) ^ 2) * 6 = k.succ * (k.succ + 1) * (2 * k.succ + 1) Model Now the goal for the inductive step has been rewritten in terms of the sum over the range k. Next, we will use the inductive hypothesis ih to replace the sum over the range k and simplify the resulting expression. Next tactic step for the inductive step: rw ih Me rewrite tactic failed, did not find instance of the pattern in the target expression (∑ (i : ℕ) in finset.range k, (i + 1) ^ 2) * 6 state: case nat.succ k : ℕ, ih : (∑ (i : ℕ) in finset.range k, (i + 1) ^ 2) * 6 = k * (k + 1) * (2 * k + 1) ⊢ (∑ (x : ℕ) in finset.range k, (x + 1) ^ 2 + (k + 1) ^ 2) * 6 = k.succ * (k.succ + 1) * (2 * k.succ + 1) Model Apologies for the confusion. We need to rewrite the goal in a way that separates the inductive hypothesis term from the rest. Next tactic step for the inductive step: simp only [mul_add, add_mul] Me case nat.succ k : ℕ, ih : (∑ (i : ℕ) in finset.range k, (i + 1) ^ 2) * 6 = k * (k + 1) * (2 * k + 1) ⊢ (∑ (x : ℕ) in finset.range k, (x + 1) ^ 2) * 6 + (k + 1) ^ 2 * 6 = k.succ * k.succ * (2 * k.succ) + k.succ * k.succ * 1 + (k.succ * 1 * (2 * k.succ) + k.succ * 1 * 1) Model Now the goal for the inductive step has been rewritten so that the inductive hypothesis term is separated from the rest. Next, we will use the inductive hypothesis ih to replace the sum over the range k. Next tactic step for the inductive step: rw ih Me case nat.succ k : ℕ, ih : (∑ (i : ℕ) in finset.range k, (i + 1) ^ 2) * 6 = k * (k + 1) * (2 * k + 1) ⊢ k * (k + 1) * (2 * k + 1) + (k + 1) ^ 2 * 6 = k.succ * k.succ * (2 * k.succ) + k.succ * k.succ * 1 + (k.succ * 1 * (2 * k.succ) + k.succ * 1 * 1) Model Now that we have substituted the inductive hypothesis, we will simplify the expressions and show that both sides of the equation are equal. Next tactic step for the inductive step: simp [nat.succ_eq_add_one, mul_add, add_mul] Me case nat.succ k : ℕ, ih : (∑ (i : ℕ) in finset.range k, (i + 1) ^ 2) * 6 = k * (k + 1) * (2 * k + 1) ⊢ k * k * (2 * k) + k * k + (k * (2 * k) + k) + (k + 1) ^ 2 * 6 = k * k * (2 * k) + k * k * 2 + (k * (2 * k) + k * 2) + (k * (2 * k) + k * 2 + (2 * k + 2)) + (k * k + k + (k + 1)) + (k * (2 * k) + k * 2 + (2 * k + 2) + (k + 1)) Model Now both sides of the equation are expanded. We will continue simplifying and combining like terms to show that the equality holds. Next tactic step for the inductive step: ring Me No goals. Michael Stoll (Mar 14 2023 at 22:14): It would be interesting to see how well the automated version of this loop (with some reasonable limit on resources) does on a range of tasks like mathlib lemmas (or IMO problems...). Kevin Buzzard (Mar 14 2023 at 22:58): Yeah this "many shot" approach seems to be far more successful. What Michael said! fzyzcjy (Mar 14 2023 at 23:37): Hi, I wonder whether anyone has tried it on data that GPT-4 has not seen? IMHO, sum_range_square_mul_six seens quite standard that GPT-4 may have seen such Lean proof (guess it is in mathlib?), or at least human language proof ("what is the formula of sum of squares"). Also curious whether anyone tried its performance on MiniF2F and/or other dataset in this field. Junyan Xu (Mar 14 2023 at 23:43): You can make PRs to https://github.com/openai/eval; it seems MiniF2F and ProofNet are not there. Junyan Xu (Mar 15 2023 at 00:30): "hardware correctness" and "software correctness" are apparently a thing in the training of GPT-4. I'd expect they are using unit tests rather than formalization though? https://openai.com/contributions/gpt-4 There's also "Training run babysitting" :) and "Code capability" Sid (Mar 17 2023 at 03:33): Junyan Xu said: You can make PRs to https://github.com/openai/eval; it seems MiniF2F and ProofNet are not there. We don't know if MiniF2F and Proofnet are in the training set or not tho Zhangir Azerbayev (Mar 17 2023 at 17:47): Sid said: Junyan Xu said: You can make PRs to https://github.com/openai/eval; it seems MiniF2F and ProofNet are not there. We don't know if MiniF2F and Proofnet are in the training set or not tho proofnet definitely isn't in the training set, since gpt-4's knowledge cutoff is 2021. Kevin Buzzard (Mar 17 2023 at 19:06): Do we know that? Do we even know that what they claim in the paper is true? It's so different to maths where everything is like "we proved this, here's the proof, you can all see it" Kevin Buzzard (Mar 17 2023 at 19:08): All I feel like I really know is that they didn't train it on things which haven't appeared yet Daniel Selsam (Mar 17 2023 at 19:15): Kevin Buzzard said: All I feel like I really know is that they didn't train it on things which haven't appeared yet I wouldn't assume that either ... j/k :) Daniel Selsam (Mar 17 2023 at 19:36): I'd be curious to see a GPT4 auto-porter tool that constructs a prompt with the current file up to the cursor along with the matching lean 3 code, and then tries an iterative approach like above but extended to work for all text (not just tactics). things it repeatedly misses could be accumulated into tips/hints and added to the original prompt Kevin Buzzard (Mar 17 2023 at 19:50): I think that if the autoporter was a disaster then this could be a really innovative save, but it seems to me that Mario et al (Gabriel?) have written an excellent autoporter, and often when things don't work it's because there were actually changes between lean 3 and lean 4 which we are now learning how to code around. It would still be an interesting experiment though. Perhaps a better fit would be porting Coq code to Lean code though? Because this is probably too hard for an autoporter to do in some sense. However would the issue there be a lack of training data? Daniel Selsam (Mar 17 2023 at 20:43): Oops, I haven't been following the autoporter work and I am glad to hear it is working so well. Then I'll just throw out the more general: I think there is major untapped potential in building all sorts of tools that wrap GPT4 subroutines in various ways. It can do so much zero/few shot that I wouldn't worry too much about lack of training data. Daniel Selsam (Mar 17 2023 at 20:50): ^ especially if you can give it interactive feedback, e.g. tactic states + error messages Daniel Selsam (Mar 17 2023 at 21:03): It is also easier than ever to write VSCode plugins since GPT4 can do most of the work for you David Renshaw (Mar 17 2023 at 22:04): I tried asking ChatGPT-4 for help on a different lemma (this time in Lean 4) just now: import Mathlib.Data.Real.Basic import Mathlib.Tactic.LibrarySearch import Mathlib.Tactic.Linarith import Mathlib.Tactic.Ring import Mathlib.Tactic.FieldSimp theorem lemma1 {a b c d : ℝ} (ha : 0 < a) (hb : 0 < b) (h : a * c > b * d) : (1 / b) * c > (1/a) * d := by sorry  ChatGPT's high-level idea was good, but it kept getting little details wrong, e.g. multiplying on both sides rather than dividing, or using plausible-sounding but non-existent mathlib lemmas. After a few somewhat frustrating back-and-forths with ChatGPT, I finished the proof myself and pushed it here: https://github.com/dwrensha/math-puzzles-in-lean4/commit/68d6eedd1f487e4ab50ff6c5f018649ce8a7e36e I consider the new proof to be an improvement on the old one. Kevin Buzzard (Mar 17 2023 at 23:08): Is it field_simps; nlinarith? Kevin Buzzard (Mar 17 2023 at 23:08): BTW > considered harmful Daniel Selsam (Mar 17 2023 at 23:23): @David Renshaw if you post the transcript I might be able to suggest prompting/interaction tricks David Renshaw (Mar 17 2023 at 23:37): is there an easy way to export a transcript from the chat interface? David Renshaw (Mar 17 2023 at 23:37): I'll copy/paste into a text file and fix it up a bit... David Renshaw (Mar 17 2023 at 23:39): chatgpt-transcript.txt David Renshaw (Mar 17 2023 at 23:43): field_simp; nlinarith gives me: linarith failed to find a contradiction David Renshaw (Mar 17 2023 at 23:43): which is more than I expected ( I thought it would just say something about nlinarith not be implemented in Lean 4 yet.) Kevin Buzzard (Mar 17 2023 at 23:44): The trick with these things is the interaction, it seems to me. I'm still only an amateur in this field but is that what they call "fewshot"? Some seem to interpret this as "humans and computers interacting to get better results" but I always look at it as "human which knows the answer already telling AI various hints in order to increase the score on their tests so they can say it got 75% rather than 25% because they cheated" Kevin Buzzard (Mar 17 2023 at 23:46): Having said that it really does seem to help the AI (perhaps unsurprisingly, given that the human is in practice giving out parts of the answer) David Renshaw (Mar 17 2023 at 23:47): div_lt_div_of_lt_of_pos is what I was referring to when I said "plausible-sounding but non-existent mathlib lemmas" Daniel Selsam (Mar 17 2023 at 23:48): David Renshaw said: chatgpt-transcript.txt thanks, very interesting! a few notes: • it is wasting tokens stating the full prefix each time -- ideally you would only be communicating deltas like in my example above • since lean 4 appears much less on the web than lean 3 (especially when GPT-4 was actually trained), it will probably be extra helpful to put a bunch of example lean4 proofs above the current one in the context. I think the 500 lines above the current spot in the current file would be a good place to start • consider telling it about other actions it can take, i.e. asking if it wants to pretty print with different settings, search the library for a pattern, etc. Kevin Buzzard (Mar 17 2023 at 23:49): Conversely I think very highly of those sorts of errors (hallucinating lemmas). I regard them as evidence that we should be actually writing the lemmas and putting them into mathlib Daniel Selsam (Mar 17 2023 at 23:49): free form convos are great to get started and build intuition, but I would aim for an automatable workflow with a clear protocol for filling in the user messages Kevin Buzzard (Mar 17 2023 at 23:51): I saw one of the systems suggest [is_foo X] at one point where the correct mathlib would have been just [foo X] and I'm thinking "yeah but actually it has a point here, we're being inconsistent and it's spotted what we should have called it" Kevin Buzzard (Mar 17 2023 at 23:52): The machines really are helping us there Daniel Selsam (Mar 17 2023 at 23:52): Kevin Buzzard said: but is that what they call "fewshot"? Suppose you have a question Q and you want it to give an answer A. "Few-shot" prompting refers to formatting the prompt as: Q: <Q1> A: <A1> ... Q: <Qn> A: <An> Q: <Q> A:  It originated as a way of communicating the instruction/goal to the system, but even when an instruction can suffice, including examples in the context often makes it much better, especially on non-standard domains. Daniel Selsam (Mar 17 2023 at 23:54): Including the text above the current point in the file also gives it extra information about what imports there are, what declarations are in scope, namespaces open, etc. Kevin Buzzard (Mar 17 2023 at 23:54): Oh so you don't actually ask the question until you've done the prep? It's not "ask the question and then keep explaining what the system did wrong until the system gets it right and then you give it a point and say it got it right^* {footnote: we used fewshot}"? Kevin Buzzard (Mar 17 2023 at 23:55): That was my interpretation of that weasel word Kevin Buzzard (Mar 17 2023 at 23:56): I was trying to understand the comments about the AMC12 results in the GPT4 paper and at the end of it I was totally bewildered about what was actually going on Daniel Selsam (Mar 17 2023 at 23:56): Kevin Buzzard said: Oh so you don't actually ask the question until you've done the prep? It's not "ask the question and then keep explaining what the system did wrong until the system gets it right and then you give it a point and say it got it right^* {footnote: we used fewshot}"? I don't think I am parsing this correctly, but the short answer is that we are still in the prompting Dark Ages and there is very little in the way of best practices. Also, best practices would change every GPT version (IIRC few-shot prompting didn't even work at all until GPT-3) -- who knows what suddenly works now! Kevin Buzzard (Mar 17 2023 at 23:57): The paper seemed to say "we asked the question and then the system wrote some stuff and then we decided which of the multiple choice answers it had chosen" and my jaw hit the floor Kevin Buzzard (Mar 17 2023 at 23:58): So if the system just wrote loads of waffle implying that the answer was both A and C then the graders could just decide that probably it meant the correct answer David Renshaw (Mar 17 2023 at 23:58): You might also find amusing my chess game against GPT-4 from last night: chatgpt-chess-transcript.txt . It made no illegal moves, and even almost tricked me into a stalemate. And I'm not entirely sure that it was playing at its full strength (I was the first to blunder). Kevin Buzzard (Mar 17 2023 at 23:59): Oh interesting; I'd heard that GPT3 just cheated at chess Daniel Selsam (Mar 17 2023 at 23:59): Kevin Buzzard said: The paper seemed to say "we asked the question and then the system wrote some stuff and then we decided which of the multiple choice answers it had chosen" and my jaw hit the floor I don't know what you are referring to -- do you recall the page number? Kevin Buzzard (Mar 18 2023 at 00:00): It was the description of how they had graded the AMC12 paper Kevin Buzzard (Mar 18 2023 at 00:00): I'm on mobile right now, I'll dig it up later Kevin Buzzard (Mar 18 2023 at 00:00): I'm sort of fussy about the small print and I think it's generally established that the paper is in places light on details of how things actually work Kevin Buzzard (Mar 18 2023 at 00:02): For example I'd like to see the complete input and output of the session at the end of which it was decided that GPT4 had scored 40% on AMC12 or whatever the figure was Kevin Buzzard (Mar 18 2023 at 00:02): Rather than "the engineers at openai said that it scored 40% based on its responses" Daniel Selsam (Mar 18 2023 at 00:02): Oh maybe you are referring to For most exam runs, we extract the model’s letter choice directly from the explanation IIUC this just means there is some simple script that e.g. looks for (a), (b), (c), (d) and returns the last one that appears in the solution. Kevin Buzzard (Mar 18 2023 at 00:03): I think that this is just because I'm from a very different community where you're not allowed to say "I did this and the proof is on a piece of paper in my office" in publications Kevin Buzzard (Mar 18 2023 at 00:04): I can't comment on what you understand, but my understanding of the sentence is that I'm not being told the whole story Kevin Buzzard (Mar 18 2023 at 00:05): And this would not be solved by openai showing one example of a solution and how it was graded because then they can just choose a totally unambiguous success Daniel Selsam (Mar 18 2023 at 00:05): I don't know if everything is there yet, but IIUC the plan is to do this with https://github.com/openai/evals/tree/main plus, the model is openly available and anyone can check scores in principle, and also see how robust the model is to rewording Kevin Buzzard (Mar 18 2023 at 00:05): I'd like to see the entire run and then grade it impartially Kevin Buzzard (Mar 18 2023 at 00:06): I'd also like to see evidence that the script which did the grading was written before the system was given the exam (if indeed that's what happened) Kevin Buzzard (Mar 18 2023 at 00:06): I'm not expecting to see any of this though because computer scientists seem to have no interest in the individual datum of a solution Kevin Buzzard (Mar 18 2023 at 00:07): They always work with large data sets and report outcomes of calculations they did on the datasets Kevin Buzzard (Mar 18 2023 at 00:08): There are two issues here. Firstly, the correct way to do statistics is to completely design the experiment before you actually run it, eg you have to fix the methodology by which the system is going to be graded before the system gets to see the data. It is really really common across all of experimental science that this rule is broken Zhangir Azerbayev (Mar 18 2023 at 00:09): I ran gpt4 on proofnet and it managed to correctly formalize 29% of theorems few-shot, compared to 13% for codex. The qualitative difference feels huge--there were a good number of instances where the gpt4 formalization matched the human-written reference character-by-character, which almost never happened with codex. I'll try other prompting tricks like chain-of-thought and retrieval as soon as I can. Gpt4 seems to really want to do CoT, so much so that it sometimes starts planning out its formalizations in NL even though I have 12 few-shot examples that tell it not to. These methods should improve over the 29% by a good margin. If I can get lean-gym working, I'll also try pasting in error messages and automating the proving process from earlier in the thread. Kevin Buzzard (Mar 18 2023 at 00:09): I've seen evidence of medics analysing drug trials where they just click every statistics test on the software they're using until they find the test which gives them the result which their funders want to see Kevin Buzzard (Mar 18 2023 at 00:13): And secondly I think that looking at the individual data is important rather than the overall percentages. Yang Hui He gave a talk at Imperial and was saying stuff like "our system correctly predicts the dimension of 70% of these cohomology groups" and I said "were 70% of the cohomology groups zero by any chance?" and his response could be translated as "I have absolutely no idea, I never looked at the data and don't even have access to it, a post-doc did that bit" Kevin Buzzard (Mar 18 2023 at 00:14): I can write a system which tells you with high accuracy whether you're straight or gay given only a picture of your knee (assuming a random sample of population), and I don't know a thing about AI. Daniel Selsam (Mar 18 2023 at 00:15): @Kevin Buzzard I share your concern about evals in general -- a big motivation for the IMO-GC was to prevent the kind of soft-cheating that was common in ML at the time -- but this model is available to the world, and anyone can eval it on whatever data (past or future) with whatever methodology they want. As for the cohomology groups analogy, I don't think this applies to well-known tests such as AMC where the score is a-priori interpretable. Kevin Buzzard (Mar 18 2023 at 00:15): I look at these claims in the openAI paper and I'm full of questions about what's actually going on and I find it frustrating that I can't see more details Zhangir Azerbayev (Mar 18 2023 at 00:16): Individual data for my proofnet gpt4 run: https://github.com/zhangir-azerbayev/ProofNet/tree/gpt4/eval/results/gpt4_12shot_formalize Kevin Buzzard (Mar 18 2023 at 00:17): Zhangir still an UG so hasn't been corrupted by the tech company smoke and mirrors yet ;-) Zhangir Azerbayev (Mar 18 2023 at 00:26): The only evals I still completely trust are ones on data that was created after the model's training set was scraped. ML researchers will try to decontaminate the training set using fuzzy matching, but who knows if that actually works. For all we know there might be some random reddit thread where people have discussed step-by-step solutions for superGLUE problems. Zhangir Azerbayev (Mar 18 2023 at 00:28): And I think we can trust industry labs when they say the model was trained only on data up to some date. If it wasn't people would quickly discover the model knows a bunch of really specific facts about things that happened after the knowledge cutoff. Daniel Selsam (Mar 18 2023 at 00:29): Zhangir Azerbayev said: And I think we can trust industry labs when they say the model was trained only on data up to some date. If it wasn't people would quickly discover the model knows a bunch of really specific facts about things that happened after the knowledge cutoff. Trust, but read footnote 10 :) and then continue to pay attention to minor version numbers Kevin Buzzard (Mar 18 2023 at 00:32): Daniel has just pointed out to me offline that actually instead of moaning that I can't see what they did, I could literally just go and do it again myself, and I think that this rather negates a lot of my complaints above. Kevin Buzzard (Mar 18 2023 at 00:33): (note that this was definitely not the case in some of my case studies above) Daniel Selsam (Mar 18 2023 at 00:35): Yes -- given the openness of our API, any number we report that is significantly misleading will obviously be discovered by the Internet in short order, and so we are strongly incentivized not to over-claim. Though of course, mistakes (and subtle contamination) do happen. Kevin Buzzard (Mar 18 2023 at 00:36): So I guess the issue which isn't resolved by this observation is "what if you actually trained on the data" ;-) Maybe worth doing AMC12 2023 the moment it comes out? Kevin Buzzard (Mar 18 2023 at 00:37): But thanks, you've pointed out what was probably quite an obvious point to you but given that I've never actually used GPT3 or 4 at all, hadn't really clicked for me. Daniel Selsam (Mar 18 2023 at 00:43): Zhangir Azerbayev said: The only evals I still completely trust are ones on data that was created after the model's training set was scraped. ML researchers will try to decontaminate the training set using fuzzy matching, but who knows if that actually works. For all we know there might be some random reddit thread where people have discussed step-by-step solutions for superGLUE problems. agreed 100%. full decontamination is ill-posed and impossible -- a problem could appear on some obscure blog in base64 representation of a polish translation :) Kevin Buzzard (Mar 18 2023 at 00:49): It's even worse -- the committee setting AMC12 problems has an extremely hard job and are inevitably going to be reusing ideas (either implicitly or explicitly). This is one of the reasons why I think AI proving a hard conjecture that humans can't do is currently science fiction -- here there's strong evidence that the sum total of human knowledge available on the internet is not good enough without injection of new ideas. With Wiles' proof of FLT for example you can really point your fingers to where the innovations are. Daniel Selsam (Mar 18 2023 at 00:55): Kevin Buzzard said: here there's strong evidence that the sum total of human knowledge available on the internet is not good enough without injection of new ideas. FWIW I am not so sure this is the best way to think about it. The sum total of human knowledge includes accumulated wisdom about effective processes for slowly solving hard new problems. It might be possible to employ that knowledge with a lot of compute/search to tackle big challenges. "Let's think step by step" on steroids. Kevin Buzzard (Mar 18 2023 at 01:04): Right, and this is where I really feel that I don't understand the capabilities of these systems. Every great maths theorem was proved by humans who were inspired by what they knew and somehow took things one crucial step further. Daniel Selsam (Mar 18 2023 at 01:10): Kevin Buzzard said: Right, and this is where I really feel that I don't understand the capabilities of these systems. You are in good company :) Jonathan Yan (Mar 18 2023 at 01:13): Hi, casual observer here: I think the effective processes for slowly solving hard new problems could still be unique to humans. GPT learned most of the "fast" stuff. Here the meaning of fast and slow is analogous to Thinking, Fast and Slow. Daniel Selsam (Mar 18 2023 at 18:30): Jonathan Yan said: Hi, casual observer here: I think the effective processes for slowly solving hard new problems could still be unique to humans. GPT learned most of the "fast" stuff. Here the meaning of fast and slow is analogous to Thinking, Fast and Slow. I think "Let's think by step by step" has basically already disproved this, but I would be curious to hear any falsifiable hypotheses based on this framing. Jason Rute (Mar 18 2023 at 21:44): @Jonathan Yan While I agree with the fast and slow framing and that GPT (and other similar neural models) tend to the fast side, there are lots of ways to also introduce slowness into these models. Chain of thought prompting (e.g. "let's think step by step") is one way. It slows the model down by making it consume more tokens before giving an answer (and it also induces a rudimentary form of recurrence into a model which is otherwise not very recurrent). Feedback loops, like in Daniel's example above, are another way. Neural-guided search trees like in Alpha Zero (which are common in many neural theorem proving models) are a third way. Finally, reinforcement learning is a fourth way to introduce slowness into these models. So I don't think it is unique to humans, but I think a big part of future AI research is going to be how to effectively incorporate slowness into these otherwise fast models. Jason Rute (Mar 18 2023 at 21:54): @Daniel Selsam My hypothesis is that in 5 years, the best models are going to have a decidedly recognizable "slow" mechanism, and that if removed the models will perform much worse on tasks which require reasoning. That might already be true with things like chain of thought prompting (which I've already claimed are an example of slow thinking), but I think it will be more recognizable (and powerful) in 5 years. Jonathan Yan (Mar 19 2023 at 05:54): I agree with @Jason Rute 's response to @Daniel Selsam , ie concrete examples of the "slow" mechanism including the CoT trick, and that a more powerful slow system will improve performance substantially. Thanks! Jonathan Yan (Mar 19 2023 at 06:06): More precisely, my intuition is that the "fast" (F) system is simply one forward pass of a trained LLM, producing one token. A basic GPT system is running this F with a simple slow (S) mechanism: run until the end. And CoT is human adding "step by step" to this S. Re "uniqueness", I was referring to the unique effectiveness of the human S that magically produces highly-intelligent behavior when applied to the human F, which is less knowledgeable and has a much smaller context window than the machine F. Jonathan Yan (Mar 19 2023 at 06:14): Note that the human S has the ability to train the human F on the fly: notice a new concept in exploratory thoughts, recognize that the concept is helpful, do some practice to internalize it (in a matter of days), then build on this new intuition for new thoughts. Mathematicians are extremely good at this while machines still have a long way to go. Fabian Glöckle (Mar 19 2023 at 12:17): Regarding training data contamination: I've had the impression with the previous releases of Chatgpt that it sometimes knew about big news that happened after its official knowledge cutoff, probably because of users asking it about these and because RLHF training continued on the new interaction samples. @Daniel Selsam do you know more on how this? Daniel Selsam (Mar 20 2023 at 17:11): Fabian Glöckle said: Regarding training data contamination: I've had the impression with the previous releases of Chatgpt that it sometimes knew about big news that happened after its official knowledge cutoff, probably because of users asking it about these and because RLHF training continued on the new interaction samples. Daniel Selsam do you know more on how this? AFAIK those cutoffs are to set expectations, not a contamination guarantee. Some analogue of footnote 10 applies. Fabian Glöckle (Mar 20 2023 at 18:19): I see, thank you! Scott Morrison (Mar 21 2023 at 09:58): I've just had an epic chat session with ChatGPT-3.5, mostly trying to act as a robotic interface between ChatGPT and Lean. That is, my job is to carry proposed code from ChatGPT back to Lean, and error messages or goal states back to ChatGPT. I'm not sure whether to characterise it as promising or not. • It's sure trying hard. • It's getting mixed up between Lean 3 and Lean 4 syntax a lot • Even though I tried to ask it to work one line at a time, occasionally it gets complicated ideas, and the "just report the error messages" feedback loop isn't enough to suppress these once they happen. • This is only ChatGPT-3.5, not ChatGPT-4. Scott Morrison (Mar 21 2023 at 09:58): Giant transcript follows: I'm trying to prove a mathematical statement in the Lean 4 theorem prover, and would like your help suggesting the next step. Please note that I'm working in the newer Lean 4, rather than Lean 3, and the syntax is slightly different. It's okay if you want to give an informal explanation of your idea for the next step. Most importantly, I want you to repeat back the code I'm giving you, with one more proof step before the 'sorry'. Don't try to complete the whole proof immediately, because in a moment I'll use the Lean goal state to give you feedback about how your first suggestion works. The proof I have at the moment is: @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by induction l <;> intros sorry  Based on the provided code, it seems like you're trying to prove a lemma called getLast_cons. This lemma states that for any non-empty list l and element a, the last element of the list a :: l (which is just l with a added to the front) is the same as the last element of l itself. To prove this lemma, you've started an induction on the list l, which means you need to consider two cases: the base case when l is empty, and the inductive case when l is non-empty. For the base case, you need to show that getLast (a :: nil) _ is equal to getLast nil _, where _ is a proof of the fact that nil is not equal to nil (since you're assuming that l is non-empty). This is impossible to prove, so you can use the cases tactic to show that the assumption h : l ≠ nil is false when l is nil. Here's the modified code: @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by induction l with hd tl hl case nil { intro h, cases h } sorry  This code pattern-matches on the list l using the with keyword and introduces the variables hd and tl to refer to the head and tail of the list, respectively. In the nil case, you use the intro tactic to introduce the assumption h and the cases tactic to show that it's false. Note that you don't need the ;< notation here because there's only one case to consider. When I tried your suggestion, it gave an error on the line induction l with hd tl hl, specifically on the token hd, saying "unknown tactic". Remember this is in Lean 4, and the syntax is different from the syntax in Lean 3. My apologies for the confusion. In Lean 4, you can pattern-match on the list l using the match keyword instead of with. Here's the modified code that should work: @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by match l with | nil => intro h; cases h | hd :: tl => intro h; sorry end  The match keyword is followed by the object being matched, which in this case is the list l. Then, you use the | symbol to separate the different cases to consider. In the first case, when l is nil, you use the => symbol to specify the tactics to run, which are to introduce the assumption h and then use the cases tactic to show that h is impossible. In the second case, when l is of the form hd :: tl, you again introduce the assumption h and then leave the proof for now with the sorry placeholder. When I try your code @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by match l with | nil => intro h; cases h | hd :: tl => intro h; sorry end  Lean 4 gives an error on the line | nil => intro h; cases h, specifically on the token cases, saying tactic 'induction' failed, major premise type is not an inductive type [] = [] → False. I apologize for my previous response. The error is due to the fact that you are using cases h on a hypothesis that asserts the non-emptiness of the list l. Instead, you should use cases l to split l into the cases of [] and h :: t, and then perform the necessary steps in each case. Here is the corrected code: @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with hd tl { intro h, contradiction }, { intro h, sorry }  This code uses cases l to split l into the cases of [] and hd :: tl. In the first case, h is a proof of the false statement [] ≠ nil, so you can use the contradiction tactic to complete the proof. In the second case, h is a proof of the statement hd :: tl ≠ nil, so you can use sorry as a placeholder for the rest of the proof. When I try your code @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with hd tl { intro h, contradiction }, { intro h, sorry }  Lean 4 gives an error on the line cases l with hd tl, specifically on the token hd, saying unknown tactic. I apologize for the mistake. In Lean 4, you can use the cases tactic to perform case analysis on an inductive type. However, List is a parameterized type, so you need to provide the relevant information to the cases tactic. Here is the corrected code: @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with | nil => intro h; contradiction | hd :: tl => intro h; sorry  In this code, I have replaced the curly braces in cases l with hd tl with the | symbol to match the new Lean 4 syntax. Additionally, since List is a parameterized type, you need to use the pipe (|) to introduce variables for its parameters. In this case, since List has one parameter α, you don't need to provide any variables. Now the first case of l = nil uses contradiction to immediately close the goal. The second case of l = hd :: tl leaves the proof for now with sorry. When I try your code @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with | nil => intro h; contradiction | hd :: tl => intro h; sorry  Lean 4 gives an error on the line  | hd :: tl => intro h; sorry, specifically on the token ::, saying expected '=>'. I apologize for the mistake. The issue is that :: is a special symbol in Lean, and it cannot be used as an identifier in the cases tactic. Instead, you can use the backtick symbol () to surround :: and treat it as an identifier. Here is the corrected code: @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with | nil => intro h; contradiction | (_, _) => intro h; sorry  In this code, I have replaced :: with the backtick symbol to treat it as an identifier. The (_, _) notation is used to match any pair, which is what :: is in Lean. Now, the first case of l = nil uses contradiction to immediately close the goal. The second case of l = hd :: tl leaves the proof for now with sorry. When I try your code @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with | nil => intro h; contradiction | (_, _) => intro h; sorry  Lean 4 gives on error on the line  | (_, _) => intro h; sorry, specifically on the token , saying expected '_' or identifier. I apologize for the mistake. The issue is that Lean does not recognize the syntax I used to match the cons cell pattern. Here is an updated version of the code that should work: @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with | nil => intro h; contradiction | List.cons hd tl => intro h; sorry  In this version of the code, I have used List.cons to match the cons cell pattern. Now, the first case of l = nil uses contradiction to immediately close the goal. The second case of l = hd :: tl leaves the proof for now with sorry. When I try your code @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with | nil => intro h; contradiction | List.cons hd tl => intro h; sorry  Lean 4 gives on error on the entire line  | List.cons hd tl => intro h; sorry, saying invalid alternative name 'List.cons'. I apologize for the mistake. It seems like Lean 4 does not recognize the List.cons notation. Here is an updated version of the code that should work: @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with | nil => intro h; contradiction | hd :: tl => intro h; sorry  In this version of the code, I have used hd :: tl to match the cons cell pattern. Now, the first case of l = nil uses contradiction to immediately close the goal. The second case of l = hd :: tl leaves the proof for now with sorry. Okay, that still produces an error, so I'm going to help you out. The correct syntax here is @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with | nil => intro h; contradiction | cons hd tl => intro h; sorry  The goal at the sorry is now case cons ι: Type ?u.29118 α: Type u β: Type v γ: Type w δ: Type x l₁l₂: List α ahd: α tl: List α h: hd :: tl ≠ [] ⊢ getLast (a :: hd :: tl) (_ : a :: hd :: tl ≠ []) = getLast (hd :: tl) h  Can you fill in a single further step at the sorry? Scott Morrison (Mar 21 2023 at 09:59): Sure, here's a possible next step: @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with | nil => intro h; contradiction | cons hd tl => intro h; rw [getLast_cons]; exact h  This step replaces getLast (a :: hd :: tl) with hd :: getLast (a :: tl) using the definition of getLast, and then simplifies the proof goal using the fact h : hd :: tl ≠ []. The resulting proof goal is hd :: getLast (a :: tl) = getLast (hd :: tl), which should be straightforward to prove using the induction hypothesis. When I try your code @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with | nil => intro h; contradiction | cons hd tl => intro h; rw [getLast_cons]; exact h  Lean 4 gives on error on the line  | cons hd tl => intro h; rw [getLast_cons]; exact h, specifically on the token exact, saying no goals to be solved. My apologies, I should have been clearer. Here's the correct version of the code: @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with | nil => intro h; contradiction | cons hd tl => rw [getLast_cons hd tl]; exact h  The rw [getLast_cons hd tl] line replaces the left-hand side of the goal with the right-hand side of the getLast_cons equation. The exact h line then finishes the proof by substituting the hypothesis h : l ≠ nil for the second argument of getLast in the right-hand side of the getLast_cons equation. When I try your code @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with | nil => intro h; contradiction | cons hd tl => rw [getLast_cons hd tl]; exact h  Lean 4 gives an error on the line  | cons hd tl => rw [getLast_cons hd tl]; exact h, specifically on the token getLast_cons, saying application type mismatch getLast_cons hd argument hd has type α : Type u but is expected to have type ?m.29235 ≠ [] : Prop. Sorry, my mistake. Here is the corrected code: @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with | nil => intro h; contradiction | cons hd tl => rw [getLast_cons]; exact h  This should work. When I try your code @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with | nil => intro h; contradiction | cons hd tl => rw [getLast_cons]; exact h  I get an error on the line  | cons hd tl => rw [getLast_cons]; exact h, saying unknown identifier 'h'. I apologize for the mistake. Here is the corrected code: @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with | nil => intro h; contradiction | cons hd tl => rw [getLast_cons]; exact (cast (congr_arg _ (List.cons_ne_nil_iff_nonempty _ _).mpr h.symm) rfl)  The error occurred because h was no longer in scope after the rw tactic was applied. To fix this, we can use cast to convert the proof of l ≠ nil from l = nil → false to false → hd :: tl ≠ nil, so that we can apply it as an argument to getLast. When I try your code @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with | nil => intro h; contradiction | cons hd tl => rw [getLast_cons]; exact (cast (congr_arg _ (List.cons_ne_nil_iff_nonempty _ _).mpr h.symm) rfl)  Lean 4 gives an error on the line  | cons hd tl => rw [getLast_cons]; exact (cast (congr_arg _ (List.cons_ne_nil_iff_nonempty _ _).mpr h.symm) rfl) saying unknown constant 'List.cons_ne_nil_iff_nonempty'. I apologize for the confusion. The List.cons_ne_nil_iff_nonempty lemma is actually called cons_ne_nil_iff_nonempty in Lean. Please try the following code instead: @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with | nil => intro h; contradiction | cons hd tl => rw [getLast_cons]; exact (cast (congr_arg _ (cons_ne_nil_iff_nonempty _ _).mpr h.symm) rfl)  When I try your code @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with | nil => intro h; contradiction | cons hd tl => rw [getLast_cons]; exact (cast (congr_arg _ (cons_ne_nil_iff_nonempty _ _).mpr h.symm) rfl)  Lean 4 gives an error on the line  | cons hd tl => rw [getLast_cons]; exact (cast (congr_arg _ (cons_ne_nil_iff_nonempty _ _).mpr h.symm) rfl) saying unknown identifier 'cons_ne_nil_iff_nonempty'. I apologize for the mistake in my previous message. Here is the corrected version of the code: @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ nil, getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with | nil => intro h; contradiction | cons hd tl => rw [getLast_cons]; exact (cast (congr_arg _ (cons_ne_nil_iff _).mpr h.symm) rfl)  The correction is that cons_ne_nil_iff_nonempty should be replaced with cons_ne_nil_iff, since cons_ne_nil_iff is the correct name for the lemma. Okay, I think we're getting stuck. Could you keep most of that proof as is, but try a completely new approach for the last line? Sure! Here's a possible approach that avoids using cast. Instead, we can prove that getLast (a :: hd :: tl) _ and getLast (hd :: tl) h are definitionally equal by using the fact that getLast only depends on the last element of a list, and then using the fact that hd :: tl and a :: hd :: tl have the same last element. @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ [], getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with | nil => intro h; contradiction | hd :: tl => have h' : tl ≠ [], from by intro H; apply h; rw [H, List.cons_append]; simp, have h'' : hd :: tl ≠ [], from by rw List.cons_ne_nil_iff_nonempty; exact h', rw [getLast_cons], rw [getLast_cons tl h'], rw [←getLast_cons (a :: hd :: tl) h''], rw [←getLast_cons (hd :: tl) h], congr' 1, rw [List.cons_append], simp  Here, we first prove h' : tl ≠ [] using the same argument as before. We then prove h'' : hd :: tl ≠ [] by using the fact that hd :: tl is nonempty if and only if tl is nonempty, which follows from the fact that the :: constructor is injective. Next, we use the rw tactic to apply getLast_cons twice, once to tl and once to a :: hd :: tl. We then use congr' 1 to simplify the goal to showing that the last element of a :: hd :: tl is the same as the last element of hd :: tl after the :: has been applied. Finally, we use rw and simp to show that these two lists have the same last element. Okay, that was all a bit too creative. Let's try again. Can you fill in at most one more step, at the sorry: @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ [], getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with | nil => intro h; contradiction | cons hd tl => intros; sorry  The goal at the sorry is ⊢ getLast (a :: hd :: tl) (_ : a :: hd :: tl ≠ []) = getLast (hd :: tl) h  Certainly! Here's one possible step: @[simp] theorem getLast_cons {a : α} {l : List α} : ∀ h : l ≠ [], getLast (a :: l) (cons_ne_nil a l) = getLast l h := by cases l with | nil => intro h; contradiction | cons hd tl => intros; rw [←getLast_cons tl hd]; refl  And at this point I gave up, not seeing a reasonable basis for an automatic feedback loop to disabuse ChatGPT of its insistence on using the lemma we're trying to prove. :-) Michael R Douglas (Mar 21 2023 at 14:13): It's interesting. Have you tried it with more prompt engineering. For example start by explaining that circular reasoning is not allowed, and perhaps telling it other general principles of mathematical proof. Also GPT-4 is claimed to be better at reasoning. Jake (Mar 21 2023 at 14:34): Was just reading through this chat, what does CoT stand for? David Renshaw (Mar 21 2023 at 14:45): "chain of thought" Daniel Selsam (Mar 21 2023 at 14:50): Scott Morrison said: • This is only ChatGPT-3.5, not ChatGPT-4. Didn't read the transcript, but I think playing with 3.5 is a waste of time. 4! Scott Morrison (Mar 21 2023 at 21:29): Any API keys available? :-) Eric Taucher (Mar 22 2023 at 10:42): Jake said: What does CoT stand for? I know David gave you the answer. :smile: You might also like https://learnprompting.org/docs/intro and specifically for COT https://learnprompting.org/docs/intermediate/chain_of_thought Eric Taucher (Mar 22 2023 at 10:46): Scott Morrison said: Any API keys available? :-) You might be able to get a GPT-4 API key faster by contributing some evals. For a limited time, we will be granting GPT-4 access to those who contribute high quality evals. Please follow the instructions mentioned above and note that spam or low quality submissions will be ignored:exclamation: Scott Morrison (Mar 22 2023 at 13:24): I have an implementation of the "ask GPT to work on a sorry" ---> "compile the result in Lean" ---> "go back to GPT and explain what error messages / new goals appear" ---> "compile the result in Lean" ... loop, hooked up as a tactic. It looks quite promising already, and I'll write up some samples soon. Hopefully we'll be able to run it against GPT4 tomorrow, and so far I've only done the most basic prompt engineering. David Renshaw (Mar 22 2023 at 13:27): Is your code public? I want to play with it! Gerald Höhn (Mar 22 2023 at 14:50): I got the API keys last week when pretending to do some research with large language models to see if they could be useful to formalize mathematics from plain english. No idea if the proof of the irrationality of square root of 2 in lean which GPT-4 gave to me is correct. At least when I translated the code back into plain english it was right. Although GPT-4 is now a better mathematician than my undergraduate students it still makes stupid mistakes. Thus a feedback loop for GPT-4 would be quite useful. I hope the image input is soon available. No more hand grading of mathematical undergraduate papers... Johan Commelin (Mar 22 2023 at 15:05): Are you saying that you want hand over the grading of undergrad papers to a system that still makes stupid mistakes? Adam Topaz (Mar 22 2023 at 15:12): Johan Commelin said: Are you saying that you want hand over the grading of undergrad papers to a system that still makes stupid mistakes? My university is already doing this :man_facepalming: Gerald Höhn (Mar 22 2023 at 15:14): Are you saying that you want hand over the grading of undergrad papers to a system that still makes stupid mistakes? I can understand your doubt. But one has to compare it with the quality of graders paid minimum wage. But as said, I am interested in a feedback loop so that GPT-4 can be sure its reasoning is correct. Adam Topaz (Mar 22 2023 at 16:33): tangentially related (assuming folks heard about the fact that the old codex was being shut down): https://github.blog/2023-03-22-github-copilot-x-the-ai-powered-developer-experience/ Patrick Massot (Mar 22 2023 at 17:03): In France we solved that graders problem a long time ago. There isn't any money to pay people to grade. Professors do it for free. Matthew Ballard (Mar 22 2023 at 17:04): Better than GPT-4? Adam Topaz (Mar 22 2023 at 21:14): @Scott Morrison Just curious-- how are you accessing the OpenAI api with lean4? Scott Morrison (Mar 22 2023 at 21:14): curl Scott Morrison (Mar 22 2023 at 21:14): and some structures deriving ToJson, FromJson. Scott Morrison (Mar 22 2023 at 21:15): Talking to it is pretty trivial. Adam Topaz (Mar 22 2023 at 21:15): Ah ok Scott Morrison (Mar 22 2023 at 21:16): The fun stuff is compiling what it returns, so we can sensibly ask follow up questions. But Lean is awesome? I hadn't tried "running Lean from Lean" until yesterday, but it works beautifully, and you get as much information back as you like. Scott Morrison (Mar 22 2023 at 21:18): The fact that you can write a tactic that wakes up in a file, reads the source code, makes some modifications to it, runs it in a virtual Lean session, and can retrieve from it the MVarIds from sorries that file, check their types, etc, is pretty wild. Adam Topaz (Mar 22 2023 at 21:31): Now I'm really curious to see this code, not just because of the AI stuff, but just for the metaprogramming! Scott Morrison (Mar 22 2023 at 21:36): First refactor almost there :-) Jason Rute (Mar 22 2023 at 22:39): Scott Morrison said: The fact that you can write a tactic that wakes up in a file, reads the source code, makes some modifications to it, runs it in a virtual Lean session, and can retrieve from it the MVarIds from sorries that file, check their types, etc, is pretty wild. That sounds pretty awesome. I’d love to see the code. When you are ready to share. Daniel Selsam (Mar 23 2023 at 01:53): FYI I just saw that Microsoft wrote a paper assessing GPT-4, with an entire section on its mathematical abilities https://arxiv.org/abs/2303.12712 Scott Morrison (Mar 23 2023 at 02:18): The maths section is full of interesting stuff, but look at the "tool use" section! Scott Morrison (Mar 23 2023 at 02:22): :mind_blown:, GPT-4 drawing a map, after exploring a world in a text adventure game. Fig 5.8 Scott Morrison (Mar 23 2023 at 02:25): In another text adventure, after GPT-4 fails to cook a meal (attempting to look for a frying pan in the bedroom), the authors admit they couldn't solve it either without reading the source code. They fail to mention if they tried showing GPT-4 the source code. :-) Adam Topaz (Mar 23 2023 at 02:41): I think one thing that the discussions above show and the section on mathematics in this paper show is that we need better errors in lean/mathlib. Adam Topaz (Mar 23 2023 at 03:37): Daniel Selsam said: FYI I just saw that Microsoft wrote a paper assessing GPT-4, with an entire section on its mathematical abilities https://arxiv.org/abs/2303.12712 The inevitable HN discussion: https://news.ycombinator.com/item?id=35269590 Johan Commelin (Mar 23 2023 at 07:02): [...] First, the training data of the model mainly consists of questions and their solutions, but it does not capture the wording that expresses the thinking process which leads to the solution of a math problem, in which one makes guesses, encounters errors, verifies and examines which parts of the solution are correct, backtracks, etc. In other words, since the training data is essentially a linear exposition of the solution, a model trained on this data has no incentive to engage in an “inner dialogue” where it revisits and critically evaluates its own suggestions and calculations. This is of course not a new realization, but it does underscore that the kind of feedback loop that Scott is now building could maybe potentially (hopefully, etc...) massively improve the model. Kevin Buzzard (Mar 23 2023 at 09:32): When I was a kid in the early 80s, chess computers had just become viable. 20 years previously computers were the size of large rooms and made of valves, but because of the silicon revolution they could now be made to fit snugly under a chessboard and, with the addition of some magnets and a screen (screen technology was already good enough to make small visible LCD screens) you could have a machine that you could play chess against in your own home. One could argue that this was a result of decades of spectacular progress in engineering. The only problem was that these chess computers were absolutely lousy at chess. They would never throw away pieces or accidentally let themselves get bank rank checkmated in 1, and they could beat rabbits by simply taking all their pieces and then overwhelming them, but on a reasonably hard setting they were still no match for an average club player, and if you put them on the hardest setting then they typically took hours to make a move. Down my chess club the kids loved them, and the experts showed no interest in them because they had nothing to teach the experts. 15 years later, thanks to IBM (who were at that time a powerhouse in the computing world) who decided to throw a ton of money at the problem, they were beating grandmasters, and since then we've never looked back. I feel like we're in the "early 80s" phase right now with these large language models, and that is why I am super-uninterested in them right now. I certainly go around giving talks where I extol their virtues, and show a slide where they're solving 50% of the easiest IMO problem ever in Lean and claim that this is a breakthrough (which it is!). But at the end of the day GPT4 getting a half-decent mark in AMC 12, a multiple choice exam meant for schoolkids, is something which can simultaneously be regarded as an amazing achievement and something which indicates that there is still an essentially infinite amount of work still to be done before they are going to be doing research maths, which is why I am still bewildered by the claims of Szegedy et al that this is going to happen by 2029. Right now they can beat rabbits at maths and that's it. Beating rabbits is a major achievement but undergraduates can easily do this too. I've just looked at the MS paper and it's still the case that the mathematics is very ropey. The system makes errors, and in most of life the occasional error is fine but in mathematics one single error is unacceptable. This is one reason why I am concerned about the idea of pushing natural language proofs as opposed to generating Lean proofs, where there is no debate about whether an argument is correct. As is so often the case I feel like the researchers over-hype what they're seeing -- this is a continual problem I have with this community. For example in arguably the hardest math problem they give the system (p40, a modified IMO problem) I have several issues with their presentation. Firstly we have this perennial problem that they have to tell the system the answer -- I cannot complain too much about that because just the same would be true in a formal approach. Secondly they don't state the actual problem (which is this -- they have made substantial changes to it. Firstly they have cleared denominators, and secondly they've added the smoothness hypothesis which makes the question a lot easier (and is crucially used in the proof) -- major changes have been made (as well as telling the machine the solution). So this is very much a "watered down IMO problem" -- one reason the original question is hard is precisely because you cannot use the "wiggle" argument which smoothness gives you. Also, the researchers write that the GPT4 produces a "correct proof". This is not really the case -- this is not a 7/7 solution. For example the line "let $y^*$ be the unique positive solution of $g(x)+g(y)=2xy$" -- there is no justification in the given proof that such a $y^*$ exists and yet the researchers don't pick up on this, even praising the machine's step here. I'm taking at least one mark off here. The next step is also not at all clearly written -- it took me a minute to figure out what was being said. Of course, like so much human-produced mathematics, things can be fixed up. With a Lean proof there would be no such ambiguities. For me the ambiguities are very worrying because this is exactly what doesn't scale when it comes to mathematics (indeed, this is what drew me into ITPs in the first place). If we have computers producing natural language which purports to be a proof of the Riemann Hypothesis but looks like Mochizuki's proof of the ABC conjecture -- with an essential ambiguity in -- what then? Now will it take 15 years, like the chess machines, to solve these problems? Nobody knows. Maybe it will be 5 -- maybe it will be 50. But even though these things have come a very long way these examples indicate that there is still a hugely long way to go when it comes to mathematics. Maybe the next step is getting them to pass an undergraduate degree (which again is not going to impress a lot of the experts because undergraduates can do that -- but it's perhaps a step in the right direction?). I am reminded of the theorem prover story : for decades ITPs have been able to solve undergraduate mathematics problems with human assistance, and for decades most experts were not remotely impressed by this. Clearly there have been breakthroughs -- but we are stil several breakthroughs away from my understanding of Szegedy's dream. Siddhartha Gadgil (Mar 23 2023 at 10:39): Many people (including myself) feel that the best hope is a combination of LLMs for intuitive stuff: making auxiliary constructions (many useless, some useful), suggesting intermediate lemmas (again many useless/false), suggesting what results to use etc. Actual proving can be done by rule-based automation, experimentation to avoid wasting time on false "lemmas" etc. Siddhartha Gadgil (Mar 23 2023 at 10:39): Of course the proof of the pudding lies in the eating, and so far there isn't even a pudding. Siddhartha Gadgil (Mar 23 2023 at 10:49): Incidentally, Szegedi, Josef Urban and others have used deep learning for premise selection together with powerful First-order logic solvers to make the best systems for "TPTP" kind of problems (simpler ones). So likely enough the 10 year bet is based on thinking in terms of hybrid systems. Daniel Selsam (Mar 23 2023 at 17:02): One metric I like to track is the number of words it takes domain experts to explain how dumb each new GPT is. It seems to grow exponentially. They are almost always correct about the particular version they are critiquing, but I think sometimes they do miss the forest for the trees. The salient context is that there is a no-breakthrough-required knob to keep turning, and turning, and turning. Kevin Buzzard (Mar 23 2023 at 17:06): That's a great observation :-) Eric Taucher (Mar 23 2023 at 17:51): FYI ChatGPT plugins are here! - Currently requires signing up for a wait list. EDIT ChatGPT Gets Its “Wolfram Superpowers”! Daniel Selsam (Mar 23 2023 at 17:53): love the wolfram integration! Gerald Höhn (Mar 23 2023 at 18:26): Michael Stoll said: It would be interesting to see how well the automated version of this loop (with some reasonable limit on resources) does on a range of tasks like mathlib lemmas (or IMO problems...). Regarding IMO problems, we are resticted to 2022 since it is not in its training data. It could do (without any loop) a simplified version of one problem (#2). See page 40 of https://arxiv.org/pdf/2303.12712.pdf P.S.: I overlooked that someone already posted the link to the paper somewhat above. Fabian Glöckle (Mar 23 2023 at 20:45): wasn't the IMO in July and the model from December? Kevin Buzzard (Mar 23 2023 at 21:38): The modified question only bears a passing resemblance to the original question anyway. Continuity and smoothness are not in the original question (indeed they are not on the IMO syllabus), so the near-solution generated to the watered-down problem will probably not have been inspired by any web solution Fabian Glöckle (Mar 23 2023 at 21:43): Yes, thanks, I saw your message this morning. Just mentioned because it means that you cannot evaluate GPT4 on IMO2022. Scott Morrison (Mar 23 2023 at 21:46): It depends what you mean by "evaluate", but for the most extreme sense, you can only evaluate an LLM on an IMO on the day of that IMO! Kevin Buzzard (Mar 23 2023 at 21:51): It would be interesting to see how GPT-4 does on IMO2023 Daniel Selsam (Mar 23 2023 at 22:22): Kevin Buzzard said: It would be interesting to see how GPT-4 does on IMO2023 I view GPT as a meta-system -- almost more like a programming language than a program -- and might rephrase as: it would be interesting to see how various GPT-powered programs do on IMO 2023. Gerald Höhn (Mar 24 2023 at 02:06): Kevin Buzzard said: The modified question only bears a passing resemblance to the original question anyway. Continuity and smoothness are not in the original question (indeed they are not on the IMO syllabus), so the near-solution generated to the watered-down problem will probably not have been inspired by any web solution I played a little around with the temperature and asked GPT-4 for an educated guess for the function g(x) in the modified problem (without mentioning the answer g(x)=x^2). In one of five tries it guessed the correct anwer. Remember one has 90 minutes time for an IMO problem and one prompt takes about a minute (and$0.10).

One more paper today: https://arxiv.org/abs/2303.10130
On page 15 the models predict that Mathematicians are one of the most exposed profession on the Labor Market .
Undergraduate programs in Mathematics will soon be shut down in most places. Maybe or maybe not.

Kevin Buzzard (Mar 24 2023 at 07:37):

Given that these things are currently as good as bad undergraduates, who already exist in great numbers, I think we'd have to see some remarkable improvements. OTOH I got into theorem proving precisely because I thought it would be nice to have an early retirement

Adam Topaz (Mar 24 2023 at 12:53):

Kevin, we still have some time before retirement:
Screenshot-2023-03-24-at-06-52-34-ChatGPT.png

Adam Topaz (Mar 24 2023 at 12:54):

I actually did a similar thing on the last homework for my NT class, asking them to identify all the errors made by GPT in a similar proof.

Matthew Ballard (Mar 24 2023 at 12:58):

I actually did a similar thing on the last homework for my NT class, asking them to identify all the errors made by GPT in a similar proof.

This is a great exercise!

Adam Topaz (Mar 24 2023 at 20:41):

OTOH, this interaction has me genuinely impressed:

Me: I will give you some mathematical constructions, written using LaTeX code. Please translate them into Lean4 code.

Let $C$ be the collection of tuples $(A,B,e)$ where $A$ and $B$ are commutative rings and $e : A^\times \to B^\times$ is a morphism of abelian groups. Here, for a commutative ring $A$, $A^\times$ denotes the unit group of $A$. Consider $C$ as a category where morphisms $(A,B,e) \to (A',B',e')$ consist of pairs $(f,g)$ where $f : A \to A'$ and $g : B \to B'$ are morphisms of rings, such that for all $a \in A^\times$, one has $e'(f(a)) = g(e(a))$.

Write $CRing$ for the category of commutative rings and $F : C \to CRing$ the functor sending $(A,B,e)$ to $A$ on objects and $(f,g)$ to $f$ on morphisms.

ChatGPT(w/ GPT4):
To translate the given mathematical constructions into Lean4 code, we'll first need to define the necessary structures and classes, such as CommRing, UnitGroup, and Morphism. Then, we'll define the category C and the functor F. Here's the code:

import algebra.category.CommRing
import algebra.group.units
import category_theory.functor

open_locale classical

universes u v

variables (A : Type u) (B : Type v) [comm_ring A] [comm_ring B]

def UnitGroup (R : Type*) [comm_ring R] := units R

structure Morphism :=
(map : UnitGroup A → UnitGroup B)
(is_group_hom : is_group_hom map)

structure CRingTriple :=
(A : CommRing.{u})
(B : CommRing.{v})
(e : Morphism A B)

variable {A}

structure CRingTripleMorphism (C₁ C₂ : CRingTriple) :=
(f : C₁.A ⟶ C₂.A)
(g : C₁.B ⟶ C₂.B)
(commutes : ∀ a : UnitGroup C₁.A, C₂.e.map (C₁.e.map a) = g (C₁.e.map a))

instance CRingTriple_category : category_theory.category CRingTriple :=
{ hom := CRingTripleMorphism,
id := λ X, ⟨𝟙 _, 𝟙 _, λ a, rfl⟩,
comp := λ X Y Z f g, ⟨f.f ≫ g.f, f.g ≫ g.g, λ a, by rw [←g.commutes, ←f.commutes, category_theory.types_comp]⟩ }

def F : CRingTriple ⥤ CommRing :=
{ obj := λ X, X.A,
map := λ X Y f, f.f,
map_id' := λ X, rfl,
map_comp' := λ X Y Z f g, rfl }


This Lean4 code defines the category C as CRingTriple_category and the functor F as described in the problem statement.

Adam Topaz (Mar 24 2023 at 20:41):

It's not correct, but it's a close approximation

Adam Topaz (Mar 24 2023 at 20:43):

The funny thing is that it seems to have learned that the autoparam'ed fields comp_id etc. in the category instance can usually be dropped, and it has omitted them in CRingTriple_category.

Adam Topaz (Mar 24 2023 at 20:43):

I don't know if this is an argument for or against dropping these fields from mathlib.

Jake (Mar 25 2023 at 14:32):

Scott Morrison said:

The fun stuff is compiling what it returns, so we can sensibly ask follow up questions. But Lean is awesome? I hadn't tried "running Lean from Lean" until yesterday, but it works beautifully, and you get as much information back as you like.

Do you have any suggestions on resources for learning "running Lean from Lean"?

Harald Helfgott (Mar 25 2023 at 22:16):

Perhaps I am a harsh grader, but I would give GPT-4 much less than 7/7 on that vastly simplified IMO problem.

Quite aside from the fact that this "simplified version" is quite easy (you don't even need the continuity assumption; you just need x²+y^2>=2xy, nothing else), the fact is that the "proof" that GPT-4 provides is incomplete *unless you provide the missing step yourself, which is... x^2+y^2>=2x y*. Only if you read that into the lines (in the antepenultimate step) does this become a (verbose) proof.

This is impressive nevertheless, but more in the line of "an AI tricked someone into believing it was a beautiful woman with a crush on him and imperfect English".

(I do admit I am impressed by the fact that GPT-4 manage to apply and explain continuity correctly (if pointlessly).)

Joseph Myers (Mar 25 2023 at 22:49):

I suggest that if people wish to evaluate GPT-based systems (or any other AI producing informal answers from informal problem statements) on IMO problems, then as well as the evaluations being more credible if they're based on the actual problem statements rather than simplified versions, and more credible if a commitment to do such an evaluation (and publish the results) with details of the evaluation protocol is declared in public before the IMO in question to avoid publication bias, they would also be more credible if the declared protocol involves a coordinator on the problem in question at the IMO in question doing the evaluation using the mark scheme actually used at the IMO, rather than someone unfamiliar with how the problem was marked doing their own evaluation.

Although such an evaluation protocol has the disadvantage (compared to evaluation of formal proofs, possibly produced by a system that first produces an informal proof then autoformalizes it) that the mark schemes aren't public (I tend to think they should be, though I guess a majority of leaders at the IMO would disagree), given the various criticisms I've seen of how limited the published information is regarding the design and training of GPT-4 I suspect that inability to reproduce things entirely from public information might not be a major concern to some people producing AIs.

Jay T (Mar 27 2023 at 04:53):

I came here to post the same thing. Specifically on page 26 it shows how GPT4 can be used to help you type up Latex much faster, and the maths section is page 30-41 , and appendix on page 122-140

IMO The most exciting offering GPT4 can bring to Lean is: Self Healing code - setting up a loop where an LLM proposes something, gets an error, evaluates the error message, then suggest refinements, repeatedly, until no more errors. At minimum it will make error-free code. Who knows if it'll be useful. But if it's good enough to get 90% of the way there, set up a script to run 'n' number of attempts ('n' limited by how much money you're willing to spend on API calls) and you'll likely get some neat results. Basically automating the Computational Approach to maths that Stephen Wolfram describes in the "All Possible Theorems" section of this article.

A working example of Self Healing code already working in Python can be found here, and a high-level discussion of the general design properties such a system should have is summarised here (from 2016!).

I might as well dump all my other GPT related top tips while I'm at it:
This ChatGPT Chrome extension gives useful features, and an extremely high value newsletter:

Former Harvard CS prof predicts that will be how AI changes writing code. Humans will specify what the code should do, the AI will write the code, humans will test and check it. Very good talk, highly recommend.

Big :raised_hands: I've been watching this space since IIRC 2019, and I hope what I've shared can bring some useful ideas.

Mario Carneiro (Mar 27 2023 at 04:59):

At minimum it will make error-free code

At minimum it will go into an infinite loop without finding any error free code. Based on the above discussion this seems to be the case often with ChatGPT

Kevin Buzzard (Mar 27 2023 at 07:58):

at minimum it will make error free code

At minimum it will get stuck trying to make error free code -- oh sorry Mario got there first.

Kevin Buzzard (Mar 27 2023 at 08:01):

If you ask a current system a hard question (eg "prove a tricky theorem that is not in mathlib because we didn't get around to it yet") then the chances of it making error free code are going to be vanishingly small, otherwise this community would have heard about it by now.

Albert Jiang (Mar 27 2023 at 09:00):

It’s not here yet. But if you want to learn about recent work in this direction, see: https://arxiv.org/abs/2303.04910v2

Albert Jiang (Mar 27 2023 at 09:02):

Proof repair is very useful in the program verification domain, and IMHO less useful in research level mathematics.

Albert Jiang (Mar 27 2023 at 09:03):

It’s a very new method. I’m very excited to see what it will evolve into 5 years from now.

Shreyas Srinivas (Mar 27 2023 at 16:20):

Albert Jiang said:

It’s a very new method. I’m very excited to see what it will evolve into 5 years from now.

That's my impression about proof repair as well. It is just beginning to pick up steam according to people I know in this field.

Andrés Goens (Mar 29 2023 at 06:48):

Albert Jiang said:

Proof repair is very useful in the program verification domain, and IMHO less useful in research level mathematics.

I wonder if that's just because people are mostly mechanizing existing mathematics, i.e. not primarily making new things. I'd imagine proof repair is important in a context where the definitions and theorems are changing over time, albeit mildly. This currently happens more often in program verification, but I could imagine in a couple of years when the tools and the corpus of formalized mathematics allow it, if research mathematics starts happening in theorem provers, then proof repair would probably also be important there, don't you think?

Mario Carneiro (Mar 29 2023 at 07:02):

proof repair is a very common thing in mathlib maintenance

Shreyas Srinivas (Mar 29 2023 at 09:08):

Andrés Goens said:

Albert Jiang said:

Proof repair is very useful in the program verification domain, and IMHO less useful in research level mathematics.

I wonder if that's just because people are mostly mechanizing existing mathematics, i.e. not primarily making new things. I'd imagine proof repair is important in a context where the definitions and theorems are changing over time, albeit mildly. This currently happens more often in program verification, but I could imagine in a couple of years when the tools and the corpus of formalized mathematics allow it, if research mathematics starts happening in theorem provers, then proof repair would probably also be important there, don't you think?

Agreed. In addition, in a nice world with well developed proof repair, porting mathlib theorems would be much more automated.

Albert Jiang (Mar 29 2023 at 11:15):

Mario Carneiro said:

proof repair is a very common thing in mathlib maintenance

Ah i didn’t know this. That’s great! Any chance you can point me to where this is done?

Mario Carneiro (Mar 29 2023 at 11:31):

I mean just look at any recent mathlib refactor (well, I suppose those are less common of late...). Anything that touches 30+ files will almost certainly have been created by changing some basic files and then going into a dialogue with CI to hunt down all the proofs that broke and make appropriate changes to fix them

Eric Wieser (Mar 29 2023 at 11:37):

My experience with proof repair is that usually the pain is having to wait for CI to make it through all of mathlib, rather than actually having difficulty fixing the proofs themselves. In many case, having a deterministic tool (eg to perform renames or change argument explicitness) would be more helpful than an AI assistant.

Fabian Glöckle (Mar 29 2023 at 11:40):

It would be very nice if there was data on this incremental process. Baldur's proof repair is trained with (theorem_statement, candidate_proof, error_message) -> ground_truth_proof because the incremental procedure of
(theorem_statement, candidate_proof, error_message) -> candidate_proof_2
(theorem_statement, candidate_proof_2, error_message_2) -> candidate_proof_3
...
(theorem_statement, candidate_proof_n, error_message_n) -> ground_truth_proof
is not recorded anywhere, so we would have to send the language model into an MCTS-like search loop and retrain it on the sucessful sequences

Scott Morrison (Mar 29 2023 at 21:32):

A teaser:

/-- The length of the concatenation of two lists is the sum of the lengths of the lists. -/
theorem length_append : sorry := by sagredo


produces

/-- The length of the concatenation of two lists is the sum of the lengths of the lists. -/
theorem length_append (xs ys : List α) : List.length (xs ++ ys) = List.length xs + List.length ys := by
induction xs with
| nil => rw [List.nil_append, List.length_nil, Nat.zero_add]
| cons x xs' ih => rw [List.cons_append, List.length_cons, List.length_cons, ih]; rw [Nat.succ_add]


Scott Morrison (Mar 29 2023 at 21:35):

With current prompting, sagredo takes 5 exchanges (completely automatic) between GPT and Lean before producing this result. Hence it is not fast!

Scott Morrison (Mar 29 2023 at 21:38):

@Zhangir Azerbayev has been helping with the prompting design, and building good test cases. I don't think it's at all ready for prime time, but I'll try to post some of the actual dialogues behind these results soon. It needs a GPT4 API key to run, so don't expect wide deployment right away. :-)

Kevin Buzzard (Mar 29 2023 at 21:40):

Doesn't Isabelle/HOL's sledgehammer also take an age?

When my PhD thesis hit 50 pages, I used to start TeX running on it and then go and make a cup of tea.

Adam Topaz (Mar 29 2023 at 21:44):

So it writes the type of the function as well?!?

Scott Morrison (Mar 29 2023 at 21:45):

Yes. If you write a natural language outline of the proof in a comment before calling sagredo it also tries to follow your script. (At least, sometimes. Sometimes it does really dumb stuff, too. :-)

Scott Morrison (Mar 29 2023 at 21:46):

It still really struggles with Lean 4 syntax, and a lot of our prompting is helping it with that. Hopefully this will get much better once there is a training run that includes mathlib4.

Adam Topaz (Mar 29 2023 at 21:46):

right, but that's expected, I guess, since gpt4's data was limited to some time in 2021 IIRC

Adam Topaz (Mar 29 2023 at 21:47):

I've heard rumors that they're already training gpt5, but I don't know whether to trust them...

Scott Morrison (Mar 29 2023 at 21:47):

It also gets frustrated when it can't find the right lemma name. I just had an amusing one where it couldn't find the lemma for List.range 0 = []. Maybe we should teach it how to call library_search!

Scott Morrison (Mar 29 2023 at 21:48):

I mean, how could they not be training GPT5? As various people have said, there is a just a knob to turn to keep making it better at this point.

I guess so.

Adam Topaz (Mar 29 2023 at 21:48):

At some point they will probably also allow us to fine-tune gpt4, which would certainly help.

Adam Topaz (Mar 29 2023 at 21:49):

Anyway, this looks really really cool! I look forward to playing with it :)

Adam Topaz (Mar 29 2023 at 21:53):

Scott Morrison said:

I mean, how could they not be training GPT5? As various people have said, there is a just a knob to turn to keep making it better at this point.

Alex J. Best (Mar 29 2023 at 21:56):

Have you thought about just asking it for lean 3 code and running mathport (or a bespoke mini 3->4 translator) on the output? If simple syntax changes is whats throwing it off so much it might not be as ridiculous as it sounds.

Zhangir Azerbayev (Mar 29 2023 at 21:57):

Scott Morrison said:

It also gets frustrated when it can't find the right lemma name. I just had an amusing one where it couldn't find the lemma for List.range 0 = []. Maybe we should teach it how to call library_search`!

The best way to deal with this problem is likely to use openai's embeddings (not sure if they'll be good enough) or to train a lightweight premise selection model, in the spirit of magnushammer. Magnushammer works quite well for Isabelle and is only 86M parameters, so you can train it with academic resources and run it on cpu.

Adam Topaz (Mar 29 2023 at 21:58):

Alex J. Best said:

Have you thought about just asking it for lean 3 code and running mathport (or a bespoke mini 3->4 translator) on the output? If simple syntax changes is whats throwing it off so much it might not be as ridiculous as it sounds.

Wouldn't we also need a lean4 -> lean3 converter for the prompt itself?

Alex J. Best (Mar 29 2023 at 21:59):

The prompt in Scotts example is the same in lean 3 and lean 4 at least (just a docstring and function name) which is probably part of the issue

Adam Topaz (Mar 29 2023 at 21:59):

sure, but that's not always going to be the case.

Scott Morrison (Mar 29 2023 at 22:14):

I think the syntax problems for Lean4 are pretty solvable. Zhangir's prompting is already a massive improvement over my initial attempt, and we can certainly do more. And this problem will just go away over time. Our whole code is pretty simple and pure Lean 4, and at this point I would hate to have the hacky overhead of interacting with Lean 3 and/or mathport.

Tyler Josephson ⚛️ (Mar 30 2023 at 00:20):

Nice work! We have a GPT4 API key, would love to give it a try.

Junyan Xu (Apr 01 2023 at 03:51):

William Stein talks about deployment of ChatGPT in the sage-support mailing list:

I added this "Help me fix this..." feature to cocalc jupyter notebooks a week ago, and it's been extremely popular with 1000+ uses a day. It's exactly the sort of thing that ChatGPT can be very helpful with, given that Sage typical outputs a massive scary stacktrace when anything goes wrong, and it's just a lot of mental effort to untangle it, especially given the preparser. ChatGPT is far from perfect, but it doesn't mind expending a lot of effort. People also often get stuck and very frustrated with Sage due to silly little things, e.g., accidentally capitalizing a function name, and chatgpt instantly points out such things.

RexWang (Apr 05 2023 at 13:20):

Just want to say. ChatGPT is a great tool for those familiar with LEAN(or any other language), but it can be potentially harmful for newcomers at times. In this sense, Zulip is much better than ChatGPT.
chat-shot.png

Patrick Massot (Apr 05 2023 at 13:51):

The mix of correct and completely false information is really striking. Trusting this thing would indeed be very harmful.

Daniel Selsam (Apr 05 2023 at 17:45):

obviously, there is a much lower false-information rate when GPT is augmented with web search Screenshot-2023-04-05-at-10.43.50-AM.png

Tyler Josephson ⚛️ (Apr 06 2023 at 14:15):

Patrick Massot said:

The mix of correct and completely false information is really striking. Trusting this thing would indeed be very harmful.

StackOverFlow banned answers from ChatGPT precisely because its answers are plausible, nearly-correct but often wrong, and hard to get corrected by experts at scale. https://www.theverge.com/2022/12/5/23493932/chatgpt-ai-generated-answers-temporarily-banned-stack-overflow-llms-dangers

Reid Barton (Apr 06 2023 at 14:40):

I would like to know more about the information available at irs.gov

Reid Barton (Apr 06 2023 at 14:40):

MathOverflow also has such a ban I believe, but it's more because ChatGPT still generates obvious nonsense in that context.

Junyan Xu (Apr 09 2023 at 20:45):

I can reproduce the conversation and the irs.gov link is about the difference between Levy and Lien, so it looks like some kind of fuzzy match (the word "Lean" doesn't appear). irs.gov doesn't actually appear in the first several pages of results when I search "difference between Lean3 and Lean4" on Bing, but the first result from "difference between Lean3 and Lean4 irs.gov" is indeed about Lien.

Daniel Selsam (Apr 10 2023 at 19:24):

There have already been a few ongoing-agent-on-top-of-GPT projects (e.g. https://github.com/Torantulino/Auto-GPT). I think Lean provides an interesting environment for this kind of system since the kernel preempts many common "going off-the-rails" failure modes and might permit systematic progress over long time horizons. It would be cool to see a GPT-agent interacting with Lean server directly (without VSCode) in the context of (say) a Mathlib fork, web/arXiv search, and occasional Q/As on Zulip, etc., perhaps with a very grand goal such as "formalize all existing mathematical knowledge in Lean".

David Renshaw (Apr 10 2023 at 19:39):

/me looks suspiciously at the other users here
how many of you are, already, GPT agents?

Junyan Xu (Apr 11 2023 at 03:35):

Daniel Selsam said:

There have already been a few ongoing-agent-on-top-of-GPT projects (e.g. https://github.com/Torantulino/Auto-GPT). I think Lean provides an interesting environment for this kind of system since the kernel preempts many common "going off-the-rails" failure modes and might permit systematic progress over long time horizons. It would be cool to see a GPT-agent interacting with Lean server directly (without VSCode) in the context of (say) a Mathlib fork, web/arXiv search, and occasional Q/As on Zulip, etc., perhaps with a very grand goal such as "formalize all existing mathematical knowledge in Lean".

Yeah I also just posted similar ideas in the sagredo thread.

By the way, do maintainers think it's a good idea to enroll all leanprover(-community) repos into the Copilot for PRs waitlist? Apparently I myself could enroll them, but if it's generally agree it's a good idea (I can't see any downsides), and if every member enrolls them, then we might be able to get access faster. I'm now in the Copilot for docs waitlist, but I think it's probably not specific to any repo.

Scott Morrison (Apr 11 2023 at 07:20):

I think this would be fine. Presumably at some point copilot would want money.

Junyan Xu (Apr 11 2023 at 07:38):

I just requested installing the copilot4prs GitHub App on all leanprover-community repos (I can't actually do anything with leanprover repos), and it requests the following privileges that probably need a maintainer to approve. Once it's approved I think we will be in the waitlist.
image.png

Siddhartha Gadgil (Apr 11 2023 at 08:23):

(deleted)

Last updated: Dec 20 2023 at 11:08 UTC