Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Simple but hard puzzler for LLMs

Lars Ericson (Dec 13 2024 at 16:43):

I have tried Perplexity.AI, ChatGPT, Claude, Microsoft Copilot and Meta.AI on this simple theorem. They all get it wrong and most of them invent Mathlib theorems as they go along to apply their proofs. I think it has to do with a transition between Boolean and Prop usage for theorems about lists that happened in the past 18 months in Lean 4, differences between IsPrefix and isPrefixOf which become extremely tricky when going back and forth between Bool and Prop notations. I tried LeanGPT which also failed. Are there any openly accessible Lean AI proof assistants that can get this one right?

theorem prefix_to_suffix {a l : List (ℕ × ℕ × ℕ × ℕ)} (h : l <:+ a) : l.reverse <:+ a.reverse := sorry

Alex Meiburg (Dec 13 2024 at 16:50):

Well, I think your theorem statement is wrong. You have IsSuffix, <:+, on both h and the conclusion. One of those should be IsPrefix, <+:, instead. If I fix it, then it's a theorem (in Init, not even in Mathlib!) and so exact? or apply? find it immediately.

theorem prefix_to_suffix {a l : List (ℕ × ℕ × ℕ × ℕ)} (h : l <:+ a) : l.reverse <+: a.reverse := by
  exact List.reverse_prefix.mpr h

is a working proof. Technically,

theorem prefix_to_suffix {a l : List (ℕ × ℕ × ℕ × ℕ)} (h : l <:+ a) : l.reverse <+: a.reverse := by
  exact?

is a complete proof too.

Basically every AI agent can output exact? pretty easily, so they will all solve this with ease.

Joseph Tooby-Smith (Dec 13 2024 at 16:51):

ChatGPT o1 got this right for me (I think)

https://chatgpt.com/share/675c651a-c6c8-8000-bca0-382199bda1b0

theorem prefix_to_suffix {a l : List (ℕ × ℕ × ℕ × ℕ)} (h : l <+: a) : l.reverse <:+ a.reverse := by
  obtain ⟨t, rfl⟩ := h
  rw [List.reverse_append]
  exact ⟨t.reverse, rfl⟩

Alex Meiburg (Dec 13 2024 at 16:52):

ChatGPT actually spat out exact List.IsSuffix.reverse h for me, which works too.

Alex Meiburg (Dec 13 2024 at 16:57):

I would be impressed though if an AI could do it all as an explicit term without referring to the reverse_prefix theorem. :)

theorem prefix_to_suffix {a l : List (ℕ × ℕ × ℕ × ℕ)} (h : l <:+ a) : l.reverse <+: a.reverse :=
  h.rec fun w h ↦ ⟨w.reverse, h.rec (List.reverse_append _ _).symm⟩

Lars Ericson (Dec 13 2024 at 20:26):

@Alex Meiburg you're right I intended to do prefix to suffix so +: to :+.

Lars Ericson (Dec 13 2024 at 20:28):

It seems like the general LLMs have caught up enough with Lean that specialized LLM fine-tunes like LeanGPT aren't as much of a topic, where they were more so maybe last year.

Jason Rute (Dec 14 2024 at 00:43):

Alex Meiburg said:

Basically every AI agent can output exact? pretty easily, so they will all solve this with ease.

I don't know of a single AI agent for Lean, either local neural-symbolic AIs (e.g. Lean Copilot), LLM chatbots (e.g. Claude, ChatGPT, etc), or even symbolic AIs (e.g. aesop) that uses exact? (or it's underlying discrimination tree). Sure, they could. And maybe with the right prompting, they would. And arguably, they should (especially since it is so fast now). But I think in practice, they just don't.

Jason Rute (Dec 14 2024 at 00:46):

I think this discussion is really great. In this case it happened to be that the lemma was false as originally stated, but overall, I'd love to see a lot more in-the-trenches AI for Lean. What works and what doesn't? (For example, I learned in the last week that Sonnet doesn't have near the Lean troubles that OpenAI's models do.) What problems are tricky for AI but shouldn't be? What tips and tricks do those who do Lean day-to-day have?

Lars Ericson (Dec 14 2024 at 01:56):

Here is another example from same context, which is that I am trying to port this github from 18 months ago Lean 4 to today Lean 4, as a learning exercise. In particular, this lemma:

import Mathlib

lemma sum_is_prefix {a b n m : List (ℕ × ℕ × ℕ × ℕ)} (h1 : a ++ b = n ++ m) (h2 : a.length ≤ n.length) : a.isPrefixOf n := by
  induction a generalizing n with
  | nil =>  simp [List.isPrefixOf]
  | cons ha ta iha =>
      cases n with
      | nil =>
          simp at h2
      | cons hn tn =>
          simp [List.isPrefixOf]
          simp at h1
          apply And.intro
          . simp [h1.left]
          . simp at h2
            have h2 := Nat.le_of_succ_le_succ h2
            exact iha h1.right h2

Current Lean 4 has a problem with the 2nd to last line. I asked Perplexity for help. It took 9 passes and one hallucination before Perplexity was able to port the code. The final result was:

import Mathlib

lemma sum_is_prefix {a b n m : List (ℕ × ℕ × ℕ × ℕ)} (h1 : a ++ b = n ++ m) (h2 : a.length ≤ n.length) : a <+: n := by
  induction a generalizing n with
  | nil =>  simp [List.IsPrefix]
  | cons ha ta iha =>
      cases n with
      | nil =>
          simp at h2
      | cons hn tn =>
          simp [List.IsPrefix]
          simp at h1
          apply And.intro
          . simp [h1.left]
          . simp at h2
            have h2' : ta.length.succ ≤ tn.length.succ := Nat.succ_le_succ h2
            have ih := iha h1.right (Nat.le_of_succ_le_succ h2')
            exact ih

This also raises a question of the archival quality of Lean as a "LaTeX of proofs": it seems that the reliability of a proof is married to the version of the proof checker at the moment it checked. Is this a problem that needs to be solved? Does solving this problem inhibit and constrain further development? Or is it a kind of passing phase, and things will settle down?

Separate question that comes to mind: In Lean 4, for example, a Fieldis a CommRing and a DivisionRing, where a CommRing is a Ring and CommMonoid, and a Ring is a Semiring and an AddCommGroup and an AddGroupWithOne, and a Semiring is a NonUnitalSemiring, NonAssocSemiring, and a MonoidWithZero, and a NonUnitalSemiring is aNonUnitalNonAssocSemiring and a SemigroupWithZero, and a NonUnitalNonAssocSemiring is an AddCommMonoid and a Distrib and a MulZeroClass, and an AddCommMonoid is an AddMonoid and a AddCommSemigroup, and an AddMonoid is an AddSemigroup and an AddZeroClass, and an AddSemigroup is an Add, and somewhere along the line one of these is eventually a CommMagma, where a Magma is something French that used to be a Brandt Groupoid until it wasn't. This filleting of Field goes way way beyond MacLane and Birkhoff's Algebra book, which has no Magma in it and in which Field is defined as "a non-trivial commutative ring in which every non-zero element has a multiplicative inverse". I don't think that, Magma aside, if you carefully traversed all of the definitions underlying that sentence in pages 1-133 of the book, that you would get the exact breakdown that exists in Mathlib. So...what if the breakdown in Mathlib itself were to change, over time? It could, couldn't it? How people like to organize facts and definitions can change. What happens to all the proofs that depend on an earlier revision of that hierarchy? The simple answer is that all those proofs might break. Not because the syntax of the language has changed, but because the intellectual organization or semantics of the library has changed or been refactored.

Those are a couple of challenges to long-term use of Lean as a go-to archival language and lingua franca for proofs. These are things which don't matter in the moment (it is very satisfying to see a proof check), but worth thinking about long term. Otherwise we can expect to see a lot of digital rot of checkable proofs in the same way that people who own VHS videotapes and compact discs can see those assets age and become inaccessible as things change.

Alex Meiburg (Dec 14 2024 at 02:26):

Well -- unless something very strange happened -- that GitHub repo from 18 months ago will still compile and check just fine right now. And, in case you're worried about there being some old bug in a previous version of Lean that is fixed today, you can still verify the proof with the checker today (I don't think the binary form has changed incompatibly.) It will still work, because it specifies a particular version of Lean (in lean-toolchain) and of Mathlib (in lake-manifest).

I would argue that, based on this, the "reliability" of the proof isn't compromised: it's easy for me to download the repo and verify it now.

Now, it is true that porting it the newer versions is not very easy, if you want to then use those proofs in some new repo. But that feels, to me, like a very different flavor than VHS which many people now simply cannot view. It's more like some old and somewhat-fragile C code that someone wants to port to a modern C++ codebase (remember that reverse-compatibility with C is one of the top priorities for the C++ consortium, to the grief of many). Will it mostly work? Sure. Will porting it to fully work be easy? No. Can I always just compile the old C code in C, instead of C++, and get a working object file? Yes.

Alex Meiburg (Dec 14 2024 at 02:28):

That being said ... it might be nice if there was some kind of bisection tool for forward porting. You've got a proof from an old version that you want to use now, but it doesn't work and you don't know what changed.

So, imagine a script that run forward through revisions of mathlib and lean until the proof breaks. It shows you the offending commit to mathlib/lean that broke the proof. At that point, it should be much more obvious what changed, and you can probably fix it easily. Then you resume running forward, until you get to the current version.

I know I could have used such a tool a few days ago...

Thomas Zhu (Dec 14 2024 at 02:50):

Jason Rute said:

Alex Meiburg said:

Basically every AI agent can output exact? pretty easily, so they will all solve this with ease.

I don't know of a single AI agent for Lean, either local neural-symbolic AIs (e.g. Lean Copilot), LLM chatbots (e.g. Claude, ChatGPT, etc), or even symbolic AIs (e.g. aesop) that uses exact? (or it's underlying discrimination tree). Sure, they could. And maybe with the right prompting, they would. And arguably, they should (especially since it is so fast now). But I think in practice, they just don't.

Actually, we know that AlphaProof calls exact?. In their solution to IMO2024P6 https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/imo-2024-solutions/P6/index.html on one line they used "by hint", where hint is a tactic that tries a group of tactics including exact?.

Similar to exact?, you wouldn't find hint in any actual human written proof, because people would see the squiggly line and click to replace with the message in Infoview. So the only possibility is that AlphaProof either has human demonstration of specifically using hint, or has a prompt to try it (eg giving it the entire tactic documentation)

Jason Rute (Dec 14 2024 at 02:54):

Thomas Zhu said:

Actually, we know that AlphaProof calls exact?.

Great, I stand corrected! Still glad I made the claim, otherwise I wouldn't have noticed this. I'll now weaken it to saying no publically usable AI for Lean uses exact?, but happy to be corrected again. :stuck_out_tongue:

Thomas Zhu (Dec 14 2024 at 03:03):

I think this speaks to another flaw in many LLM for Lean training procedures: there are many tactics natural to proof search (aesop, simp_all, hint) but bad for proof representation (one would replace with simp only, exact _, etc). LLMs are trained with the latter data because of Mathlib code style, but the former might in fact prove more theorems. I remember the ABEL paper saying they replaced simp only with simp in the Appendix, because of this.

A parallel mismatch is between Mathlib and competition-style problems: for example, I am pretty sure basically 0 theorems on Mathlib can be proved using the tactic omega. But "by omega" solves a nontrivial percentage of miniF2F problems.

Lars Ericson (Dec 14 2024 at 15:49):

Here is one more I tried on Perplexity and just couldn't get anywhere. Perplexity assumes a Mathlib of a while ago and also frequently hallucinates Mathlib theorems that probably never existed. The theorem to prove is

theorem prime_2_3 (n m : Nat) : 3^(n+1) ≠ 2^(m+1) := by sorry

The unsuccessful dialogue is here: https://www.perplexity.ai/search/please-help-me-finish-this-lea-83qN.5UMRi6sgAl_.WabnQ

Lars Ericson (Dec 14 2024 at 16:41):

I tried this with Microsoft Copilot, Google Gemini, Meta AI and Claude and they all flailed in very similar ways. That makes this a nice benchmark problem because it is short and self-contained. I was hoping to try DeepMind AlphaProof but they don't have a public portal for that.

Lars Ericson (Dec 14 2024 at 18:04):

There are two other somewhat public LLMs fine tuned for Lean, LeanGPT and Morph Prover. LeanGPT fails exactly like the others. Rather than saying just import Mathlib, it says import Mathlib.Data.Nat.Basic, which all the others do, and this dates them all to training on something like Zulipchat at say a year ago:

import Mathlib.Data.Nat.Basic

theorem prime_2_3 (n m : Nat) : 3^(n+1) ≠ 2^(m+1) := by
  intro h
  have hmod3 := Nat.pow_mod 3 (n+1) 2
  have hmod2 := Nat.pow_mod 2 (m+1) 3
  rw [h] at hmod3
  have : 2^((m+1) % 3) ≡ 0 [MOD 3] := hmod3
  -- Contradiction by modular arithmetic
  exact Nat.not_mod_eq_zero_of_prime_ne (Nat.prime_two) (Nat.prime_three) this

The other thing it does, which they all do, is hallucinate convenient theorems which are not currently in Mathlib and maybe never where. In the above case, Nat.not_mod_eq_zero_of_prime_ne .

Lars Ericson (Dec 14 2024 at 18:08):

To use Morph Prover, the steps are to pip install some libraries

pip install transformers accelerate

and then run the prompt in a script

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("typeof/morph-prover-v0-7b-sharded")

# Load the model
model = AutoModelForCausalLM.from_pretrained("typeof/morph-prover-v0-7b-sharded", device_map="auto")

# Use the model
text = "Provide a logical proof for: theorem prime_2_3 (n m : Nat) : 3^(n+1) ≠ 2^(m+1)  "
inputs = tokenizer(text, return_tensors="pt").to("cuda")  # Use GPU if available
outputs = model.generate(inputs["input_ids"], max_new_tokens=100)

# Decode the output
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

This downloads a large sharded model which takes some time.

Mario Carneiro (Dec 14 2024 at 18:53):

Lars Ericson said:

The other thing it does, which they all do, is hallucinate convenient theorems which are not currently in Mathlib and maybe never where. In the above case, Nat.not_mod_eq_zero_of_prime_ne .

By the way, this is possibly valuable information, either for spotting theorems that are badly named or for spotting natural theorems that don't exist yet in mathlib. Our naming convention is designed so that you can guess a theorem name without knowing if it exists and find it a significant fraction of the time

Lars Ericson (Dec 14 2024 at 19:10):

On MorphProver above, the output is not helpful compared to the current state of the not specifically Lean 4 trained and publicly accessible LLMs:

Provide a logical proof for: theorem prime_2_3 (n m : Nat) : 3^(n+1) ≠ 2^(m+1)

In the first paragraph (labeled Context), give context on the mathematical concepts, definitions, and arguments.

In the second paragraph (labeled Statement), explain how to arrive at the key insight, providing a detailed natural language line-by-line account of the proof.

In the third paragraph (labeled Summary), briefly summarize what you’ve said, in a declarative tone. [/INST] Context:
In this code block, we are working with

It seems to be giving back the system prompt which is hidden instructions to the model on what is expected of it, without actually doing anything. It doesn't seem like there is a public version of Morph Labs model that does much Lean. I'm guessing that this was a demo and not the focus of Morph Labs going forward. If anybody knows better please chime in.

Lars Ericson (Dec 14 2024 at 19:34):

Here is one more benchmark case:

theorem pow_injective {a b : ℕ} (h : 2^a = 2^b) : a = b  := sorry

and flailing Perplexity transcript: https://www.perplexity.ai/search/see-if-you-can-complete-this-p-hNsUjBHqQT6z.wHZK_in5Q

Jason Rute (Dec 15 2024 at 03:07):

@Jesse Michael Han what is this Morph Prover?

Jason Rute (Dec 15 2024 at 03:10):

@Lars Ericson I heard from others here that Sonnet 3.5 seems to get Lean syntax correct. Is that the version of Claude you are using?

Jason Rute (Dec 15 2024 at 03:13):

But maybe we shouldn’t be surprised these models struggle so much on Lean, especially since (1) they don’t have access to Lean’s goals and (2) Lean changes so fast and so drastically.

Jason Rute (Dec 15 2024 at 03:16):

It would be interesting if the situation was a lot better in Isabelle or Coq, but especially Isabelle, since it has a canonical library which has been more stable for a while.

Jason Rute (Dec 15 2024 at 03:23):

Also, there are locally run models. While AlphaProof isn’t (and may never be) available, there are solvers like Lean Copilot and LeanLLM that are user facing. There are also open models (not user facing) like DeepSeek-Prover, InternLM-Step, Able, Corpra, etc that one could try if one was adventurous. If I had more time, I’d make a personal arena of all these solvers (as well as the Coq/Isabelle ones) so I could test problems like this on them.

GasStationManager (Dec 15 2024 at 14:09):

A preliminary try with Sonnet 3.5: it outputting correct proof sketches with sorry, and was able to articulate what kind of lemma it is looking for, but was not able to precisely recall the lemma. Is there a good natural language search engine for Mathlib?

Jason Rute (Dec 15 2024 at 14:19):

GasStationManager said:

Is there a good natural language search engine for Mathlib?

There are three: https://www.moogle.ai, https://leansearch.net, and https://loogle.lean-lang.org. The last one is symbolic, while the other two are machine learning. I think you can now call all three from inside Lean, but I don’t know the details. (There was also a third machine learning search engine, but I don’t remember the name off hand.)

Lars Ericson (Dec 15 2024 at 14:58):

Jason Rute said:

Lars Ericson I heard from others here that Sonnet 3.5 seems to get Lean syntax correct. Is that the version of Claude you are using?

I used Claude.AI free version, which means Claude 3.5 Haiku. Claude gives a link to https://claude.ai/new but that just redirects you to Claude 3.5 Haiku. I think to get Sonnet you have to commit to sending $20/month to Anthropic, which is beyond my budget.

Lars Ericson (Dec 15 2024 at 15:04):

Lean Copilot is at https://leandojo.org/. I guessed (wrong) that that was a fine-tuning of Microsoft Copilot. I think they just borrowed the Copilot name, which Microsoft lawyers might eventually complain about. Not to mention:

:warning: Native Windows currently not supported

They also have a ChatGPT plugin. To get Lean (not-fintuned-Microsoft-)Copilot, it is an install using lake of a VS Code extension: https://github.com/lean-dojo/LeanCopilot?tab=readme-ov-file#getting-started-with-lean-copilot

I will try it.

Alex Meiburg (Dec 15 2024 at 15:21):

Lars Ericson said:

I think they just borrowed the Copilot name,

This feels a little unfair. The idea of an "AI Copilot" was certainly a phrase tossed around plenty before Microsoft announced their product.

Jason Rute (Dec 15 2024 at 15:28):

Here is another example which came up on PA.SX. https://proofassistants.stackexchange.com/a/4467/122. Can you either get an AI to correctly make an outline of the proof or get one to correctly fill in the steps. The steps should be standard analysis facts, but maybe Mathlib doesn’t have them yet or they are too hard to use with all the needed side conditions.

Lars Ericson (Dec 15 2024 at 15:32):

Alex Meiburg said:

Lars Ericson said:

I think they just borrowed the Copilot name,

This feels a little unfair. The idea of an "AI Copilot" was certainly a phrase tossed around plenty before Microsoft announced their product.

You are correct, complex history: https://www.perplexity.ai/search/is-ai-copilot-a-coinage-that-p-eAQvaiXITy2ngVyokNIODQ

Lars Ericson (Dec 15 2024 at 15:37):

Jason Rute said:

Here is another example which came up on PA.SX. https://proofassistants.stackexchange.com/a/4467/122. Can you either get an AI to correctly make an outline of the proof or get one to correctly fill in the steps. The steps should be standard analysis facts, but maybe Mathlib doesn’t have them yet or they are too hard to use with all the needed side conditions.

Lean Copilot has search_proof which pretty much ends the matter. I am getting it configured now. The instructions say to add this in lakefile.lean inside my package declaration:

  moreLinkArgs := #[
    "-L./.lake/packages/LeanCopilot/.lake/build/lib",
    "-lctranslate2"

and before that this require line:

require LeanCopilot from git "https://github.com/lean-dojo/LeanCopilot.git" @ "LEAN_COPILOT_VERSION"

where LEAN_COPILOT_VERSION is looked up from the table here, which goes up to lean-toolchain v4.11.0. I am actually at leanprover/lean4:v4.15.0-rc1, so I will try Copilot version v1.6.0, which is their latest version.

Lars Ericson (Dec 15 2024 at 15:43):

Additionally for Linux it is necessary to install Git LFS:

sudo apt-get update
sudo apt-get install git-lfs

Lars Ericson (Dec 15 2024 at 15:44):

and then run

lake update LeanCopilot
lake exe LeanCopilot/download
lake clean
lake build

This downloads around 23GB of model files.

Jason Rute (Dec 15 2024 at 15:52):

I don’t understand how your message relates to mine. Ends what matter? (Also I think I tried Lean Copilot on the sorries in my problem and it didn’t solve them if I recall correctly, but I could be mistaken.)

Lars Ericson (Dec 15 2024 at 16:01):

Jason Rute said:

I don’t understand how your message relates to mine. Ends what matter? (Also I think I tried Lean Copilot on the sorries in my problem and it didn’t solve them if I recall correctly, but I could be mistaken.)

That was an example of overly dry humor that missed the mark. My standup comedy skills need improvement.

Lars Ericson (Dec 15 2024 at 16:39):

Unfortunately, my lake build now hangs even after a lake clean:

$ lake clean
$ lake build -v
⣿ [?/?] Computing build jobs

with this lakefile.lean:

import Lake
open Lake DSL

require mathlib from git "https://github.com/leanprover-community/mathlib4.git"

require LeanCopilot from git "https://github.com/lean-dojo/LeanCopilot.git" @ "v1.6.0"

package Hybrid {
  moreLinkArgs := #[
    "-L./.lake/packages/LeanCopilot/.lake/build/lib",
    "-lctranslate2"
  ]
}

@[default_target]
lean_lib Hybrid {
  -- Explicitly list all your .lean files here
  roots := #[
    `Hybrid.PROP, `Hybrid.TotalSet, `Hybrid.SVAR, `Hybrid.TypeIff,
    `Hybrid.Completeness, `Hybrid.Eval, `Hybrid.Model, `Hybrid.Tautology,
    `Hybrid.Substitutions, `Hybrid.IteratedModalities, `Hybrid.Variables,
    `Hybrid.New_NOM, `Hybrid.NominalSubstitution, `Hybrid.FormSubstitution,
    `Hybrid.Nominals, `Hybrid.NOM, `Hybrid.Form, `Hybrid.Util, `Hybrid.Proof,
    `Hybrid.Tautologies, `Hybrid.Variants, `Hybrid.Satisfaction, `Hybrid.Truth,
    `Hybrid.Lemmas, `Hybrid.Soundness, `Hybrid.ExistenceLemma, `Hybrid.ListUtils,
    `Hybrid.RenameBound, `Hybrid.GeneralModel, `Hybrid.Canonical,
    `Hybrid.GeneratedSubmodel, `Hybrid.WitnessedModel, `Hybrid.ProofUtils,
    `Hybrid.FormCountable, `Hybrid.MCS, `Hybrid.CompletedModel,
    `Hybrid.StandardCompletedModel, `Hybrid.Lindenbaum, `Hybrid.LanguageExtension
  ]
}

Lars Ericson (Dec 15 2024 at 16:41):

I have one CPU cpre running 100% so it is spinning it's wheels doing something.

I would appreciate any suggestions on how to debug the lake build.

Jason Rute (Dec 15 2024 at 16:41):

I would try it in a fresh project first to make sure it isn’t an issue with your project.

Jason Rute (Dec 15 2024 at 16:43):

And one without mathlib. (I assume you downloaded the mathlib cache, right. If not you will be waiting for all of mathlib to compile.)

Lars Ericson (Dec 15 2024 at 17:05):

All of mathlib is there. I think I may need to move back from latest Lean to Lean v4.11.0 which corresponds to their September release of LeanDojo. I will try making a new emptyproject, add the lines for LeanDojo, and see if it builds.

Lars Ericson (Dec 15 2024 at 17:30):

I think I have it built, like this:

lake +leanprover/lean4:4.11.0 new lean_dojo math
cd lean_dojo
emacs lakefile.lean
lake update
lake clean
lake build -v
code .

where I edit lakefile.lean to be

import Lake
open Lake DSL

package "lean_dojo" where
  -- Settings applied to both builds and interactive editing
  leanOptions := #[
    ⟨`pp.unicode.fun, true⟩ -- pretty-prints `fun a ↦ b`
  ]
  moreLinkArgs := #[
    "-L./.lake/packages/LeanCopilot/.lake/build/lib",
    "-lctranslate2"
  ]

require "leanprover-community" / "mathlib"

require LeanCopilot from git "https://github.com/lean-dojo/LeanCopilot.git"

@[default_target]
lean_lib «LeanDojo»

Lars Ericson (Dec 15 2024 at 17:37):

At this point with a test like

import Mathlib
import LeanCopilot

theorem prime_2_3 (n m : Nat) : 3^(n+1) ≠ 2^(m+1) := by
  search_proof

I have to rebuild the imports including all of Mathlib which takes a while.

Lars Ericson (Dec 15 2024 at 17:44):

Open issue of build circularity: https://github.com/lean-dojo/LeanCopilot/discussions/57

Clean Dojo-use project here, I will try this: https://github.com/yangky11/lean4-example/tree/LeanCopilot-demo

Lars Ericson (Dec 15 2024 at 17:51):

sudo apt-get update
sudo apt-get install -y curl git git-lfs clang lld libc++-dev
git clone -b LeanCopilot-demo https://github.com/yangky11/lean4-example.git
cd lean4-example/
lake update LeanCopilot
lake exe LeanCopilot/download
lake build
code .

image.png

Lars Ericson (Dec 15 2024 at 18:28):

sigh
image.png

Lars Ericson (Dec 15 2024 at 18:31):

With maxRecDepth of 1000. It would be nice to be able to configure this to run inference on GPU. My GPU is idle. All CPU cores are running hot. Not much memory is being used, so stack space is not a problem.
image.png

Lars Ericson (Dec 15 2024 at 18:36):

suggest_tactics also gets nowhere. pow_injective is super hard it seems.

Alex Meiburg (Dec 15 2024 at 18:46):

So exact? doesn't get it the pow_injective problem? docs#Nat.pow_right_injective is basically exactly this theorem

Lars Ericson (Dec 15 2024 at 18:59):

I didn't know exact?, I'm a beginner. exact apply? I know and it didn't get that. I'm still trying to configure Lean Dojo. It got lost trying to do a proof. Running suggest_tactics gives these ideas after 10 minutes of thinking:

• exact Nat.pow_eq_zero_iff.1 h
• induction a
• cases a <;> contradiction
• induction a generalizing b
• cases a
• cases a <;> cases b <;> rfl
• cases a <;> cases b <;> simp [h]
• cases a <;> cases b <;> simp
• cases a <;> cases b
• cases a <;> cases b <;> cases b
• cases a <;> cases b <;> contradiction
• cases a <;> cases b <;> assumption

It seems that Nat.pow_eq_zero_iff is not in the standard library, it is in Mathlib.

The lean4-example GItHub doesn't include Mathlib, but LeanDojo knows Mathlib. It is a real tangle. I'm trying to add back Mathlib into the build but it is failing. I'm about to give up for now. LeanDojo doesn't seem to be an actively supported deliverable and doesn't look like it plays well with full Mathlib.

Alex Meiburg (Dec 15 2024 at 19:46):

(deleted)

Huajian Xin (Dec 15 2024 at 20:20):

Lars Ericson said:

There are two other somewhat public LLMs fine tuned for Lean, LeanGPT and Morph Prover. LeanGPT fails exactly like the others. Rather than saying just import Mathlib, it says import Mathlib.Data.Nat.Basic, which all the others do, and this dates them all to training on something like Zulipchat at say a year ago:
import Mathlib.Data.Nat.Basic

theorem prime_2_3 (n m : Nat) : 3^(n+1) ≠ 2^(m+1) := by
  intro h
  have hmod3 := Nat.pow_mod 3 (n+1) 2
  have hmod2 := Nat.pow_mod 2 (m+1) 3
  rw [h] at hmod3
  have : 2^((m+1) % 3) ≡ 0 [MOD 3] := hmod3
  -- Contradiction by modular arithmetic
  exact Nat.not_mod_eq_zero_of_prime_ne (Nat.prime_two) (Nat.prime_three) this
The other thing it does, which they all do, is hallucinate convenient theorems which are not currently in Mathlib and maybe never where. In the above case, Nat.not_mod_eq_zero_of_prime_ne .

Why not give DeepSeek-Prover (https://github.com/deepseek-ai/DeepSeek-Prover-V1.5) a try? It is currently the best-performing specialized LLM for Lean4 and is open-source. For example, in the case you provided, the RL version of DeepSeek-Prover V1.5 can produce a correct proof:

import Mathlib

/-- Show that 3^(n+1) ≠ 2^(m+1)-/
theorem prime_2_3 (n m : Nat) : 3^(n+1) ≠ 2^(m+1) := by
  /-
  We aim to show that \(3^{n+1} \neq 2^{m+1}\) for any natural numbers \(n\) and \(m\). We will use a proof by contradiction. Assume, for the sake of contradiction, that \(3^{n+1} = 2^{m+1}\). We will then derive a contradiction by examining the properties of the powers of 2 and 3.
  1. First, we consider the parity of both sides of the equation \(3^{n+1} = 2^{m+1}\).
  2. The left-hand side, \(3^{n+1}\), is odd because any power of 3 is odd.
  3. The right-hand side, \(2^{m+1}\), is even because any power of 2 is even.
  4. Since an odd number cannot equal an even number, we have a contradiction.
  5. Therefore, our assumption that \(3^{n+1} = 2^{m+1}\) must be false.
  Thus, we conclude that \(3^{n+1} \neq 2^{m+1}\).
  -/
  -- Assume, for the sake of contradiction, that 3^(n+1) = 2^(m+1).
  by_contra h
  -- Consider the parity of both sides of the equation.
  have h₀ : 3^(n+1) % 2 = 1 := by
    -- The left-hand side, 3^(n+1), is odd because any power of 3 is odd.
    simp [Nat.pow_mod, Nat.mod_eq_of_lt, show 3 % 2 = 1 by decide]
  have h₁ : 2^(m+1) % 2 = 0 := by
    -- The right-hand side, 2^(m+1), is even because any power of 2 is even.
    simp [Nat.pow_mod, Nat.mod_eq_of_lt, show 2 % 2 = 0 by decide]
  -- Since an odd number cannot equal an even number, we have a contradiction.
  simp_all

Jason Rute (Dec 15 2024 at 20:59):

@Huajian Xin I wish you would advertise more. I find one problem is that researchers don’t try to get anyone to use their research. For example, is it easy to install/use DeepSeek-Prover v1.5 on a laptop? Also what are the expectations? Since the target was minif2f, does it work on more advanced math, or just competition problems?

Lars Ericson (Dec 15 2024 at 21:18):

Thanks @Huajian Xin I will be happy to give DeepSeek a try.

GasStationManager (Dec 15 2024 at 21:32):

Nice one @Huajian Xin !
After Sonnet 3.5 got stuck, I asked what it would search for if given a search engine for Mathlib. It asked for Nat.pow_mod. I searched for it on moogle.ai and pasted the definition of the theorem to Sonnet. It was able to use the theorem to complete the proof. This is done on my command line chat interface where Sonnet can invoke Lean and fix its own syntax errors.

import Mathlib
theorem prime_2_3 (n m : Nat) : 3^(n+1) ≠ 2^(m+1) := by
  intro h

  -- For 3^(n+1) % 2
  have h1 : 3^(n+1) % 2 = 1 := calc
    3^(n+1) % 2 = (3 % 2)^(n+1) % 2 := by apply Nat.pow_mod
    _ = 1^(n+1) % 2 := by rw [show 3 % 2 = 1 by norm_num]
    _ = 1 % 2 := by simp
    _ = 1 := by norm_num

  -- For 2^(m+1) % 2
  have h2 : 2^(m+1) % 2 = 0 := calc
    2^(m+1) % 2 = (2 % 2)^(m+1) % 2 := by apply Nat.pow_mod
    _ = 0^(m+1) % 2 := by rw [show 2 % 2 = 0 by norm_num]
    _ = 0 := by simp

  -- From h: 3^(n+1) = 2^(m+1), their remainders mod 2 must be equal
  have contra : 1 = 0 := by
    rw [←h1, ←h2, h]

  -- This is our contradiction
  exact Nat.one_ne_zero contra

Next is to make this automatic. Does moogle.ai have an API?

Huajian Xin (Dec 16 2024 at 08:26):

Jason Rute said:

Huajian Xin I wish you would advertise more. I find one problem is that researchers don’t try to get anyone to use their research. For example, is it easy to install/use DeepSeek-Prover v1.5 on a laptop? Also what are the expectations? Since the target was minif2f, does it work on more advanced math, or just competition problems?

Thank you @Jason Rute !

Running a 7B large language model directly on a personal laptop is generally not practical without significant compromises, such as using heavily quantized versions (e.g., as mentioned here). However, quantization often leads to further performance degradation. For this reason, we believe that the most effective way to make tools like DeepSeek-Prover v1.5 widely accessible is for developers to build remote-access services, similar to how GitHub Copilot operates. This would allow users to interact with the model seamlessly without needing specialized hardware.
As for more advanced math, we haven't conducted extensive experiments beyond MiniF2F. We do have some preliminary results on ProofNet, but those are still in the early stages. Additionally, while we did a limited amount of training on mathlib, we haven’t fully evaluated the model on advanced math benchmarks. Maybe we can evaluate the model further on suitable benchmarks like miniCTX.

Lars Ericson (Dec 16 2024 at 14:11):

@Huajian Xin does DeepSeek need GPU to run the 8B model? For inference can it run on CPU? In that case, a laptop with 32GB of RAM should be sufficient? Or can you suggest a minimum RAM requirement for CPU inference?

Can you make a GitHub which shows a working Lean install using DeepSeek? For example along the lines of this one used to demo LeanCopilot which has the correctly configured lakefiles and so on: https://github.com/yangky11/lean4-example/tree/LeanCopilot-demo, together with exact build steps. I'm asking because you can see above in thread that I struggled for several hours trying to get LeanCopilot to install with Lean before I got it working more or less.

Is it able to handle import Mathlib? Importing mathlib seemed to cause problems for LeanCopilot .

Jason Rute (Dec 16 2024 at 14:30):

@Lars Ericson As for Mathlib and lean copilot, did you downloaded the mathlib cache so you aren’t building mathlib locally? @Huajian Xin why do your instructions say to build mathlib locally instead of just downloading the cache?

Huajian Xin (Dec 16 2024 at 14:36):

@Lars Ericson Basically a GPU with at least 24G memory (e.g. RTX 4090) is required to run the full-sized model.
@Jason Rute That's mainly because we modify the lakefile to support our custom setting for REPL. Maybe downloading the cache before building will also work.

Jason Rute (Dec 16 2024 at 14:52):

Ok, so to run DS-P I think one either needs to host a server with a GPU or get quanitization working. Neither is easy for the typical Lean user, but some people here are pretty talented and adventurous. (By the way, I’m going to talk about this sort of stuff at Lean Together next month, and I also gave a talk about it at AITP. The point is that there is a huge disconnect between research and users, and I think this discussion we are having right now really illustrates that divide.)

Lars Ericson (Dec 16 2024 at 20:11):

Huajian Xin said:

Lars Ericson Basically a GPU with at least 24G memory (e.g. RTX 4090) is required to run the full-sized model.
Jason Rute That's mainly because we modify the lakefile to support our custom setting for REPL. Maybe downloading the cache before building will also work.

I don't understand why inference for the trained model can't run on CPU. Have you tried it? Typically GPU is only required for training. Does it train on the fly? What makes it different from most other machine learning models on this point?

Otherwise the try-at-home price is around USD2500 including the 4090 card and power supply upgrade: https://www.amazon.com/ASUS-Gaming-GeForce-Graphics-DisplayPort/dp/B0C7JYX6LN/

Free tier Google Colab won't have as much VRAM for either GPU or TPU as the 4090: https://www.perplexity.ai/search/how-much-vram-is-on-a-google-c-rX.ezKv.RySkOuYCx_Nx4A

Alex Meiburg (Dec 16 2024 at 20:16):

Lars Ericson said:

I don't understand why inference for the trained model can't run on CPU. Have you tried it? Typically GPU is only required for training.

I don't think that's accurate. At least, not these days, with current LLMs. Neural networks are highly parallel structures, and benefit from the parallel power of GPUs for both training and inference. If you look at what people do with "inference-at-the-edge", it almost all uses accelerators (i.e. GPUs).

Of course you can run it on a CPU. But it's roughly 100x slower, usually much too slow to be interesting or useful.

Jason Rute (Dec 16 2024 at 20:26):

@Lars Ericson If you have enough memory, you could try on your CPU. It is an 8B model so it isn’t impossible, I think, just slow and memory intensive. But I’m also not sure what kind of search @Huajian Xin was using to get the result above. One sample? Many samples (like hundreds)? Or Monte Carlo tree search? The later would probably be quite slow on a CPU.

Lars Ericson (Dec 16 2024 at 22:12):

@Jason Rute in your AITP talk you might want to discuss the hardware requirements for inference for the available models like LeanCopilot and DS-P. Here is a guess from Perplexity.

Jason Rute (Dec 16 2024 at 22:56):

I don’t really appreciate big dumps on Zulip of AI written explainations.

Jason Rute (Dec 16 2024 at 23:11):

But for DS-P, it is a normal-ish size model for this type of research. I think it is sort of big to be running on a laptop (especially if you need multiple generations to do anything of value), but it is also a research model not intended to be run as such. (Hence, I was really surprised when they said "Have you tried DeepSeek-Prover?".) To make it more user-friendly, there are options like quantization or running it on a local GPU like those found on gamer laptops or newer Macbooks. There are also important questions about how this sort of thing looks in the future. For example, if some organization wants to start supporting a server of Lean-specific models or if there needs to be research into how exactly to make usable models that can fit on a laptop and finish in real-world time. I think it will come down to the various use cases. (For example, many Lean projects would be happy with tools that run in the background on a server. And there are plenty of non-transformer approaches that may be better for running on a laptop. Also, there are approaches that use existing LLM APIs but with ITP-specific integration to clean up the code. These again, could be good for laptop users.)

Jason Rute (Dec 16 2024 at 23:23):

As for LeanCopilot, it uses a smaller model that can more easily run on a laptop. I'm sure it is a very good model from experience, but some have said they got value out of it (and I don't know if we have ever had a particularly good user-facing model for Lean proof generation).

Lars Ericson (Dec 16 2024 at 23:53):

The laptop 4090 has 16GB of VRAM, so it wouldn't work. This size and type of model in current day will only run on suitably sized servers. For users who are not developers of the model, the model would need to be hosted on a server under some funding model, either users pay to use or a funding agency pays to make it free.

Huajian Xin (Dec 17 2024 at 17:32):

Jason Rute said:

But for DS-P, it is a normal-ish size model for this type of research. I think it is sort of big to be running on a laptop (especially if you need multiple generations to do anything of value), but it is also a research model not intended to be run as such. (Hence, I was really surprised when they said "Have you tried DeepSeek-Prover?".) To make it more user-friendly, there are options like quantization or running it on a local GPU like those found on gamer laptops or newer Macbooks. There are also important questions about how this sort of thing looks in the future. For example, if some organization wants to start supporting a server of Lean-specific models or if there needs to be research into how exactly to make usable models that can fit on a laptop and finish in real-world time. I think it will come down to the various use cases. (For example, many Lean projects would be happy with tools that run in the background on a server. And there are plenty of non-transformer approaches that may be better for running on a laptop. Also, there are approaches that use existing LLM APIs but with ITP-specific integration to clean up the code. These again, could be good for laptop users.)

OpenAI and the experiences of many other players have shown us that we must first scale up before we can scale down—only after developing a hundreds-of-billion-parameter, superhuman theorem-proving model can we distill it into a smaller, faster, and also stronger model with hundreds of millions of parameters. Therefore, our current progress is only halfway through the first half.

Auguste Poiroux (Dec 17 2024 at 18:08):

@Huajian Xin Lots of information here, I am looking forward a DeepSeekProver 2 model based on DeepSeek v2.5 236B and accessible through API then ^^ Speculation aside, DeepSeekProver 1.5 is a good model, thanks for making it open-source!

Jason Rute (Dec 17 2024 at 18:19):

@Auguste Poiroux when you say a “good model”, do you mean from personal experience or the benchmarks in their paper. I’m mostly just curious how many people have been able to get it working for themselves. And if so, what have you tried doing with it?

Auguste Poiroux (Dec 17 2024 at 18:36):

I mean from personal experience. When we use the provided prompts and generation parameters it works fairly well. I recommend using 32<=n_samples<=256. This range is a good trade-off between accuracy and compute-time. I am just using the whole-proof generation mode though, I didn't try the tree-search approach given the low improvement it seems to bring (according to the official paper results).
And beyond the provided prompts it is also capable of following some additional instructions. I wasn't expecting this in the beginning because I thought the heavy RL tuning would have destroyed these capabilities.
I am using it on ProofNet-like problems, but I didn't try it on real problems.

Jason Rute (Dec 17 2024 at 18:42):

Is there any Lean interaction, or do you manually check each proof? Also how long does it take to generate 32 to 256 proofs? And what kind of machine are you running it on?

Auguste Poiroux (Dec 17 2024 at 18:49):

I use Lean to check the proofs, I am not crazy enough to check thousands of Lean proofs by hand :big_smile:
As long as you have a good GPU with enough memory (A100, H100, RTX4090), generating proofs using parallel sampling is really fast. Currently, checking the proofs is the bottleneck in the pipeline for me (but not too bad).

Jason Rute (Dec 17 2024 at 18:52):

Sorry for the silly question, but does DeepSeek come with a way to check proofs in Lean, or did you write your own scripts? And are the proofs checked the “naive” way where one just checks a separate Lean text file from beginning to end for each generation, or do you use metaprogramming to check all the proofs in the same Lean instance?

Auguste Poiroux (Dec 17 2024 at 19:02):

DeepSeek has some code to run the model on benchmarks: https://github.com/deepseek-ai/DeepSeek-Prover-V1.5. So yes it comes with a way to check proofs. From what I understand in their code, they check complete files (approximately true). Under the hood they use Lean REPL. I am not using their code, but I am also using Lean REPL. It has a nice way to iterate on environment states, so we can load the context once and then reuse that environment multiple times to check several proofs. In practice, in benchmarks, the context is just "import Mathlib" and checking proofs has a tendency to take more CPU-time than to run "import Mathlib". So separating the import from the statement+proof is not that interesting in this case.

Auguste Poiroux (Dec 17 2024 at 19:05):

I may have read too fast DeepSeek's code, my bad. It seems that they use the REPL environment states: https://github.com/deepseek-ai/DeepSeek-Prover-V1.5/blob/main/prover/lean/verifier.py#L25

Jason Rute (Feb 01 2025 at 10:28):

I just tried the theorem prime_2_3 (n m : Nat) : 3^(n+1) ≠ 2^(m+1) on o3-mini (ChatGPT with reasoning turned on). (https://chatgpt.com/c/679df465-0a6c-8001-ad84-318bfd363c32) It got pretty close. I had to fix the Lean 4 syntax and replace a hallucinated theorem name with dvd_of_dvd_pow. Hard stuff for a new user, but not so hard for an experienced user. (I was also going to try DeepSeek-r1 since it might have been trained on Deepseek-Prover's data, or so someone here said. But alas it was busy when I tried.)

Eric Taucher (Feb 01 2025 at 10:51):

Jason Rute said:

Hard stuff for a new user, but not so hard for an experienced user.

Over the last two years since ChatGPT became public that statement has become so common that it should get a name and a Wikipedia entry. Would be curious to see examples as the years progress.

Jason Rute (Feb 01 2025 at 10:53):

Deepseek-r1's solution didn't seem as helpful. ~~(https://chat.deepseek.com/a/chat/s/f1cfb2dc-3a98-4d16-9f3b-f2226f6bf359)~~ It just tried to solve the whole thing with one hallucinated theorem. I don't know how to fix it.

Drophet (Feb 01 2025 at 11:11):

Unfortunately, the link provided isn't accessible to others. Could you please share the relevant parts of the conversation in another way?

Jason Rute (Feb 01 2025 at 11:46):

Ok, wasn't sure how it works. Here is the final "proof" from DS-r1:

"Proof"

theorem prime_2_3 (n m : Nat) : 3^(n+1) ≠ 2^(m+1) := by
  -- Use the fundamental theorem of arithmetic to show that the prime factorizations of 3^(n+1) and 2^(m+1) are distinct.
  apply Nat.pow_ne_pow_of_ne_of_lt (by norm_num) (by norm_num)
  -- `norm_num` is used to verify that 3 ≠ 2 and 3 < 2^(m+1) for all m ∈ ℕ, which are necessary conditions for the theorem to hold.
  <;> norm_num
  <;> linarith

Jason Rute (Feb 01 2025 at 11:51):

The reasoning trace (not shown) doesn't mention Lean until the proof at the end. It was just thinking through an English proof going through the fundamental theorem of arithmetic. Then the Lean "proof" uses some magic lemma that I don't think exists in Mathlib.

Kevin Buzzard (Feb 01 2025 at 11:52):

It's hard to guess what the lemma would say, for example 8^2=4^3, you will need some kind of primality assumption somewhere (4 != 8 and 4 < 8^(m+1))

Bolton Bailey (Feb 05 2025 at 19:33):

Jason Rute said:

I just tried the theorem prime_2_3 (n m : Nat) : 3^(n+1) ≠ 2^(m+1) on o3-mini (ChatGPT with reasoning turned on). (https://chatgpt.com/c/679df465-0a6c-8001-ad84-318bfd363c32)

I can't see anything at this link (Unable to load conversation 679df465-0a6c-8001-ad84-318bfd363c32), would you care to copy out the o3-mini answer as well?

Last updated: May 02 2025 at 03:31 UTC