Zulip Chat Archive

Stream: Lean Together 2024

Topic: Towards LLMs as Copilots - Kaiyu Yang


Mario Carneiro (Jan 11 2024 at 16:11):

@Mac Malone just noting a hidden lake feature request from the talk: the LeanCopilot readme asks that you add link arguments to the downstream project, it should be possible for the upstream project to be able to do this directly

Rob Lewis (Jan 11 2024 at 16:38):

I have a broad question here! @Heather Macbeth has made the point that LLM-based tactic search is only as good as the underlying tactic language. The symbolic methods used by normal tactics are the building blocks for any kind of copilot and improving these improves the copilot. But our current tactic suite is designed around humans, it's not necessarily the best tactics for human use are also the best tactic for LLMs (at least if you're interested in generating full proofs). Has anyone thought about what AI-oriented symbolic tactics might look like?

Joachim Breitner (Jan 11 2024 at 16:49):

I have two questions, very broad as well:
I’m somewhat surprised that the same textual representation of the proof state that we humans use is also best for the machine model. Why not something that makes some of implicit structure more easily “machine” readable, e.g. using S-Expressions, adding more contextual information, maybe alpha-renaming to reduce irrelevant variation.
Or is it the case that seeing through these things is so simple for LLMs these days that there isn’t much to be gained?

Joachim Breitner (Jan 11 2024 at 16:50):

And the other one is about attack surface. The copilot runs arbitrary tactics, behind the scenes, without user interaction. A malicious tactic could take over the user's machine, right? So if someone were to train the model on, say, all of reservoir, it seem it wouldn’t be too hard to poison it to try some some malicious tactics. How do Copilot-like systems prevent that so far?

Adam Topaz (Jan 11 2024 at 16:51):

S-Expressions of lean expressions end up being really huge, right (by which I mean when compared with something like the pretty printed expression)?

Eric Wieser (Jan 11 2024 at 17:03):

Joachim Breitner said:

I’m somewhat surprised that the same textual representation of the proof state that we humans use ...

In particular I'd claim that this is not what we humans use, and we regular use additional information accessible via hovers in the editor and infoview

Joachim Breitner (Jan 11 2024 at 17:08):

Maybe someone should ping @Kaiyu Yang about this thread :-)

Kaiyu Yang (Jan 11 2024 at 17:43):

There are attempts using more machine-oriented representation of states. E.g., [ASTactic] (https://arxiv.org/abs/1905.09381) and its follow-up works use the ASTs of terms in the form of S-expressions. Graph2Tac uses GNNs to encode the graph of terms. We don't really have enough empirical evidence to know how these non-LLM methods compare to LLM-based methods such as ours, due to difficulties in a fair comparison across different proof assistants.

Nevertheless, I tend to believe it's better for LLM-based methods to take human-readable input formats such as pretty-printed tactic states. These models are pretrained on natural language from the web and programs (in Python, C++, etc.) from GitHub. It's likely that what they have learned can be better transferred to domains with similar human-readable representations.

Kaiyu Yang (Jan 11 2024 at 17:45):

Furthermore, a promising approach (that we haven't explored yet) is to use LLMs to learn to align informal and formal math (see Draft, Sketch, and Prove). LLMs' are likely needed for this setup, given their capabilities in understanding/generating natural language.

Kevin Buzzard (Jan 11 2024 at 17:49):

I just tried to add this copilot to the repo for the course which I started teaching today https://github.com/ImperialCollegeLondon/formalising-mathematics-2024 (a project which depends on mathlib) and to my surprise lake update LeanCopilot seemed to update std, and now lake exe cache get followed by lake build is building Std. Is this expected?

Adam Topaz (Jan 11 2024 at 17:49):

Sounds like some version mismatch between the std in the two dependencies?

Eric Wieser (Jan 11 2024 at 17:50):

What does git diff show?

Eric Wieser (Jan 11 2024 at 17:50):

If it modified the Std version in the lake-manifest, then Adam is probably correct (and the fix is to revert that change with git checkout -p HEAD lake-manifest.json)

Adam Topaz (Jan 11 2024 at 17:51):

How heavily does LeanCopilot depend on std @Kaiyu Yang ?

Kevin Buzzard (Jan 11 2024 at 17:52):

Eric Wieser said:

What does git diff show?

--- a/lake-manifest.json
+++ b/lake-manifest.json
@@ -4,26 +4,35 @@
  [{"url": "https://github.com/leanprover/std4",
    "type": "git",
    "subDir": null,
-   "rev": "d8610e1bcb91c013c3d868821c0ef28bf693be07",
+   "rev": "ee49cf8fada1bf5a15592c399a925c401848227f",
    "name": "std",
...

Eric Wieser (Jan 11 2024 at 17:52):

I would guess that LeanCopilot works fine with old std, and the problem is that lake update is too aggressive

Kevin Buzzard (Jan 11 2024 at 17:57):

so what do I do? I'm a lake noob

Eric Wieser (Jan 11 2024 at 17:58):

Revert the change to that line

Kevin Buzzard (Jan 11 2024 at 17:58):

(deleted)

Eric Wieser (Jan 11 2024 at 17:58):

Either manually, or with the checkout command I gave you above

Eric Wieser (Jan 11 2024 at 17:58):

$image

I'm afraid that strategy won't work here

Kevin Buzzard (Jan 11 2024 at 17:58):

yeah that's the problem :-/

Kevin Buzzard (Jan 11 2024 at 18:00):

If I revert the change to that file then I'll also remove this part:

+   "rev": "5e5b19486d8b850aa4f3f3169e95de560f39b2f8",
+   "name": "LeanCopilot",
+   "manifestFile": "lake-manifest.json",
+   "inputRev": "v1.0.1",
+   "inherited": false,
+   "configFile": "lakefile.lean"},
+  {"url": "https://github.com/leanprover-community/quote4",
+   "type": "git",
+   "subDir": null,
+   "rev": "1c88406514a636d241903e2e288d21dc6d861e01",

Eric Wieser (Jan 11 2024 at 18:00):

I meant remove just the one line

Eric Wieser (Jan 11 2024 at 18:00):

git checkout -p asks you piece by piece which ones you want to revert

Eric Wieser (Jan 11 2024 at 18:01):

Which is why I suggested it :)

Kevin Buzzard (Jan 11 2024 at 18:01):

yeah, that was when I gave up

Kevin Buzzard (Jan 11 2024 at 18:01):

I had to choose between [y,n,q,a,d,s,e,?] with no clues

Eric Wieser (Jan 11 2024 at 18:01):

? is "give clues", y is "yes revert", n is "no leave alone"

Kevin Buzzard (Jan 11 2024 at 18:02):

I don't like the look of this though:

buzzard@buster:~/lean-projects/formalising-mathematics-2024$ git checkout -p HEAD lake-manifest.json
diff --git a/lake-manifest.json b/lake-manifest.json
index 4448554..368aca5 100644
--- a/lake-manifest.json
+++ b/lake-manifest.json
@@ -4,26 +4,35 @@
  [{"url": "https://github.com/leanprover/std4",
    "type": "git",
    "subDir": null,
-   "rev": "d8610e1bcb91c013c3d868821c0ef28bf693be07",
+   "rev": "ee49cf8fada1bf5a15592c399a925c401848227f",
    "name": "std",
    "manifestFile": "lake-manifest.json",
    "inputRev": "main",
    "inherited": true,
    "configFile": "lakefile.lean"},
-  {"url": "https://github.com/leanprover-community/quote4",
+  {"url": "https://github.com/JLimperg/aesop",
    "type": "git",
    "subDir": null,
-   "rev": "1c88406514a636d241903e2e288d21dc6d861e01",
-   "name": "Qq",
+   "rev": "69404390bdc1de946bf0a2e51b1a69f308e56d7a",
+   "name": "aesop",
    "manifestFile": "lake-manifest.json",
    "inputRev": "master",
    "inherited": true,
    "configFile": "lakefile.lean"},
-  {"url": "https://github.com/leanprover-community/aesop",
+  {"url": "https://github.com/lean-dojo/LeanCopilot.git",
    "type": "git",
    "subDir": null,
-   "rev": "69404390bdc1de946bf0a2e51b1a69f308e56d7a",
-   "name": "aesop",
+   "rev": "5e5b19486d8b850aa4f3f3169e95de560f39b2f8",
+   "name": "LeanCopilot",
+   "manifestFile": "lake-manifest.json",
+   "inputRev": "v1.0.1",
+   "inherited": false,
+   "configFile": "lakefile.lean"},
+  {"url": "https://github.com/leanprover-community/quote4",
+   "type": "git",
+   "subDir": null,
+   "rev": "1c88406514a636d241903e2e288d21dc6d861e01",
+   "name": "Qq",
    "manifestFile": "lake-manifest.json",
    "inputRev": "master",
    "inherited": true,
(1/1) Discard this hunk from index and worktree [y,n,q,a,d,s,e,?]?

Eric Wieser (Jan 11 2024 at 18:02):

Ouch

Kevin Buzzard (Jan 11 2024 at 18:03):

Which parts of these do I want to manually revert? Is it "everything other than LeanCopilot"?

Eric Wieser (Jan 11 2024 at 18:03):

This happened because the name of aesop changed between LeanCopilot and your version of mathlib (leanCopilot is too out of date)

Eric Wieser (Jan 11 2024 at 18:03):

this is a dead end, a piecewise revert through git is going to fail here

Kevin Buzzard (Jan 11 2024 at 18:03):

OK thanks for saving me from screwing everything up royally.

Eric Wieser (Jan 11 2024 at 18:04):

I think the thing to do is copy the LeanCopilot section from the file you currently have, then completely revert it

Eric Wieser (Jan 11 2024 at 18:04):

Then paste in just that new section

Eric Wieser (Jan 11 2024 at 18:06):

Maybe @Mac Malone can advise on a better approach here

namibj (Jan 11 2024 at 18:06):

@Eric Wieser
There are AST digesting language models that take an AST as input instead of taking a flat string.
They can't be pre-trained on loads of pre-existing human language like Google's byt5 was.
There are also just general graph capable transformer variants used quite successfully for e.g. molecules in chemistry.
The problem is that when you don't use the string variant, you have to somehow emit the structured variant for tactics that take fresh expressions.

If you or @Joachim Breitner want to investigate, DM me, I'm working towards this (for Lean4).

Adam Topaz (Jan 11 2024 at 18:12):

namibj said:

There are AST digesting language models that take an AST as input instead of taking a flat string.

Can you point us to some examples?

Kaiyu Yang (Jan 11 2024 at 18:13):

now lake exe cache get followed by lake build is building Std. Is this expected?

It's fine as long as it's not building mathlib? Building Std only shouldn't take long.

Eric Wieser (Jan 11 2024 at 18:13):

Mathlib will rebuild if std was rebuilt

Kaiyu Yang (Jan 11 2024 at 18:15):

OK, is there anything I can do on LeanCopilot's side to solve this problem?

E.g., updating the aesop dependency?

Kevin Buzzard (Jan 11 2024 at 18:17):

Eric Wieser said:

I think the thing to do is copy the LeanCopilot section from the file you currently have, then completely revert it

Then paste in just that new section

Thanks a lot Eric -- this seems to have worked; I have Lean Copilot running in my repo!

Mauricio Collares (Jan 11 2024 at 18:18):

What happens if you do lake update LeanCopilot followed by lake update mathlib? Does it keep LeanCopilot in the manifest but uses mathlib's std version?

Eric Wieser (Jan 11 2024 at 18:18):

Updating aesop to the leanprover-community version would maybe make this less painful

Eric Wieser (Jan 11 2024 at 18:18):

Mauricio Collares said:

What happens if you do lake update LeanCopilot followed by lake update mathlib? Does it keep LeanCopilot in the manifest but uses mathlib's std version?

Yes, but this also bumps mathlib!

namibj (Jan 11 2024 at 18:19):

Adam Topaz said:

namibj said:

There are AST digesting language models that take an AST as input instead of taking a flat string.

Can you point us to some examples?

https://arxiv.org/abs/1911.01545
They gave up/didn't publish their tries to Python AST; it's just mentioned in one of the versions in the LaTeX source code you can download as other format (a .tar with browser-handled .gz around; you'll rename to .tar and can then use it).
They "only" tackle some fairly simple math terms, basically phone calculator tier in function availability, but the idea scales somewhat further.

They use it to classify term pairs as equal/unequal and also use it to do blank filling up to iirc depth 2. iirc the latter is a fairly brute-force decoding approach used; more modern approaches would probably use masked/blanked diffusion model technology to enable generating expressions compliant with some demand/requirements.

Alex Kontorovich (Jan 11 2024 at 18:22):

Kevin Buzzard said:

git_xkcd_edited.png

(I got yelled at for posting a meme to zulip... I guess Kevin is allowed :smile: )

Kaiyu Yang (Jan 11 2024 at 18:22):

Updating aesop to the leanprover-community version would maybe make this less painful

Doing it now. It will take 15 minutes or so since I have to rebuild the cloud release.

Eric Wieser (Jan 11 2024 at 18:23):

Updating to a similar quote4 might help too (assuming you have that as a dependency)

Eric Wieser (Jan 11 2024 at 18:23):

But I think realistically you either need a new lake feature, or instructions to manually edit the lake manifest

Mauricio Collares (Jan 11 2024 at 18:24):

Eric Wieser said:

Yes, but this also bumps mathlib!

Hmm, and what if you edit the lakefile (not the manifest) to pin a particular rev before running lake update mathlib? Does it try to update again when changing the lakefile back to master?

Eric Wieser (Jan 11 2024 at 18:24):

Yes, that should work

Kevin Buzzard (Jan 11 2024 at 18:27):

meme: Sorry :-( I'm not allowed at all (it's now deleted). I was frustrated. The infrastructure around Lean 4 is infinitely more complex than the analogous Lean 3 set-up. I guess this is a sign of progress though.

On the plus side, I'm at Xena right now and an undergrad just asked me how to prove this:

lemma foo (X : Type) (A : Set X) (h : ¬ (A = )) :  x : X, x  A := by

and search_proof found a proof!

Kaiyu Yang (Jan 11 2024 at 18:38):

Updated LeanCopilot's dependencies in v1.0.2. You would need to update your lakefile.lean if you want to use this version.

I have to run to the airport for a ski trip to Colorado....will be back next Monday.

Jason Rute (Jan 16 2024 at 02:52):

(deleted)

Jason Rute (Jan 16 2024 at 03:22):

Adam Topaz said:

namibj said:

There are AST digesting language models that take an AST as input instead of taking a flat string.

Can you point us to some examples?

It is not a language model, but our Graph2Tac (#Machine Learning for Theorem Proving > Announcing Graph2Tac, a prover based on Tactician's new API) model takes as input a graph representing the Coq term of the proof state, as well as the Coq terms of any prerequisite definitions (and for argument selection, the Coq terms of any possible tactic arguments in the global context).

Jason Rute (Jan 16 2024 at 03:25):

Kaiyu Yang said:

There are attempts using more machine-oriented representation of states. E.g., [ASTactic] (https://arxiv.org/abs/1905.09381) and its follow-up works use the ASTs of terms in the form of S-expressions. Graph2Tac uses GNNs to encode the graph of terms. We don't really have enough empirical evidence to know how these non-LLM methods compare to LLM-based methods such as ours, due to difficulties in a fair comparison across different proof assistants.

Thanks for mentioning Graph2Tac! While I agree it is very hard to compare across systems and approaches, note we do compare Graph2Tac to a transformer-based approach similar to GPT-f, and Graph2Tac does much better. (And for very short time periods like up to a minute, a simple k-NN model seems to be the best model.) See Figures 5, 6, 7, 12, 16, and 17 as well as the caveats in the Threats to validity section in the current version. This could be for a number of reasons, and I'm happy to discuss it on #Machine Learning for Theorem Proving > Announcing Graph2Tac, a prover based on Tactician's new API. But, overall, this suggests that more comparison is needed, and as a community, we should be careful not to assume that LLMs solutions, or even neural solutions, are SoTA in this area (especially since, like you said, there are so few comparisons so far).

Kaiyu Yang (Jan 17 2024 at 15:36):

Hi @Jason Rute , congrats on the great work! I really like the comparisons in Graph2Tac between Transformers and simpler models such as KNN. I'm wondering why you chose to train the Transformer model from scratch on Coq data, instead of fine-tuning an existing language model such as ByT5 or Mistral 7B. The latter experiment seems more interesting to me, as LLMs are less about Transformers than about pertaining the model on Internet-scale data. E.g., RWKV or Mamba are probably more like LLMs compared to Transformers without pretraining.

Jesse Wright (Mar 16 2024 at 23:36):

(apologies for leeching onto this thread, but I figured the people interested in this message would be in it) I did some quick experiments with Claude 3 Opus (https://www.anthropic.com/claude) over the last couple of days and it does do quite an impressive job on reasoning tasks as they claim; for instance it was able to repro this proof with a tiny bit of prompt help such as sample proofs to learn the proof structure (not a lean proof sorry, it's an N3 proof from the domain I primarily work in).

With the way things are going it seems to me that there could be a valuable student project for someone to create something similar to Devin for (1) generating proofs from scratch and (2) converting papers to Lean4 proofs. I'd imagine this would use a combination of talking to Claude/GPT, querying moogle, using existing tooling in LeanCopilot and then following the debug cycle similar to that seen in the sample videos.

Jason Rute (Mar 17 2024 at 03:24):

Hi @Jesse Wright you might also be interested in the #Machine Learning for Theorem Proving stream. That would also be a good place to discuss this.


Last updated: May 02 2025 at 03:31 UTC