Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: LeanCopilot


Kaiyu Yang (Dec 11 2023 at 07:07):

We introduce Lean Copilot for language models to suggest tactics, search for proofs, and select premises in Lean. Here is a short demo:

Lean-Copilot-v1.mp4

Its initial version was known as LeanInfer, but now we have significantly expanded its scope:

  • Faster tactic generation on both CPUs and GPUs (powered by CTranslate2)
  • Higher tactic generation quality, using beam search instead of multinomial sampling.
  • Integrating LLM-generated tactics with aesop for proof search.
  • Retrieval for premise selection
  • Supporting user-provided models.

Chris Birkbeck (Dec 11 2023 at 11:02):

This is probably a stupid question, but how far are we from being able to use something like this to "clean up" some code and it get PR ready or close to PR ready? I have a bunch of stuff in a repo I need to PR and while I don't expect something like this to be able to prove the results it contains, it would be nice if it could learn from the proofs that are in the repo and then just tidy it up (i.e. I say this is the main result I want you to prove, then go, find the proof in the repo, break up into smaller results, remove unused results, fix spacing/layout etc). Perhaps I'm just way behind and this is already possible, or maybe this is sci fi and its actually hard to do and we are not there yet.

Chris Birkbeck (Dec 11 2023 at 11:04):

(also sorry if this is off topic for this thread, it was just that the premise selection and proof search bits got me thinking about this)

Jason Rute (Dec 11 2023 at 12:09):

One danger of putting the discussion here is that if this is really meant as a user facing tool with good usability (easy to install, no GPU needed, etc), then you may be missing much of your audience. It so, it might be worthy of a post in #announce with a link back here.

Jason Rute (Dec 11 2023 at 12:20):

I see you have a short paper on this at Math-AI 2023 (https://mathai2023.github.io/papers/4.pdf). I’m curious if this system could be used for AI research, especially with the build your own model capabilities. In particular one would need more baselines. You only test on 50 theorems and they are from Mathematics in Lean. Why? This is just another incomparable baseline to all the others out there in the literature. I’m also curious about your test. Was it a full proof search? Was it a deterministic timeout? Either way, how long did it typically take to run per theorem. Did you use a GPU? And if so, how long would it take per theorem on a CPU?

Jason Rute (Dec 11 2023 at 12:23):

I’m also curious how this compares to LLMStep by @Sean Welleck both in terms of capabilities and in terms of ease of use. I guess for both, it would be best if users play with it, but I’m not sure if either is designed for end users yet.

Jason Rute (Dec 11 2023 at 12:24):

And how it compares to published works like your own ReProver. Is it significantly nerfed?

Jason Rute (Dec 11 2023 at 12:45):

Lean 3 used to have the ability (maybe still does) to just do the tactic suggestions automatically in the infoview (using HTPS model via an API). That would be a much better user interface than typing a tactic every time. That is if it doesn’t churn through your CPU too much. @Gabriel Ebner added it to Lean 3.

Jason Rute (Dec 11 2023 at 13:06):

Although the Lean 3 infoview suggestions don’t run or rank the tactics which I think this does. That is even more important than putting them in the infoview. (I gave a demo of the Lean 3 tactic suggestions at the IPAM workshop once and the audience was really skeptical since all the first suggested tactics failed.)

Kaiyu Yang (Dec 13 2023 at 19:13):

Chris Birkbeck said:

This is probably a stupid question, but how far are we from being able to use something like this to "clean up" some code and it get PR ready or close to PR ready? I have a bunch of stuff in a repo I need to PR and while I don't expect something like this to be able to prove the results it contains, it would be nice if it could learn from the proofs that are in the repo and then just tidy it up (i.e. I say this is the main result I want you to prove, then go, find the proof in the repo, break up into smaller results, remove unused results, fix spacing/layout etc). Perhaps I'm just way behind and this is already possible, or maybe this is sci fi and its actually hard to do and we are not there yet.

We think it would be interesting to have something that can learn from your repo (as long as it's public on GitHub and can be built via lake build), automatically fill in sorry and submit pull requests to your repo. I don't see a fundamental difficulty here. It's just that we have limited personnel and have to priortize.

BTW, I was wondering if "PR" here means pull requests or public relations :rolling_on_the_floor_laughing:.

Kaiyu Yang (Dec 13 2023 at 19:14):

Jason Rute said:

worthy of a post in #announce with a link back here.

Thanks for the suggestion! I'll do that once I get a chance.

short paper on this at Math-AI 2023

That paper describes an earlier version of Lean Copilot (around September) and hasn't been updated. We plan to release a full paper in early 2024. We haven't decided whether to submit it to a ML conference or conferences like ITP. Personally I lean towards ITP since we want the main contribution to be a user-facing tool.

Kaiyu Yang (Dec 13 2023 at 19:21):

You only test on 50 theorems and they are from Mathematics in Lean. Why? This is just another incomparable baseline to all the others out there in the literature. I’m also curious about your test. Was it a full proof search? Was it a deterministic timeout? Either way, how long did it typically take to run per theorem. Did you use a GPU? And if so, how long would it take per theorem on a CPU?

And how it compares to published works like your own ReProver. Is it significantly nerfed?

Just to clarify, Lean Copilot does not introduce new models or algorithms for machine learning to prove theorems. You can think of it as a "frontend" that interfaces with existing methods. Under the hood, it uses the same ReProver (w/o retrieval) model as in LeanDojo (and you can bring your own models). Therefore, I don't think it's necessary to evaluate Lean Copilot empirically on benchmarks. The results should just be similar to what we described in the LeanDojo paper, despite some minor differences, e.g., different implementations of beam search by CTranslate2 and Hugging Face.

Jason Rute (Dec 13 2023 at 19:40):

From the user side, I’m wondering how much worse performance gets for a typical user computer versus your LeanDojo evaluation machine with eight GPUs. But it is also true the real test is what @Kevin Buzzard or other Lean users think.

Jason Rute (Dec 13 2023 at 19:40):

As for the ML side, we need better tools for AI researchers (like me) to experiment with new ideas without having to rebuild everything from scratch. Since you can plug in your own models here, this might be one such tool. (And in another thread you suggested that.) If so, it would be good to have baselines and ways to run those baselines. [Edit: One advantage of using this system as a research tool is that research experiments can be automatically distributed as user-facing tools right away.]

Jason Rute (Dec 13 2023 at 19:43):

But I also think I’ve been too negative. Great job, and I hope the Lean community finds this to be a valuable tool which is easy to use and powerful enough to be useful. I don’t know if there are any such tools yet in ITP. (Maybe SledgeHammer, CoqHammer, and Tactician come closest and I’d love an informal comparison with those from users familiar with the other tools.)

Jason Rute (Dec 13 2023 at 19:56):

Jason Rute said:

But it is also true the real test is what Kevin Buzzard or other Lean users think.

I playfully (!) pick on Kevin here since I think he is the most vocal critic of the current research, but I know @Heather Macbeth, @Johan Commelin, @Scott Morrison, @Jeremy Avigad and others have also expressed that they don't have a chance to play with the published AI work, so I hope this is not only a chance for them to play with it, but to tell us AI researchers what these systems still need to go the extra mile (or 10 or 100) both in terms of usability and capabilities.

Filippo A. E. Nuccio (Dec 13 2023 at 19:59):

When calling lake update LeanCopilot I get the error

error: no error (error code : 0)

that has the effect that the next command lake exe LeanCopilot/download fails.

Filippo A. E. Nuccio (Dec 13 2023 at 20:00):

Oh, I see; I must be in Win WSL and not in Win itself.

Notification Bot (Dec 13 2023 at 20:01):

2 messages were moved here from #announce > LeanCopilot by Rob Lewis.

Richard Copley (Dec 13 2023 at 20:04):

That message does get printed in Windows. It seems to happen when Lake and/or Lean is already running (as a language server in VS Code, for example).

Richard Copley (Dec 13 2023 at 20:07):

I reported that here. The conversation tailed off.

Kaiyu Yang (Dec 13 2023 at 20:08):

Filippo A. E. Nuccio said:

Oh, I see; I must be in Win WSL and not in Win itself.

Yep, Windows does not work yet. See https://github.com/lean-dojo/LeanCopilot/issues/31

Filippo A. E. Nuccio (Dec 13 2023 at 20:13):

I see, no problem. I have WSL installed (with an oldish version of lake and little time to update now), but I will be happy to test on Win whenever it will become available.

Kevin Buzzard (Dec 13 2023 at 20:56):

I'll rise to the bait here. Come April (when teaching is out of the way) I hope to be spending a lot of time proving Fermat's Last Theorem. What can this system offer me? I tend to just ignore all the messages in this stream because my instinct is that right now they can offer me nothing. But I'd be very happy to be proved wrong. In practice I'm going to be translating technical mathematics from research papers into lean.

Jason Rute (Dec 13 2023 at 21:20):

@Kevin Buzzard see the 2 min video at the top of this thread. Previous works like lean-gptf had tactic suggestions, but not full proof search or premise selection for end users. (I’m not involved in the work, but) I’m curious if you think the tactic predictions and/or proofs it finds are any good. I’m also curious of if you think tools like this would be useful enough if say they had better user interfaces (automatic tactic/premise suggestions, the ability to search for proofs in the background, etc.) In short, I’m curious where the state of this field is at from the point of view of end users.

Jason Rute (Dec 13 2023 at 21:21):

@Kaiyu Yang Does premises selection include new premises, or just those seen during training?

Kevin Buzzard (Dec 13 2023 at 21:28):

I guess I watch these videos and feel like they are so far away from my actual use case, but really what I should do is just try to use the software I guess.

Jason Rute (Dec 13 2023 at 21:31):

And say what your use case is…, maybe in its own thread or a blog post even after trying these tools.

Bolton Bailey (Dec 13 2023 at 21:32):

I am mildly disappointed that it doesn't seem to work like GitHub Copilot. Is there a way just let my cursor sit where I need to complete my proof and have it automagically finish it for me if it can?

Bolton Bailey (Dec 13 2023 at 21:37):

Another minor complaint: The installation seems to indicate that I need to import the project as a Lean dependency. This makes it hard to contribute to Mathlib with this, since I generally just work in the Mathlib repository itself, and I would need to modify Mathlib's lakefile to import this, but then not commit those modifications.

Bolton Bailey (Dec 13 2023 at 21:39):

(I guess I'll just make some feature suggestions)

Jason Rute (Dec 13 2023 at 21:41):

OpenAI has spoiled everyone. Haha. But I imagine part of this is just UI and the Lean community could help build better UI for this sort of thing around the existing tech. Avi Shinnar had some good takes on what a good UI would look like at the recent National Academies meeting (at 38:50 mins in).

Jason Rute (Dec 13 2023 at 21:42):

The Mathlib comment was the number one reason no one ever used Lean GPT-f, so it is probably a big deal.

Jason Rute (Dec 13 2023 at 21:46):

Actually, maybe that wasn't quite the same, since lean-gptf required mathlib so it couldn't be used for mathlib development, where I guess you are just saying it is annoying to use for mathlib development.

Mario Carneiro (Dec 13 2023 at 21:56):

@Kaiyu Yang I remarked on this before, but I think it's a bit more important now that it's a tactic and not just a python API for lean: your tactic select_premises is misnamed, these are not 'premises', they are theorems and lemmas in lean terminology. (I'm aware that this terminology is used esp. in the ATP field, but it makes a bit more sense there given that all premises are taken as axioms or hypotheses when proving the conjecture in question. In the context of lean/mathlib, these are all proved theorems.)

Jason Rute (Dec 13 2023 at 22:24):

“Lemma selection” seems to be the second most common term for this in the literature, so maybe select_lemmas? But then technically you could be selecting a non-Prop definition as well…

Junyan Xu (Dec 14 2023 at 01:42):

The installation seems to indicate that I need to import the project as a Lean dependency. This makes it hard to contribute to Mathlib with this, since I generally just work in the Mathlib repository itself, and I would need to modify Mathlib's lakefile to import this, but then not commit those modifications.

I guess we wouldn't need to worry about that if LeanCopilot could become an official dependency of mathlib master ... I hope it doesn't mean every mathlib user has to download GBs of model weights, otherwise it's probably not too heavy a dependency.

In the meantime, maybe we could consider making LeanCopilot run in another copy of the mathlib repo (or of whatever project you're working on)? It's probably not very helpful to retrieve the new lemmas that you added in your branch that you want to PR to mathlib, as you know them well and know when to apply them; it's much more useful to retrieve unfamiliar lemmas in the wider existing mathlib; periodically merging master in the copy should be good enough. I don't think much of LeanCopilot's functionalities depend on running within the same project, but correct me if I'm wrong.

I think most existing LLM code assistants work by looking at the plain text of the current file and other relevant files in your repo, and maybe also communicating with the language server at times. Most are not written in the target language; in fact many support multiple languages. The Lean-native approach of LeanCopilot could have speed advantages, which may not be important to a human user, but may prove crucial if we go ahead to train automated agents, where faster interaction with the environment could lead to faster feedback loop, iteration and improvement, and big savings in resources.

Junyan Xu (Dec 14 2023 at 02:09):

making LeanCopilot run in another copy of the mathlib repo

To elaborate, I think a realistic proposal is to split LeanCopilot into a lightweight user-facing client side and a server side. The client just need to implement the suggest_tactics, search_proof, and select_premises (maybe in some generalized extensible form that could support other assistants in the future), and know the server's address to send requests to and receive suggestions from it. It should be relatively easy to get this merged into mathlib master. The server side will host the inference framework, retriever models, indices (are they there yet?), etc. It also needs a copy of mathlib to retrieve from, but that doesn't have to be the client's copy.

Jason Rute (Dec 14 2023 at 02:16):

But doesn't that ruin the generalizability of the bring-your-own-model part of the project? Or is it still possible? Do you now have to split the custom model into a client and server portion, where the client portion reads the current state and environment, and sends stuff back and forth to the server?

Jason Rute (Dec 14 2023 at 02:18):

And I don't understand the "also needs a copy of mathlib to retrieve from". I'm still not sure if retrieval is intended to be project independent. Can Kevin retrieve a lemma from his FLT project, or does it only work for lemmas which existed in mathlib when the model was trained (in which case maybe it is possible to have a separate server with its own copy of mathlib since that is all that matters).

Jason Rute (Dec 14 2023 at 02:30):

(Actually, if premise selection is only based on stuff seen during training, then the server doesn't need any access to Mathlib since it was already seen during training.)

llllvvuu (Dec 14 2023 at 02:51):

My experience with retrieval as a user (mostly when I tried Cursor) is that you typically need to click a button to refresh the vector index for whatever repository you have open. The speed is actually pretty fast even on a large repository. I could see there being an "LeanCopilot Server" VSCode extension that manages this stuff instead of Lean/Mathlib.

Junyan Xu (Dec 14 2023 at 02:55):

Decoupled server-client design should make it easier to customize the server side (including the model), no? There should be some protocol that the client and the server both follow, but that's all.

LeanDojo docs talks about tracing mathlib (or other project) repo and caching the results; they host a number of traced repos on AWS. I'm not sure what the purpose is though; maybe it's just for extraction of training data, because I don't see anything about embedding and storing in a vector database for retrieval.

Jason Rute (Dec 14 2023 at 02:55):

Sorry, I looked now at the types of custom models it supports and they are all text models, so I think I was thinking of a different project (namely aesop) that could work with non-text-based models that can directly look at the environment.

Jason Rute (Dec 14 2023 at 02:55):

I know this project depends on aesop, so I guess the main parts are aesop, some llm-step stuff for the interface, and the reprover model in the backend. And I guess the suggestion is to move the reprover model to a separate server.

Jason Rute (Dec 14 2023 at 02:57):

I know of another (unpublished, but soon to be released) project that works like that. The ITP client sends a whole bunch of goal and environment data to the server to process. The server keeps a list of vector embeddings for each definition and also handles predicting new tactics. The client handles the tree search (which I think it would in this case too since aesop handles the search, but it could go the other way as well where the server does the search as in Lean gym).

Junyan Xu (Dec 14 2023 at 02:58):

The aesop integration is this PR. I think it may well serve as part of the client side.

Scott Morrison (Dec 14 2023 at 02:59):

I'm surprised there is still interface code from llm-step. We upstreamed the Try these widget to Std already, but perhaps LeanCopilot is not using that.

Junyan Xu (Dec 14 2023 at 03:06):

The word "embedding" appears twice in the LeanDojo paper, and they do say it can be precomputed. I agree it's not very useful to keep the mathlib repo around once you have extracted the cache including embeddings.

image.png

Jason Rute (Dec 14 2023 at 03:09):

Even if it is not precomputed, you can just still use the client to send the environment data to the server to compute lemma embeddings or to align with pre-computed embeddings. (At least that is what the other project I know about does.)

Chris Birkbeck (Dec 14 2023 at 11:43):

Kaiyu Yang said:

Chris Birkbeck said:

This is probably a stupid question, but how far are we from being able to use something like this to "clean up" some code and it get PR ready or close to PR ready? I have a bunch of stuff in a repo I need to PR and while I don't expect something like this to be able to prove the results it contains, it would be nice if it could learn from the proofs that are in the repo and then just tidy it up (i.e. I say this is the main result I want you to prove, then go, find the proof in the repo, break up into smaller results, remove unused results, fix spacing/layout etc). Perhaps I'm just way behind and this is already possible, or maybe this is sci fi and its actually hard to do and we are not there yet.

We think it would be interesting to have something that can learn from your repo (as long as it's public on GitHub and can be built via lake build), automatically fill in sorry and submit pull requests to your repo. I don't see a fundamental difficulty here. It's just that we have limited personnel and have to priortize.

BTW, I was wondering if "PR" here means pull requests or public relations :rolling_on_the_floor_laughing:.

Sorry , yes PR was Pull request, to mathlib in this case. Its encouraging that such a thing could maybe be done soon. One of the things that worries me is that there are lots of bits of code out there in repos that people (including me) haven't had the time to add to mathlib and slowly gets lost. Something like this, even if its not perfect, would really help.

Kevin Buzzard (Dec 14 2023 at 19:10):

OK so if I want to try this out, is it still the case that I can't use it in a project which depends on mathlib master?

Kaiyu Yang (Dec 14 2023 at 19:59):

Scott Morrison said:

I'm surprised there is still interface code from llm-step. We upstreamed the Try these widget to Std already, but perhaps LeanCopilot is not using that.

I'm not aware of it. Can you give me a pointer? Thx!

Kaiyu Yang (Dec 14 2023 at 20:00):

Kevin Buzzard said:

OK so if I want to try this out, is it still the case that I can't use it in a project which depends on mathlib master?

Lean Copilot only depends on aesop, so you can definitely use it in mathlib or other repos depending on mathlib. Here is an example of a repo depending on both mathlib and Lean Copilot.

Scott Morrison (Dec 14 2023 at 23:08):

Kaiyu Yang said:

Scott Morrison said:

I'm not aware of it. Can you give me a pointer? Thx!

(Sorry, my fault for not updating you on this!)

See https://github.com/leanprover/std4/blob/main/Std/Tactic/TryThis.lean#L386 for the main entry point addSuggestions. This support multiple suggestions, and has customisation hooks for indicating tactics that do or don't succeed.

Also useful is the suggestion function provided by Mathlib's hint tactic, that takes a piece of syntax representing a tactic invocation, and takes care of running it against the goal, checking if it succeeded, and preparing the Suggestion with appropriate formatting that can be passed to the TryThis widget.

(To be clear, this is all inspired by llm-step's implementation, just with a bit more engineering work. :-)

It would be great if this can all be reused.

(The components that are still in Mathlib have open PRs moving them to Std; if it's helpful let me know and I can hurry those along. :-)

Kaiyu Yang (Dec 16 2023 at 17:55):

Thanks @Scott Morrison ! It's great to see these common frontend utilities are moving to the upstream. We'll re-use it. In addition, do you have a similar function for displaying retrieved premises? For each premise, we have

  • fully qualified name
  • which module (file) it comes from
  • the code for defining it (optional)

Currently, we simply use the fully qualified name to find its type and doc-string (if available) and use logInfo to print it to the infoview panel. However, it would be great if it can be more interactive, e.g., displaying additional information when the user hovers the mouse over it (similar to hovering over the goals in the infoview panel), or allowing the user to "go-to definition". Would that also be useful for the hammer Lean FRO is developing?

Kaiyu Yang (Dec 16 2023 at 18:00):

Jason Rute said:

But then technically you could be selecting a non-Prop definition as well…

@Mario Carneiro That was the main reason we used the term "premise" instead of "lemma", though informally (e.g., when explaining it to people coming to our poster), I actually use "lemma".

Mario Carneiro (Dec 16 2023 at 18:01):

I'm not sure that makes it better, people don't call definitions premises either

Kaiyu Yang (Dec 16 2023 at 18:01):

I guess "constant" is technically the right term for Lean? But people would be confused.

Scott Morrison (Dec 17 2023 at 00:30):

How about we just say "lemma suggestions"?

Scott Morrison (Dec 17 2023 at 00:32):

We do not have a function for displaying "lemma suggestions", but creating one sounds like a great idea. As @Thomas Murrills did such a great job with the Try these widget I wonder if they might be interested? :-)

Thomas Murrills (Dec 17 2023 at 01:45):

Thanks for thinking of me! You caught me at a good time; I’ve got a few days free! :D I’ll be happy to give it a shot! First I want to think a bit more deeply about the actual use cases of such a widget, and try to understand what information will (or might be) relevant to the user in these cases. I’ll probably ask some questions here as I think through it! :)

(I also want to consider whether it would make sense to integrate this with Try these or have a standalone widget. The answer might be readily apparent once I think about it…)

Min-Hsien Weng (Dec 17 2023 at 05:31):

@Kaiyu Yang

I recently came across the mistral-7b model and was surprised by its capability. I ran some initial experiments, including using it to detect whether texts are generated AI model. I found it efficient fine-tuning (fully trained on a TPU in just 2 hours with Lora layers) and impressive performance (0.85 ROC accuracy).

Has anyone explored using mistral-7b on Leandrojo? That would be an very interesting experiment. :grinning_face_with_smiling_eyes:

Eric Rodriguez (Dec 17 2023 at 15:54):

I'm currently trying to install this, and getting the error error: unknown executable «LeanCopilot/download»`; has anyone else come across this?

Kaiyu Yang (Dec 17 2023 at 16:30):

Eric Rodriguez said:

I'm currently trying to install this, and getting the error error: unknown executable «LeanCopilot/download»`; has anyone else come across this?

Maybe open an issue on GitHub?

Kaiyu Yang (Dec 19 2023 at 03:32):

Min-Hsien Weng said:

Kaiyu Yang

I recently came across the mistral-7b model and was surprised by its capability. I ran some initial experiments, including using it to detect whether texts are generated AI model. I found it efficient fine-tuning (fully trained on a TPU in just 2 hours with Lora layers) and impressive performance (0.85 ROC accuracy).

Has anyone explored using mistral-7b on Leandrojo? That would be an very interesting experiment. :grinning_face_with_smiling_eyes:

This is something I have always wanted to do but haven't had the chance to do yet :). I probably won't do it in the very near future, but I am interested to see others do it.

Kaiyu Yang (Dec 19 2023 at 03:35):

Thomas Murrills said:

Thanks for thinking of me! You caught me at a good time; I’ve got a few days free! :D I’ll be happy to give it a shot! First I want to think a bit more deeply about the actual use cases of such a widget, and try to understand what information will (or might be) relevant to the user in these cases. I’ll probably ask some questions here as I think through it! :)

(I also want to consider whether it would make sense to integrate this with Try these or have a standalone widget. The answer might be readily apparent once I think about it…)

Thank you! I'm happy to discuss more details when you set out to implement it!


Last updated: Dec 20 2023 at 11:08 UTC