Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: LeanCopilot

Kaiyu Yang (Dec 11 2023 at 07:07):

We introduce Lean Copilot for language models to suggest tactics, search for proofs, and select premises in Lean. Here is a short demo:

Lean-Copilot-v1.mp4

Its initial version was known as LeanInfer, but now we have significantly expanded its scope:

Faster tactic generation on both CPUs and GPUs (powered by CTranslate2)
Higher tactic generation quality, using beam search instead of multinomial sampling.
Integrating LLM-generated tactics with aesop for proof search.
Retrieval for premise selection
Supporting user-provided models.

Chris Birkbeck (Dec 11 2023 at 11:02):

This is probably a stupid question, but how far are we from being able to use something like this to "clean up" some code and it get PR ready or close to PR ready? I have a bunch of stuff in a repo I need to PR and while I don't expect something like this to be able to prove the results it contains, it would be nice if it could learn from the proofs that are in the repo and then just tidy it up (i.e. I say this is the main result I want you to prove, then go, find the proof in the repo, break up into smaller results, remove unused results, fix spacing/layout etc). Perhaps I'm just way behind and this is already possible, or maybe this is sci fi and its actually hard to do and we are not there yet.

Chris Birkbeck (Dec 11 2023 at 11:04):

(also sorry if this is off topic for this thread, it was just that the premise selection and proof search bits got me thinking about this)

Jason Rute (Dec 11 2023 at 12:09):

One danger of putting the discussion here is that if this is really meant as a user facing tool with good usability (easy to install, no GPU needed, etc), then you may be missing much of your audience. It so, it might be worthy of a post in #announce with a link back here.

Jason Rute (Dec 11 2023 at 12:20):

I see you have a short paper on this at Math-AI 2023 (https://mathai2023.github.io/papers/4.pdf). I’m curious if this system could be used for AI research, especially with the build your own model capabilities. In particular one would need more baselines. You only test on 50 theorems and they are from Mathematics in Lean. Why? This is just another incomparable baseline to all the others out there in the literature. I’m also curious about your test. Was it a full proof search? Was it a deterministic timeout? Either way, how long did it typically take to run per theorem. Did you use a GPU? And if so, how long would it take per theorem on a CPU?

Jason Rute (Dec 11 2023 at 12:23):

I’m also curious how this compares to LLMStep by @Sean Welleck both in terms of capabilities and in terms of ease of use. I guess for both, it would be best if users play with it, but I’m not sure if either is designed for end users yet.

Jason Rute (Dec 11 2023 at 12:24):

And how it compares to published works like your own ReProver. Is it significantly nerfed?

Jason Rute (Dec 11 2023 at 12:45):

Lean 3 used to have the ability (maybe still does) to just do the tactic suggestions automatically in the infoview (using HTPS model via an API). That would be a much better user interface than typing a tactic every time. That is if it doesn’t churn through your CPU too much. @Gabriel Ebner added it to Lean 3.

Jason Rute (Dec 11 2023 at 13:06):

Although the Lean 3 infoview suggestions don’t run or rank the tactics which I think this does. That is even more important than putting them in the infoview. (I gave a demo of the Lean 3 tactic suggestions at the IPAM workshop once and the audience was really skeptical since all the first suggested tactics failed.)

Kaiyu Yang (Dec 13 2023 at 19:13):

Chris Birkbeck said:

This is probably a stupid question, but how far are we from being able to use something like this to "clean up" some code and it get PR ready or close to PR ready? I have a bunch of stuff in a repo I need to PR and while I don't expect something like this to be able to prove the results it contains, it would be nice if it could learn from the proofs that are in the repo and then just tidy it up (i.e. I say this is the main result I want you to prove, then go, find the proof in the repo, break up into smaller results, remove unused results, fix spacing/layout etc). Perhaps I'm just way behind and this is already possible, or maybe this is sci fi and its actually hard to do and we are not there yet.

We think it would be interesting to have something that can learn from your repo (as long as it's public on GitHub and can be built via lake build), automatically fill in sorry and submit pull requests to your repo. I don't see a fundamental difficulty here. It's just that we have limited personnel and have to priortize.

BTW, I was wondering if "PR" here means pull requests or public relations :rolling_on_the_floor_laughing:.

Kaiyu Yang (Dec 13 2023 at 19:14):

Jason Rute said:

worthy of a post in #announce with a link back here.

Thanks for the suggestion! I'll do that once I get a chance.

short paper on this at Math-AI 2023

That paper describes an earlier version of Lean Copilot (around September) and hasn't been updated. We plan to release a full paper in early 2024. We haven't decided whether to submit it to a ML conference or conferences like ITP. Personally I lean towards ITP since we want the main contribution to be a user-facing tool.

Kaiyu Yang (Dec 13 2023 at 19:21):

You only test on 50 theorems and they are from Mathematics in Lean. Why? This is just another incomparable baseline to all the others out there in the literature. I’m also curious about your test. Was it a full proof search? Was it a deterministic timeout? Either way, how long did it typically take to run per theorem. Did you use a GPU? And if so, how long would it take per theorem on a CPU?

And how it compares to published works like your own ReProver. Is it significantly nerfed?

Just to clarify, Lean Copilot does not introduce new models or algorithms for machine learning to prove theorems. You can think of it as a "frontend" that interfaces with existing methods. Under the hood, it uses the same ReProver (w/o retrieval) model as in LeanDojo (and you can bring your own models). Therefore, I don't think it's necessary to evaluate Lean Copilot empirically on benchmarks. The results should just be similar to what we described in the LeanDojo paper, despite some minor differences, e.g., different implementations of beam search by CTranslate2 and Hugging Face.

Jason Rute (Dec 13 2023 at 19:40):

From the user side, I’m wondering how much worse performance gets for a typical user computer versus your LeanDojo evaluation machine with eight GPUs. But it is also true the real test is what @Kevin Buzzard or other Lean users think.

Jason Rute (Dec 13 2023 at 19:40):

As for the ML side, we need better tools for AI researchers (like me) to experiment with new ideas without having to rebuild everything from scratch. Since you can plug in your own models here, this might be one such tool. (And in another thread you suggested that.) If so, it would be good to have baselines and ways to run those baselines. [Edit: One advantage of using this system as a research tool is that research experiments can be automatically distributed as user-facing tools right away.]

Jason Rute (Dec 13 2023 at 19:43):

But I also think I’ve been too negative. Great job, and I hope the Lean community finds this to be a valuable tool which is easy to use and powerful enough to be useful. I don’t know if there are any such tools yet in ITP. (Maybe SledgeHammer, CoqHammer, and Tactician come closest and I’d love an informal comparison with those from users familiar with the other tools.)

Jason Rute (Dec 13 2023 at 19:56):

Jason Rute said:

But it is also true the real test is what Kevin Buzzard or other Lean users think.

I playfully (!) pick on Kevin here since I think he is the most vocal critic of the current research, but I know @Heather Macbeth, @Johan Commelin, @Scott Morrison, @Jeremy Avigad and others have also expressed that they don't have a chance to play with the published AI work, so I hope this is not only a chance for them to play with it, but to tell us AI researchers what these systems still need to go the extra mile (or 10 or 100) both in terms of usability and capabilities.

Filippo A. E. Nuccio (Dec 13 2023 at 19:59):

When calling lake update LeanCopilot I get the error

error: no error (error code : 0)

that has the effect that the next command lake exe LeanCopilot/download fails.

Filippo A. E. Nuccio (Dec 13 2023 at 20:00):

Oh, I see; I must be in Win WSL and not in Win itself.

Notification Bot (Dec 13 2023 at 20:01):

2 messages were moved here from #announce > LeanCopilot by Rob Lewis.

Richard Copley (Dec 13 2023 at 20:04):

That message does get printed in Windows. It seems to happen when Lake and/or Lean is already running (as a language server in VS Code, for example).

Richard Copley (Dec 13 2023 at 20:07):

I reported that here. The conversation tailed off.

Kaiyu Yang (Dec 13 2023 at 20:08):

Filippo A. E. Nuccio said:

Oh, I see; I must be in Win WSL and not in Win itself.

Yep, Windows does not work yet. See https://github.com/lean-dojo/LeanCopilot/issues/31

Filippo A. E. Nuccio (Dec 13 2023 at 20:13):

I see, no problem. I have WSL installed (with an oldish version of lake and little time to update now), but I will be happy to test on Win whenever it will become available.

Kevin Buzzard (Dec 13 2023 at 20:56):

I'll rise to the bait here. Come April (when teaching is out of the way) I hope to be spending a lot of time proving Fermat's Last Theorem. What can this system offer me? I tend to just ignore all the messages in this stream because my instinct is that right now they can offer me nothing. But I'd be very happy to be proved wrong. In practice I'm going to be translating technical mathematics from research papers into lean.

Jason Rute (Dec 13 2023 at 21:20):

@Kevin Buzzard see the 2 min video at the top of this thread. Previous works like lean-gptf had tactic suggestions, but not full proof search or premise selection for end users. (I’m not involved in the work, but) I’m curious if you think the tactic predictions and/or proofs it finds are any good. I’m also curious of if you think tools like this would be useful enough if say they had better user interfaces (automatic tactic/premise suggestions, the ability to search for proofs in the background, etc.) In short, I’m curious where the state of this field is at from the point of view of end users.

Jason Rute (Dec 13 2023 at 21:21):

@Kaiyu Yang Does premises selection include new premises, or just those seen during training?

Kevin Buzzard (Dec 13 2023 at 21:28):

I guess I watch these videos and feel like they are so far away from my actual use case, but really what I should do is just try to use the software I guess.

Jason Rute (Dec 13 2023 at 21:31):

And say what your use case is…, maybe in its own thread or a blog post even after trying these tools.

Bolton Bailey (Dec 13 2023 at 21:32):

I am mildly disappointed that it doesn't seem to work like GitHub Copilot. Is there a way just let my cursor sit where I need to complete my proof and have it automagically finish it for me if it can?

Bolton Bailey (Dec 13 2023 at 21:37):

Another minor complaint: The installation seems to indicate that I need to import the project as a Lean dependency. This makes it hard to contribute to Mathlib with this, since I generally just work in the Mathlib repository itself, and I would need to modify Mathlib's lakefile to import this, but then not commit those modifications.

Bolton Bailey (Dec 13 2023 at 21:39):

(I guess I'll just make some feature suggestions)

Jason Rute (Dec 13 2023 at 21:41):

OpenAI has spoiled everyone. Haha. But I imagine part of this is just UI and the Lean community could help build better UI for this sort of thing around the existing tech. Avi Shinnar had some good takes on what a good UI would look like at the recent National Academies meeting (at 38:50 mins in).

Jason Rute (Dec 13 2023 at 21:42):

The Mathlib comment was the number one reason no one ever used Lean GPT-f, so it is probably a big deal.

Jason Rute (Dec 13 2023 at 21:46):

Actually, maybe that wasn't quite the same, since lean-gptf required mathlib so it couldn't be used for mathlib development, where I guess you are just saying it is annoying to use for mathlib development.

Mario Carneiro (Dec 13 2023 at 21:56):

@Kaiyu Yang I remarked on this before, but I think it's a bit more important now that it's a tactic and not just a python API for lean: your tactic select_premises is misnamed, these are not 'premises', they are theorems and lemmas in lean terminology. (I'm aware that this terminology is used esp. in the ATP field, but it makes a bit more sense there given that all premises are taken as axioms or hypotheses when proving the conjecture in question. In the context of lean/mathlib, these are all proved theorems.)

Jason Rute (Dec 13 2023 at 22:24):

“Lemma selection” seems to be the second most common term for this in the literature, so maybe select_lemmas? But then technically you could be selecting a non-Prop definition as well…

Junyan Xu (Dec 14 2023 at 01:42):

The installation seems to indicate that I need to import the project as a Lean dependency. This makes it hard to contribute to Mathlib with this, since I generally just work in the Mathlib repository itself, and I would need to modify Mathlib's lakefile to import this, but then not commit those modifications.

I guess we wouldn't need to worry about that if LeanCopilot could become an official dependency of mathlib master ... I hope it doesn't mean every mathlib user has to download GBs of model weights, otherwise it's probably not too heavy a dependency.

In the meantime, maybe we could consider making LeanCopilot run in another copy of the mathlib repo (or of whatever project you're working on)? It's probably not very helpful to retrieve the new lemmas that you added in your branch that you want to PR to mathlib, as you know them well and know when to apply them; it's much more useful to retrieve unfamiliar lemmas in the wider existing mathlib; periodically merging master in the copy should be good enough. I don't think much of LeanCopilot's functionalities depend on running within the same project, but correct me if I'm wrong.

I think most existing LLM code assistants work by looking at the plain text of the current file and other relevant files in your repo, and maybe also communicating with the language server at times. Most are not written in the target language; in fact many support multiple languages. The Lean-native approach of LeanCopilot could have speed advantages, which may not be important to a human user, but may prove crucial if we go ahead to train automated agents, where faster interaction with the environment could lead to faster feedback loop, iteration and improvement, and big savings in resources.

Junyan Xu (Dec 14 2023 at 02:09):

making LeanCopilot run in another copy of the mathlib repo

To elaborate, I think a realistic proposal is to split LeanCopilot into a lightweight user-facing client side and a server side. The client just need to implement the suggest_tactics, search_proof, and select_premises (maybe in some generalized extensible form that could support other assistants in the future), and know the server's address to send requests to and receive suggestions from it. It should be relatively easy to get this merged into mathlib master. The server side will host the inference framework, retriever models, indices (are they there yet?), etc. It also needs a copy of mathlib to retrieve from, but that doesn't have to be the client's copy.

Jason Rute (Dec 14 2023 at 02:16):

But doesn't that ruin the generalizability of the bring-your-own-model part of the project? Or is it still possible? Do you now have to split the custom model into a client and server portion, where the client portion reads the current state and environment, and sends stuff back and forth to the server?

Jason Rute (Dec 14 2023 at 02:18):

And I don't understand the "also needs a copy of mathlib to retrieve from". I'm still not sure if retrieval is intended to be project independent. Can Kevin retrieve a lemma from his FLT project, or does it only work for lemmas which existed in mathlib when the model was trained (in which case maybe it is possible to have a separate server with its own copy of mathlib since that is all that matters).

Jason Rute (Dec 14 2023 at 02:30):

(Actually, if premise selection is only based on stuff seen during training, then the server doesn't need any access to Mathlib since it was already seen during training.)

llllvvuu (Dec 14 2023 at 02:51):

My experience with retrieval as a user (mostly when I tried Cursor) is that you typically need to click a button to refresh the vector index for whatever repository you have open. The speed is actually pretty fast even on a large repository. I could see there being an "LeanCopilot Server" VSCode extension that manages this stuff instead of Lean/Mathlib.

Junyan Xu (Dec 14 2023 at 02:55):

Decoupled server-client design should make it easier to customize the server side (including the model), no? There should be some protocol that the client and the server both follow, but that's all.

LeanDojo docs talks about tracing mathlib (or other project) repo and caching the results; they host a number of traced repos on AWS. I'm not sure what the purpose is though; maybe it's just for extraction of training data, because I don't see anything about embedding and storing in a vector database for retrieval.

Jason Rute (Dec 14 2023 at 02:55):

Sorry, I looked now at the types of custom models it supports and they are all text models, so I think I was thinking of a different project (namely aesop) that could work with non-text-based models that can directly look at the environment.

Jason Rute (Dec 14 2023 at 02:55):

I know this project depends on aesop, so I guess the main parts are aesop, some llm-step stuff for the interface, and the reprover model in the backend. And I guess the suggestion is to move the reprover model to a separate server.

Jason Rute (Dec 14 2023 at 02:57):

I know of another (unpublished, but soon to be released) project that works like that. The ITP client sends a whole bunch of goal and environment data to the server to process. The server keeps a list of vector embeddings for each definition and also handles predicting new tactics. The client handles the tree search (which I think it would in this case too since aesop handles the search, but it could go the other way as well where the server does the search as in Lean gym).

Junyan Xu (Dec 14 2023 at 02:58):

The aesop integration is this PR. I think it may well serve as part of the client side.

Scott Morrison (Dec 14 2023 at 02:59):

I'm surprised there is still interface code from llm-step. We upstreamed the Try these widget to Std already, but perhaps LeanCopilot is not using that.

Junyan Xu (Dec 14 2023 at 03:06):

The word "embedding" appears twice in the LeanDojo paper, and they do say it can be precomputed. I agree it's not very useful to keep the mathlib repo around once you have extracted the cache including embeddings.

image.png

Jason Rute (Dec 14 2023 at 03:09):

Even if it is not precomputed, you can just still use the client to send the environment data to the server to compute lemma embeddings or to align with pre-computed embeddings. (At least that is what the other project I know about does.)

Chris Birkbeck (Dec 14 2023 at 11:43):

Kaiyu Yang said:

Chris Birkbeck said:

This is probably a stupid question, but how far are we from being able to use something like this to "clean up" some code and it get PR ready or close to PR ready? I have a bunch of stuff in a repo I need to PR and while I don't expect something like this to be able to prove the results it contains, it would be nice if it could learn from the proofs that are in the repo and then just tidy it up (i.e. I say this is the main result I want you to prove, then go, find the proof in the repo, break up into smaller results, remove unused results, fix spacing/layout etc). Perhaps I'm just way behind and this is already possible, or maybe this is sci fi and its actually hard to do and we are not there yet.

We think it would be interesting to have something that can learn from your repo (as long as it's public on GitHub and can be built via lake build), automatically fill in sorry and submit pull requests to your repo. I don't see a fundamental difficulty here. It's just that we have limited personnel and have to priortize.

BTW, I was wondering if "PR" here means pull requests or public relations :rolling_on_the_floor_laughing:.

Sorry , yes PR was Pull request, to mathlib in this case. Its encouraging that such a thing could maybe be done soon. One of the things that worries me is that there are lots of bits of code out there in repos that people (including me) haven't had the time to add to mathlib and slowly gets lost. Something like this, even if its not perfect, would really help.

Kevin Buzzard (Dec 14 2023 at 19:10):

OK so if I want to try this out, is it still the case that I can't use it in a project which depends on mathlib master?

Kaiyu Yang (Dec 14 2023 at 19:59):

Scott Morrison said:

I'm surprised there is still interface code from llm-step. We upstreamed the Try these widget to Std already, but perhaps LeanCopilot is not using that.

I'm not aware of it. Can you give me a pointer? Thx!

Kaiyu Yang (Dec 14 2023 at 20:00):

Kevin Buzzard said:

OK so if I want to try this out, is it still the case that I can't use it in a project which depends on mathlib master?

Lean Copilot only depends on aesop, so you can definitely use it in mathlib or other repos depending on mathlib. Here is an example of a repo depending on both mathlib and Lean Copilot.

Scott Morrison (Dec 14 2023 at 23:08):

Kaiyu Yang said:

Scott Morrison said:

I'm not aware of it. Can you give me a pointer? Thx!

(Sorry, my fault for not updating you on this!)

See https://github.com/leanprover/std4/blob/main/Std/Tactic/TryThis.lean#L386 for the main entry point addSuggestions. This support multiple suggestions, and has customisation hooks for indicating tactics that do or don't succeed.

Also useful is the suggestion function provided by Mathlib's hint tactic, that takes a piece of syntax representing a tactic invocation, and takes care of running it against the goal, checking if it succeeded, and preparing the Suggestion with appropriate formatting that can be passed to the TryThis widget.

(To be clear, this is all inspired by llm-step's implementation, just with a bit more engineering work. :-)

It would be great if this can all be reused.

(The components that are still in Mathlib have open PRs moving them to Std; if it's helpful let me know and I can hurry those along. :-)

Kaiyu Yang (Dec 16 2023 at 17:55):

Thanks @Scott Morrison ! It's great to see these common frontend utilities are moving to the upstream. We'll re-use it. In addition, do you have a similar function for displaying retrieved premises? For each premise, we have

fully qualified name
which module (file) it comes from
the code for defining it (optional)

Currently, we simply use the fully qualified name to find its type and doc-string (if available) and use logInfo to print it to the infoview panel. However, it would be great if it can be more interactive, e.g., displaying additional information when the user hovers the mouse over it (similar to hovering over the goals in the infoview panel), or allowing the user to "go-to definition". Would that also be useful for the hammer Lean FRO is developing?

Kaiyu Yang (Dec 16 2023 at 18:00):

Jason Rute said:

But then technically you could be selecting a non-Prop definition as well…

@Mario Carneiro That was the main reason we used the term "premise" instead of "lemma", though informally (e.g., when explaining it to people coming to our poster), I actually use "lemma".

Mario Carneiro (Dec 16 2023 at 18:01):

I'm not sure that makes it better, people don't call definitions premises either

Kaiyu Yang (Dec 16 2023 at 18:01):

I guess "constant" is technically the right term for Lean? But people would be confused.

Scott Morrison (Dec 17 2023 at 00:30):

How about we just say "lemma suggestions"?

Scott Morrison (Dec 17 2023 at 00:32):

We do not have a function for displaying "lemma suggestions", but creating one sounds like a great idea. As @Thomas Murrills did such a great job with the Try these widget I wonder if they might be interested? :-)

Thomas Murrills (Dec 17 2023 at 01:45):

Thanks for thinking of me! You caught me at a good time; I’ve got a few days free! :D I’ll be happy to give it a shot! First I want to think a bit more deeply about the actual use cases of such a widget, and try to understand what information will (or might be) relevant to the user in these cases. I’ll probably ask some questions here as I think through it! :)

(I also want to consider whether it would make sense to integrate this with Try these or have a standalone widget. The answer might be readily apparent once I think about it…)

Min-Hsien Weng (Dec 17 2023 at 05:31):

@Kaiyu Yang

I recently came across the mistral-7b model and was surprised by its capability. I ran some initial experiments, including using it to detect whether texts are generated AI model. I found it efficient fine-tuning (fully trained on a TPU in just 2 hours with Lora layers) and impressive performance (0.85 ROC accuracy).

Has anyone explored using mistral-7b on Leandrojo? That would be an very interesting experiment. :grinning_face_with_smiling_eyes:

Eric Rodriguez (Dec 17 2023 at 15:54):

I'm currently trying to install this, and getting the error error: unknown executable «LeanCopilot/download»`; has anyone else come across this?

Kaiyu Yang (Dec 17 2023 at 16:30):

Eric Rodriguez said:

I'm currently trying to install this, and getting the error error: unknown executable«LeanCopilot/download»`; has anyone else come across this?

Maybe open an issue on GitHub?

Kaiyu Yang (Dec 19 2023 at 03:32):

Min-Hsien Weng said:

Kaiyu Yang

I recently came across the mistral-7b model and was surprised by its capability. I ran some initial experiments, including using it to detect whether texts are generated AI model. I found it efficient fine-tuning (fully trained on a TPU in just 2 hours with Lora layers) and impressive performance (0.85 ROC accuracy).

Has anyone explored using mistral-7b on Leandrojo? That would be an very interesting experiment. :grinning_face_with_smiling_eyes:

This is something I have always wanted to do but haven't had the chance to do yet :). I probably won't do it in the very near future, but I am interested to see others do it.

Kaiyu Yang (Dec 19 2023 at 03:35):

Thomas Murrills said:

Thanks for thinking of me! You caught me at a good time; I’ve got a few days free! :D I’ll be happy to give it a shot! First I want to think a bit more deeply about the actual use cases of such a widget, and try to understand what information will (or might be) relevant to the user in these cases. I’ll probably ask some questions here as I think through it! :)

(I also want to consider whether it would make sense to integrate this with Try these or have a standalone widget. The answer might be readily apparent once I think about it…)

Thank you! I'm happy to discuss more details when you set out to implement it!

namibj (Jan 06 2024 at 18:18):

It seems to me that the authors forgot to upload the initially unavailable (because their mathlib4 digestion scripts weren't able to dump premise details like they did for Lean3 and I assume used for the RAG fine-tuning) Lean4 RAG model, see it's missing: https://github.com/lean-dojo/ReProver?tab=readme-ov-file#using-trained-models-on-hugging-face (I checked the hugging face account they uploaded under, it's not there, and it wasn't part of v1 of their ArXiv submission, either).

Not that goal-conditioned auto-complete for tactics isn't already very useful, but there's a reason they went through the effort to set up a premise retrieval engine (the paper shows aesop (iirc best-first) proof search on top of the success-probability-annotated tactic suggestions (with RAG) to be state-of-the-art in auto-provers for Lean3, and they didn't change anything substantial to Lean4)...

Junyan Xu (Jan 06 2024 at 20:15):

@Kaiyu Yang and maybe this should be moved to the other thread.

namibj (Jan 06 2024 at 20:16):

Oh, thanks; how would/should I go about said move?

Junyan Xu (Jan 06 2024 at 20:20):

would need an administrator to move across streams, I think ...

namibj (Jan 06 2024 at 20:40):

Kaiyu Yang said:

Retrieval for premise selection

I believe I'm missing/you forgot to upload the fine-tuned RAG model for Lean4?

Also CTranslate2 either crashes on GPU attempt or refuses to build for me (at least from your CMake setup), so I've settled for a bit on the non-RAG generator running on my 5950X via OpenBLAS as it's sufficiently interactive for now.

I'd maybe try to get a half-remote (so you can time-share the GPU with other GPU users/usage) autocomplete provider based on an (opportunistically) stateful prompt-completing beam search decoder: KV cache the encoder results, re-steer beam search to respond to latest typed prefix, feed suggestions to the usual cursor-localized drop-down of regular autocomplete.

I'd have to try the RAG generator first, though, and hear feedback from others more familiar with Lean about it's casual usefulness.

Notification Bot (Jan 06 2024 at 22:43):

4 messages were moved here from #general > LeanCopilot by Scott Morrison.

Kaiyu Yang (Jan 08 2024 at 20:07):

@namibj We're in the process of re-training a new version of the models for NeurIPS's Jan 15 deadline for final updates to the cam-ready version. We will update the ReProver repo with those models (including the Lean 4 w/ retrieval model). Feel free to DM me in case I forgot.

For CTranslate2 or other problems with Lean Copilot, please open an issue in the LeanCopilot repo.

namibj (Jan 08 2024 at 20:09):

I'll ping you when they're not on HF or mentioned in the repo when I look on the 17th.
Thanks for the info!

namibj (Jan 17 2024 at 07:52):

@Kaiyu Yang It is the 17th, and the updates are neither on the ReProver repo nor are there updated models on your :hug: page.

Kaiyu Yang (Jan 17 2024 at 20:41):

namibj said:

Kaiyu Yang It is the 17th, and the updates are neither on the ReProver repo nor are there updated models on your :hug: page.

Updated!

namibj (Jan 17 2024 at 22:02):

Kaiyu Yang said:

namibj said:

Kaiyu Yang It is the 17th, and the updates are neither on the ReProver repo nor are there updated models on your :hug: page.

Updated!

Will LeanCopilot get the upgrade soon, or should I attempt to take care of it?

Kaiyu Yang (Jan 18 2024 at 18:07):

@namibj Currently Lean Copilot still uses the previous models. We have other plans for keeping Lean Copilot's models up to date (potentially better than the models in the LeanDojo paper). I'll send an update later.

namibj (Jan 25 2024 at 13:29):

Kaiyu Yang said:

namibj Currently Lean Copilot still uses the previous models. We have other plans for keeping Lean Copilot's models up to date (potentially better than the models in the LeanDojo paper). I'll send an update later.

If you don't expect this to be wasted work, I'd like to go ahead and try my best to unlock RAG in LeanCopilot, assuming I can get the retriever to work at all (currently it complains because the CTranslate2 CUDA support of the prebuild LeanCopilot errors out on my system (I blame my old GTX 970); forcing CUDA doesn't seem to work for the retriever, just for the generator, which is how I used it for the time being), but I was able to build it without CUDA on my Laptop at least). Edit: Reading code made me realize how I can force CUDA off for the encode function, and thus the select_premises tactic.

I'll assume the official guide to convert the model to CTranslate2 would work, or there's code/documentation on how the current CTranslate2 model versions were created.

Worst-case, I'd attempt to use the ReProver repo as a base to cook up a local PyTorch-based RAG for use via the external model feature of LeanCopilot.

On a side note, while preparing to allow manually turning CUDA usage off, I noticed that the retrieve function in ct2.cpp lacks CUDA synchronization, which fits the theory that the crash happens from accessing unfinished/uninitialized output of the TopK operator (assuming the MatMul -> TopK sequence at least executes in a sequentially consistent manner).

If you don't make me hold off on this "upgrade LeanCopilot to RAG", I'd see to tackle that bug together with getting "current project"(and/or "in scope") premises included in the retrieval.

Kaiyu Yang (Jan 26 2024 at 03:14):

@namibj

there's code/documentation on how the current CTranslate2 model versions were created.

Please see the README of the CTranslate2 models on my Hugging Face.

Worst-case, I'd attempt to use the ReProver repo as a base to cook up a local PyTorch-based RAG for use via the external model feature of LeanCopilot.

I think that is a good start. You're less likely to run into problems when running the model in Python.

On a side note, while preparing to allow manually turning CUDA usage off, I noticed that the retrieve function in ct2.cpp lacks CUDA synchronization, which fits the theory that the crash happens from accessing unfinished/uninitialized output of the TopK operator (assuming the MatMul -> TopK sequence at least executes in a sequentially consistent manner).

That could be the reason for some CUDA errors we encountered when using the retrieval. Could you please open an issue to elaborate on that in LeanCopilot's GitHub repo? Thanks!

For incorporating retrieval into LeanCopilot, we also need a way to query and index the corpus of premises. Currently, the select_premises tactic in LeanCopilot only retrieves from a fixed snapshot of mathlib. Ideally, we want to go beyond that and make the corpus dynamic, though I'm not sure what's the best way to do that.

namibj (Jan 26 2024 at 03:56):

Kaiyu Yang said:

namibj

there's code/documentation on how the current CTranslate2 model versions were created.

Please see the README of the CTranslate2 models on my Hugging Face.

Ahh, yeah, that's about what I remembered; perfect then.

Worst-case, I'd attempt to use the ReProver repo as a base to cook up a local PyTorch-based RAG for use via the external model feature of LeanCopilot.

I think that is a good start. You're less likely to run into problems when running the model in Python.

It _is_ working, though, besides one reproducible crash in init_premise_embeddings on the second select_premises tactic; I hope to have gotten a pointer to how to rr record it without also recording VSCode itself; but if still unclear, I'll just attempt the latter tomorrow (ETA about 25 hours from now).

On a side note, while preparing to allow manually turning CUDA usage off, I noticed that the retrieve function in ct2.cpp lacks CUDA synchronization, which fits the theory that the crash happens from accessing unfinished/uninitialized output of the TopK operator (assuming the MatMul -> TopK sequence at least executes in a sequentially consistent manner).

That could be the reason for some CUDA errors we encountered when using the retrieval. Could you please open an issue to elaborate on that in LeanCopilot's GitHub repo? Thanks!

I'm actually planning to 1-up this and submit a PR directly; I expect to need to rework the function anyways to include non-mathlib premises in the initial ranking part of the RAG, so I might as well attempt to debug/fix this after learning enough of the CTranslate2 C++ API.

Either way, that operation doesn't need to run on GPU; the encode step is far more expensive.

For incorporating retrieval into LeanCopilot, we also need a way to query and index the corpus of premises. Currently, the select_premises tactic in LeanCopilot only retrieves from a fixed snapshot of mathlib. Ideally, we want to go beyond that and make the corpus dynamic, though I'm not sure what's the best way to do that.

I'm aware of what it currently does; ideas I have include upstreaming a hook suitable for dumping premises in the required format during a lake build, as well as more-or-less "just" digesting the local .olean files to understand what's locally relevant in terms of premises.
I'll ask for how to do it in a way that probably won't break with, like, lean 4.7.0 (~10 weeks from now) (nothing special about that version. Just a hypothetical issue.)

If the premise dumping can run during a lake build, we could probably ask Reservoir to collect it across the ecosystem on the next build of each package.

Jason Rute (Jan 26 2024 at 16:11):

@namibj @Kaiyu Yang In case it helps, here is roughly how we do this in Graph2Tac. Graph2Tac stores vector embeddings for each definition (including theorems) in the global context. These are used for both argument selection (similar to premise selection) and for node embeddings in the graph model. The embeddings are stored inside the neural network weights, and there are roughly two categories of embeddings (both stored in the same table)

embeddings of definitions seen during training which already come stored in the network weights
embeddings of new definitions which are updated in the network weights in real time

Jason Rute (Jan 26 2024 at 16:11):

When synth or Suggest is run by the user, then Tactician API sends to the Python server all the definitions in the global environment that haven't already been sent over. If we have Tactician's auto-caching on, then Tactician API also sends over definitions after every new definition. This amortizes the computation of new embeddings, so the user doesn't notice any delay. We also have it set up that if the user backs up in the context, then Tactician knows this as well, and it is reflected in the information Tactician sends to the Python client (because, again, Tactician only sends updates to context information, not the full context).

Jason Rute (Jan 26 2024 at 16:11):

Now when the Python server gets a new list of definitions, it checks the hashes of these definitions against the hashes for all embeddings it currently has in its table. (The hashes are based on the graph of the definition and its name, so they are invariant between different instances of running Coq.) If a definition hash is new, it calculates the embedding for that definition using the neural network and adds it to the next available spot in the embedding table. It also associates that hash with the index in the embedding table. Since the model comes already with embeddings for definitions seen during training, including all definitions in the standard library, it doesn't have to recalculate those. (As an implementation detail, we keep the table larger than the number of currently embedded definitions, so we can add new definitions without having to resize the array and recompile our model, except when we run out of room. This is like a resizable array.) (As another implementation detail, we don't delete embeddings, so there could be many embeddings seen during training which are not used as well as embeddings for definitions which the user wrote but then backtracked and edited. This isn't a big deal since we only do argument selection over the definitions available in the current global context.)

Jason Rute (Jan 26 2024 at 16:11):

I don't know how much of this is easily implementable in Lean, and it helps in our case that calculating the embeddings for a few definitions is really fast in our model. (But still, importing a large package can take a number of seconds to build the table for those definitions.)

Jason Rute (Jan 26 2024 at 16:11):

My coauthors @Lasse Blaauwbroek and @Mirek Olšák were more involved in some of these technical details and have more to add, but I also want to shout out to them out for making a system which not only works in theory, but works in practice.

namibj (Jan 26 2024 at 16:14):

Jason Rute said:

namibj Kaiyu Yang In case it helps, here is roughly how we do this in Graph2Tac. Graph2Tac stores vector embeddings for each definition (including theorems) in the global context. These are used for both argument selection (similar to premise selection) and for node embeddings in the graph model. The embeddings are stored inside the neural network weights, and there are roughly two categories of embeddings (both stored in the same table)

embeddings of definitions seen during training which already come stored in the network weights

embeddings of new definitions which are updated in the network weights in real time

I mean, yeah, I expect something like that. The question is more of how to integrate with Lean/Lake the build system/compiler, than how to handle the entries once they made their way to the vector database/retriever/encoder ("RAG machinery").

Jason Rute (Jan 26 2024 at 16:22):

I would think the best approach, but also the most engineering is to have a three entry points:

A model comes stored with embeddings for core, std, and mathlib.
During local compilation, the embedding table is updated in the local model.
During user interaction, the embedding table is updated in real time by communicating with Lean incremental information about changes in the global context.

My point is mainly that 1. and 3. are possible, and we have experience doing it in Coq if you want to ask questions.

Jason Rute (Jan 26 2024 at 16:25):

(But I don't mean to presume we have all the answers, and you are correct that working with Lean/Lake may be a very different that our work in Coq.)

namibj (Jan 26 2024 at 16:28):

Jason Rute said:

I would think the best approach, but also the most engineering is to have a three entry points:

A model comes stored with embeddings for core, std, and mathlib.

During local compilation, the embedding table is updated in the local model.

During user interaction, the embedding table is updated in real time by communicating with Lean about incremental information about changes in the global context.

My point is mainly that 1. and 3. are possible, and we have experience doing it in Coq if you want to ask questions.

Well, those things ought to happen, maybe not quite structured like this.
The question is where and how exactly 2 shall happen.

I hope to get to this the weekend/early next week; most of next week is blocked off in my schedule though.
As I understand it, embedding expense would be roughly in the 10's of GFLOPs for common single-line theorems.
I.e., it's comparatively really expensive, so aggressive caching is critical. I just haven't figured out yet where and how to best do that. Especially as there is some mild pretty-printing/normalization going on for symbol resolution as I understand it, that might not be trivial in all convenient ways where these new definitions could be treated.

Kaiyu Yang (Jan 29 2024 at 01:34):

@namibj That sounds great. Please keep me posted. I think making this work for Lean is meaningful not only for RAG but also for selecting useful lemmas for humans.

namibj (Jan 29 2024 at 18:41):

Kaiyu Yang said:

namibj That sounds great. Please keep me posted. I think making this work for Lean is meaningful not only for RAG but also for selecting useful lemmas for humans.

Yeah; though IMO the RAG aspect should help humans interpret how those premises could be applied to the current goal/state.

Furthermore, may I request you upload the ct2 conversion of kaiyuy/leandojo-lean4-retriever-tacgen-byt5-small? It's clearer/easier from the licensing POV and would keep the commits to LeanCopilot from containing someone else's HF repo URLs.

Also, can you clarify the input spec for the Lean4 RAG generator model? I.e., does it need that <a></a> wrapping for the premise names that the Lean3 RAG generator model seems to need?

Futhermore, to my understanding larger transformers are more efficient to train to some particular level of quality, but rather more expensive to run inference on: have you run any ablation studies whatsoever with a larger ByT5 than the smallest that is widely used in the paper?
Do the 100 hours/5 days of A100 training time for the fine tuning of google/byt5-small refer to single-model, or do they cover both retriever and generator?

Min-Hsien Weng (Feb 01 2024 at 22:03):

namibj
Futhermore, to my understanding larger transformers are more efficient to train to some particular level of quality, but rather more expensive to run inference on: have you run any ablation studies whatsoever with a larger ByT5 than the smallest that is widely used in the paper?

I came across a python library vLLM that can speed up the inference time of large language models. It's reported that it can generate the texts over 1000 tokens/second using the Mistral-7b model with 128 parallel requests.

Currently, vLLM supports 20 models, but ByT5 isn't yet included. However, you can submit a request for its integration, and I believe they are happy to incorporate Google's ByT5 model.

vLLM: https://github.com/vllm-project/vllm

Siddhartha Gadgil (Feb 02 2024 at 03:51):

Min-Hsien Weng said:

namibj
Futhermore, to my understanding larger transformers are more efficient to train to some particular level of quality, but rather more expensive to run inference on: have you run any ablation studies whatsoever with a larger ByT5 than the smallest that is widely used in the paper?

I came across a python library vLLM that can speed up the inference time of large language models. It's reported that it can generate the texts over 1000 tokens/second using the Mistral-7b model with 128 parallel requests.

Currently, vLLM supports 20 models, but ByT5 isn't yet included. However, you can submit a request for its integration, and I believe they are happy to incorporate Google's ByT5 model.

vLLM: https://github.com/vllm-project/vllm

Any model similar enough to the listed ones and on Huggingface is automatically supported. For example, I run MorphProver using vLLM following the same instructions as Mistral.

I don't know if this applies to ByT5 though.

Kaiyu Yang (Feb 02 2024 at 14:57):

@Min-Hsien Weng @Siddharth Bhat vLLm currently does not support encoder-decoder models, though I believe there is a pull request working on that. For decoder-only models, running it using vLLM may help significantly in speed.

namibj (Feb 02 2024 at 14:58):

With beam decoding multi-request processing on ByT5 is not that high-impact.

namibj (Feb 02 2024 at 15:00):

1) 2/3rds effort is the encoder. 2) output is shorter than input. 3) you run output token processing in parallel through the beam decoding anyways.

Kaiyu Yang (Feb 02 2024 at 15:07):

Furthermore, may I request you upload the ct2 conversion of kaiyuy/leandojo-lean4-retriever-tacgen-byt5-small

Uploading. Will be ready soon.

does it need that <a></a> wrapping

Yes. For details see format_augmented_state (with p_drop == 0)

have you run any ablation studies whatsoever with a larger ByT5 than the smallest that is widely used in the paper?

We tried google/byt5-base but the result was similar to google/byt5-small. There are other people currently trying Mistral 7B or CodeLLaMA: https://github.com/lean-dojo/LeanDojo/discussions/5.

Do the 100 hours/5 days of A100 training time for the fine tuning of google/byt5-small refer to single-model, or do they cover both retriever and generator?

That is retriever + generator. Training the generator needs ~3 days.

namibj (Feb 02 2024 at 15:08):

Thanks!

Jason Rute (Feb 08 2024 at 14:55):

@Kaiyu Yang I'm trying to run Lean Copilot's suggest_tactics (or search_proof) on a compute cluster via lake build. I'm getting the following error:

error: stderr:
terminate called after throwing an instance of 'std::runtime_error'
terminate called recursively
error: external command `/u/jasonrute/.elan/toolchains/leanprover--lean4---v4.5.0-rc1/bin/lean` exited with code 134

Do you have any idea what could be causing this and how to fix it? (It works fine on my laptop.)

Jason Rute (Feb 08 2024 at 14:56):

Replacing suggest_tactics with aesop? doesn't give this error, so it is likely a lean copilot thing.

Kaiyu Yang (Feb 08 2024 at 15:00):

@Jason Rute Did you try restarting the file? Did it fix the error?

Jason Rute (Feb 08 2024 at 15:02):

I'm using lake build from the command line to compile the project (including the file which has suggest_tactics in it), so I don't think "restarting the file" applies here.

Jason Rute (Feb 08 2024 at 15:03):

I assume something is broken with running the model. Is there a simple way to test if that works?

Kaiyu Yang (Feb 08 2024 at 15:06):

Hmm, let me try to reproduce it. What are the exact steps?

Jason Rute (Feb 08 2024 at 15:16):

I doubt you will be able to reproduce it. I think it is system dependent. (It doesn't happen on my laptop for example.) I don't have a MWE yet, but it would look like this:

Make a new lean project.
Add LeanCopilot to it.
Do all the lake stuff to download the models:

lake update LeanCopilot
lake exe LeanCopilot/download
lake build

Modify a lean file already being built to run suggest_tactics, e.g.

import LeanCopilot

theorem foo : true := by suggest_tactics

lake build
Get error.

The only difference is that I'm using an existing project instead of a new one.

Jason Rute (Feb 08 2024 at 15:24):

Ok, I just followed those steps with a fresh project created by lake new test_project. I get the same error.

Jason Rute (Feb 08 2024 at 15:25):

I think what is happening is that there is an error running the model. Is there a way to run the model directly to get a less opaque error message?

Jason Rute (Feb 08 2024 at 15:27):

Maybe this python script?: https://github.com/lean-dojo/LeanCopilot/tree/main/python

Kaiyu Yang (Feb 08 2024 at 16:40):

@Jason Rute I tried but was unable to reproduce the error.

Yes, the error probably comes from the FFI part and can be avoided by running the model as an external model following these steps. First, create an external model and configure suggest_tactics to use it. The code may look like this:

import LeanCopilot

open LeanCopilot

def myModel : ExternalGenerator := {
  name := "kaiyuy/leandojo-lean4-tacgen-byt5-small"
  host := "localhost"
  port := 23337
}

#eval registerGenerator "kaiyuy/leandojo-lean4-tacgen-byt5-small" (.external myModel)

set_option LeanCopilot.suggest_tactics.model "kaiyuy/leandojo-lean4-tacgen-byt5-small"

theorem foo : true := by suggest_tactics

Second, launch the server that hosts the external model. You can use the Python script here: https://github.com/lean-dojo/LeanCopilot/tree/main/python. And you can comment our all models in this file except the model you're actually going to use ("kaiyuy/leandojo-lean4-tacgen-byt5-small"?). If successful, you will see something like below.
image.png

Third, lake build

Jason Rute (Feb 08 2024 at 17:07):

Thanks. I'll give this a try, but first for some reason I am getting ld.lld: error: undefined symbolerrors even just with import LeanCopilot even on a fresh new project, so I have to figure that out first.

Jason Rute (Feb 08 2024 at 17:07):

(This wasn't happening early today.)

namibj (Feb 09 2024 at 01:45):

Jason Rute said:

Thanks. I'll give this a try, but first for some reason I am getting ld.lld: error: undefined symbolerrors even just with import LeanCopilot even on a fresh new project, so I have to figure that out first.

No, well, the undefined symbol part is expected on a fresh project.
This is because you have to link against CTranslate2.

Once you're able to reproduce, I'd suggest setting up to get a core dump on the crash.
What sort of debugging access do you have on the cluster?
If you manage to set up a small reproducing example, and it's on the C/C++ side of the FFI boundary to Lean, I'd like to debug it.

Patrick Massot (Feb 20 2024 at 22:51):

@Kaiyu Yang there is a missing withContext in the LeanCopilot frontend. In version 1.1.1 try:

import LeanCopilot
import Mathlib.Topology.Instances.Real

def continuous_function_at (f : ℝ → ℝ) (x₀ : ℝ) :=
∀ ε > 0, ∃ δ > 0, ∀ x, |x - x₀| ≤ δ → |f x - f x₀| ≤ ε

def sequence_tendsto (u : ℕ → ℝ) (l : ℝ) :=
∀ ε > 0, ∃ N, ∀ n ≥ N, |u n - l| ≤ ε

example (f : ℝ → ℝ) (u : ℕ → ℝ) (x₀ : ℝ)
    (hu : sequence_tendsto u x₀) (hf : continuous_function_at f x₀) :
    sequence_tendsto (f ∘ u) (f x₀) := by
  unfold  sequence_tendsto
  suggest_tactics
  sorry

and all those _uniq in the remaining goals:

Try these:
refine' fun ε ε_pos =>  _
Remaining subgoals:
⊢ ∃ N, ∀ n ≥ N, |(f ∘ u) n - f x₀| ≤ _uniq.19267
intro ε  hε
Remaining subgoals:
⊢ ∃ N, ∀ n ≥ N, |(f ∘ u) n - f x₀| ≤ _uniq.772
intro  ε
Remaining subgoals:
⊢ _uniq.782 > 0 → ∃ N, ∀ n ≥ N, |(f ∘ u) n - f x₀| ≤ _uniq.782
intro ε  εpos
Remaining subgoals:
⊢ ∃ N, ∀ n ≥ N, |(f ∘ u) n - f x₀| ≤ _uniq.790

Patrick Massot (Feb 20 2024 at 22:53):

The fix is very easy. In the Frontend file, replace

    let mut str := "\nRemaining subgoals:"
    for g in goals do
      let e ← PrettyPrinter.ppExpr (← instantiateMVars (← g.getType))
      str := str ++ Format.pretty ("\n⊢ " ++ e)

with

    let mut str := "\nRemaining subgoals:"
    for g in goals do
      let goalType ← instantiateMVars (← g.getType)
      let e ← g.withContext do (PrettyPrinter.ppExpr goalType)
      str := str ++ Format.pretty ("\n⊢ " ++ e)

Patrick Massot (Feb 20 2024 at 22:53):

I am sorry I don’t have much time for a proper bug report+PR right now.

Kaiyu Yang (Feb 21 2024 at 16:40):

@Patrick Massot Thanks! I'll take a look once I have a chance!

Esteban Estupinan (Apr 01 2024 at 17:18):

Hi all, I managed to install lean copilot with a lot of effort on windows wsl, apparently there are no errors when importing leanCopilot in the.lean file. The problem arises when I execute a command like "suggest_tactics", after a few seconds I get the following error. Maybe someone has already experienced this or knows how to fix this error?

import LeanCopilot

theorem hello_world (a b c : Nat)
  : a + b + c = a + c + b := by
  suggest_tactics

imagen.png

David Renshaw (Apr 17 2024 at 12:32):

I've been trying LeanCopilot's search_proof on Compfiles via tryAtEachStep. Last night it found this nice simplification, shortening eleven lines to three lines.

(The literal thing it suggested was

  apply mul_nonneg
  · simp_all only [sub_nonneg]
  · simp_all only [sub_nonneg]
    nlinarith

and I touched that up a bit before committing.)

Kaiyu Yang (Apr 22 2024 at 13:37):

Lean Copilot's preprint is out: https://arxiv.org/abs/2404.12534

Jason Rute (Apr 22 2024 at 13:49):

Nice. I see you provide a benchmark, always appreciated. Do you have the benchmarking code or at least a list of test theorems so that others can compare their models on the same theorems?

Jason Rute (Apr 22 2024 at 14:01):

Also what were the specs and time limits for the automatic proofs?

Peiyang Song (Apr 22 2024 at 19:11):

Hi @Jason Rute , thanks! We detail the list of test theorems and their respective results in the Appendix (page 6) of our previous short paper. We will add the information to our Arxiv version too in its next edition. For all the experiments, we sticked to the default configuration (e.g. maxHeartBeats) in aesop without any customized configurations.

Last updated: May 02 2025 at 03:31 UTC