Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: llmstep


Sean Welleck (Aug 18 2023 at 00:31):

Hi all, check out llmstep: a tactic for suggesting proof steps using a language model:


It uses a model running on your own device, and the suggestions usually take around a second or less with a GPU.

By default, llmstep uses an open-source language model fine-tuned on LeanDojo’s training set. @Rahul Saha and I are working on improving it, please let us know if you have any feedback!

llmstep.gif

Scott Morrison (Aug 18 2023 at 01:07):

Cool! Excited to try it out.

Scott Morrison (Aug 18 2023 at 01:07):

Is there a mode which doesn't require the tactic type suggestion?

Scott Morrison (Aug 18 2023 at 01:08):

Could you write a wrapper that explores the proof tree itself?

Scott Morrison (Aug 18 2023 at 01:09):

If you can give me a "no hands" version, I would love to run it against the whole library.

Scott Morrison (Aug 18 2023 at 01:24):

It seems like lines 24-141 of LLMStep.lean could be immediately upstreamed to Std! (Someone was independently asking for this functionality recently, maybe Yury?)

Scott Morrison (Aug 18 2023 at 01:28):

It would be great to have the elab_rules : tactic | `(tactic | llmstep%$tac $pfx:str) => ... declaration that actually does the work be separated out into a MetaM function (that pretty prints the goal, calls the model, and returns the parsed Syntax suggestions) and an elab_rules that pipes that to addSuggestions.

This would make it relatively easy to add a full self-driving mode. :-)

Scott Morrison (Aug 18 2023 at 01:30):

Are PRs welcome? :-)

Scott Morrison (Aug 18 2023 at 01:47):

Bug report: llmstep creates a widget, but does not put a blue squiggle under the llmstep when it finishes. I think this is essential so you can see that something happened! (See how all the other suggestions functions work.)

Scott Morrison (Aug 18 2023 at 01:52):

Unfortunately on my (soon to be replaced) old machine, it takes 30-90s to get suggestions.

Scott Morrison (Aug 18 2023 at 01:52):

That aside, it works. :octopus:.

Scott Morrison (Aug 18 2023 at 02:49):

Sean Welleck (Aug 18 2023 at 03:15):

Awesome, thanks!! PR's are definitely welcome, I'll take a look!

Zhangir Azerbayev (Aug 18 2023 at 12:44):

This is a really exciting release. I've had a lot of fun playing with it.

Zhangir Azerbayev (Aug 18 2023 at 12:47):

I created a PR that adds support for a llama.cpp inference server. This means that, in principle, anyone with a decent laptop should be able to run tactic prediction locally without gpus.

Zhangir Azerbayev (Aug 18 2023 at 12:51):

Unfortunately, the default llmstep model has an architecture that isn't compatible with llama.cpp. I quickly finetuned a next tactic prediction model that is compatible, but it's very undertrained (<1 gpu hour) and should only really be used as a proof of concept that the latency is ok.

Jason Rute (Aug 18 2023 at 13:36):

Note Lean 3 has (or had at least) a built in interface to give machine learning suggestions automatically in the info view. #general > api suggestion vscode integration I have a number of questions, partly based on my experience with that

  • can someone easily rig up a similar thing for Lean 3?
  • can this tactic (or better an automatic suggestion giver) filter out tactics which fail like lean gptf did?

Jason Rute (Aug 18 2023 at 13:36):

Also why this model over Reprover? Is it just that there is no Reprover for Lean 4?

Kaiyu Yang (Aug 18 2023 at 14:51):

This is awesome! We are also working on something that can be used to build tools for tactic suggestions (hopefully will be able to share more later).

Also why this model over Reprover? Is it just that there is no Reprover for Lean 4?

ReProver (w/ retrieval) requires access to a corpus of all premises in the environment and their pre-computed embeddings, which LLMStepp does not currently provide.

It should be easy to use ReProver (w/o retrieval), though the input format has some minor differences. Note that Pythia is ~10x larger than ReProver (w/o retrieval). I'm curious about whether it performs much better. We're trying to replace the ByT5 backbone in ReProver with larger LLMs such as StarCoder but no luck yet. Maybe we should try Pythia?

Zhangir Azerbayev (Aug 18 2023 at 15:07):

Kaiyu Yang said:

Maybe we should try Pythia?

Pythia was designed with interpretability research rather than performance in mind. By today's standards, the larger Pythia models are very undertrained and you should expect much stronger theorem proving ability from something like starcoder.

Kaiyu Yang (Aug 18 2023 at 15:12):

Will the math-lm project produce models potentially useful for this purpose anytime soon?

Zhangir Azerbayev (Aug 18 2023 at 15:13):

Kaiyu Yang said:

Will the math-lm project produce models potentially useful for this purpose anytime soon?

Yes. The current timeline is early September.

Sean Welleck (Aug 18 2023 at 17:34):

Jason Rute said:

  • can this tactic (or better an automatic suggestion giver) filter out tactics which fail like lean gptf did?

Yes it checks if the suggestion succeeds (and if so, highlights it in blue), and additionally if it completes the proof (and highlights it in green)

Regarding the model, we finetuned a standard decoder-only model with the idea that we can always drop in a better one (e.g. see Zhangir's comments). As Kaiyu mentioned, it shouldn't be too hard to use the non-retrieval version of ReProver too. Adding support for a retrieval-augmented model is a great idea!

Jason Rute (Aug 18 2023 at 22:43):

Scott Morrison said:

If you can give me a "no hands" version, I would love to run it against the whole library.

Just so it is clear, while there would be some value to these numbers, it would be a vast overestimate of the capability of this model. In particular, this model was trained on theorems from the library, and likely has memorized most of the proofs, especially the short proofs. A better evaluation would be on theorems not used for training (assuming @Sean Welleck / @Kaiyu Yang left out some testing theorems from the training). Also, even in this case there are two realistic concerns. (1) if the test theorems are randomly mixed in the library, it would be very different from much of mathlib development where one is working on a new file with new definitions and lemmas., so proving 50% of test theorems might not translate to anywhere near 50% of real world theorems. This is worse for models like this which don't take the environment into account. (2) It is still not clear how much of mathlib already occurs in the pre-training data and contaminates any meaningful evaluation on test theorems. I know some people who think this is a minimal concern and some who feel it is a huge problem in this area.

Scott Morrison (Aug 19 2023 at 01:12):

Here's an idea for testing: benchmark against every theorem in Mathlib after running (non-terminal) aesop on the initial goal. This

  • removes all of the "it's easy anyway" goals out of the benchmark
  • hopefully mostly preserves the provability of the goals!
  • may shuffle things up just enough to prevent memorization
    (How true the 3rd point is is unclear!)

Scott Morrison (Aug 19 2023 at 01:19):

I want to say something about my llmstep PR https://github.com/wellecks/llmstep/pull/2 above, which refactors the code (it's a tiny change) so rather than just providing a syntactical tactic, it provides a MetaM level tactic that can be run on any goal.

/--
Call the LLM on a goal, asking for suggestions beginning with a prefix.
-/
def llmStep (pre : String) (g : MVarId) : MetaM (List String) := do
   ...

I hope that everyone releasing Lean tools can make sure to do this as well --- it allows us to easily share benchmarking tools, use common frontends for displaying suggestions, layer best-first search on top in a generic way, etc.

Perhaps it's even worth agreeing what an API should look like!

The obvious thing missing from the signature above is that probably we should return a List (String \times Float), to allow the LLM to say something about its confidence in each answer, to help rank answers and steer search.

Zhangir Azerbayev (Aug 19 2023 at 09:16):

Scott Morrison said:

Here's an idea for testing: benchmark against every theorem in Mathlib after running (non-terminal) aesop on the initial goal. This

  • removes all of the "it's easy anyway" goals out of the benchmark
  • hopefully mostly preserves the provability of the goals!
  • may shuffle things up just enough to prevent memorization
    (How true the 3rd point is is unclear!)

I think that a past/future split is the best way to evaluate neural theorem provers. The open llama 3b model I finetuned was released on June 7th, so rolling back the finetuning dataset to a mathlib commit from that date would mean all mathlib theorems merged between June 7th and now are fair game for testing.

This methodology also has the advantage of directly measuring a system's utility to mathlib developers.

Scott Morrison (Aug 21 2023 at 02:35):

@Zhangir Azerbayev, I like the past/future split, but this doesn't really allow for comparisons between different models, does it? (As they'll all have different cut-offs.)

Scott Morrison (Aug 21 2023 at 02:36):

I wonder if it's worth having a tool that lists the dates on which every declaration in a library landed in the git repository.

Junyan Xu (Aug 21 2023 at 02:44):

Does https://mathlib-changelog.org/ not do that?

Scott Morrison (Aug 21 2023 at 03:15):

As far as I'm aware that was never updated from Lean 3 to Lean 4. We'd also want machine-readable data.

Kaiyu Yang (Aug 21 2023 at 03:21):

LeanDojo has a function intended for getting the creation date of a theorem. It may not be very reliable since it just uses the file name and line number to query git log. I'm not sure what will happen if the theorem has undergone many modifications. Also, for mathlib4, the git log only reflects the porting progress instead of mathlib's development process?

Zhangir Azerbayev (Aug 21 2023 at 07:50):

Scott Morrison said:

Zhangir Azerbayev, I like the past/future split, but this doesn't really allow for comparisons between different models, does it? (As they'll all have different cut-offs.)

It would require retraining all baselines on the new finetuning dataset. Right now, that's not a problem, since the only strong baseline is ReProver and it's a very small language model. It may become an issue in the future.

Sean Welleck (Nov 01 2023 at 22:40):

Hi all, we posted a preprint about llmstep v1.0.0: https://arxiv.org/pdf/2310.18457.pdf

Notable additions:

  • Reprover (e.g. for CPU)
  • Free GPU via Google Colab
  • Runtime and theorem proving benchmarks

https://github.com/wellecks/llmstep

Sean Welleck (Nov 01 2023 at 22:43):

We also added experimental llemma suggestions that use the preceding file content (or other language models via --hf-model). Currently it requires a GPU that supports vLLM. Here is an example from Examples.lean:
Screen-Shot-2023-11-01-at-3.25.54-PM.png


Last updated: Dec 20 2023 at 11:08 UTC