Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: sagredo


Scott Morrison (Apr 10 2023 at 05:04):

sagredo is a conversational tactic, which runs an automatic loop of GPT suggesting proof steps, and Lean providing feedback on errors and remaining goals.

If anyone would like to play with it, I've pushed our branches sagredo and sagredo-widget to the main repository. This is joint work with (so far!) @Zhangir Azerbayev, @Wojciech Nawrocki, and @Zach Battleman.

Scott Morrison (Apr 10 2023 at 05:04):

The sagredo branch doesn't have a widget frontend, so you have to wait for the entire conversation to play out before you see it. However it is more or less mathlib-friendly, so could be deployed for everyone quickly. The sagredo-widget branch has a nice frontend, but at the cost of some extra installation steps:

  • cd widget && npm install && npm run build
  • olean caching seems to be broken, so you may have to rebuild mathlib to play with it.

The three main files (these will likely move around) are Mathlib.Tactic.GPT.Sagredo.{Monad,Dialog,Widget}, and provde tactics sagredo (non-interactive) and sagredo! (interactive).

Scott Morrison (Apr 10 2023 at 05:05):

There is obviously a lot of work that could be done on it in several directions. The first priority, however, would be to make a proper test suite that reports which problems it succeeds on, and in how many steps, so that changes to prompting and feedback can be evaluated. There are a handful of examples in test/Sagredo, but it's probably more fun to make things up yourself!

It's worth noting that GPT struggles with Lean 4 syntax. We get a lot of mileage out of careful prompting (our system message explains some of the syntax differences), but it would just be a lot better if it had seen more Lean 4. Maybe we don't have to wait long, who knows!

Scott Morrison (Apr 10 2023 at 05:05):

There's a demo video at https://youtu.be/CEwRMT0GpKo, which includes some fun examples; it may be easiest to start there.

Johan Commelin (Apr 10 2023 at 07:01):

Fascinating! Great work Scott, Zhangir, Wojciech and Zach!

I'm really curious whether GPT4 would improve a lot if it could be trained on such interactive sessions. Because otherwise it has to learn from polished end results. And my uninformed intuition says that it could learn a lot more and a lot faster if it could do such interactive training runs.

Mario Carneiro (Apr 10 2023 at 07:04):

you don't really "train" GPT4 though

Mario Carneiro (Apr 10 2023 at 07:04):

actual training happens behind closed doors and on the entire internet

Mario Carneiro (Apr 10 2023 at 07:05):

all you can do is push it in various directions using prompts

Johan Commelin (Apr 10 2023 at 07:09):

I know. But I'm still curious if it would make a significant difference.

Mario Carneiro (Apr 10 2023 at 07:17):

I'm saying that it is unlikely we can do this experiment unless we have an OpenAI insider

Johan Commelin (Apr 10 2023 at 07:21):

Yep. An alternative experiment would be to try LLaMa as backend. Then we do our own training on mathlib4, and also do such "interactive-session" training. But I have no idea if LLaMa is good enough, quality-wise.

Eric Wieser (Apr 10 2023 at 07:48):

Scott Morrison said:

It's worth noting that GPT struggles with Lean 4 syntax. We get a lot of mileage out of careful prompting (our system message explains some of the syntax differences), but it would just be a lot better if it had seen more Lean 4. Maybe we don't have to wait long, who knows!

I wonder if we could help it out here in the long run by teaching GitHub that lean4 is a different language to lean 3

Jason Rute (Apr 10 2023 at 10:30):

Johan Commelin said:

I'm really curious whether GPT4 would improve a lot if it could be trained on such interactive sessions. Because otherwise it has to learn from polished end results. And my uninformed intuition says that it could learn a lot more and a lot faster if it could do such interactive training runs.

I would look at Baldur ( #Machine Learning for Theorem Proving > Baldur: Whole-Proof Generation and Repair with Large Lang...) which uses a LLM comparable to the GPT-3 family and fine tunes it on proof repair (which is basically a form of what you are saying about interactive runs). I’ve been meaning to write a review of that paper.

Jason Rute (Apr 10 2023 at 11:18):

@Scott Morrison I haven’t finished your video yet, so maybe this is answered there, but are you using the long context or short context version of GPT-4? If the long context version, do you find the additional context length helps, for say adding more examples in the prompt, or other stuff? Actually, are your prompts in the code? I’d love to see them if they are publicly available.

Jason Rute (Apr 10 2023 at 11:21):

I also look forward to the benchmarks when you have them. Benchmarks on ported Mathlib4 as well as miniF2F would be interesting (even though one has to consider there is a high amount of data contamination of Mathlib since it is likely used as training data for GPT-4, and there is likely some miniF2F contamination as well, but not as much).

Scott Morrison (Apr 10 2023 at 11:21):

@Jason Rute, it just uses the 8k version at present, although I was thinking to switch over. The second example in the video actually hits the limit, so yes, a longer context length would be helpful. (One could also switch dynamically mid-conversation, or alternatively just drop the early parts of the conversation, keeping the system message.)

And yes, the prompting is all in the branch. It is interspersed with Lean code, because the point is to pass back error messages and goals in useful ways.

Jason Rute (Apr 10 2023 at 11:23):

But (referring to benchmark) the best benchmark is always if Lean users find this useful, so the fact that it is already available is a nice start!

David Renshaw (Apr 10 2023 at 11:45):

I tried sagredo last week and spent $0.07 over the course of two or three tactic executions. :sweat_smile:

David Renshaw (Apr 10 2023 at 12:35):

Thanks for releasing this! It's very cool, if kinda costly. (Welp, I just now spent another $1.20 on openai API calls.)

Adam Topaz (Apr 10 2023 at 13:41):

Has openai started letting people use the 32k model?

Matthew Ballard (Apr 10 2023 at 13:41):

It's in their pricing

Adam Topaz (Apr 10 2023 at 13:42):

I still only have 8k access. And I tried calling the 32k api yesterday with no success

Eric Rodriguez (Apr 10 2023 at 14:46):

Pay to Lean :)

Adam Topaz (Apr 10 2023 at 14:56):

Yeah, it seems like we will all have a "api fees" item in the budget sections of our future grant proposals.

Junyan Xu (Apr 10 2023 at 14:57):

Johan Commelin said:

Yep. An alternative experiment would be to try LLaMa as backend. Then we do our own training on mathlib4, and also do such "interactive-session" training. But I have no idea if LLaMa is good enough, quality-wise.

The most popular LLaMA derivative model at the moment is probably Vicuna, and GPT4All is probably next. Since our goal is to generate Lean code, we may look at how Code Alpaca does fine-tuning; LoRA fine-tuning is available too. Fine-tuning isn't yet available for GPT-4 and GPT-3.5-turbo models, unfortunately, and when it becomes available you'd still need to pay OpenAI. Quantization + llama.cpp etc. allow LLaMA models be run on consumer hardware, even CPU only.

Junyan Xu (Apr 10 2023 at 15:40):

Vicuna claims to achieve 90% ChatGPT performance, but that would still be a lot below GPT-4. With fine-tuning though, it's conceivable that it could outperform GPT-4 on Lean 4 code writing, especially if we can gather usage data from all Lean users. (Maybe upload the whole log each time sagredo is called, or store it locally so that users can choose to contribute.) It's common to train smaller open models on conversations generated by GPT-4 or ChatGPT (e.g. from ShareGPT) to boost their performance; even Google did that.

It may also be helpful to extend the sagredo agent's capabilities to enable active learning by reading mathlib documentation or even arXiv papers. (Copilot is already able to look into related files and maybe the whole repo in the future.) Agents like Auto-GPT are granted (see prompt) the capabilities like web search, browsing, Python code execution, and long-term memory using Pinecone / Redis / Weaviate (in PR) (see also BabyAGI), and you can check out AgentGPT for a demo of another variant running on ChatGPT. Here's the output starting from the goal "Solve the Riemann hypothesis" before it hangs for some reason.

Johan Commelin (Apr 10 2023 at 16:27):

before it hangs for some reason

:rofl: :rofl: :rofl:

Zhangir Azerbayev (Apr 10 2023 at 16:30):

The explosion of progress on edge LMs is really encouraging for applications like Sagredo. It's likely that a LLaMA 7B or 13B model fine-tuned on mathlib4 next-tactic prediction would be good enough to drop in to Sagredo. After all, gpt-f was only 700M parameters. Then you could run the tactic locally on cpu, and the setup would only involve installing a single binary.

Magnushammer-style premise selection models are tiny, and could also be run on cpu.

One drawback of LLaMA is that its pre-training data doesn't have that much math and code, which will hurt downstream performance on theorem proving. But I'm working on open-sourcing better math models, similar to Minerva.

Zhangir Azerbayev (Apr 10 2023 at 16:45):

In the coming years, there will be a lot of effort towards jointly optimizing algorithms and hardware for efficient transformer inference. Apple has been ahead of the curve in this regard. We should expect running transformers on our laptop to become as commonplace as, say, running video codecs.

Adam Topaz (Apr 10 2023 at 20:10):

Is there some list somewhere which includes all of these open source LLMs?

Adam Topaz (Apr 10 2023 at 20:42):

Johan Commelin said:

Yep. An alternative experiment would be to try LLaMa as backend. Then we do our own training on mathlib4, and also do such "interactive-session" training. But I have no idea if LLaMa is good enough, quality-wise.

I think it still takes multiple 10s of thousands of dollars to train LLaMa from scratch. Fine tuning should be much more doable though.

Junyan Xu (Apr 11 2023 at 02:59):

Adam Topaz said:

Is there some list somewhere which includes all of these open source LLMs?

https://github.com/nichtdax/awesome-totally-open-chatgpt#readme is such a list

Junyan Xu (Apr 11 2023 at 03:04):

I think it still takes multiple 10s of thousands of dollars to train LLaMa from scratch. Fine tuning should be much more doable though.

Yeah, Alpaca takes <$600 and Vicuna-13B can be trained (on top of LLaMA) on 8 A100 GPUs with 80GB memory with around $300.

Johan Commelin (Apr 11 2023 at 03:13):

@Adam Topaz Sure. When I wrote "training" I definitely meant "fine tuning".


Last updated: Dec 20 2023 at 11:08 UTC