Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: A chat interface for formalizing theorem statements

Zhangir Azerbayev (Jun 15 2022 at 22:22):

@Edward Ayers and I have developed an OpenAI Codex-powered chat interface for formalizing natural language theorem statements. The interface is a widget that runs in the Lean infoview. Here is an example of a multi-turn conversation with the bot:

Screenshot-from-2022-06-15-17-41-10.png

Our repo is here.

Unfortunately, you need an OpenAI API key to use the interface, but we hope that in a future version you won't. In the meantime, I'm happy to run any suggestions from the replies on my machine. I've tried this bot on some exercises from undergraduate textbooks, and in my experience thus far it gets about 25% of statements correct on the first try and about 75% correct with further suggestions.

If you're curious about what's under the hood, we're wrapping the user input in a few shot prompt before querying the OpenAI API. The current version of the interface is still very much in the proof of concept stage, so please point out any bugs or glitches you find.

We were inspired by the Autoformalization with Large Language Models paper from the N2Formal group at Google.

Jason Rute (Jun 16 2022 at 02:11):

Very cool looking project! Can you try these hard examples that @Jeremy Avigad often uses to talk about type inference in natural mathematics? Screen-Shot-2022-06-15-at-10.09.18-PM.png

Albert Jiang (Jun 16 2022 at 17:14):

Zhangir Azerbayev said:

Edward Ayers and I have developed an OpenAI Codex-powered chat interface for formalizing natural language theorem statements. The interface is a widget that runs in the Lean infoview. Here is an example of a multi-turn conversation with the bot:

Screenshot-from-2022-06-15-17-41-10.png

Our repo is here.

Unfortunately, you need an OpenAI API key to use the interface, but we hope that in a future version you won't. In the meantime, I'm happy to run any suggestions from the replies on my machine. I've tried this bot on some exercises from undergraduate textbooks, and in my experience thus far it gets about 25% of statements correct on the first try and about 75% correct with further suggestions.

If you're curious about what's under the hood, we're wrapping the user input in a few shot prompt before querying the OpenAI API. The current version of the interface is still very much in the proof of concept stage, so please point out any bugs or glitches you find.

We were inspired by the Autoformalization with Large Language Models paper from the N2Formal group at Google.

Very impressive stuff! I wonder if it is possible to extend this interface beyond Lean? I would have had a much enjoyable time debugging Isabelle statements for the autoformalisation work if this interface were around :)

Jason Rute (Jun 16 2022 at 17:21):

@Albert Jiang I think you would just need to replace the prompts in https://github.com/zhangir-azerbayev/lean-chat/blob/main/src/prompting.lean. It would still run as a Lean widget, but it would give Isabelle (or whatever) results.

Albert Jiang (Jun 16 2022 at 17:30):

Jason Rute said:

Albert Jiang I think you would just need to replace the prompts in https://github.com/zhangir-azerbayev/lean-chat/blob/main/src/prompting.lean. It would still run as a Lean widget, but it would give Isabelle (or whatever) results.

Thanks!

Zhangir Azerbayev (Jun 16 2022 at 18:40):

Jason Rute said:

Very cool looking project! Can you try these hard examples that Jeremy Avigad often uses to talk about type inference in natural mathematics? Screen-Shot-2022-06-15-at-10.09.18-PM.png

I managed to eventually get each of these three examples formalized. The first and the third took a few turns of back and forth, and the chatbot got the second on the first try (albeit with deprecated syntax).
Screenshot-from-2022-06-16-14-07-04.png
Screenshot-from-2022-06-16-14-10-50.png
Screenshot-from-2022-06-16-14-36-23.png

Yuhuai Tony Wu (Jun 16 2022 at 19:01):

Impressive work @Zhangir Azerbayev @Edward Ayers !

Leonardo de Moura (Jun 16 2022 at 20:29):

This is awesome. Nice work!

Junyan Xu (Jun 16 2022 at 21:00):

It has this habit of putting : at the start of the main statement. What can we infer about the age of Codex's training data?

Albert Jiang (Jun 16 2022 at 22:02):

Junyan Xu said:

It has this habit of putting : at the start of the main statement. What can we infer about the age of Codex's training data?

I think that might have to do with the prompt: https://github.com/zhangir-azerbayev/lean-chat/blob/main/src/prompting.lean. Maybe change the : in the prompt and see what happens?

Junyan Xu (Jun 17 2022 at 04:51):

Thanks! It didn't come to my mind that the prompt is the issue.

If colon at the beginning has never been the mathlib style, then in the training data, this style might correlate with the author being a new Lean user who never contributed to mathlib, and with unfamiliarity with mathlib APIs, etc. So I think it's best to move colons to the end of the previous line in the official prompt.

Mario Carneiro (Jun 17 2022 at 05:21):

Zhangir Azerbayev said:

Screenshot-from-2022-06-16-14-07-04.png

Lol @ trying to get to infinity by repeatedly applying nat.succ

Kevin Buzzard (Jun 17 2022 at 11:09):

Do you think the evaluation function goes up by 2^{-n} when you add the n'th succ because we're getting a bit closer?

Zhangir Azerbayev (Jun 17 2022 at 18:40):

Albert Jiang said:

Junyan Xu said:

It has this habit of putting : at the start of the main statement. What can we infer about the age of Codex's training data?

I think that might have to do with the prompt: https://github.com/zhangir-azerbayev/lean-chat/blob/main/src/prompting.lean. Maybe change the : in the prompt and see what happens?

I tried changing the prompt as you suggested and it does seem to improve performance. Here is the field characteristic p example with the new prompt.
Screenshot-from-2022-06-17-12-51-30.png

Jason Rute (Jun 17 2022 at 20:24):

Do the encouraging phases, e.g. “you are almost there” actually help? (Prompt engineering is so mysterious.)

Albert Jiang (Jun 17 2022 at 20:48):

Jason Rute said:

Do the encouraging phases, e.g. “you are almost there” actually help? (Prompt engineering is so mysterious.)

My guess is that it's not, since Lean code is usually contributed by experts already.

Although why not give it a try with something like "Translate in the style of Kevin Buzzard"

Junyan Xu (Jun 17 2022 at 22:03):

I've made a version in the style of mathlib docstrings (untested, I have GPT API key but not Codex) and I'm curious to see how it performs :) If I fill in the sorries in the prompt will it spit out proofs?

Jason Rute (Jun 17 2022 at 23:07):

@Junyan Xu The Autoformalization with Large Language Models paper (at the top of the discussed) mentioned that the choice of prompt matters a lot. If you have examples using the right vocabulary then that would probably help a lot. Of course the burden is on the user then to come up with good example prompts for their use case (prompt engineering).

Jason Rute (Jun 17 2022 at 23:14):

If you fill in the sorries in the prompt with proofs it would indeed output “proofs”. I would also be curious what the proof outputs look like and if they are correct.

Junyan Xu (Jun 17 2022 at 23:22):

Yes, @Yao Liu already commented down the tweet advertising this chat interface, and suggested using “let’s think step by step”, which intuitively doesn't apply to the current task of formalization, which is more like translation. But he's alluding to this recent research which shows prompts can matter a lot, and it's important to experiment with many different prompts.

Large Language Models are Zero-Shot Reasoners Simply adding “Let’s think step by step” before each answer increases the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with GPT-3. https://arxiv.org/abs/2205.11916 https://twitter.com/arankomatsuzaki/status/1529278580189908993/photo/1

- Aran Komatsuzaki (@arankomatsuzaki)

Julian Berman (Jun 19 2022 at 21:49):

Should one expect this to work with a key from the gpt-f beta? When I try to use mine, I just see codex: error. Is this a different model and I need to re-apply at the link then?

Zhangir Azerbayev (Jun 20 2022 at 15:44):

Julian Berman said:

Should one expect this to work with a key from the gpt-f beta? When I try to use mine, I just see codex: error. Is this a different model and I need to re-apply at the link then?

Yes, as far as I know Codex and the gpt-f beta are separate keys.

Julian Berman (Jun 20 2022 at 21:27):

Modulo waiting for codex access, hacking support in to lean.nvim was somewhat trivial :P, so I suspect this works now there too (possibly with the exception of the paste button needing a tweak).

Screen-Recording-2022-06-20-at-17.24.04.gif

Obviously the UX could be tweaked a bit more for the terminal and if I get really evil I'll render the latex to png and show that in the terminal... but yeah excited to try this out there too.

Junyan Xu (Jun 21 2022 at 21:47):

GitHub Copilot just went public: https://twitter.com/github/status/1539283958441160712
$10/month, or 2-month free trial + $100/year afterwards.
Since Copilot uses Codex for code completion under the hood, I think there's no loss from using Copilot instead of Codex directly for our task of autoformalization.
Would Microsoft make Lean code completion free to promote Lean?

GitHub Copilot helps you get better focus and build faster by instantly suggesting code—and is now available for developers everywhere. https://github.blog/2022-06-21-github-copilot-is-generally-available-to-all-developers/

- GitHub (@github)

Junyan Xu (Jun 21 2022 at 21:50):

Wow, Free for verified students and maintainers of popular open source projects. I hope mathlib qualifies!

Junyan Xu (Jun 21 2022 at 21:55):

https://github.com/pricing#faq-copilot
People who maintain popular open source projects receive a credit to have 12 months of GitHub Copilot access for free. A maintainer of a popular open source project is defined as someone who has write or admin access to one or more of the most popular open source projects on GitHub. Simply visit the GitHub Copilot subscription page to see if you are one of the open source maintainers that meet our criteria for a complimentary subscription. If you do, you should see that you can add GitHub Copilot for no charge. If you see a charge on the purchase page then this means that you do not qualify at this time. Once awarded, if you are still a maintainer of a popular open source project when your initial 12 months subscription expires then you will be able to renew your subscription for free.

Junyan Xu (Jun 21 2022 at 21:58):

Apparently I don't have free access. Could any @maintainers try? https://github.com/github-copilot/signup

Mario Carneiro (Jun 21 2022 at 22:00):

I don't think I have free access either

Patrick Massot (Jun 21 2022 at 22:05):

There is no way mathlib could qualify as "one of the most popular open source projects on GitHub".

Patrick Massot (Jun 21 2022 at 22:06):

There is a precise measure here: each GitHub user can give one star to any project. mathlib has 1.2k stars. VSCode for instance has 133k stars.

Junyan Xu (Jun 21 2022 at 22:17):

Yes mathlib probably would not qualify based on popularity, but in terms of activity the gap is smaller. The number of PRs opened is about 1/10 of vscode, 1/5 of pytorch, or 1/4 of tensorflow. (What if we count lines of code?) The stated criterion is popularity though.

Julian Berman (Jun 22 2022 at 12:19):

I'm not sure that's the metric they use, I have free access

Julian Berman (Jun 22 2022 at 12:20):

Though I can't tell if it's from the education side, they conflate the two a bit

Junyan Xu (Jun 22 2022 at 14:50):

If you’re a student and want to participate in the program, apply for the GitHub Student Pack to get started.

https://docs.github.com/en/education/explore-the-benefits-of-teaching-and-learning-with-github-education/use-github-for-your-schoolwork/apply-for-a-student-developer-pack

If you never applied, I'd assume it comes from one of your open source projects (you have a lot)! I've heard people claiming they got access due to a 200+ star repo.

I suspect GitHub doesn't have an easy way to distinguish mathlib maintainers vs. normal contributors with non-master write access; do they have difference in GitHub privileges (not bors privileges)? The announcement mentions "write or admin access"; can maintainers commit directly to master? I've also heard people suspecting they didn't get access because their project is under an organization, like mathlib is.

Adam Topaz (Jun 22 2022 at 14:54):

@Junyan Xu I'm just curious, have you used copilot (with lean or other languages)? Is it worth $10/month?

Junyan Xu (Jun 22 2022 at 14:57):

No I have not, neither do I have Codex access; Copilot is now commercialized but Codex API is still in free private beta.
I reported the news here only because Copilot is using Codex, which is used by the chat interface in this topic.

Zhangir Azerbayev (Jun 22 2022 at 17:15):

For all those discussing API keys/copilot, once we update lean-chat to comply with all of OpenAI's safety requirements, we plan to go through their app review process. If we are approved, lean-chat will no longer require the user to have their own API key.

Alex J. Best (Jun 22 2022 at 19:29):

I have used copilot for lean, it is occasionally very helpful, providing autocompletion for statements that are fairly nontrivial, I'm not sure I'd pay $10 a month for it though

Last updated: May 02 2025 at 03:31 UTC