Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Lean or Isabelle: Which Theorem Proving Tool for LLM?


Dongwei Jiang (May 13 2023 at 07:37):

Hello everyone,

I'm new here, and I've been going through several theorem proving studies, such as GPT-F, Curriculum Learning, Hyper Tree Search, Autoformalization, Thor, DSP, and MiniF2F. These works use either Isabelle or Lean.

Understanding both Lean and Isabelle appears to be a steep task, and I believe focusing on one initially would be a better approach. Even though this group is Lean-oriented, I'd appreciate your advice on whether to begin with Lean or Isabelle. Here are some aspects I'm contemplating:

1 Community support and activity: I anticipate I'll have many questions as a beginner. While I don't have much exposure to the Isabelle community, this Lean group is quite active and supportive.
2 Compatibility of the tool with LLM for theorem proving.
3 Ease of learning the tool.
4 Future development of the tool: I'm interested in a tool with an expanding, well-maintained community.

Johan Commelin (May 13 2023 at 07:44):

@Dongwei Jiang Welcome! Maybe this talk by Jason will help you get a bit of an overview https://www.youtube.com/watch?v=P5ew0BrRm_I&pp=ygUKamFzb24gcnV0ZQ%3D%3D ?

Dongwei Jiang (May 13 2023 at 15:13):

Johan Commelin said:

Dongwei Jiang Welcome! Maybe this talk by Jason will help you get a bit of an overview https://www.youtube.com/watch?v=P5ew0BrRm_I&pp=ygUKamFzb24gcnV0ZQ%3D%3D ?

Hi Johan, thanks for pointing me to this video, it's very informative and interesting :) One of my main impression after watching the video is it's hard to compare results in different systems, and there are lots of factors that make fairly comparing the results in the same system very difficult...

However, I'm still seeking advice regarding the choice of ATP to use alongside LLM. Any additional advice would be appreciated

Jason Rute (May 13 2023 at 15:20):

Here are my general thoughts. Ideally we would have good and interchangeable interfaces for all the major ITPs, including Lean, Isabelle, and Coq. But, it rarely works that well. Here are some things to think about:

  1. Library size. Coq has the largest amount of code, but it is also spread across numerous projects on different versions of Coq, and using different tactic languages. Isabelle/HOL AFP is probably the largest unified library. I think Lean's mathlib will soon (in < 2 years) be larger, and it is growing quite rapidly (well, before the port).
  2. Tools. All the papers you mentioned for Isabelle use tools developed by @Albert Jiang and he would be a great person to talk to. Having this set of tooling has made it easy for other folks (mostly Google, but also Meta and some academic groups) to try new ideas, and compare results in a sort of standard way. Since Albert is also on all these papers, I don't know if they tools are easy to use without his assistance. Lean has an assorted collection of tools for both Lean 3 and Lean 4. Honestly, I think a lot of researchers have been waiting for Lean 4 before they do too much work in Lean, but at the same time Lean 4's meta-programming will make it easy to build new tools and customize existing ones. Indeed, a number of researchers have already gotten started on this. Coq also has some tools, including the CoqGym project and SerAPI (which is more low level but powerful). A lot of papers have been written using and extending models based on the CoqGym interface for a lot of papers.
  3. LLM support. The world of AI is moving quickly. Previous papers were all about predicting the next tactic. While GPT-f and Hyper Tree Proof Search do use (medium size) LLMs, they still do the next tactic thing. The more recent papers have played around with completely different paradigms like predicting whole proofs or proof sketches in one go with very large language models that can't easily be fine tuned. Or just predicting premises instead of tactics like in MagnusHammer. Or focusing on other tasks like Auto-formalization. So it is not even clear how useful the existing tools will be in that new paradigm. It may be better to just know how to work with Lean/Coq/Isabelle meta-programmatically to quickly apply new ideas. Indeed one of the newest Isabelle papers, Baldur, uses the Minerva LLM to generate whole proofs, and I'm not even sure if they use Albert's tooling. They might just make their own. (I should really check on this. Do you know @Albert Jiang?)
  4. Meta-programming. If the existing tools don't work for you, then Meta-programming is the best bet. Isabelle and Coq do meta-programming in OCaml (although with Coq you can get pretty far with the SerAPI interface and tools built on top of it). Lean has it's own (lean based) metaprogramming language that people really like. I think the best symbiosis is to get a metaprogramming expert to help you with your work (even if it is just helping you get started with existing tools).
  5. Community support. I think Isabelle, Coq, and Lean all have people actively working on theorem proving and tools. I think the Lean work is more behind, but also more in the open, with Zulip's streams like this one. But other work is happening behind the scenes in all three.
  6. Benchmarks. It is hard to build a new tool and benchmark, because you can't compare to existing work. So your 40% and their 60% aren't comparable at all. This is one of the largest problems in this field. For Isabelle, there are two good benchmarks. One is MiniF2F and the other is testing on a held-out part of the Isabelle AFP using the same setup as in the other papers. (Again, this is where it helps to plug into existing tooling.) There is also a Mini-F2F benchmark for Lean and that is what a lot of Lean papers use. While one can test on Lean's mathlib, that is hard, since mathlib changes so much and so fast. For Coq, there is no MiniF2F benchmark and I doubt there will be anytime soon. But one can still test on library theorems, either new projects not used for training, or held out theorems/files not seen during training. Also, given LLMs ability to memory the whole internet, data contamination is becoming a major problem and we don't have a good solution for it (or even a clear sense of how much it matters for our benchmarks).

Albert Jiang (May 13 2023 at 15:31):

Dongwei and I had some brief email exchanges. Would be great to talk after Neurips!

  1. It’s not easy to use Isabelle tooling at the moment without my assistance at the moment. I wrote PISA right before my PhD starts and never had the time to maintain it afterwards - three year PhD programmes don’t reward software maintenance. But! Wenda Li and I have an intern now working on making it accessible and RL friendly. So it might change quickly.
  2. I’m >99% sure that Baldur uses my Isabelle tooling, with modifications for Google’s infra.

Wenda Li (May 13 2023 at 18:34):

Jason Rute said:

  1. Library size. Coq has the largest amount of code, but it is also spread across numerous projects on different versions of Coq, and using different tactic languages. Isabelle/HOL AFP is probably the largest unified library. I think Lean's mathlib will soon (in < 2 years) be larger, and it is growing quite rapidly (well, before the port).

Some statistics on the Isabelle side: the standard library has about .4M LoC, the AFP has 3.8M LoC, and the seL4 project has 1.5M LoC. Admittedly, proofs in Isabelle are a combination of logic, algorithm/program verification and math.

Jason Rute (May 13 2023 at 22:19):

Yeah, I should be more careful with my words. On second thought, as I look at the statistics closer, I’m not sure my 2 year prediction will hold. Isabelle and Coq will probably be larger sources of theorem proving data than Lean for a while now if one goes just by numbers theorems, proof state-tactic pairs, or lines of code.

Dongwei Jiang (May 14 2023 at 07:03):

Hi @Jason Rute, Thanks for your detailed an informative answer! Your insights on the library size, LLM support, meta-programming, and community support were truly helpful and eye-opening. As for the tools, I've been experimenting with Albert's Isabelle repo and Jesse's Lean repo. They're admittedly challenging, but I'm optimistic about their future improvements.

And from your answer, it seems like there is a near tie between Isabelle and Lean. It looks like I'll need to dive into the code and explore more, especially on the meta-programming aspect, to make a decision...

Also @Albert Jiang , Thank you so much for your project! I've been following your suggestion and I'm current in the process of building AFP (the current ATP version is 2023 and I think it's already taking more than 150 hours...)

Lev Stambler (May 15 2023 at 20:16):

Has anyone looked into fine tuning an open LLM with something like Lora? It could be interesting to retrain the encoder to handle more Lean friendly tokens.

Fabian Glöckle (May 15 2023 at 20:33):

Lev Stambler said:

Has anyone looked into fine tuning an open LLM with something like Lora? It could be interesting to retrain the encoder to handle more Lean friendly tokens.

I think currently the question is less how to finetune language models and more how to put them into an interesting proving loop

Dongwei Jiang (May 20 2023 at 15:35):

Lev Stambler said:

Has anyone looked into fine tuning an open LLM with something like Lora? It could be interesting to retrain the encoder to handle more Lean friendly tokens.

I know it's a bit off the topic, but I'm currently working on fine-tuning LLAMA using alpaca-lora on Lean data from PACT paper. I'll be sure to provide more updates once I make some significant progress.


Last updated: Dec 20 2023 at 11:08 UTC