Zulip Chat Archive

Stream: new members

Topic: Hacking in Lean3


Vincent Siles (Sep 21 2020 at 08:47):

Hi ! I'm new to Lean and I'm getting familiar with the system. My team is currently working on some AI/ML models with metamath and HOL light, and would like to see if Lean could be a new target too. To do so, I might end up hacking a bit in Lean (change some output to fit our format, ...). Should I compile lean from source and work with a local clone, or are there other options (plugins ?) to tune Lean for our needs ?

Johan Commelin (Sep 21 2020 at 08:49):

Hi, welcome!

Johan Commelin (Sep 21 2020 at 08:50):

@Gabriel Ebner @Edward Ayers are probably the people who know this best

Gabriel Ebner (Sep 21 2020 at 08:51):

You might also want to look in the #Machine Learning for Theorem Proving stream. @Jason Rute and @Patrick Massot have been doing similar stuff.

Vincent Siles (Sep 21 2020 at 09:02):

Will do, thank you !

Jason Rute (Sep 21 2020 at 11:57):

Welcome @Vincent Siles! Unfortunately, without knowing your needs or the approach you want to take, it is difficult to know the right answer. Let me, however, try to provide some options, and some possible roadblocks, but like Gabriel said, I suggest you look at a few things:

  • The threads on the #Machine Learning for Theorem Proving stream. We discuss a lot about AI/ML for TP in general, but also Lean specifically.
  • Messages by me in the #general stream (and maybe #new members as well). I discuss a lot of things, but I do think I field variations of this particular question a lot.
  • You could see if the OpenAI team (@Stanislas Polu or @Auguste Poiroux) who is working on something similar is open to discussing their approach.

Jason Rute (Sep 21 2020 at 11:57):

Now, you likely have two concerns: (1) interacting with Lean (sending partial proofs to Lean and getting back information from Lean), and (2) proof recording. As for the first, there is no good way currently to interactively programmatically communicate with Lean, but a number of mediocre ways. One of the simplest approaches is to use the Lean Language Server (lean --server). I have an example here using the language server to search for proofs. There also is a rudimentary Python client which calls the Lean server. Neither of those two links is perfect. The most important thing to know about the server is that certain parallel requests (like sending two info commands at the same time) may cause one request to fail or get cut short. Also, the information sent back can sometimes be weird especially if there is nested structure. Also, it is SLOW. There is a hard coded delay of 200 ms. (If you are looking to modify the C++ code, you might try to remove that delay as see what happens. Search ".2 s" or "200 ms" in Zulip to find other messages of mine about this issue.)

Jason Rute (Sep 21 2020 at 11:58):

Other options include working internally in the Lean tactic framework. Lean has an IO library which lets you communicate with Lean from stdin/stdout or from other processes. I'm working on a client which does this, but it is a lot of work and I haven't got far enough to show anything yet.

Jason Rute (Sep 21 2020 at 11:58):

You could certainly try to modify the C++ code. (I've shied away from this because I don't know C++ well.) What you modify is probably based on your needs and your willingness to maintain a fork of Lean.

Jason Rute (Sep 21 2020 at 11:58):

As for proof recording, that is tricky in Lean. First, Lean has three modes of proof: term mode, calc mode, and tactic mode. This is further complicated by the fact that these modes are regularly combined. I've seen (although rarely), tactic proofs inside term proofs inside calc mode inside tactic proofs. I assume you are looking for tactic mode data. If so, there isn't any good way to parse the Lean files to get all the tactics statements. I do however have a bunch of hacks that can get somewhat of the way there. Look for some recent messages of mine and feel free to hit me up for some more advice. (One can try to use the language server, but it gets mixed results. Another is to hack the Lean base library. I can print the start position of every tactic, but not the end position. With more work, I could do better parsing but I think I would have to modify every tactic. Again, you might be able to modify the C++ here, but I'm not sure.) One thing to keep in mind is that it isn't even clear what it means to "record" a proof. If like OpenAI, you are working with a language model, then you can record the high level tactic code and the goal in its output format. If like Google, you are working with a more restricted tactic grammar, you will need to parse the tactics and their arguments which is more work, but theoretically doable. Also, you may want to store the expressions in a more machine parsable format.

Jason Rute (Sep 21 2020 at 11:59):

Last, Lean3 is currently changing rapidly with massive refactoring going on, so keep that in mind. Also, Lean4 is eminent, but will still take a lot of time to port Lean3 over to Lean4. My strong guess is that the Lean4 situation won't be any better than Lean3 at first, but I could be wrong. The big promise of Lean4 is that it will compile Lean inside Lean which makes it seem as if one will have better tools for parsing Lean files into training data and running Lean programmatically. However, neither is really the purpose of Lean, so I doubt such tools will come out of the box. One would have to build these tools and I'm not sure it is any easier to build them in Lean than is has been in C++.

Jason Rute (Sep 21 2020 at 11:59):

I know this is discouraging, but I hope I haven't scared you away. I'd love to see Facebook research work on automating Lean proofs. The work you all are doing on code translation and symbolic reasoning is cool and I'd love to see it applied to theorem proving, including Lean.

Kevin Buzzard (Sep 21 2020 at 12:04):

I would imagine that the maintainers of the community fork of Lean 3 would be open to C++ PR's which made IO better.

Vincent Siles (Sep 21 2020 at 12:08):

Thank you for the very detailed answer ! At the moment I'm gathering general info on Coq/Lean to see how we could adapt them to our tools (we discuss with HOL using a json structure, so we might be able to read the lsp output, or learn about to make Lean output our format first hand). Also, we try to make our training as parallel as possible, so we are serializing some internal state of HOL so we can move any subgoal to a new worker if possible: is there any documentation on Lean's internal I could start to browse, or should I just dive into the C++ code and see if we can perform the same kind of tricks ?

Vincent Siles (Sep 21 2020 at 12:08):

Kevin Buzzard said:

I would imagine that the maintainers of the community fork of Lean 3 would be open to C++ PR's which made IO better.

I was not aware of the community fork, will look into it, thanks !

Jason Rute (Sep 21 2020 at 12:10):

Yes, the community fork is way ahead of the "official" version. Also it has better documentation and is being actively maintained.

Jason Rute (Sep 21 2020 at 12:13):

Kevin Buzzard said:

I would imagine that the maintainers of the community fork of Lean 3 would be open to C++ PR's which made IO better.

I've asked about adding a flag to the server to remove the 200ms timeout, but you are right that a PR would be better. I just have to learn enough C++ to do it. :) Seriously though, I think there is a strong desire to have tools like this. For example, I used the Lean server to help Johan refactor Lean once. If anyone is willing to help me make that PR (which removes the 200ms timeout with a flag), I'd be happy to work with you.

Reid Barton (Sep 21 2020 at 12:17):

Jason Rute said:

Last, Lean3 is currently changing rapidly with massive refactoring going on, so keep that in mind.

I'd say we're fairly conservative with changes to the C++ part of Lean 3: mainly bug fixes, performance improvements, better interactivity and other quality of life improvements. It's true that we are more ruthless with the core library but this is mostly independent of changes on the C++ side. Sometimes bug fixes to the C++ part do require compensating changes to the core library (and to mathlib) of course.

Jason Rute (Sep 21 2020 at 12:23):

I'm referring to the math and core libraries. This matters for Machine Learning somewhat. Consider the change from category to control. If an AI prover was trained on an old version of Lean 3, it would try to use category, when new versions expect control. I'm mostly warning that to get a stable benchmark one probably needs to fix a version of Lean and just work with that. It probably isn't a big deal for a research lab which is just trying to write a paper and establish a benchmark. It is a larger deal if someone was trying to make a practical AI tool for Lean.

Kevin Buzzard (Sep 21 2020 at 12:36):

When Lean 4 hits I would imagine that enthusiasm to maintain a Lean 3 fork will slowly wane -- or at least I would hope this.

Vincent Siles (Sep 21 2020 at 12:39):

FWIW I'm only looking at Lean 3 because the page of Lean 4 says so :) I think we would change when 4 is out

Vincent Siles (Sep 21 2020 at 14:52):

@Jason Rute so right now, on top of tuning the output format of HOLlight to our purpose (which is not that a complex task), we have the following architecture: we have a pool of workers that each run an instance of HOLlight. Each time one of the worker makes a step in a proof, it generates new subgoals. We serialize these subgoals and send them to other workers, to work in parallel.
So we are using the fact that it is easy to serialize HOLlight goal state and restore it, to have an aggressive parallelisation.

Do you think a similar pattern could be done in Lean 3 ? Serializing intermediate sub-goals and restoring them to resume a proof "in the middle" of it ?

Jason Rute (Sep 21 2020 at 22:20):

Vincent Siles said:

Do you think a similar pattern could be done in Lean 3 ? Serializing intermediate sub-goals and restoring them to resume a proof "in the middle" of it ?

It may be possible, but it won’t be as easy as HOL-Light. A “state” in a theorem prover like Lean is a combination of two things, the environment (what has come before) and the goal state (what you are currently working to prove).

Jason Rute (Sep 21 2020 at 22:20):

In HOL-Light, the environment is basically just all the theorems and definitions. (There are a few settings that can be turned on/off, but I think those are mostly for parsing and printing, so can be ignored if you are working with the expressions more directly.) For this reason, it is usually ok to just assume the environment is everything proven in the library (as long as you add some checks to make sure you don’t reference theorems out of order).

Jason Rute (Sep 21 2020 at 22:20):

In Lean the environment is more complicated. It still includes theorems and definitions, but it also more. Some of these things are more of an annoyance. Others might be a problem:

  • In addition to definitions and theorems, one also has typeclasses and notation to determine how expressions get parsed, but if you print a goal with pp.all in Lean (or read the expression tree), they are fully elaborated, so you shouldn’t need to worry about those too much.
  • A theorem can be inside an open or namespace block which allows one to write, say, add instead of nat.add. Again, if you are careful, I think this isn’t a huge deal. For example, I think pp.all will print the full version nat.add and expressions won’t change based on this.
  • The behavior of some tactics change as the environment changes. For example, you can add lemmas to the simp tactic using @simp and then those get applied automatically when using simp. So if you mark a lemma as @simp, and later try to prove that same lemma again, then by simp would suffice.
  • Some theorems and definitions are private. Hence they won’t be visible in all places in the code.
  • Some tactics can add theorems to the environment in the middle of a proof.

Jason Rute (Sep 21 2020 at 22:21):

In many cases, it probably would work to just use a super environment which imports everything. This can be done with import all (after you create all.lean for which there is a script in mathlib). You would have access to all theorem and definitions. You would also get the most powerful version of the simp tactic (which is both good and bad since all simp lemmas are now trivial to prove). Also, import all would be fast since mathlib comes with compiled .olean files.

Jason Rute (Sep 21 2020 at 22:21):

If instead you want to capture the environment exactly as it is before you prove a theorem, then that would be trickier. A lot of the environment would be captured by the imports at the top of the file, but not the theorems and definitions in the file. (You could theoretically make and compile an .olean file for each environment checkpoint.)

Jason Rute (Sep 21 2020 at 22:21):

Now as for the goal stack, in HOL-Light, the goals are pretty easy to serialize. In Lean, it is more difficult. Note a Lean expression is a Lean object that you can work with in a number of ways. In theory, if you use pp.all to print an expression, that expression should be parsable by Lean.

Jason Rute (Sep 21 2020 at 22:21):

If you look at [my example here](communicating-with-lean/communicate_with_lean.ipynb at master · jasonrute/communicating-with-lean · GitHub, you can see some examples of this approach.

Jason Rute (Sep 21 2020 at 22:21):

However, there are a number of common edge cases. Lean4 is supposed to fix many of those, but I think some will remain. One common edge case that isn’t really fixable is that the fully elaborated expression is sometimes just too long to print and Lean doesn’t print the whole thing. This often happens with type class elaboration. Another issue is that expressions with meta-variables may not print in such a way that they can be used in another Lean instance. (I’m less certain about this however.)

Jason Rute (Sep 21 2020 at 22:21):

Moreover, while Lean expressions are first class citizens, goals, including the local context are more tricky. There is no direct way I know of to say “create a goal of this form”. There are just some hacks. That notebook provides one hack which is to enter the local context as assumptions of a theorem in Lean. Another different hack is to apply the revert_all tactic to combine all the local context into the goal as a single Pi expression. This might serialize better. A final way might be to rebuild the goal state using Lean’s meta programming framework, but I don’t see a foolproof way to do that either.

Jason Rute (Sep 21 2020 at 22:21):

Now, for all of the above, I’ve been assuming that you are using the Lean server in a way similar to my notebook where I try to serialize each goal and run one tactic on it in a new file. Another (trivial) approach would be to build the proof line by line. Obviously then to build a proof in parallel, you would have to re-run the proof up to the current goal (at least through that path in the proof tree) every time you want to expand the proof tree.

Jason Rute (Sep 21 2020 at 22:22):

Also, I’ve been experimenting with having Lean maintain the proof tree internally, letting you guide it with an external tool. This has the advantage that Lean can represent the tactic state exactly. However, then it is not parallelizable which is what you want.

Jason Rute (Sep 21 2020 at 22:22):

Now, it is possible there is other ways to do this. Lean can save tactic states and even backtrack, so I think internally the tactic state is a persistent data-structure (although a big one if you include the whole environment). It might be possible to modify the C++ code (or the Lean4 code) to capture and serialize the state in such as way that you can return to it in another processes, but I think you would have to build it.

Jason Rute (Sep 21 2020 at 22:28):

Of course, someone who knows more about the internals of Lean than I do, might see an obvious way.

Jason Rute (Sep 21 2020 at 22:42):

@Stanislas Polu You might also be interested in this discussion. :point_up:

Vincent Siles (Sep 22 2020 at 06:46):

Thanks a lot, that's very valuable information. I'll get to know Lean more, see if your current path could be adapted to our needs, or if we need to start hacking some C++ too. That's awesome !


Last updated: Dec 20 2023 at 11:08 UTC