Zulip Chat Archive

Stream: general

Topic: Segmenting a Lean 3 document


Clément (Dec 12 2020 at 00:53):

Hi all! I'm trying to extract a simplified AST of a Lean document, ideally using the --server API. IOW, I'd like to take a document like this:

#check nat
-- comment
example (p : Prop) : false -> p := begin intros H, apply false.elim, from H end

and fragment it like this (where […] indicates a "fragment"):

[#check nat]
-- comment
[example (p : Prop) : false -> p] := [begin] [intros H], [apply false.elim], [from H] [end]

I'm pretty flexible about exactly where the boundaries are, as long as individual tactics are separated from each other. While I can use the info query to get data at a given point, it doesn't give me the boundaries of each info span. I could look at widget boundaries returned by info, but each widget only includes a starting position, not a span. I guess I could dump the contents of the relevant info_manager on the C++ side, but I'm not sure whether there might be a simpler solution.

Jason Rute (Dec 12 2020 at 03:24):

My experience is that this is hard. I and others are working on a dataset which does this for proofs, but it required injecting code into lean to get the starts, (you could use the server, but I find it isn’t 100% reliable, and it is really slow). I also injected other more complicated code in every tactic to get the positions of the parameters. Finally I wrote a good-enough Lean parser to combine all this data to get the ends of the parameters. I’d be happy to share the dataset when it is ready for distribution.

Jason Rute (Dec 12 2020 at 03:25):

Of course I might have overlooked something obvious, so don’t let me stop you if you know what to do. I didn’t feel comfortable messing with the C++ but it might be the simplest approach in the end.

Jason Rute (Dec 12 2020 at 03:26):

Curious, what is your use case?

Jason Rute (Dec 12 2020 at 03:32):

My hope is that Lean 4 when it comes out is going to store all this nicely in some giant parse tree which can be inspected. (But that is just a hope, I have no idea what Lean 4 is capable of in this regard.)

Clément (Dec 13 2020 at 02:08):

Jason Rute said:

Curious, what is your use case?

I'm porting Alectryon to Lean (see the blog post for demos) ^^

Clément (Dec 13 2020 at 02:09):

Jason Rute said:

My hope is that Lean 4 when it comes out is going to store all this nicely in some giant parse tree which can be inspected. (But that is just a hope, I have no idea what Lean 4 is capable of in this regard.)

Yes, the team is working exactly that, so it should be much easier soon-ish

Clément (Dec 13 2020 at 02:09):

Jason Rute said:

Of course I might have overlooked something obvious, so don’t let me stop you if you know what to do. I didn’t feel comfortable messing with the C++ but it might be the simplest approach in the end.

AFAICT the way the C++ works for the info query is that it reparses all the way to the point that the query is for, so it would require hooking into the parser.

Clément (Dec 13 2020 at 02:10):

I've been spoiled by Coq's nice APIs ^^ https://github.com/ejgallego/coq-serapi

Alex J. Best (Dec 13 2020 at 02:36):

Clément said:

Jason Rute said:

Curious, what is your use case?

I'm porting Alectryon to Lean (see the blog post for demos) ^^

This sounds very cool! It's not exactly the same of course but some people have worked on html output within the goal view, e.g. you might be interested in https://github.com/leanprover-community/mathlib/pull/4718 and the "widgets" functionality more generally if you haven't seen it. The game of life output in that blog post seems similar to the lean sudoku output too https://github.com/TwoFX/sudoku

Rob Lewis (Dec 13 2020 at 02:37):

It would be really great to have something like Alectryon for Lean! But yeah, I suspect it will be a whole lot easier to implement in Lean 4. The closest Lean 3 comparison I know of is @Patrick Massot 's formatting tool (https://github.com/leanprover-community/format_lean), but it doesn't tokenize to the level you're looking for.

Jason Rute (Dec 13 2020 at 03:19):

The problem with server based approaches like format lean (and I think widgets) is that the lean server only records the start of the tactic and not the end. Also it doesn’t handle hierarchy and it can ignore the second of adjacent tokens if there is no space between them.

Bryan Gin-ge Chen (Dec 13 2020 at 04:11):

cc @Edward Ayers who created the widgets feature for Lean 3.

Patrick Massot (Dec 13 2020 at 09:50):

@Clément I think your work is extremely important. My format_lean project is a quick hack that was good enough for my immediate purpose but I'm fully aware it's not good enough for general purposes. I lack time (and courage) to work on it before the rise of Lean 4 since Lean 4 will open so many possibilities in a much nicer way. But I'm thrilled to see someone doing something for Lean 3.

You can also have a look at how the sphere eversion blueprint is setup. The main source here is a TeX document having a couple of TeX macros referencing Lean declarations (example). This is then compiled by plasTeX and produces things like this lemma statement (hover over the Lemma word to see a bunch of icons appearing). But right now the Lean links simply send you to the relevant line on GitHub where things are completely static. plaxTeX also generates the dependency graph/progress report but this is a somewhat orthogonal story. Of course I dream of a better integration, although there will never be a one to one correspondence between informal lemmas and the formalization.

Patrick Massot (Dec 13 2020 at 09:52):

If you can identity a C++ hook that would make your life much easier you can ask Gabriel Ebner. He is busy, we are all busy, but he made many dreams come true before. And he is currently the main maintainer of the community fork of Lean 3 (the official version of Lean 3 is completely frozen, for good reasons).

Edward Ayers (Dec 13 2020 at 11:25):

Hi not much more to add to this other than as far as I know there is no api for this and would require digging through the c++ but the tactic boundary information is in there somewhere but I could never figure out how the info request got it exactly. As mentioned above this will hopefully become really easy in Lean 4.

Clément (Dec 16 2020 at 08:15):

Thanks all for the great pointers! Here's what I've got at this point:

@Jason Rute your summary is excellent: in particular, if there's no space after a {, the focused goal list doesn't seem to be available through the --server API.

@Patrick Massot format_lean is great — I'm really sad I didn't hear about it two months ago when I was writing the Alectryon paper, it would have a great addition to the related work :/ I'm not sure how robust things can get in Lean3 without patching the C++. With Lean4, it should be essentially trivial. I'll get in touch with @Gabriel Ebner .

@Edward Ayers Thanks a lot for confirming that. Interestingly, it looks like the least-broken heuristic for Lean3 would be to use widget positions + look for specific markers. I'm used to complaining about Coq's parsing, but this is something else :laughter_tears:

Sebastien Gouezel (Dec 16 2020 at 08:43):

That's very nice!

(btw, the Lean proof in your file can be compressed to

lemma gauss (n : ) : 2 * Sum n = n * (n + 1) :=
begin
  induction n with n ih,
  { simp [Sum] },
  { simp [Sum, left_distrib, ih, succ_eq_add_one],
    ring }
end

if you are willing to use mathlib and its ring tactic)

Patrick Massot (Dec 16 2020 at 09:03):

Clément, I understand your frustration since I wrote format_lean, but I think we should be careful when expressing it. What we see is not a broken API, it's a non-existent API replaced by hacking something that was meant for a different purpose.

Patrick Massot (Dec 16 2020 at 09:06):

There is some issue with javascript by the way. If I click on two different bubbles then I can't close the first one without reloading the page..

Jason Rute (Dec 16 2020 at 13:23):

@Clément You wrote:

With Lean4, it should be essentially trivial.

I hope so, but I’m not sure yet if this is true. (For example, does it involve hooking into the Lean4 parser/compiler or is it stored already somewhere in the environment? Also, how does one handle macros?) When you learn how to do it, I hope you share what you find!

Sebastian Ullrich (Dec 16 2020 at 14:02):

It's not true yet because there is no step-through goal display at all yet, so no storage of that information either. But when that is implemented in the language server, it would be sensible to have an API function that compiles a single file, returning not only the Environment but also all metadata held by the elaborator for the language server. Then you should be able to do all this in a relatively short Lean 4 program.

Jason Rute (Dec 16 2020 at 14:04):

That would be awesome!

Kevin Buzzard (Dec 16 2020 at 14:27):

There will be big applications for this in teaching (both teaching Lean and teaching mathematics)

Gabriel Ebner (Dec 16 2020 at 14:34):

It would also be cool to have an alectron-rendered version of mathlib online.

Clément (Dec 16 2020 at 16:41):

Patrick Massot said:

Clément, I understand your frustration since I wrote format_lean, but I think we should be careful when expressing it. What we see is not a broken API, it's a non-existent API replaced by hacking something that was meant for a different purpose.

Wait, where did I talk about a broken API? Sorry if I seemed to be criticizing something! I meant to say that the hacks I was resorting to were ugly.

Rob Lewis (Dec 16 2020 at 17:24):

Don't worry, we're no strangers to ugly hacks here :smile: my favorite being the old implementation of src#where.where_cmd (it was improved when we switched to the community fork of Lean 3). It summoned information about the parser context out of nowhere.

Patrick Massot (Dec 16 2020 at 21:02):

Clément said:

Patrick Massot said:

Clément, I understand your frustration since I wrote format_lean, but I think we should be careful when expressing it. What we see is not a broken API, it's a non-existent API replaced by hacking something that was meant for a different purpose.

Wait, where did I talk about a broken API? Sorry if I seemed to be criticizing something! I meant to say that the hacks I was resorting to were ugly.

Great! No problem at all then.

Jason Rute (Jul 09 2021 at 00:23):

Given @Kevin Buzzard and @Clément's #announce > Alectryon port summer project, where they outline how the creation of Alectryon for Lean would go, I want to make some technical comments and share my experience. Quotes are @Clément's from the announcement.

Jason Rute (Jul 09 2021 at 00:23):

As expected, things should be easy with Lean 4 (because it will expose a proper abstract syntax tree with all relevant metadata).

See here for a prototype example by @Daniel Selsam of tactic proof recording to Lean 4. This example records goal states and tactics (as syntax). It might be easy (I don't know) to also include position information? Also, see the discussion in #Machine Learning for Theorem Proving > Lean gym for Lean 4 (and the end of #Machine Learning for Theorem Proving > Getting started with lean-gym), in particular the fact that in Lean 4, depending how you look at things, many tactic proofs will just be one outer tactic, with subproofs inside. (I don't know enough about Coq or Alectryon to know if this is a significant departure.)

Jason Rute (Jul 09 2021 at 00:23):

The key ingredients that Alectryon needs are 1: a way to query for the goals to display at a given point in a file; and 2: a way to segment a file into relevant "spans" or "sentences", each of which corresponds to one step in a proof.

Before I comment on Alectryon for Lean 3, I want to mention https://github.com/jasonrute/lean_proof_recording which was a big part of the lean-gptf project. It is currently the only way to fully record Lean 3 tactic proofs, doing both (1) and (2) . While for that project, we only need the goal and the corresponding tactic, we actually capture a lot more. We capture the entire tree (sort of an AST) of the tactic proof, taking into account nested tactics. We also capture position information (begin and end) as well as the tactic parameters. To be clear, I'm not suggesting you build Alectryon on top of lean_proof_recording. The latter is much too hacky. However, it may be helpful for comparison of results and for making mockups of Alectryon for Lean. It could also be helpful for finding interesting edge cases. Also, if you can get your proof recording changes into Lean, I might consider rebuilding lean_proof_recording to use those instead of the current Rube Goldberg machine that is lean_proof_recording.

Jason Rute (Jul 09 2021 at 00:24):

For 1, we propose to bypass the heuristics and make Lean dump its entire internal tables. For 2, we think it should (likely) work to make the parser dump the boundaries of each tactic as it processes a file. This way we get a well-parenthesized interval tree giving the boundaries of each tactic, which we can then use in conjunction with the internal tables to reconstruct the info that Alectryon needs. Then, it's just a matter of feeding that information to Alectryon, which is quite easy.

As for the boundaries, do you know if it is easy to get the end position out of the Lean C++ parser? I've tried looking (maybe not hard enough, since I'm not fluent in C++) and I am not sure I see where to get it? As for the internal tables, are you referring to the tables used by the language server?

Jason Rute (Jul 09 2021 at 00:24):

Here is a discussion of the different types of Lean tactics you will encounter. (I'm not as familiar with Coq, so I don't know how many are their as well.)

  • Most tactic commands are made of interactive tactics, like intros a b c.
  • There are also non-interactive tactics like tactic.swap which may or may not be handled a little differently in the backend.
  • There are two types of semicolon tactic combinators, e.g. intro; simp; refl and intro; [simp, tidy]. Note, that the Lean language server uses the first semicolon of the list as the location of the "start position" of the tactic for the purpose of displaying the goal. (That presented a big pain in lean_proof_recording.)
  • You can have solve1 tactics which focus on and solve the first goal, e.g. {simp, refl}. One can also use the begin...end syntax for this.
  • Alternative combinators, refl <|> simp try the first tactic, and if it fails applies the second. IIRC, the <|> doesn't show up in the Lean server position information, only the individual tactics (and only the second tactic if the first fails).
  • The calc tactic, is mostly syntactic sugar for other stuff and is handled differently. IIRC, it doesn't not get any position information in the Lean server.
  • There are also interactive conversion tactics, which are tactics in another monad (used with the conv tactic). Their state uses | instead |-.

Jason Rute (Jul 09 2021 at 00:24):

Also, there can be a lot of nesting of tactics. Here are a few discussion points:

  • For semicolon combinators, e.g. induction n n ih; simp, the second tactic can be applied to more than one goal state, so there isn't necessarily a single goal state associated with, say, simp in this case. (In other rare cases, the final tactic in the chain isn't even executed.)
  • It isn't uncommon to see a new tactic proof started in the middle of another tactic proof, like rw [by foo] where by foo is a new tactic proof. This is most apparent in the calc tactic, but you see it other places.
  • Of course there are tactic combinators like try and repeat and conv followed by {...} or begin...end blocks of tactics.
  • I assume you are not interested in parsing term proofs, but Lean mixes term proofs and tactic proof fairly readily, including inside the same proof. I don't know a good way to record term proofs with position information.

Jason Rute (Jul 09 2021 at 00:24):

Hope this helps!

Mario Carneiro (Jul 09 2021 at 00:56):

@Jason Rute I think the lean 3 AST export will be helpful for you (and other ML researchers) in particular. I see it as a souped up version of lean_proof_recording.

Clément (Jul 10 2021 at 13:09):

Jason Rute said:

Hope this helps!

It's incredibly helpful, yes! Thanks so much for taking the time to share these thoughts, it will help a whole lot ^^ Some brief answers below.

As for the boundaries, do you know if it is easy to get the end position out of the Lean C++ parser?

No, I think that info is not does no exist. For symbols it's OK (we can use word boundaries), and for tactics that's where we want to hook into the parser, so we can raise some sort of begin/end events every time we start/end to parse a tactic.

As for the internal tables, are you referring to the tables used by the language server?

I was thinking of the m_line_data of the info_manager.

Here is a discussion of the different types of Lean tactics you will encounter. (I'm not as familiar with Coq, so I don't know how many are their as well.) […]

Thanks, that's super helpful. Coq has all the facilities you listed, I think, though they are used a bit differently because of Coq's hard-baked notion of "sentence" (basically, one meaningful proof step, implemented as one or multiple tactics).

One basic design decision in Alectryon has been to assume that any goal that's not readily accessible from an IDE isn't super important, so e.g. for repeat we only show the final goal, not any of the intermediate ones. Same for a ;-sequence, where we show the end result but not the evolution of the goals through the pipeline. It's common in Coq to actually structure proofs in this way, with interesting intermediate states at the end of sentences (.-separated in Coq). So for example in Coq you would do induction x; simpl; intros; eauto. and Alectryon will only show the goals that were not solved by eauto, with the assumption that the others were not interesting. But if you care to show all the goals, then you could write induction x; simpl; intros. all: eauto. and then Alectryon will show all goals after intros and then only the remaining ones after eauto.

Jason Rute (Jul 10 2021 at 14:02):

As for end positions, I meant do you see where you can get it from the parser even? It seems very obfuscated. The fact that @Mario Carneiro ‘s project also is currently missing end positions reenforces that in my mind, but it must be possible in principle at least.

Mario Carneiro (Jul 10 2021 at 15:59):

The parser is already too late to get end positions, the data about where tokens end is known only to the scanner and discarded immediately. Adding end positions to everything would be quite a bit of work, but adding it only in a few places like the comma after a tactic wouldn't be so bad

Jason Rute (Jul 10 2021 at 17:29):

The hardest thing in my experience is getting the end of a by tactic block. Imagine by exact ….


Last updated: Dec 20 2023 at 11:08 UTC