Zulip Chat Archive

Stream: general

Topic: AST export for lean 3


Mario Carneiro (Jul 09 2021 at 00:53):

I've been working with @Daniel Selsam on mathport at MS, and as part of it there is now a complete AST export of lean 3 available at https://github.com/leanprover-community/lean/compare/master...digama0:ast . You can see the generated grammar at https://github.com/dselsam/mathport/blob/master/Mathport/AST3.lean . What makes this most interesting is its completeness: I made sure not to leave out anything, and it has been tested on lean core + mathlib. For example, here is the concrete syntax you can use in expr position in a lean file:

  inductive Expr
    | «...» : Expr
    | «sorry» : Expr
    | «_» : Expr
    | «()» : Expr
    | «{}» : Expr
    | ident : Name  Expr
    | const (ambiguous : Bool) : #Name  Levels  Expr
    | nat : Nat  Expr
    | decimal : Nat  Nat  Expr
    | string : String  Expr
    | char : Char  Expr
    | paren : #Expr  Expr
    | sort (isType isStar : Bool) : Option #Level  Expr
    | app : #Expr  #Expr  Expr
    | «fun» (isAssume : Bool) : Binders  #Expr  Expr
    | «→» : #Expr  #Expr  Expr
    | Pi : Binders  #Expr  Expr
    | «show» : #Expr  #Proof  Expr
    | «have» (suff : Bool) : Option #Name  #Expr  #Proof  #Expr  Expr
    | «.» (compact : Bool) : #Expr  #Proj  Expr
    | «if» : Option #Name  #Expr  #Expr  #Expr  Expr
    | «calc» : Array (#Expr × #Expr)  Expr
    | «@» («partial» : Bool) : #Expr  Expr
    | pattern : #Expr  Expr
    | «`()» (lazy expr : Bool) : #Expr  Expr
    | «%%» : #Expr  Expr
    | «`[]» : Array #Tactic  Expr
    | «`» (resolve : Bool) : Name  Expr
    | «⟨⟩» : Array #Expr  Expr
    | infix_fn : Choice  Option #Expr  Expr
    | «(,)» : Array #Expr  Expr
    | «:» : #Expr  #Expr  Expr
    | hole : Array #Expr  Expr
    | «#[]» : Array #Expr  Expr
    | «by» : #Tactic  Expr
    | begin : Block  Expr
    | «let» : Binders  #Expr  Expr
    | «match» : Array #Expr  Option #Expr  Array Arm  Expr
    | «do» (braces : Bool) : Array #DoElem  Expr
    | «{,}» : Array #Expr  Expr
    | subtype (setOf : Bool) : #Name  Option #Expr  #Expr  Expr
    | sep : #Name  #Expr  #Expr  Expr
    | setReplacement : #Expr  Binders  Expr
    | structInst (ty : Option #Name) (src : Option #Expr)
      (fields : Array (#Name × #Expr)) (srcs : Array #Expr) (catchall : Bool) : Expr
    | atPat : #Expr  #Expr  Expr
    | «.()» : #Expr  Expr
    | «notation» (n : Choice) : Array #Arg  Expr
    | userNotation (n : Name) : Array #Param  Expr

Jason Rute (Jul 09 2021 at 01:11):

Exciting!

  • So it parses lean 3 files into a JSON grammar which you then read into Lean 4?
  • How does one run it?
  • Will it be kept up to date with newer versions of Lean 3, or added as an option to a new release of Lean 3?
  • Does it capture internal state information? That is the big need for lean_proof_recording. (Although that is also the easier step. The harder step is getting the tactic commands which this seems to do.)
  • Does it capture the position information? (I currently need that for lean proof recording, to align internal state info that I've dumped, and it is also nice for a number of other reasons. I think Alectryon needs it for their project, but I don't know if this is directly useful to Alectryon or not.)

Jason Rute (Jul 09 2021 at 01:13):

*Also, is there already a dump of the data or should one run it to get the data?

Mario Carneiro (Jul 09 2021 at 01:18):

So it parses lean 3 files into a JSON grammar which you then read into Lean 4?

Yes. If you don't want to use lean 4 you can also use those mathport files (especially Parse.lean) as a specification of the JSON schema. It's mostly stable now but it might still change in the near future if any bugs are discovered.

How does one run it?

If you compile the linked ast fork of lean 3, you can use the --ast option in conjunction with --make to produce the AST of any file that is read from source. (Note that you will have to kill the oleans first, because files with oleans already are never read or parsed, so you don't get any AST output.) It produces foo.ast.json files alongside foo.lean and foo.olean.

Will it be kept up to date with newer versions of Lean 3, or added as an option to a new release of Lean 3?

I hope to upstream it to community lean at some point. It adds some to the memory usage, at least in theory, but nothing that I have noticed. But I haven't done any detailed profiling to be sure. It may need to be behind a compile flag in the worst case.

Does it capture internal state information? That is the big need for lean_proof_recording. (Although that is also the easier step. The harder step is getting the tactic commands which this seems to do.)

That is a planned extension. Currently it only captures the pexpr before and the expr after, when there is an expr attached to that piece of syntax.

Does it capture the position information? (I currently need that for lean proof recording, to align internal state info that I've dumped, and it is also nice for a number of other reasons. I think Alectryon needs it for their project, but I don't know if this is directly useful to Alectryon or not.)

Yes! All the # you see in the inductive type I just posted are spans (begin and end), although right now the end is always none (lean 3 does not capture any token end information and I only added a hook to support it eventually).

Mario Carneiro (Jul 09 2021 at 01:24):

There is not a dump available, but it's not very hard to run it. It's basically the same as regular compiling, I just don't like compiling mathlib unless I have to

Mario Carneiro (Jul 09 2021 at 01:39):

Does it capture internal state information? That is the big need for lean_proof_recording. (Although that is also the easier step. The harder step is getting the tactic commands which this seems to do.)

That is a planned extension. Currently it only captures the pexpr before and the expr after, when there is an expr attached to that piece of syntax.

Specifically, the reason for doing all this is so that we can have very accurate information about how a lean 3 proof was elaborated when reconstructing corresponding lean 4 syntax, so that we can do things like insert missing arguments, change argument order or other things in order to acommodate lean 4 elaboration changes. Ultimately we want to have lean 3 -> lean 4 porting which can achieve a high success rate even on syntax-ported tactic proofs by understanding what's going on on both sides

Gabriel Ebner (Jul 09 2021 at 07:21):

Great work! I should probably just run it myself, but I wonder what happens with the super-hacky stuff like open_locale that call the parser on a generated string in the middle of the command? Do you get both the user command and the local notation command (with dangling ranges)?

Mario Carneiro (Jul 09 2021 at 11:43):

We get the user command, but do not track additions coming from the called command. More specifically, the call will produce AST nodes but they are not attached to the tree so they are skipped during export

Johan Commelin (Jul 09 2021 at 11:49):

That seems like it might cause problems for the port, right? Without open_locale nat, there will be an error when parsing n!.

Mario Carneiro (Jul 09 2021 at 11:50):

Actually that's not quite true. Item parsers like small_nat will drop their AST nodes on the floor, but commands are added to a global data structure, so if you use with_input to read a full command it will appear after the user command, with messed up position info

Mario Carneiro (Jul 09 2021 at 11:51):

The port will add support for individual commands when the data we receive is not sufficient. We need this anyway since we aren't running a lean 3 VM, so tactics will need to be rewritten

Mario Carneiro (Jul 09 2021 at 11:51):

So if some user command does something really weird, we can just match on it and clean up the result

Johan Commelin (Jul 09 2021 at 11:52):

I'm really excited. This exporter seems like a pretty massive step forward!

Mario Carneiro (Jul 09 2021 at 11:54):

Johan Commelin said:

That seems like it might cause problems for the port, right? Without open_locale nat, there will be an error when parsing n!.

Actually this isn't as much of a problem as you might think. We already know it is the factorial function being applied, because lean 3 parsed it. So we can output n! if we see the notation in scope and nat.fact n otherwise, or we can play it fast and loose and always output n!, as well as the open_locale nat line and make sure there is a working lean 4 equivalent

Mario Carneiro (Jul 09 2021 at 11:56):

We'll probably want to convert uses of open_locale to scoped notations anyway

Gabriel Ebner (Jul 11 2021 at 13:01):

I hope to upstream it to community lean at some point.

Please PR as soon as you have time!

Jason Rute (Sep 30 2021 at 10:16):

@Mario Carneiro I noticed this was recently added to Lean (since it caused lean_proof_recording to break. :smile:) just want to check that this still doesn’t include the end position, right? If it did, it could be a more robust replacement to lean_proof_recording. (I guess even if it doesn’t have end positions we could try taking the AST and converting into Lean 3 code, similar to how you already covert it into Lean 4 code. It wouldn’t be character-for-character the same as the user inputted code, but would also have more syntactic hierarchical structure.). cc @Stanislas Polu

Mario Carneiro (Sep 30 2021 at 10:17):

I was just having a conversation about this with Stan. End positions have been added for tactics, at least, since lean#612

Eric Rodriguez (Sep 30 2021 at 10:18):

in my experience, those end positions are accurate, but I didn't check with every possible tactic

Eric Rodriguez (Sep 30 2021 at 10:18):

It works fairly well with your proof-recording.lean file though (that's been my major testcase with alectryon! very good file :] )

Jason Rute (Sep 30 2021 at 10:20):

Oh, Stan told me this in another convo and I misread it.

Stanislas Polu (Sep 30 2021 at 10:22):

@Jason Rute one potential missing piece to recover lean_proof_recording capabilities, as discussed with Mario, would be to add a pp_expr list along with the tactic state that provides pretty-printed version of the expressions in the tactic state, so that its reconstruction for our ML purposes is straightforward. We could have default options + named specific set of options as needed (as an example for us I think we would just want to add pp.full_names).

Stanislas Polu (Sep 30 2021 at 10:24):

(Won't be able to get to this myself before early November, but definitely plan to do it by then)


Last updated: Dec 20 2023 at 11:08 UTC