Zulip Chat Archive

Stream: lean4

Topic: Lean fails to parse simple language


Jason Ganz (Nov 25 2023 at 13:08):

I'm defining a DSL, but Lean is having trouble parsing it. I've reduced the problem to a minimal test case:

declare_syntax_cat thing
syntax "thing% " thing* : term
syntax "a" thing "b" thing : thing
syntax thing "b" thing : thing
syntax ident : thing
example := thing% a x b x x

I get an expected 'b' error on the third x. But there is a correct parse as (a x b x) x. What's going on? Can this be fixed, to have it correctly parse my context-free language?

James Gallicchio (Nov 25 2023 at 13:24):

I think this language is still ambiguous if you have something like a x b x b x, right? If I make it unambiguous by making x b x too low a precedence to occur in the left of a _ b _, Lean parses fine:

declare_syntax_cat thing
syntax "thing% " thing : term
syntax:100 "a" thing:100 "b" thing : thing
syntax:90 thing "b" thing : thing
syntax:100 ident : thing

James Gallicchio (Nov 25 2023 at 13:31):

although I'm not sure how to get the left recursive variant of this working...

declare_syntax_cat thing
syntax "thing% " thing : term
syntax:100 "a" thing:90 "b" thing:100 : thing
syntax:90 thing:100 "b" thing:90 : thing
syntax:100 ident : thing

Jason Ganz (Nov 25 2023 at 13:33):

I see, the language is in fact ambiguous, but I didn't realize this would prevent Lean from parsing even unambiguous expressions. Thanks for the info!

James Gallicchio (Nov 25 2023 at 13:39):

I'm not sure how ambiguity has to do with what grammars Lean can/can't parse. You can definitely parse ambiguous grammars, and unambiguous grammars can definitely not parse the way you want (see above). So someone with more knowledge than me should chime in.

I think the _ b _ rule needs some extra combinators or something. I'm not sure how left recursion is handled.

Mac Malone (Nov 28 2023 at 00:36):

@Jason Ganz For your original example, the way Lean will end up parsing a x b x x as a thing is as follows:

  • As thing is a syntax category, it will select the longest matching parse of its alternatives (if equal, chooses the successful parse with the highest priority). Lean parses category syntax via a single leading parser and many optional trailing parsers. Trailing parser are those whose LHS is recursion (e.g., thing "b" thing). This approach is Pratt parsing.
  • As the expression starts with the symbol a, the only possible leading parser is "a" thing "b" thing.
  • It will then recurse on the first thing in "a" **thing** "b" thing.
  • Since the next token is x, the only possible leading parser is ident.
  • Finished with the level 2 leadingthing, Lean will then attempt to match zero or more trailing "b" thing (i.e., parse many thing "b" thing) .
  • It matches the b and recurses on the following thing.
  • Once again, since the next token is x, the only possible leading parser is ident.
  • As there are no longer any matching trailing parsers (i.e., more trailing "b" thing), Lean will stop and work back up the parse call stack.
  • Prior to this at failure this point, Lean has parsed a x b x and the parse call stack is currently: "a" **thing** "b" thing -> thing "b" **thing** (LHS parsed as the ident, x) -> ident.
  • It thus works back up to "a" **thing** "b" thing with the finished thing being x b x.
  • Now, the next required token is b, but the next token in the expression is x, hence the error.

Mac Malone (Nov 28 2023 at 00:46):

More visually:

 a    x     b    x       x
----------------------------------
"a" thing              ~"b" thing~
      |
    ident ["b" thing] [~"b" thing~]
                 |
               ident  [~"b" thing~]

[..] is the RHS of the attempted trailing parser, and ~...~ is a failed match.


Last updated: Dec 20 2023 at 11:08 UTC