Zulip Chat Archive
Stream: lean4
Topic: Lean fails to parse simple language
Jason Ganz (Nov 25 2023 at 13:08):
I'm defining a DSL, but Lean is having trouble parsing it. I've reduced the problem to a minimal test case:
declare_syntax_cat thing
syntax "thing% " thing* : term
syntax "a" thing "b" thing : thing
syntax thing "b" thing : thing
syntax ident : thing
example := thing% a x b x x
I get an expected 'b'
error on the third x
. But there is a correct parse as (a x b x) x
. What's going on? Can this be fixed, to have it correctly parse my context-free language?
James Gallicchio (Nov 25 2023 at 13:24):
I think this language is still ambiguous if you have something like a x b x b x
, right? If I make it unambiguous by making x b x
too low a precedence to occur in the left of a _ b _
, Lean parses fine:
declare_syntax_cat thing
syntax "thing% " thing : term
syntax:100 "a" thing:100 "b" thing : thing
syntax:90 thing "b" thing : thing
syntax:100 ident : thing
James Gallicchio (Nov 25 2023 at 13:31):
although I'm not sure how to get the left recursive variant of this working...
declare_syntax_cat thing
syntax "thing% " thing : term
syntax:100 "a" thing:90 "b" thing:100 : thing
syntax:90 thing:100 "b" thing:90 : thing
syntax:100 ident : thing
Jason Ganz (Nov 25 2023 at 13:33):
I see, the language is in fact ambiguous, but I didn't realize this would prevent Lean from parsing even unambiguous expressions. Thanks for the info!
James Gallicchio (Nov 25 2023 at 13:39):
I'm not sure how ambiguity has to do with what grammars Lean can/can't parse. You can definitely parse ambiguous grammars, and unambiguous grammars can definitely not parse the way you want (see above). So someone with more knowledge than me should chime in.
I think the _ b _
rule needs some extra combinators or something. I'm not sure how left recursion is handled.
Mac Malone (Nov 28 2023 at 00:36):
@Jason Ganz For your original example, the way Lean will end up parsing a x b x x
as a thing
is as follows:
- As
thing
is a syntax category, it will select the longest matching parse of its alternatives (if equal, chooses the successful parse with the highest priority). Lean parses category syntax via a single leading parser and many optional trailing parsers. Trailing parser are those whose LHS is recursion (e.g.,thing "b" thing
). This approach is Pratt parsing. - As the expression starts with the symbol
a
, the only possible leading parser is"a" thing "b" thing
. - It will then recurse on the first thing in
"a" **thing** "b" thing
. - Since the next token is
x
, the only possible leading parser isident
. - Finished with the level 2 leading
thing
, Lean will then attempt to match zero or more trailing"b" thing
(i.e., parse manything "b" thing
) . - It matches the
b
and recurses on the followingthing
. - Once again, since the next token is
x
, the only possible leading parser isident
. - As there are no longer any matching trailing parsers (i.e., more trailing
"b" thing
), Lean will stop and work back up the parse call stack. - Prior to this at failure this point, Lean has parsed
a x b x
and the parse call stack is currently:"a" **thing** "b" thing
->thing "b" **thing**
(LHS parsed as theident
,x
) ->ident
. - It thus works back up to
"a" **thing** "b" thing
with the finishedthing
beingx b x
. - Now, the next required token is
b
, but the next token in the expression isx
, hence the error.
Mac Malone (Nov 28 2023 at 00:46):
More visually:
a x b x x
----------------------------------
"a" thing ~"b" thing~
|
ident ["b" thing] [~"b" thing~]
|
ident [~"b" thing~]
[..]
is the RHS of the attempted trailing parser, and ~...~
is a failed match.
Last updated: Dec 20 2023 at 11:08 UTC