Zulip Chat Archive

Stream: lean4

Topic: Error with Digits in Custom Syntax

Mac (Apr 20 2021 at 18:04):

The following example errors unexpectedly:

syntax bit := "0" <|> "1"
syntax "bor" bit bit : term
macro_rules
  | `(bor 0 1) := `(true) -- Errors unexpectedly on the 0
/-
  expected '0' or '1'
-/

What is going on here? (I am running the latest Lean nightly: commit aaca889bea02.)
The problem is also not with macro_rules as #check bor 0 1 also errors in the same way.

Sebastian Ullrich (Apr 20 2021 at 19:54):

User-given tokens are registered in a (currently global) token trie; however, that trie is tried (heh) only after built-in token types such as numerals, so 0 is never parsed as your new token. I suppose the other way around wouldn't be great either since then (0 : Nat) wouldn't work anymore below your syntax declaration.

Sebastian Ullrich (Apr 20 2021 at 20:00):

For great DSL support we will eventually need local token sets, including the ability to disable built-in token kinds. There are also parsers such as raw that can circumvent the tokenizer, but these are not exposed to the syntax abstraction yet.

Mac (Apr 21 2021 at 21:21):

Sebastian Ullrich said:

User-given tokens are registered in a (currently global) token trie; however, that trie is tried (heh) only after built-in token types such as numerals, so 0 is never parsed as your new token.

Why not have the user-given tokens and built-in tokens in the same (global) trie? Or at least merge them into the same trie after bootstrapping? That sounds to me like the easiest fix, but I imagine I am missing some big problem with that.

Sebastian Ullrich (Apr 22 2021 at 07:41):

Tokenization needs to be unambiguous, in the end it has to return either the old token kind or the new one, breaking the syntax using the other one. I wouldn't know how to support overloaded tokens, it doesn't really make sense in our architecture.

Mac (Apr 22 2021 at 16:09):

Sebastian Ullrich said:

Tokenization needs to be unambiguous, in the end it has to return either the old token kind or the new one, breaking the syntax using the other one. I wouldn't know how to support overloaded tokens, it doesn't really make sense in our architecture.

Maybe I am greatly misunderstanding how all of this works under the hood, but aren't tokens already overloaded/overridden? For instance, in my previous example the token bor is now processed as a keyword instead of as identifier, correct? Doesn't that involve some form of token overriding? Why can't the same process be applied to numerals (and potentially other builtin syntax)?

Sebastian Ullrich (Apr 22 2021 at 16:19):

Yes, overriding is fine. After you've declared bor, you cannot use it as an identifier anymore because we prioritize keywords over identifiers. Just like you wouldn't be able to use 0 as a numeral literal anymore after your syntax if we prioritized symbols over them, instead of the reverse as it is right now. In either case, one syntax becomes inaccessible. There is no overloading on the token level.

Mac (Apr 22 2021 at 17:23):

Sebastian Ullrich said:

Yes, overriding is fine. After you've declared bor, you cannot use it as an identifier anymore because we prioritize keywords over identifiers. Just like you wouldn't be able to use 0 as a numeral literal anymore after your syntax if we prioritized symbols over them, instead of the reverse as it is right now. In either case, one syntax becomes inaccessible. There is no overloading on the token level.

Ah, well then, why not prioritize symbols over numeral literals? It would only come in to play when the user defines a numeral token, in which case they probably do want to override the standard syntax. And, for instance, if I want to recover the numeric literal behavior for bit in my example, all I would need do is:

syntax bit : term
macro_rules
  | `($b:bit) => `(OfNat.ofNat (nat_lit $b))

Correct?

Mac (Apr 22 2021 at 18:13):

Also, there does seem to be at least some token overloading occurring. The tokens name, kind, priority for parameters in things like notation and macro_rules do not eat up prevent those symbols from being also tokenized as identifiers.

Sebastian Ullrich (Apr 22 2021 at 20:26):

Mac said:

And, for instance, if I want to recover the numeric literal behavior for bit in my example, all I would need do is:
syntax bit : term
macro_rules
  | `($b:bit) => `(OfNat.ofNat (nat_lit $b))
Correct?

No, a bit is not a valid num. It could be made to work, but it doesn't help because every nested use of the num parser such as in id.{0} would still break.

Mac said:

Also, there does seem to be at least some token overloading occurring. The tokens name, kind, priority for parameters in things like notation and macro_rules do not eat up prevent those symbols from being also tokenized as identifiers.

Yes, these are using the nonReserved parser: https://github.com/leanprover/lean4/blob/5902fe17460cf816fe3d8b470f806504569803c4/src/Lean/Parser/Basic.lean#L1146-L1152. The analogue in your case would be a parser that parses a num and then raises a parse error if the value of that token is neither 0 nor 1. But you cannot write this in the syntax DSL.

Sebastian Ullrich (Apr 22 2021 at 20:43):

By the way, the main reason for using a separate parser type in syntax is that we do not want to import half of the Lean lib into each file. This may be solved with https://github.com/leanprover/lean4/issues/416, but that is still a pretty big "may".

Mac (Apr 22 2021 at 23:11):

Interesting! Thanks for helping me with my dumb questions. I learned a lot!

Sebastian Ullrich (Apr 23 2021 at 07:15):

Oh, these were very justified questions! I wished this part of the system was as flexible and open to users as e.g. the macro system, though I also hope that most people's syntaxes will not run into such edge cases :) .

Last updated: May 02 2025 at 03:31 UTC