Zulip Chat Archive

Stream: lean4

Topic: About Tokens (and the Token Cache)


Mac (May 14 2021 at 23:59):

In digging through the Lean 4 source code and trying to lean from it, I have stumble upon the token cache. It caches the mostly recently parsed Token so that it can be reused again later (for performance reasons). It also used to do parser selection in the Pratt parser. This is described in detail in the sources here: https://github.com/leanprover/lean4/blob/7ca2f70c2f977ca42c9df1acea86a9aa4e872928/src/Lean/Parser/Basic.lean#L34-L41.

In standard Lean, (from what I can tell), a token is one of the following primitives:

  • A string literal (e.g. "hello!\n")
  • A character literal (e.g., 'a' or '\n')
  • A numeric literal (e.g., 123, 0b0101, 0oAC, or 0xAF)
  • A scientific literal (e.g., 1e4)
  • A name literal (e.g., foo , «+» )
  • An identifier (e.g., foo, «+»)
  • A user-defined custom symbol (e.g., )

My question is: Why not just cache the head Syntax element (e.g., atom/ident/node) instead of having a separate notion of a Token? Alternatively, why not have Token be a category parser so it can be easily altered/extended?

I imagine there are a performance and complexity reasons for this choice and I would like to learn what they are. Also I am curious whether or not having more fine grained control over tokens is something the broader Lean community would be interested in or if this a more niche desire.

Mario Carneiro (May 15 2021 at 00:03):

From what I can tell from the sources, a token is a Syntax

Mac (May 15 2021 at 00:12):

@Mario Carneiro ????

The type Token is an abbreviation for String: https://github.com/leanprover/lean4/blob/7ca2f70c2f977ca42c9df1acea86a9aa4e872928/src/Lean/Parser/Basic.lean#L94

If you mean that tokens are parsed as a Syntax --- this is true. But only a subset of Syntax is a Token and that subset is hard coded into the way the parsers work, creating trouble when custom syntax clashes with it.

Mac (May 15 2021 at 00:14):

See one of my previous Zulip threads on Digits in Custom Syntax or Lean 4 GitHub issue #242 for examples of where this problem pops up.

Mario Carneiro (May 15 2021 at 00:24):

I was referring to tokenFn, which doesn't seem to deal directly with the Token type and just parses a Syntax

Mac (May 15 2021 at 00:30):

Ah, but tokenFn doesn't necessarily parse a syntax. It first tries to pop one from the token cache. Only if there is none does it parse one (through tokenFnAux). And tokenFnAux is hard-coded to parse the specific set of Syntaxes that constitutes a token. There is no way to alter this set, and all higher parsers and combinators boil down to tokens (outside of a few specific ones).

Mario Carneiro (May 15 2021 at 00:31):

yes, that sounds right

Mario Carneiro (May 15 2021 at 00:32):

I imagine that there is some performance cost to making the tokenizer dynamic

Mario Carneiro (May 15 2021 at 00:33):

but you will have to wait for @Sebastian Ullrich to give an authoritative answer

Mario Carneiro (May 15 2021 at 00:33):

I don't think the token cache matters that much, it's just memoizing the function

Mac (May 15 2021 at 00:35):

Mario Carneiro said:

I imagine that there is some performance cost to making the tokenizer dynamic

That's probably true. I'm curious as to what those are, especially considering that from the discussion in Digits in Custom Syntax that local token sets on the roadmap somewhere in the future. So it doesn't seem the burden is too insurmountable or the idea too far off the beaten path.
If it's not. It's something I might want to take a look into playing with (and perhaps implementing a way myself to make it more dynamic).

Mac (May 15 2021 at 00:38):

Mario Carneiro said:

I don't think the token cache matters that much, it's just memoizing the function

True, but its clearly important for performance reasons and for the Pratt parser. Thus preserving the notion of a cacheable head Token seems important. That is why I would think one possible way of making the cache more dynamic is generalizing it to cache a general head Syntax instead of restricted subset of a Token.

Mario Carneiro (May 15 2021 at 00:38):

But there is no such constraint

Mario Carneiro (May 15 2021 at 00:39):

the token cache stores a Syntax

Mac (May 15 2021 at 00:39):

True, the cache, by its definition already supports this.

Mario Carneiro (May 15 2021 at 00:39):

it's literally just the return value of tokenFn at some particular recent argument

Mac (May 15 2021 at 00:39):

The question is making the the parser that uses it more dynamic.

Mac (May 15 2021 at 00:40):

i.e. making what tokenFnAux parses dynamic.

Mario Carneiro (May 15 2021 at 00:40):

Sure, that would entail putting an additional indirection in front of tokenFnAux to use parserCategory

Mac (May 15 2021 at 00:40):

Yep, that is what I'm thinking. I'm curious what the performance concerns there would be.

Mario Carneiro (May 15 2021 at 00:41):

I'm 100% sure this is not something they would trust anyone else to implement though

Mac (May 15 2021 at 00:42):

That's fair. Regardless, though, if its a viable route I might play around with it on my own.

Mac (May 15 2021 at 00:47):

Also, I am not going ahead a toying with it myself right now as I imagined others might (like you) might have good insight as to complications and considerations that would probably be useful to hear before prototyping.

Mac (May 15 2021 at 00:48):

Furthermore, this discussion might prove useful to the Lean 4 developers whenever they get around to similar features in the future.

Mac (May 15 2021 at 00:50):

In essence, I am trying to model this discussion of the first step of the new Contribution Guidelines. Mostly because I suspect this feature is far on the backburner, but something at least I am very interested in.

Mario Carneiro (May 15 2021 at 00:50):

As far as I can tell there isn't any fundamental issue with making tokenFn dynamic. There isn't even really a well defined tokenization step in the parser, tokenFn is just one particularly low level parser and all parsers are ultimately operating on the source string

Mac (May 15 2021 at 00:54):

Mario Carneiro said:

As far as I can tell there isn't any fundamental issue with making tokenFn dynamic.

I was getting a similar feeling, but I am asking here and to see if there were things I was missing or if I was right, and also to see what recommendations there were for approaching this issue.

Thanks, by the way, for your input! :smile:

Mac (May 15 2021 at 00:57):

Mario Carneiro said:

Sure, that would entail putting an additional indirection in front of tokenFnAux to use parserCategory

One thing to note here though is that category parsers are Pratt parsers which in Lean, according to the documentation, use the head token to select alternatives. Thus parsing tokens using the current category parser implementation would cause a cycle (I believe).

Mario Carneiro (May 15 2021 at 01:26):

true, you would need a more simple minded selection procedure like first [p1, p2, ...] (maybe with priorities) because there is nothing to index a token parser on

Mac (May 15 2021 at 01:28):

Luckily, this does not need to be done entirely from scratch, the current Pratt parser implementation already supports non-indexed parser (in a list) alongside token-indexed ones, so it would just need to be a stripped down version that doesn't have the token-indexed option.

Mac (May 15 2021 at 01:34):

Whether or not the Pratt parser uses the index could even be an (optional) parameter to parser (that is passed on down)

Mario Carneiro (May 15 2021 at 01:35):

you probably want to skip the pratt parser stuff though, since I don't think leading / trailing parsers makes a lot of sense

Mac (May 15 2021 at 01:39):

true, it could probably just use the longest match parser for disambiguation

Mac (May 15 2021 at 02:00):

Using a non-Pratt parser for category would require augmenting ParserCategory to have two alternatives pratt and longest (or token/simple). Luckily, Lean already seems designed for such an option and this seems like a feasible extension.

Mac (May 15 2021 at 02:14):

though, thinking about it more, it should definitely somehow integrate the token trie.


Last updated: Dec 20 2023 at 11:08 UTC