Zulip Chat Archive

Stream: lean4

Topic: char trie data structure


Joachim Breitner (Sep 08 2023 at 16:49):

In the context of #find I think I can use a Char trie data structure, and I noticed that lean already has one: https://leanprover-community.github.io/mathlib4_docs/Lean/Data/Trie.html#Lean.Parser.Trie
At first glance seems less optimized for queries and memory compactness and more for frequent updates than I'd expect.
It seems it's used by the lean parser. Does anyone have a sense whether it's query performance is on the hot path, and whether it may be useful to optimize for query speed and memory size (at the expense of insertion speed)? (I'd avoid the nested search tree on Char at each node and instead would go for a flat unboxed sorted array of bytes to do binary search on. And investigate whether path compression is worth it.)

Henrik Böving (Sep 08 2023 at 17:24):

I'd guess if someone knows its @Sebastian Ullrich

Sebastian Ullrich (Sep 09 2023 at 05:15):

I'd have to measure again but I believe the token cache is quite effective at minimizing the time in the tokenizer

Joachim Breitner (Sep 09 2023 at 05:55):

And the trie data structure is used by the tokenizer?

Sebastian Ullrich (Sep 09 2023 at 06:06):

Yes, it stores the set of keywords

Joachim Breitner (Sep 09 2023 at 16:06):

Ok, so it is on the hot path? Then I'll see if i can speed it up a bit when I'm adapting it for Loogle, there might be some lowish hanging fruit there (branch on bytes, not chars; use flat unpacked arrays instead of rb trees; possibly even linear search then binary search, given the branching factor should be low and rumors are that linear branchless searches are faster than binary search for short arrays). Will be a fun exercise learning more about the kind of C code that lean generates

Joachim Breitner (Sep 09 2023 at 16:26):

(ok, not sure if switching to bytes helps, it seems ByteArray isn't a packed byte array in lean. Anyways, I'll just play around with it.)

Henrik Böving (Sep 09 2023 at 16:34):

Joachim Breitner said:

(ok, not sure if switching to bytes helps, it seems ByteArray isn't a packed byte array in lean. Anyways, I'll just play around with it.)

I was under the impression that it is, what makes you think this?

Joachim Breitner (Sep 09 2023 at 16:36):

Reading this on the phone, so may be wrong but it seems all arrays are backed by a (void *):
https://github.com/leanprover/lean4/blob/3aa1cfcceabf7d091a3b2e5d4330df76767336ac/src/include/lean/lean.h#L871

Joachim Breitner (Sep 09 2023 at 16:37):

Hmm, let me look again

Joachim Breitner (Sep 09 2023 at 16:39):

I stand corrected. Great! (And sorry for the noise)

Joachim Breitner (Sep 10 2023 at 17:06):

I ran some first experiments, where I put all mathlib definitions name suffixes into one trie.
Switching from RBTrees to flat unsorted arrays inside the nodes reduced the overall size less than I expected (255MB to 219MB – guess the Trie itself and the Names make up most of that).

Lookup speed seems to have improved noticably – 1000000 queries of Nat.iterate take 11s instead of 16.5s. And it can probably be improved even more (add something like ByteArray.strchr for branchless finding of a byte; avoid Nat for indexing).

For Loogle/#find I am unsure if I want to use this – for interactive use the index is far too slow to construct.
For the online version it wouldn’t be too bad, given that the index is precomputed, but I wonder if the two should diverge like this.
The lookup improvement might be useful for lean itself, but there I am facing the problem that Trie.matchPrefix wants to index into a String at a certain position, which shouldn’t be copied using String.toUTF8, but currently the String API doesn't allow direct iterative access to the utf8 bytes. Guess a String.getUTF8Byte could be added… maybe I’ll try.

Is there profiling data readily available to see how much time is spent in Trie operations?

James Gallicchio (Sep 10 2023 at 20:51):

I think perf with a frontend (e.g. hotspot) is the way people profile?

Joachim Breitner (Sep 10 2023 at 21:12):

I was hoping that maybe a profile is created by CI an pushed online :-D

I did another variant, where I add to the Trie a constructor for the case of only one child node (which is very common). This brings size down to 148M (plausible) and the lookup time to 2s … but something is wrong with my testing suddenly.

James Gallicchio (Sep 10 2023 at 21:16):

ah yeah I have no idea how the speedcenter works :big_smile:

David Renshaw (Sep 10 2023 at 21:21):

speedcenter scripts are linked here: https://leanprover.zulipchat.com/#narrow/stream/287929-mathlib4/topic/mathlib4.20speedcenter/near/346669256

David Renshaw (Sep 10 2023 at 21:21):

https://github.com/Kha/mathlib4-bench

Scott Morrison (Sep 10 2023 at 23:27):

Joachim Breitner said:

For Loogle/#find I am unsure if I want to use this – for interactive use the index is far too slow to construct.

Why can't interactive use also use a precomputed index, like exact? does?

Joachim Breitner (Sep 11 2023 at 09:14):

It could, if the benefit is worth the cost.
Benefit:
#find supports queries with _just_ lemma name fragments (as soon as any term pattern is used, a different index can be used)
Costs:
~200MB extra download
if the cache isn’t there (e.g. you are modifying mathlib and rebuilding everything), you spend a few extra minutes rebuilding the cache.

Hard to tell…

Joachim Breitner (Sep 11 2023 at 09:19):

On the plane I did some more testing.
Indexing a slice of mathlib, the trie size goes down from 26MB to 22MB (flat byte array) to 16MB (separate constructor for nodes with degree 1)
And looking up replicate in that trie, which means traversing nodes with degrees 92 60 43 4 2 1 1 1 1, 10M times takes 64s→28s→17s. This looks like it is going to worth giving a shot getting that into lean.

Sebastian Ullrich (Sep 11 2023 at 12:53):

@Joachim Breitner In the parser benchmark, Trie.matchPrefix accounts for 6.9% of the run time. But that's for 100% of the time spent in the parser; in mathlib, parsing is only ~2% of the total run time. So while tokenizer optimizations might be interesting for some use cases, I would argue their priority is not terribly high at the moment

Joachim Breitner (Sep 11 2023 at 16:20):

Yeah; still a fun exercise and since I might want to optimize that data structure for #find/loogle anyways… we’ll see. Right now I still have to find a bug somewhere :-)

Joachim Breitner (Sep 11 2023 at 16:58):

matchPrefix returns String.Pos × Option α, but it seems all uses in lean ignore the first field and only cares about the result. So probably that code can be simplified (wip @ https://github.com/leanprover/lean4/pull/2531)


Last updated: Dec 20 2023 at 11:08 UTC