Zulip Chat Archive
Stream: lean4
Topic: PSA: Verified Regex Matching with PikeVM
pandaman (Jan 05 2024 at 11:37):
Let me introduce a Regex library that I have just completed the correctness proof in Lean :big_smile: The library uses a NFA-based PikeVM, which guarantees the linear worst-case run time. I'm also planning to further optimize the performance by profiling and reducing the allocations.
This is my first serious Lean project, and I'd appreciate any help or suggestions you can provide. Thanks!
https://github.com/pandaman64/lean-regex/tree/main
Martin Dvořák (Jan 05 2024 at 11:55):
Does it intentionally avoid docs#NFA from Mathlib?
pandaman (Jan 05 2024 at 12:03):
Yes. I wanted to explicitly control how the data is laid out in memory.
Martin Dvořák (Jan 05 2024 at 12:23):
Where is it done?
Kyle Miller (Jan 05 2024 at 12:29):
@Martin Dvořák Are there examples of docs#NFA being evaluated? That type seems good for purely theoretical arguments, which is great, but Set
has no runtime representation.
Here's the NFA type: https://github.com/pandaman64/lean-regex/blob/main/Regex/NFA/Basic.lean#L72
Here's the NFA state: https://github.com/pandaman64/lean-regex/blob/main/Regex/NFA/VM/Basic.lean#L23
pandaman (Jan 05 2024 at 12:31):
You can find state transitions here: https://github.com/pandaman64/lean-regex/blob/9e6270b46436f7e147d1396bf67616fe8d86fd29/Regex/NFA/VM/Basic.lean#L337
Martin Dvořák (Jan 05 2024 at 12:32):
Oic now.
https://github.com/pandaman64/lean-regex/blob/main/Regex/NFA/Tests.lean
The regex is specified as a tree (inside there is array), the input is specified as a string.
Kyle Miller (Jan 05 2024 at 12:37):
@pandaman Could you change the .char
regex to take a Char -> Bool
predicate instead of a Char
? I'm not sure all the implications this would be to what you've done, but that would let you handle character ranges easily, which could otherwise be intractable to represent as a regex if it's a big unicode range. I'm assuming you could modify the VM similarly.
Kyle Miller (Jan 05 2024 at 12:41):
(For the NFA state, I saw you mentioned bitvec, which would be a really good improvement to its efficiency. It looks like you already were careful to create an API around NodeSet
, which hopefully makes the transition fairly painless! One first step you could do is change that from an abbrev
to a def
to see where you're accidentally using its definition.)
pandaman (Jan 05 2024 at 12:51):
Thank you for the suggestions! I'm planning to do some benchmarking via rebar to measure the base libe performance first, but these suggestions totally make sense.
One thing I'm unsure about the bitvec is how to reuse allocations. It uses Nat under the hood, and I'm wondering if we need to reallocate the bigint for each iteration (while we can reuse arrays by clearing it): https://github.com/leanprover/std4/blob/0f6bc5b32bf5b0498902d3b5f0806c75530539d5/Std/Data/BitVec/Basic.lean#L29
Kyle Miller (Jan 05 2024 at 13:09):
Hmm, I'm not sure whether Lean can re-use Nats like it can an Array. (I also don't see any API for setting bits for bitvec)
Kyle Miller (Jan 05 2024 at 13:13):
I suppose Array Bool
might not be that bad of a representation -- if I remember correctly, while Array Bool
might be an array of pointers, at least the Bool
values are packed inside the pointers, so there's no indirection. (I could be misremembering this detail about the Lean runtime.)
Henrik Böving (Jan 05 2024 at 13:17):
Kyle Miller said:
I suppose
Array Bool
might not be that bad of a representation -- if I remember correctly, whileArray Bool
might be an array of pointers, at least theBool
values are packed inside the pointers, so there's no indirection. (I could be misremembering this detail about the Lean runtime.)
This is corrext, Array Bool will be represented as an Array of tagged pointers so if you access the bool it has to be shifted etc. If you wish an even more efficient representation you should use docs#ByteArray
James Gallicchio (Jan 06 2024 at 09:09):
I'm putting together a bitvec implementation based on ByteArrays over in LeanColls, if it is of use to anyone :stuck_out_tongue_closed_eyes:
Joe Hendrix (Jan 11 2024 at 23:08):
I'd guess ByteArray is most efficient for dense bit arrays and either BitVec or a hash map is more efficient for sparse bit arrays depending on layout. It'd be great to get benchmarks on this though.
David Michael Roberts (Jan 12 2024 at 01:12):
Nice! How does it go on HTML?
Mac Malone (Jan 12 2024 at 01:12):
Henrik Böving said:
This is corrext, Array Bool will be represented as an Array of tagged pointers so if you access the bool it has to be shifted etc. If you wish an even more efficient representation you should use docs#ByteArray
Are you sure Array Bool
is an array of tagged pointers? Bool
is represnted as a UInt8
, so I believe Array Bool
would just be an array of boxed UInt8
/ scalar objects (where boxing in this case means a shift << 1
is applied to the value to turn it into a pseudo-lean_object*
). As such, the overhead over ByteArray
would be that each entry is represnted a USize
rather than a UInt8
and has to been unboxed from the USize
to UInt8
after loading via a single shift.
Mac Malone (Jan 12 2024 at 01:13):
On, and there is 1 more level of indirection because the EDIT: They are both inline.Array
value is stored as a pointer in the lean_object
whereas the scalar ByteArray
object is stored inline in the lean_object
.
Henrik Böving (Jan 12 2024 at 07:14):
Mac Malone said:
Henrik Böving said:
This is corrext, Array Bool will be represented as an Array of tagged pointers so if you access the bool it has to be shifted etc. If you wish an even more efficient representation you should use docs#ByteArray
Are you sure
Array Bool
is an array of tagged pointers?Bool
is represnted as aUInt8
, so I believeArray Bool
would just be an array of boxedUInt8
/ scalar objects (where boxing in this case means a shift<< 1
is applied to the value to turn it into a pseudo-lean_object*
). As such, the overhead overByteArray
would be that each entry is represnted aUSize
rather than aUInt8
and has to been unboxed from theUSize
toUInt8
after loading via a single shift.
I was under the impression that that's what we mean by a tagged pointer? But yes what you are saying is the correct representation
pandaman (Jan 12 2024 at 09:18):
I'm going to perform a benchmark with Array Bool
and ByteArray
(and possibly Nat
) next weekend. Stay tuned:+1:
pandaman (Jan 13 2024 at 12:53):
I did a quick benchmark, and it turned out that Array Bool
was 10% faster than ByteArray
. You can find the code and more results here: https://github.com/pandaman64/lean-regex/pull/1
$ hyperfine --warmup 10 "./array-bool 'lean|rust' /usr/share/dict/american-english" "./bytearray 'lean|rust' /usr/share/dict/american-english"
Benchmark 1: ./array-bool 'lean|rust' /usr/share/dict/american-english
Time (mean ± σ): 270.7 ms ± 10.4 ms [User: 262.2 ms, System: 8.2 ms]
Range (min … max): 261.3 ms … 296.2 ms 11 runs
Benchmark 2: ./bytearray 'lean|rust' /usr/share/dict/american-english
Time (mean ± σ): 298.1 ms ± 5.3 ms [User: 290.1 ms, System: 7.9 ms]
Range (min … max): 291.0 ms … 306.1 ms 10 runs
Summary
./array-bool 'lean|rust' /usr/share/dict/american-english ran
1.10 ± 0.05 times faster than ./bytearray 'lean|rust' /usr/share/dict/american-english
pandaman (Jan 13 2024 at 12:54):
I originally wanted to try rebar for more comprehensive benchmarking, but rebar is search-oriented while I only implemented matching for the whole string...
Henrik Böving (Jan 13 2024 at 12:57):
pandaman said:
I did a quick benchmark, and it turned out that
Array Bool
was 10% faster thanByteArray
. You can find the code and more results here: https://github.com/pandaman64/lean-regex/pull/1$ hyperfine --warmup 10 "./array-bool 'lean|rust' /usr/share/dict/american-english" "./bytearray 'lean|rust' /usr/share/dict/american-english" Benchmark 1: ./array-bool 'lean|rust' /usr/share/dict/american-english Time (mean ± σ): 270.7 ms ± 10.4 ms [User: 262.2 ms, System: 8.2 ms] Range (min … max): 261.3 ms … 296.2 ms 11 runs Benchmark 2: ./bytearray 'lean|rust' /usr/share/dict/american-english Time (mean ± σ): 298.1 ms ± 5.3 ms [User: 290.1 ms, System: 7.9 ms] Range (min … max): 291.0 ms … 306.1 ms 10 runs Summary ./array-bool 'lean|rust' /usr/share/dict/american-english ran 1.10 ± 0.05 times faster than ./bytearray 'lean|rust' /usr/share/dict/american-english
That PR doesn't really look like a proper ByteArray implementation to me? Once you have a ByteArray you can pack 8 Bools into a single slot of the ByteArray instead of just 1 Bool per slot, I don't see this packing happening in your PR? I would expect that just from this packing alone you should get drastically better cache hits on non trivial regexes which should easily be able to outrun Array Bool in my expectation. If that's not the case I would be very interested in digging as to why that is.
pandaman (Jan 13 2024 at 13:05):
That PR doesn't really look like a proper ByteArray implementation to me?
I just wanted to see if I can get a quick win by switching the underlying array. While bit packing could be advantageous, it would also increase the verification burden, and that's not the path I'm currently pursuing.
Henrik Böving (Jan 13 2024 at 13:19):
pandaman said:
That PR doesn't really look like a proper ByteArray implementation to me?
I just wanted to see if I can get a quick win by switching the underlying array. While bit packing could be advantageous, it would also increase the verification burden, and that's not the path I'm currently pursuing.
Right that's fair. Also just for fun I tried to reproduce your benchmark on a file of myself (has approx 2k lines):
λ hyperfine --warmup 10 ".lake/build/bin/RunRegex 'lean|rust|' /var/log/dnf.log" <<<
Benchmark 1: .lake/build/bin/RunRegex 'lean|rust|' /var/log/dnf.log
Time (mean ± σ): 121.4 ms ± 3.3 ms [User: 112.8 ms, System: 8.0 ms]
Range (min … max): 118.3 ms … 130.7 ms 24 runs
florence:~/Desktop/formal_verification/lean/lean-regex|main
λ git checkout test/bytearray
branch 'test/bytearray' set up to track 'origin/test/bytearray'.
Switched to a new branch 'test/bytearray'
florence:~/Desktop/formal_verification/lean/lean-regex|test/bytearray
λ lake build RunRegex
...
florence:~/Desktop/formal_verification/lean/lean-regex|test/bytearray
λ hyperfine --warmup 10 ".lake/build/bin/RunRegex 'lean|rust|' /var/log/dnf.log" <<<
Benchmark 1: .lake/build/bin/RunRegex 'lean|rust|' /var/log/dnf.log
Time (mean ± σ): 121.6 ms ± 1.5 ms [User: 112.5 ms, System: 8.6 ms]
Range (min … max): 119.9 ms … 126.5 ms 24 runs
That looks like approximately like within noise to me? How big is your dataset? I'd like to figure out where exactly the performance diff is coming from
pandaman (Jan 13 2024 at 13:38):
I used Ubuntu's American English dictionary: https://manpages.ubuntu.com/manpages/trusty/man5/american-english.5.html
I already left my machine, but I can check the line count tomorrow.
I ran the benchmark on WSL, which might or might not affect the result.
Mario Carneiro (Jan 13 2024 at 23:17):
This line looks bad:
def NodeSet.empty {n : Nat} : NodeSet n :=
- ⟨mkArray n false, by simp⟩
+ ⟨⟨mkArray n 0⟩, by simp [ByteArray.size]⟩
You are creating an Array UInt8
and converting it to a ByteArray
there, so you lose most of the performance wins of the 8x less memory allocation needed by ByteArray
François G. Dorais (Jan 15 2024 at 20:42):
@pandaman The speed benefits usually come from using USize
for indexing instead of Nat
(i.e, uget
vs get
/fget
). Using ByteArray
gives an 8x improvement in memory allocation, and the access time should be equal or better than Array UInt8
when using USize
instead of Nat
.
Last updated: May 02 2025 at 03:31 UTC