Zulip Chat Archive

Stream: lean4

Topic: PSA: Verified Regex Matching with PikeVM


pandaman (Jan 05 2024 at 11:37):

Let me introduce a Regex library that I have just completed the correctness proof in Lean :big_smile: The library uses a NFA-based PikeVM, which guarantees the linear worst-case run time. I'm also planning to further optimize the performance by profiling and reducing the allocations.

This is my first serious Lean project, and I'd appreciate any help or suggestions you can provide. Thanks!

https://github.com/pandaman64/lean-regex/tree/main

Martin Dvořák (Jan 05 2024 at 11:55):

Does it intentionally avoid docs#NFA from Mathlib?

pandaman (Jan 05 2024 at 12:03):

Yes. I wanted to explicitly control how the data is laid out in memory.

Martin Dvořák (Jan 05 2024 at 12:23):

Where is it done?

Kyle Miller (Jan 05 2024 at 12:29):

@Martin Dvořák Are there examples of docs#NFA being evaluated? That type seems good for purely theoretical arguments, which is great, but Set has no runtime representation.

Here's the NFA type: https://github.com/pandaman64/lean-regex/blob/main/Regex/NFA/Basic.lean#L72

Here's the NFA state: https://github.com/pandaman64/lean-regex/blob/main/Regex/NFA/VM/Basic.lean#L23

pandaman (Jan 05 2024 at 12:31):

You can find state transitions here: https://github.com/pandaman64/lean-regex/blob/9e6270b46436f7e147d1396bf67616fe8d86fd29/Regex/NFA/VM/Basic.lean#L337

Martin Dvořák (Jan 05 2024 at 12:32):

Oic now.
https://github.com/pandaman64/lean-regex/blob/main/Regex/NFA/Tests.lean
The regex is specified as a tree (inside there is array), the input is specified as a string.

Kyle Miller (Jan 05 2024 at 12:37):

@pandaman Could you change the .char regex to take a Char -> Bool predicate instead of a Char? I'm not sure all the implications this would be to what you've done, but that would let you handle character ranges easily, which could otherwise be intractable to represent as a regex if it's a big unicode range. I'm assuming you could modify the VM similarly.

Kyle Miller (Jan 05 2024 at 12:41):

(For the NFA state, I saw you mentioned bitvec, which would be a really good improvement to its efficiency. It looks like you already were careful to create an API around NodeSet, which hopefully makes the transition fairly painless! One first step you could do is change that from an abbrev to a def to see where you're accidentally using its definition.)

pandaman (Jan 05 2024 at 12:51):

Thank you for the suggestions! I'm planning to do some benchmarking via rebar to measure the base libe performance first, but these suggestions totally make sense.
One thing I'm unsure about the bitvec is how to reuse allocations. It uses Nat under the hood, and I'm wondering if we need to reallocate the bigint for each iteration (while we can reuse arrays by clearing it): https://github.com/leanprover/std4/blob/0f6bc5b32bf5b0498902d3b5f0806c75530539d5/Std/Data/BitVec/Basic.lean#L29

Kyle Miller (Jan 05 2024 at 13:09):

Hmm, I'm not sure whether Lean can re-use Nats like it can an Array. (I also don't see any API for setting bits for bitvec)

Kyle Miller (Jan 05 2024 at 13:13):

I suppose Array Bool might not be that bad of a representation -- if I remember correctly, while Array Bool might be an array of pointers, at least the Bool values are packed inside the pointers, so there's no indirection. (I could be misremembering this detail about the Lean runtime.)

Henrik Böving (Jan 05 2024 at 13:17):

Kyle Miller said:

I suppose Array Bool might not be that bad of a representation -- if I remember correctly, while Array Bool might be an array of pointers, at least the Bool values are packed inside the pointers, so there's no indirection. (I could be misremembering this detail about the Lean runtime.)

This is corrext, Array Bool will be represented as an Array of tagged pointers so if you access the bool it has to be shifted etc. If you wish an even more efficient representation you should use docs#ByteArray

James Gallicchio (Jan 06 2024 at 09:09):

I'm putting together a bitvec implementation based on ByteArrays over in LeanColls, if it is of use to anyone :stuck_out_tongue_closed_eyes:

Joe Hendrix (Jan 11 2024 at 23:08):

I'd guess ByteArray is most efficient for dense bit arrays and either BitVec or a hash map is more efficient for sparse bit arrays depending on layout. It'd be great to get benchmarks on this though.

David Michael Roberts (Jan 12 2024 at 01:12):

Nice! How does it go on HTML?

Mac Malone (Jan 12 2024 at 01:12):

Henrik Böving said:

This is corrext, Array Bool will be represented as an Array of tagged pointers so if you access the bool it has to be shifted etc. If you wish an even more efficient representation you should use docs#ByteArray

Are you sure Array Bool is an array of tagged pointers? Bool is represnted as a UInt8, so I believe Array Bool would just be an array of boxed UInt8 / scalar objects (where boxing in this case means a shift << 1 is applied to the value to turn it into a pseudo-lean_object*). As such, the overhead over ByteArray would be that each entry is represnted a USize rather than a UInt8 and has to been unboxed from the USize to UInt8 after loading via a single shift.

Mac Malone (Jan 12 2024 at 01:13):

On, and there is 1 more level of indirection because the Array value is stored as a pointer in the lean_object whereas the scalar ByteArray object is stored inline in the lean_object. EDIT: They are both inline.

Henrik Böving (Jan 12 2024 at 07:14):

Mac Malone said:

Henrik Böving said:

This is corrext, Array Bool will be represented as an Array of tagged pointers so if you access the bool it has to be shifted etc. If you wish an even more efficient representation you should use docs#ByteArray

Are you sure Array Bool is an array of tagged pointers? Bool is represnted as a UInt8, so I believe Array Bool would just be an array of boxed UInt8 / scalar objects (where boxing in this case means a shift << 1 is applied to the value to turn it into a pseudo-lean_object*). As such, the overhead over ByteArray would be that each entry is represnted a USize rather than a UInt8 and has to been unboxed from the USize to UInt8 after loading via a single shift.

I was under the impression that that's what we mean by a tagged pointer? But yes what you are saying is the correct representation

pandaman (Jan 12 2024 at 09:18):

I'm going to perform a benchmark with Array Bool and ByteArray (and possibly Nat) next weekend. Stay tuned:+1:

pandaman (Jan 13 2024 at 12:53):

I did a quick benchmark, and it turned out that Array Bool was 10% faster than ByteArray. You can find the code and more results here: https://github.com/pandaman64/lean-regex/pull/1

$ hyperfine --warmup 10 "./array-bool 'lean|rust' /usr/share/dict/american-english"  "./bytearray 'lean|rust' /usr/share/dict/american-english"
Benchmark 1: ./array-bool 'lean|rust' /usr/share/dict/american-english
  Time (mean ± σ):     270.7 ms ±  10.4 ms    [User: 262.2 ms, System: 8.2 ms]
  Range (min  max):   261.3 ms  296.2 ms    11 runs

Benchmark 2: ./bytearray 'lean|rust' /usr/share/dict/american-english
  Time (mean ± σ):     298.1 ms ±   5.3 ms    [User: 290.1 ms, System: 7.9 ms]
  Range (min  max):   291.0 ms  306.1 ms    10 runs

Summary
  ./array-bool 'lean|rust' /usr/share/dict/american-english ran
    1.10 ± 0.05 times faster than ./bytearray 'lean|rust' /usr/share/dict/american-english

pandaman (Jan 13 2024 at 12:54):

I originally wanted to try rebar for more comprehensive benchmarking, but rebar is search-oriented while I only implemented matching for the whole string...

Henrik Böving (Jan 13 2024 at 12:57):

pandaman said:

I did a quick benchmark, and it turned out that Array Bool was 10% faster than ByteArray. You can find the code and more results here: https://github.com/pandaman64/lean-regex/pull/1

$ hyperfine --warmup 10 "./array-bool 'lean|rust' /usr/share/dict/american-english"  "./bytearray 'lean|rust' /usr/share/dict/american-english"
Benchmark 1: ./array-bool 'lean|rust' /usr/share/dict/american-english
  Time (mean ± σ):     270.7 ms ±  10.4 ms    [User: 262.2 ms, System: 8.2 ms]
  Range (min  max):   261.3 ms  296.2 ms    11 runs

Benchmark 2: ./bytearray 'lean|rust' /usr/share/dict/american-english
  Time (mean ± σ):     298.1 ms ±   5.3 ms    [User: 290.1 ms, System: 7.9 ms]
  Range (min  max):   291.0 ms  306.1 ms    10 runs

Summary
  ./array-bool 'lean|rust' /usr/share/dict/american-english ran
    1.10 ± 0.05 times faster than ./bytearray 'lean|rust' /usr/share/dict/american-english

That PR doesn't really look like a proper ByteArray implementation to me? Once you have a ByteArray you can pack 8 Bools into a single slot of the ByteArray instead of just 1 Bool per slot, I don't see this packing happening in your PR? I would expect that just from this packing alone you should get drastically better cache hits on non trivial regexes which should easily be able to outrun Array Bool in my expectation. If that's not the case I would be very interested in digging as to why that is.

pandaman (Jan 13 2024 at 13:05):

That PR doesn't really look like a proper ByteArray implementation to me?

I just wanted to see if I can get a quick win by switching the underlying array. While bit packing could be advantageous, it would also increase the verification burden, and that's not the path I'm currently pursuing.

Henrik Böving (Jan 13 2024 at 13:19):

pandaman said:

That PR doesn't really look like a proper ByteArray implementation to me?

I just wanted to see if I can get a quick win by switching the underlying array. While bit packing could be advantageous, it would also increase the verification burden, and that's not the path I'm currently pursuing.

Right that's fair. Also just for fun I tried to reproduce your benchmark on a file of myself (has approx 2k lines):

λ hyperfine --warmup 10 ".lake/build/bin/RunRegex 'lean|rust|' /var/log/dnf.log"                                                                                                                            <<<
Benchmark 1: .lake/build/bin/RunRegex 'lean|rust|' /var/log/dnf.log
  Time (mean ± σ):     121.4 ms ±   3.3 ms    [User: 112.8 ms, System: 8.0 ms]
  Range (min  max):   118.3 ms  130.7 ms    24 runs

florence:~/Desktop/formal_verification/lean/lean-regex|main
λ git checkout  test/bytearray
branch 'test/bytearray' set up to track 'origin/test/bytearray'.
Switched to a new branch 'test/bytearray'
florence:~/Desktop/formal_verification/lean/lean-regex|test/bytearray
λ lake build RunRegex
...
florence:~/Desktop/formal_verification/lean/lean-regex|test/bytearray
λ hyperfine --warmup 10 ".lake/build/bin/RunRegex 'lean|rust|' /var/log/dnf.log"                                                                                                                            <<<
Benchmark 1: .lake/build/bin/RunRegex 'lean|rust|' /var/log/dnf.log
  Time (mean ± σ):     121.6 ms ±   1.5 ms    [User: 112.5 ms, System: 8.6 ms]
  Range (min  max):   119.9 ms  126.5 ms    24 runs

That looks like approximately like within noise to me? How big is your dataset? I'd like to figure out where exactly the performance diff is coming from

pandaman (Jan 13 2024 at 13:38):

I used Ubuntu's American English dictionary: https://manpages.ubuntu.com/manpages/trusty/man5/american-english.5.html
I already left my machine, but I can check the line count tomorrow.

I ran the benchmark on WSL, which might or might not affect the result.

Mario Carneiro (Jan 13 2024 at 23:17):

This line looks bad:

 def NodeSet.empty {n : Nat} : NodeSet n :=
-  ⟨mkArray n false, by simp⟩
+  ⟨⟨mkArray n 0⟩, by simp [ByteArray.size]⟩

You are creating an Array UInt8 and converting it to a ByteArray there, so you lose most of the performance wins of the 8x less memory allocation needed by ByteArray

François G. Dorais (Jan 15 2024 at 20:42):

@pandaman The speed benefits usually come from using USize for indexing instead of Nat (i.e, uget vs get/fget). Using ByteArray gives an 8x improvement in memory allocation, and the access time should be equal or better than Array UInt8 when using USize instead of Nat.


Last updated: May 02 2025 at 03:31 UTC