Zulip Chat Archive

Stream: general

Topic: How to speed up Lean programs

pandaman (Sep 03 2025 at 13:10):

I've found that the lean-regex library runs about 7–10x slower than the equivalent Rust implementation (depending on the benchmark). While I don't expect Lean to match Rust, I'd like to narrow the gap as much as possible. I've profiled and fixed some bottlenecks, but I'm running out of ideas, so I'd love to get advice on optimization strategies.

Findings so far

The hottest function is εClosure, as expected.
Array linearity issues are resolved.
Inlining some functions in εClosure gave mixed results.

An interesting observation is that functions like lean_dec_ref_cold, mi_free, lean_dec_ref, and mi_malloc_small dominate the inverted stack profile. Does this suggest I should focus on reducing allocations, or is this just a general Lean performance characteristic?

Example stack trace: https://share.firefox.dev/47VeUkD

I'd appreciate any inputs. Thanks!

Henrik Böving (Sep 03 2025 at 13:45):

Array linearity issues are resolved.

One thing that is quite clear is that in Regex.BufferStrategy.lam_0 you have a non linearity on some array as you are calling fset but get a call to lean_copy_expand_array, that should be fixed to become linear and will as a consequence likely drop some of the allocation traffic along the way.

If after that you still see such high allocation traffic I would suggest to figure out what the objects you are allocating so much are and whether you can reduce pressure on the allocator somehow.

pandaman (Sep 03 2025 at 14:05):

Thank you for taking a look!

The BufferStrategy one is resulting from a copy-on-write path, and it's expected. Sorry for not mentioning this.

Does Lean have Lean-level memory profiler? I wonder what's the best way to track allocations.

Thales Fragoso (Sep 03 2025 at 22:21):

Rust's regex crate is quite optimized, it uses simd and quite a few other tricks.

pandaman (Sep 03 2025 at 23:07):

I know. It's more than 1000 times faster than my implementation and I don't plan to match that:joy: What I'm comparing with is [regex-lite] (https://docs.rs/regex-lite/latest/regex_lite/), which doesn't have many optimizations and uses the standard NFA simulation. Sorry for the confusion.

Mr Proof (Sep 04 2025 at 01:47):

My 2-cents: :smiley:

Is Lean-regex written in Lean?

If so is there any reason not to use some fast regex library dll behind the scenes. (Much like big-integers in Lean are not written in Lean but use an external library and are converted to and from this for calculations?). Yes, this would add extra unproven code. But aren't we already past that point?

Or is the problem in converting things to and from regex expressions.

IDK. Sorry that was more like 1-cents worth :pensive:

pandaman (Sep 04 2025 at 03:09):

Yes, and the code is verified correct with respect to a formal semantics in Lean (to a reasonable extent). That's the point of the project and why I'm asking how to speed up Lean code.

pandaman (Sep 04 2025 at 03:44):

Anyway, I think the next action is to figure out which part is allocating the most. Thank you for the insights and feel free to write if you have any ideas!

pandaman (Sep 16 2025 at 12:01):

Unbundling a structure helped a bit because it reduced the allocation involving the packing/unpacking of the structure.

MIMALLOC_SHOW_STATS=1 was helpful. The optimized program showed

heap stats:     peak       total     current       block      total#
  reserved:     1.0 GiB     1.0 GiB     1.0 GiB
 committed:     1.0 GiB     1.0 GiB   967.1 MiB
     reset:     0
    purged:    56.8 MiB
   touched:    64.2 KiB    64.2 KiB   -23.6 MiB

while the unoptimized one resulted in

heap stats:     peak       total     current       block      total#
  reserved:     1.0 GiB     1.0 GiB     1.0 GiB
 committed:     1.0 GiB     1.0 GiB   986.0 MiB
     reset:     0
    purged:    38.0 MiB
   touched:    64.2 KiB    64.2 KiB   -32.2 MiB

If we believe what Claude Sonnet said, the -23.6MiB part corresponds to the amount of deallocated memory (which should equal to the allocated memory during the run), so it's a 26.7% reduction in allocation.

Across the benchmarks, it got 5-10% faster, but it's not as large as I hoped :sweat_smile:

pandaman (Sep 16 2025 at 12:05):

I went further to reduce the allocation when the algorithm pops an element from the stack immediately after pushing the element, but it regerssed by 30% :smiling_face_with_tear: It's quite tricky to make it fast...

Mac Malone (Sep 18 2025 at 03:33):

@pandaman Products with more than 2 elements (e.g., a x b x c) are represented as pairs of pairs, so you should likely refactor the εClosure result into a proper 3-field structure to reduce allocations.

Mac Malone (Sep 18 2025 at 03:36):

You could futhure optimize the result by making it an inductive (rather than a structure) with the Option field unbundled (e.g., have one constructor with the matched value and without it). However this may or may not improve performance based on how often the code coverts to/from the Option representation.

pandaman (Sep 19 2025 at 12:24):

Thank you for the insights! The product representation is a good callout and might explain the slowdown in this version, as it returns a four-element product in the hottest function.
The Option return value comes from the argument which is already an Option, so it might not be beneficial. Maybe I should consider using a sentinel value instead of Option to get rid of one indirection.

pandaman (Sep 25 2025 at 13:17):

Introducing structure instead of triples ended up improving the throughput by 3-20%. Thank you!

https://github.com/pandaman64/lean-regex/pull/131

Last updated: Dec 20 2025 at 21:32 UTC