Zulip Chat Archive
Stream: lean4
Topic: slow mut variables
Tomas Skrivan (Dec 06 2023 at 15:17):
I just found out that having multiple mutable variables impacts performance quite badly.
I did a simple test loop with three different variants of mutable variables:
let mut a : USize := 0
let mut b : USize := 1
let mut s : USize × USize := (0,1)
let mut s' : USize2 := ⟨0,1⟩
where USize2 is a specialized struct holding two USize.
The times are 1197ms, 438ms and 137ms. I'm surprised that the the variant 1 and 2 are so different as you are paying the price of generic product type. Using the specialized struct USize2 makes it almost as fast as possible as equivalent c program takes 52ms.
So clearly if I want to get speed and have multiple mutable variables I should define a custom struct that holds the values instead of using generic Prod type. Is there anything else(hopefully less tedious) that I can do to make it fater?
code
Mario Carneiro (Dec 06 2023 at 15:27):
So clearly if I want to get speed and have multiple mutable variables I should define a custom struct that holds the values instead of using generic Prod type. Is there anything else(hopefully less tedious) that I can do to make it fater?
Yes, don't use for loops :sad: Use tail recursion instead
Mario Carneiro (Dec 06 2023 at 15:30):
what do you get with this variation?
let start ← IO.monoMsNow
let rec tailFib : Nat → USize → USize → USize
| 0, _, b => b
| n+1, a, b => tailFib n b (a+b)
let b := tailFib N 0 1
IO.println s!"{N}-th 'Fibonacci' was computed in {(← IO.monoMsNow) - start}ms"
IO.println b
Tomas Skrivan (Dec 06 2023 at 15:34):
C compiler optimization kicks in and I get 0ms :)
Mario Carneiro (Dec 06 2023 at 15:34):
actually I think it's moving the call past the monoMs
Tomas Skrivan (Dec 06 2023 at 15:34):
(I had to change Nat to USize as do not want to test additiong on Nat)
Mario Carneiro (Dec 06 2023 at 15:34):
because it's pure
Tomas Skrivan (Dec 06 2023 at 15:35):
.. ahh right ... but c equivalent with O2 and O3 gives 0ms too but the purity argument is more likely
Tomas Skrivan (Dec 06 2023 at 15:35):
That explains why
Nat.iterate (fun (x,y) => let tmp := x+y; (y,tmp)) N ((0:USize),(1:USize))
also gives 0ms
Mario Carneiro (Dec 06 2023 at 15:37):
with @[noinline] def blackbox : α → IO α := pure and let b := tailFib (← blackbox N) 0 1 I get 87ms
Mario Carneiro (Dec 06 2023 at 15:38):
Oh yeah I think the specific optimization here is extract_closed: since N is a constant tailFib N 0 1 is also a constant, so it is hoisted into a run-once computation at initialization time
Tomas Skrivan (Dec 06 2023 at 15:40):
I just shoved it into IO
let rec tailFib : Nat → USize → USize → IO USize
| 0, _, b => pure b
| n+1, a, b => tailFib n b (a+b)
let s''' ← tailFib N 0 1
and I get 47ms. Your variant with blackbox gives the same time.
Mario Carneiro (Dec 06 2023 at 15:41):
I believe this loop has no allocations, which is difficult to achieve in lean code
Mario Carneiro (Dec 06 2023 at 15:42):
this is especially important because lean's allocator is not marked up as an allocator to LLVM, so it destroys all kinds of optimizations in C land
Tomas Skrivan (Dec 06 2023 at 15:47):
The issue is that I'm usually doing iterations over some generic finite type I for which I have function iterate {α} : (I → α → α) → α → α. This means that I have to write curried specialization for every α that is of the form _ × _.
Mario Carneiro (Dec 06 2023 at 15:53):
I think even if you have a tuple it will be faster to do tail recursion in this way
Mario Carneiro (Dec 06 2023 at 15:53):
it won't be as fast as uncurried though
Last updated: May 02 2025 at 03:31 UTC