Zulip Chat Archive
Stream: lean4
Topic: slow mut variables
Tomas Skrivan (Dec 06 2023 at 15:17):
I just found out that having multiple mutable variables impacts performance quite badly.
I did a simple test loop with three different variants of mutable variables:
let mut a : USize := 0
let mut b : USize := 1
let mut s : USize × USize := (0,1)
let mut s' : USize2 := ⟨0,1⟩
where USize2
is a specialized struct holding two USize
.
The times are 1197ms
, 438ms
and 137ms
. I'm surprised that the the variant 1 and 2 are so different as you are paying the price of generic product type. Using the specialized struct USize2
makes it almost as fast as possible as equivalent c program takes 52ms
.
So clearly if I want to get speed and have multiple mutable variables I should define a custom struct that holds the values instead of using generic Prod
type. Is there anything else(hopefully less tedious) that I can do to make it fater?
code
Mario Carneiro (Dec 06 2023 at 15:27):
So clearly if I want to get speed and have multiple mutable variables I should define a custom struct that holds the values instead of using generic Prod type. Is there anything else(hopefully less tedious) that I can do to make it fater?
Yes, don't use for loops :sad: Use tail recursion instead
Mario Carneiro (Dec 06 2023 at 15:30):
what do you get with this variation?
let start ← IO.monoMsNow
let rec tailFib : Nat → USize → USize → USize
| 0, _, b => b
| n+1, a, b => tailFib n b (a+b)
let b := tailFib N 0 1
IO.println s!"{N}-th 'Fibonacci' was computed in {(← IO.monoMsNow) - start}ms"
IO.println b
Tomas Skrivan (Dec 06 2023 at 15:34):
C compiler optimization kicks in and I get 0ms
:)
Mario Carneiro (Dec 06 2023 at 15:34):
actually I think it's moving the call past the monoMs
Tomas Skrivan (Dec 06 2023 at 15:34):
(I had to change Nat
to USize
as do not want to test additiong on Nat
)
Mario Carneiro (Dec 06 2023 at 15:34):
because it's pure
Tomas Skrivan (Dec 06 2023 at 15:35):
.. ahh right ... but c equivalent with O2
and O3
gives 0ms
too but the purity argument is more likely
Tomas Skrivan (Dec 06 2023 at 15:35):
That explains why
Nat.iterate (fun (x,y) => let tmp := x+y; (y,tmp)) N ((0:USize),(1:USize))
also gives 0ms
Mario Carneiro (Dec 06 2023 at 15:37):
with @[noinline] def blackbox : α → IO α := pure
and let b := tailFib (← blackbox N) 0 1
I get 87ms
Mario Carneiro (Dec 06 2023 at 15:38):
Oh yeah I think the specific optimization here is extract_closed
: since N
is a constant tailFib N 0 1
is also a constant, so it is hoisted into a run-once computation at initialization time
Tomas Skrivan (Dec 06 2023 at 15:40):
I just shoved it into IO
let rec tailFib : Nat → USize → USize → IO USize
| 0, _, b => pure b
| n+1, a, b => tailFib n b (a+b)
let s''' ← tailFib N 0 1
and I get 47ms
. Your variant with blackbox
gives the same time.
Mario Carneiro (Dec 06 2023 at 15:41):
I believe this loop has no allocations, which is difficult to achieve in lean code
Mario Carneiro (Dec 06 2023 at 15:42):
this is especially important because lean's allocator is not marked up as an allocator to LLVM, so it destroys all kinds of optimizations in C land
Tomas Skrivan (Dec 06 2023 at 15:47):
The issue is that I'm usually doing iterations over some generic finite type I
for which I have function iterate {α} : (I → α → α) → α → α
. This means that I have to write curried specialization for every α
that is of the form _ × _
.
Mario Carneiro (Dec 06 2023 at 15:53):
I think even if you have a tuple it will be faster to do tail recursion in this way
Mario Carneiro (Dec 06 2023 at 15:53):
it won't be as fast as uncurried though
Last updated: Dec 20 2023 at 11:08 UTC