Zulip Chat Archive

Stream: lean4

Topic: slow mut variables

Tomas Skrivan (Dec 06 2023 at 15:17):

I just found out that having multiple mutable variables impacts performance quite badly.

I did a simple test loop with three different variants of mutable variables:

  let mut a : USize := 0
  let mut b : USize := 1

  let mut s : USize × USize := (0,1)

  let mut s' : USize2 := ⟨0,1⟩

where USize2 is a specialized struct holding two USize.

The times are 1197ms, 438ms and 137ms. I'm surprised that the the variant 1 and 2 are so different as you are paying the price of generic product type. Using the specialized struct USize2 makes it almost as fast as possible as equivalent c program takes 52ms.

So clearly if I want to get speed and have multiple mutable variables I should define a custom struct that holds the values instead of using generic Prod type. Is there anything else(hopefully less tedious) that I can do to make it fater?

code

structure USize2 where
  (x y : USize)

def main : IO Unit := do

  let N := 100000000

  let mut a : USize := 0
  let mut b : USize := 1
  let start ← IO.monoMsNow
  for _ in [0:N] do
    let tmp := a+b
    a := b
    b := tmp
  IO.println s!"{N}-th 'Fibonacci' was computed in {(← IO.monoMsNow) - start}ms"
  IO.println b

  IO.println ""

  let mut s : USize × USize := (0,1)
  let start ← IO.monoMsNow
  for _ in [0:N] do
    let (x,y) := s
    s := (y, x+y)
  IO.println s!"{N}-th 'Fibonacci' was computed in {(← IO.monoMsNow) - start}ms"
  IO.println s.2

  IO.println ""

  let mut s' : USize2 := ⟨0,1⟩
  let start ← IO.monoMsNow
  for _ in [0:N] do
    let ⟨x,y⟩ := s'
    s' := ⟨y, x+y⟩
  IO.println s!"{N}-th Fibonacci was computed in {(← IO.monoMsNow) - start}ms"
  IO.println s'.y

  IO.println ""

Mario Carneiro (Dec 06 2023 at 15:27):

So clearly if I want to get speed and have multiple mutable variables I should define a custom struct that holds the values instead of using generic Prod type. Is there anything else(hopefully less tedious) that I can do to make it fater?

Yes, don't use for loops :sad: Use tail recursion instead

Mario Carneiro (Dec 06 2023 at 15:30):

what do you get with this variation?

  let start ← IO.monoMsNow
  let rec tailFib : Nat → USize → USize → USize
    | 0, _, b => b
    | n+1, a, b => tailFib n b (a+b)
  let b := tailFib N 0 1
  IO.println s!"{N}-th 'Fibonacci' was computed in {(← IO.monoMsNow) - start}ms"
  IO.println b

Tomas Skrivan (Dec 06 2023 at 15:34):

C compiler optimization kicks in and I get 0ms :)

Mario Carneiro (Dec 06 2023 at 15:34):

actually I think it's moving the call past the monoMs

Tomas Skrivan (Dec 06 2023 at 15:34):

(I had to change Nat to USize as do not want to test additiong on Nat)

Mario Carneiro (Dec 06 2023 at 15:34):

because it's pure

Tomas Skrivan (Dec 06 2023 at 15:35):

.. ahh right ... but c equivalent with O2 and O3 gives 0ms too but the purity argument is more likely

Tomas Skrivan (Dec 06 2023 at 15:35):

That explains why

Nat.iterate (fun (x,y) => let tmp := x+y; (y,tmp)) N ((0:USize),(1:USize))

also gives 0ms

Mario Carneiro (Dec 06 2023 at 15:37):

with @[noinline] def blackbox : α → IO α := pure and let b := tailFib (← blackbox N) 0 1 I get 87ms

Mario Carneiro (Dec 06 2023 at 15:38):

Oh yeah I think the specific optimization here is extract_closed: since N is a constant tailFib N 0 1 is also a constant, so it is hoisted into a run-once computation at initialization time

Tomas Skrivan (Dec 06 2023 at 15:40):

I just shoved it into IO

  let rec tailFib : Nat → USize → USize → IO USize
    | 0, _, b => pure b
    | n+1, a, b => tailFib n b (a+b)
  let s''' ← tailFib N 0 1

and I get 47ms. Your variant with blackbox gives the same time.

Mario Carneiro (Dec 06 2023 at 15:41):

I believe this loop has no allocations, which is difficult to achieve in lean code

Mario Carneiro (Dec 06 2023 at 15:42):

this is especially important because lean's allocator is not marked up as an allocator to LLVM, so it destroys all kinds of optimizations in C land

Tomas Skrivan (Dec 06 2023 at 15:47):

The issue is that I'm usually doing iterations over some generic finite type I for which I have function iterate {α} : (I → α → α) → α → α. This means that I have to write curried specialization for every α that is of the form _ × _.

Mario Carneiro (Dec 06 2023 at 15:53):

I think even if you have a tuple it will be faster to do tail recursion in this way

Mario Carneiro (Dec 06 2023 at 15:53):

it won't be as fast as uncurried though

Last updated: May 02 2025 at 03:31 UTC