Zulip Chat Archive

Stream: lean4

Topic: Summing intermediate list slower than summing directly?


Kyle Miller (Jan 24 2022 at 19:02):

I'm wondering about this curious difference in performance between two ways of summing up the first million positive natural numbers. The first is to create a List of a million+1 elements using List.rangeand then doing a fold, and the second is to essentially do the list fusion by hand, writing a recursive function that sums those same elements (though in reverse order) but without creating an intermediate list.

The fold over List.range takes about 210ms, and the specialized recursive procedure takes about 630ms (or 551ms if written slightly differently, avoiding an additional addition). It also takes more than 600ms to sum the same elements in the same order as List.range.

def List.foldl' {α β} (f : α  β  α) : (init : α)  List β  α
  | a, nil      => a
  | a, cons b l => foldl' f (f a b) l

def List.sum [Add α] [OfNat α 0] (xs : List α) : α := xs.foldl' (·+·) 0

def sumRangeCore (acc : Nat) : Nat  Nat
| 0 => acc
| (n+1) => sumRangeCore (acc + n + 1) n

def sumRange (n : Nat) : Nat := sumRangeCore 0 n

-- has the same performance as `sumRangeCore` if it uses `n+1` rather than `m`.
def sumRangeCore' : Nat  Nat  Nat
| acc, 0 => acc
| acc, m@(n+1) => sumRangeCore' (acc + m) n

def sumRange' (n : Nat) : Nat := sumRangeCore' 0 n

def sumRangeCore'' : Nat  Nat  Nat  Nat
| 0, acc, m => acc
| (n+1), acc, m => sumRangeCore'' n (acc + m) (m + 1)

def sumRange'' (n : Nat) : Nat := sumRangeCore'' n 0 0

def test1 : IO Nat := do
  pure $ (List.range (1000000 + 1)).sum

def test2 : IO Nat := do
  pure $ sumRange 1000000

def test3 : IO Nat := do
  pure $ sumRange' 1000000

def test4 : IO Nat := do
  pure $ sumRange'' (1000000 + 1)

#eval timeit "fold" test1
-- fold 215ms 500000500000
#eval timeit "recursive" test2
-- recursive 632ms 500000500000
#eval timeit "recursive2" test3
-- recursive2 551ms 500000500000
#eval timeit "recursive3" test4
-- recursive3 670ms 500000500000

Sebastian Ullrich (Jan 24 2022 at 19:19):

Comparing and optimizing the interpreted performance is not really meaningful when the compiled version is up to 100x as fast anyway and may have completely different performance behavior

Kyle Miller (Jan 24 2022 at 19:32):

I get that this is just interpreter performance, but I'm finding it to be surprising that somehow building this intermediate list and going through more indirection to do a fold is somehow 1/3 the runtime of a single recursive procedure that is specialized to Nat.

Kyle Miller (Jan 24 2022 at 19:33):

When compiled, the fold is 9x slower, which is more what I would expect. (34.3ms vs 3.85ms)

I'm aware that when compiled it can use faster arithmetic routines, but still, when interpreted, both the fold and recursive versions are using the same implementation of Nat.add.

Sebastian Ullrich (Jan 24 2022 at 19:43):

The time spent on Nat.add should be immaterial compared to the pure overhead of traversing and interpreting the IR. The size of the generated IR may very well be the dominating factor.

Kyle Miller (Jan 24 2022 at 20:05):

The IR is essentially the same between them, though the recursive version uses Nat.decEq rather than being able to case split directly.

It turns out that replacing List.range with an interpreted version makes the performance difference completely disappear. I'm guessing that it's because much of overhead for the loop (checking whether a number is zero, incrementing/decrementing numbers) was hidden in List.range.


Last updated: Dec 20 2023 at 11:08 UTC