Zulip Chat Archive
Stream: lean4
Topic: Summing intermediate list slower than summing directly?
Kyle Miller (Jan 24 2022 at 19:02):
I'm wondering about this curious difference in performance between two ways of summing up the first million positive natural numbers. The first is to create a List
of a million+1 elements using List.range
and then doing a fold, and the second is to essentially do the list fusion by hand, writing a recursive function that sums those same elements (though in reverse order) but without creating an intermediate list.
The fold over List.range
takes about 210ms, and the specialized recursive procedure takes about 630ms (or 551ms if written slightly differently, avoiding an additional addition). It also takes more than 600ms to sum the same elements in the same order as List.range
.
def List.foldl' {α β} (f : α → β → α) : (init : α) → List β → α
| a, nil => a
| a, cons b l => foldl' f (f a b) l
def List.sum [Add α] [OfNat α 0] (xs : List α) : α := xs.foldl' (·+·) 0
def sumRangeCore (acc : Nat) : Nat → Nat
| 0 => acc
| (n+1) => sumRangeCore (acc + n + 1) n
def sumRange (n : Nat) : Nat := sumRangeCore 0 n
-- has the same performance as `sumRangeCore` if it uses `n+1` rather than `m`.
def sumRangeCore' : Nat → Nat → Nat
| acc, 0 => acc
| acc, m@(n+1) => sumRangeCore' (acc + m) n
def sumRange' (n : Nat) : Nat := sumRangeCore' 0 n
def sumRangeCore'' : Nat → Nat → Nat → Nat
| 0, acc, m => acc
| (n+1), acc, m => sumRangeCore'' n (acc + m) (m + 1)
def sumRange'' (n : Nat) : Nat := sumRangeCore'' n 0 0
def test1 : IO Nat := do
pure $ (List.range (1000000 + 1)).sum
def test2 : IO Nat := do
pure $ sumRange 1000000
def test3 : IO Nat := do
pure $ sumRange' 1000000
def test4 : IO Nat := do
pure $ sumRange'' (1000000 + 1)
#eval timeit "fold" test1
-- fold 215ms 500000500000
#eval timeit "recursive" test2
-- recursive 632ms 500000500000
#eval timeit "recursive2" test3
-- recursive2 551ms 500000500000
#eval timeit "recursive3" test4
-- recursive3 670ms 500000500000
Sebastian Ullrich (Jan 24 2022 at 19:19):
Comparing and optimizing the interpreted performance is not really meaningful when the compiled version is up to 100x as fast anyway and may have completely different performance behavior
Kyle Miller (Jan 24 2022 at 19:32):
I get that this is just interpreter performance, but I'm finding it to be surprising that somehow building this intermediate list and going through more indirection to do a fold is somehow 1/3 the runtime of a single recursive procedure that is specialized to Nat
.
Kyle Miller (Jan 24 2022 at 19:33):
When compiled, the fold is 9x slower, which is more what I would expect. (34.3ms vs 3.85ms)
I'm aware that when compiled it can use faster arithmetic routines, but still, when interpreted, both the fold and recursive versions are using the same implementation of Nat.add
.
Sebastian Ullrich (Jan 24 2022 at 19:43):
The time spent on Nat.add should be immaterial compared to the pure overhead of traversing and interpreting the IR. The size of the generated IR may very well be the dominating factor.
Kyle Miller (Jan 24 2022 at 20:05):
The IR is essentially the same between them, though the recursive version uses Nat.decEq
rather than being able to case split directly.
It turns out that replacing List.range
with an interpreted version makes the performance difference completely disappear. I'm guessing that it's because much of overhead for the loop (checking whether a number is zero, incrementing/decrementing numbers) was hidden in List.range
.
Last updated: Dec 20 2023 at 11:08 UTC