Zulip Chat Archive

Stream: general

Topic: Implementation of List.foldrTR using Array

Simon Dima (Jun 13 2025 at 09:42):

I was quite surprised to find out that the implementation of List.foldr's csimp twin List.foldrTR uses Array.foldr under the hood:

@[specialize] def foldrTR (f : α → β → β) (init : β) (l : List α) : β := l.toArray.foldr f init

It looks like this was added relatively recently, in this commit.

This implementation looks like it will always use O(n) heap space to allocate the array, traversing the list once to do so and then walking back down the array to compute Array.foldr.

I'd have expected it to be written as a reversal followed by a foldl:

def myfoldrTR (f: α -> β -> β) (init: β) (l: List α): β := l.reverse.foldl (fun b a => f a b) init

This version has two list traversals, since the "backwards pass" of foldl on the reversed list is done by chasing pointers instead of walking down a contiguous array; I'm not sure by how much this affects performance. An advantage would be that, in the case of a uniquely owned list, the FBIP optimization will apply to List.reverse and reuse the memory used for l.

Why is the implementation of foldrTR written the way it is?

Robin Arnez (Jun 13 2025 at 10:13):

My assumption is that allocating Arrays and iterating through them is faster than allocating tons of Lists. For one, using Array means you have a continous block of memory and you have the CPU cache to speed up the operation. Also, the amount of allocations you have to do for arrays is also then O(log n) and not O(n) like for lists.

Simon Dima (Jun 13 2025 at 14:26):

Robin Arnez said:

My assumption is that allocating Arrays and iterating through them is faster than allocating tons of Lists.

I benchmarked three ways of counting up to test this out:

-- List.sum has a bug which causes the csimp List.foldr -> List.foldrTR not to apply, which leads to stack overflows.
-- https://github.com/leanprover/lean4/issues/7750
def list_sum n := List.replicate n 1 |>.foldl Nat.add 0
def list_sum_foldr n := List.replicate n 1 |>.foldr Nat.add 0
def list_sum_rev n := List.replicate n 1 |>.reverse |>.foldl Nat.add 0

The version with a left fold is unsurprisingly fastest, but on my system "reverse-then-foldl" ran faster than the version using a right fold csimped into array allocation!

❯ hyperfine -N --warmup 5 -r 20 "bin/list_sum 10000000" "bin/list_sum_rev 10000000" "bin/list_sum_foldr 10000000"
Benchmark 1: bin/list_sum 10000000
  Time (mean ± σ):     187.9 ms ±   5.3 ms    [User: 165.5 ms, System: 20.9 ms]
  Range (min … max):   184.0 ms … 207.3 ms    20 runs

Benchmark 2: bin/list_sum_rev 10000000
  Time (mean ± σ):     219.9 ms ±   2.4 ms    [User: 198.4 ms, System: 20.1 ms]
  Range (min … max):   216.0 ms … 225.6 ms    20 runs

Benchmark 3: bin/list_sum_foldr 10000000
  Time (mean ± σ):     285.7 ms ±   3.4 ms    [User: 258.4 ms, System: 25.2 ms]
  Range (min … max):   280.7 ms … 294.1 ms    20 runs

Summary
  bin/list_sum 10000000 ran
    1.17 ± 0.04 times faster than bin/list_sum_rev 10000000
    1.52 ± 0.05 times faster than bin/list_sum_foldr 10000000

Those performance gaps are quite large, and the results were very consistent on my machine (after disabling swap and making space in memory), so I'd expect you to be able to reproduce them. The question of whether this is an interesting use case to benchmark is of course another one, since here there is unique ownership.

Simon Dima (Jun 13 2025 at 14:58):

I modified the tested functions to keep a reference to the list in order to prevent FBIP reuse optimizations:

def list_sum n :=
  let l := List.replicate n 1
  let res := l.foldl Nat.add 0
  res + l[0]!

def list_sum_foldr n :=
  let l := List.replicate n 1
  let res := l.foldr Nat.add 0
  res + l[0]!

def list_sum_rev n :=
  let l := List.replicate n 1
  let res := l.reverse.foldl Nat.add 0
  res + l[0]!

This made basically no difference to the runtime of list_sum (using List.foldl) and list_sum_foldr (using List.foldrTR), but the runtime of list_sum_rev increased a lot:

❯ hyperfine -N --warmup 5 -r 20 "bin/list_sum 10000000" "bin/list_sum_rev 10000000" "bin/list_sum_foldr 10000000"
Benchmark 1: bin/list_sum 10000000
  Time (mean ± σ):     186.9 ms ±   7.3 ms    [User: 164.9 ms, System: 20.6 ms]
  Range (min … max):   181.8 ms … 216.2 ms    20 runs

Benchmark 2: bin/list_sum_rev 10000000
  Time (mean ± σ):     351.2 ms ±   8.4 ms    [User: 309.5 ms, System: 39.1 ms]
  Range (min … max):   343.2 ms … 373.2 ms    20 runs

Benchmark 3: bin/list_sum_foldr 10000000
  Time (mean ± σ):     270.3 ms ±   4.7 ms    [User: 242.2 ms, System: 25.9 ms]
  Range (min … max):   265.9 ms … 287.4 ms    20 runs

Summary
  bin/list_sum 10000000 ran
    1.45 ± 0.06 times faster than bin/list_sum_foldr 10000000
    1.88 ± 0.09 times faster than bin/list_sum_rev 10000000

The performance of list_sum_rev especially was quite sensitive to the other programs running, and it got a lot better once I closed additional memory-hungry programs. Without a proper benchmarking setup these numbers are mostly garbage, but it seems like the takeaway here is that "reverse then foldl" is better than "convert to array then foldr" when the list is unshared, but worse (and highly sensitive to the system load) when the list is shared.

Mario Carneiro (Jun 19 2025 at 18:33):

When I was looking into this I came to similar conclusions, but ended up blocked on the fact that there is no way in lean to branch on whether the input is shared or not (except using dbgTraceIsShared but this doesn't report its results to lean code). I tried to introduce a function to expose this functionality (lean4#1586) but the PR was never merged.

Robin Arnez (Jun 19 2025 at 18:35):

I'm pretty sure the one with Array is actually faster if the memory is sufficiently fragmented so I don't see a reason to change this (even if we had a function to expose whether it is shared)

Mario Carneiro (Jun 19 2025 at 18:37):

the reason the unshared case is fast is because it will just flip the linked list in place. This uses less memory but it's possible that if the list is very long and fragmented then the second pass will be just as slow as the first

Robin Arnez (Jun 19 2025 at 18:42):

Yeah right it's the difference between chase pointers twice / allocate an array and chase pointers once.

Last updated: Dec 20 2025 at 21:32 UTC