Zulip Chat Archive

Stream: general

Topic: Unexpected performance issue

Jukka Suomela (Dec 12 2024 at 20:52):

Hi all, here is something I don't understand. On my computer with Lean 4.14.0 running this (with something like lake build foo && time .lake/build/bin/foo) takes something like 22 seconds:

def foo (n : Nat) : Nat := Id.run do
  let mut x : Array (Option Nat) := Array.mkArray0
  for _ in [0:n] do
    x := x.push (some 0)
  for i in [0:n] do
    let e := x[i]!
    -- if e.isNone then
    --   unreachable!
    x := x.set! i none
    x := x.set! i e
  x.size

def main : IO Unit := do
  let r := foo 100000
  IO.println s!"{r}"

But if I un-comment the two commented-out lines, it'll be something like 0.01s:

def foo (n : Nat) : Nat := Id.run do
  let mut x : Array (Option Nat) := Array.mkArray0
  for _ in [0:n] do
    x := x.push (some 0)
  for i in [0:n] do
    let e := x[i]!
    if e.isNone then
      unreachable!
    x := x.set! i none
    x := x.set! i e
  x.size

def main : IO Unit := do
  let r := foo 100000
  IO.println s!"{r}"

The code doesn't do anything meaningful, but I originally noticed exactly the same phenomenon in some code that was actually doing useful stuff, and I had to add if … then unreachable! as a workaround to make it not-painfully-slow.

Any thoughts? I tried to look at the C code generated by lean but I didn't really get what's causing this.

Jukka Suomela (Dec 12 2024 at 21:09):

The first version seems to have a running time roughly quadratic in n, while the second one seems to be roughly linear in n, so my guess is that the first one is doing something like copying the array in the loop, but I don't see why (and esp. why does this specific change fix it)?

Shreyas Srinivas (Dec 13 2024 at 00:05):

This is a guess: because calling e.isNone is causing x to be copied

Shreyas Srinivas (Dec 13 2024 at 00:05):

Calling e.isNone is incrementing the ref count of x to 1

Shreyas Srinivas (Dec 13 2024 at 00:05):

(deleted)

Shreyas Srinivas (Dec 13 2024 at 00:06):

(deleted)

Kim Morrison (Dec 13 2024 at 00:13):

Without explaining the performance problem, it seems:

def foo (n : Nat) : Nat := Id.run do
  let mut x : Vector (Option Nat) n := Vector.mkVector n (some 0)
  for h : i in [0:n] do
    let e := x[i]
    x := x.set i none
    x := x.set i e
  x.size

def main : IO Unit := do
  let r := foo 100000
  IO.println s!"{r}"

(note the absence of !) is fast.

Kim Morrison (Dec 13 2024 at 00:14):

@Shreyas Srinivas, you have it exactly backwards: add the e.isNone line makes the code fast, rather than slow.

Eric Wieser (Dec 13 2024 at 00:16):

I think this is because []! returns none if the index is out of bounds (as your array is an Option _ type), and so adding the unreachable! amounts to stating the index is always in bounds?

Kim Morrison (Dec 13 2024 at 00:16):

I'm not going to investigate this, but presumably set_option trace.compiler.ir true will reveal (with patient reading!) what is going on.

Shreyas Srinivas (Dec 13 2024 at 00:32):

Eric Wieser said:

I think this is because []! returns none if the index is out of bounds (as your array is an Option _ type), and so adding the unreachable! amounts to stating the index is always in bounds?

The doc string of []! says it panics when out of bounds :

The syntax arr[i]! gets the i'th element of the collection arr and panics i is out of bounds.

Shreyas Srinivas (Dec 13 2024 at 00:36):

ah i see. you mean e gets the default value of None

Shreyas Srinivas (Dec 13 2024 at 00:36):

But in this example this out of bounds access never seems to occur.

Jukka Suomela (Dec 13 2024 at 13:19):

@Kim Morrison Looking at the IR code, can the problem be hidden inside e.g. a call to Array.set!? I mean, can it be the case that in one version some reference count will be zero and in another version it will be one, and depending on that Array.set! will either do a copy or not? Put otherwise, is it possible that in both versions the IR code does essentially the same operations, just in a slightly different order, and this results in a very different run-time performance because of something that happens inside Array.set! depends on some reference counts?

(I was already earlier trying to see if one version is doing different operations, looking for something like an explicit call to some expensive-sounding copy operation, but failed to find anything fundamentally different between the two versions.)

Shreyas Srinivas (Dec 13 2024 at 13:20):

I looked at the C code. The main loop seems to be essentially the same (except the checks)

Shreyas Srinivas (Dec 13 2024 at 13:24):

I do wonder if the C compiler is behaving differently. After all in both cases, the loop can be optimized away

Jukka Suomela (Dec 13 2024 at 13:26):

No it's not about the compiler optimizing the loop away, the same thing happened in real code in which the operations are very much non-redundant and the compiler can't optimize it away (what happens depends on input that is only known at run time).

Jukka Suomela (Dec 13 2024 at 13:30):

(I'm sorry that the example is "bad" in the sense that a clever compiler could do crazy stuff, but the original code had all kinds of extra operations that made it much more difficult to see what might be relevant and what is not.)

Shreyas Srinivas (Dec 13 2024 at 13:31):

CC: @Sebastian Ullrich or @Mario : is there a way to know, looking at the generated C code, which objects are mutated in-place and which ones will be copied? In the main function I see many global declarations being called "persistent" but I don't know if that has any connection at all to FBIP. In this instance what should we be looking at to test Jukka's guess?

Jukka Suomela (Dec 13 2024 at 13:34):

In any case, the fundamental question is why the "slow" version seems to have quadratic running time as a function of n? And especially, is there some way to catch this issue either at compilation time or at run time?

For example, if the slowness is coming from Array.set! possibly doing copies depending on some reference counts (?), is there some special version of Array.set! that I could call that says "don't compile this unless you can prove that copying is not needed" or "please panic at run time if you'd need to do copying"? So that this kind of a performance issue wouldn't be unnoticed in development/testing?

Shreyas Srinivas (Dec 13 2024 at 13:35):

For the latter question I know that the answer is partially "no" because there have been experiments adding what type theorists called "linearity" to lean's type system in a nice masters thesis, but the language itself doesn't offer it.

Shreyas Srinivas (Dec 13 2024 at 13:38):

Also I don't think there is any syntactic way of detecting whether an object is persistent or mutated in place. I encountered this problem when writing the vector API

Henrik Böving (Dec 13 2024 at 13:39):

Shreyas Srinivas said:

For the latter question I know that the answer is partially "no" because there have been experiments adding what type theorists called "linearity" to lean's type system in a nice masters thesis, but the language itself doesn't offer it.

it's https://pp.ipd.kit.edu/uploads/publikationen/huisinga23masterarbeit.pdf by @Marc Huisinga btw

Henrik Böving (Dec 13 2024 at 13:49):

Shreyas Srinivas said:

CC: Sebastian Ullrich or @Mario : is there a way to know, looking at the generated C code, which objects are mutated in-place and which ones will be copied? In the main function I see many global declarations being called "persistent" but I don't know if that has any connection at all to FBIP. In this instance what should we be looking at to test Jukka's guess?

i would suggest reading the IR not the C code and looking at the ref counting behavior of each function to see how Array is ref counted. For objects that don't do internal mutation but are actually FBIP-ed you would look for branches over the object being shared and then manually figure out if an object is shared or not at that point by inspecting the IR.

If it was possible to "just figure this out by looking at the C code" Marc's thesis wouldn't be really interesting would it

Shreyas Srinivas (Dec 13 2024 at 13:51):

How does one get the lean IR? I'd like to read it outside infoview, with indentations if possible

Henrik Böving (Dec 13 2024 at 13:52):

well for one you can just copy it out of the infoview, you can also just set trace.compiler.ir.result true and look at the output on the CLI (or pipe that wherever and look at it with whatever tool, it's just text)

Shreyas Srinivas (Dec 13 2024 at 14:07):

what does an argument annotated with@& mean in the IR?

Shreyas Srinivas (Dec 13 2024 at 14:08):

Nvm the docstring makes sense

Jukka Suomela (Dec 13 2024 at 15:21):

Apparently there is dbgTraceIfShared that might help here (I just hope I understood correctly how it's supposed to work).

For example, this (fast version) only prints out shared RC 1, i.e., x is apparently shared after let e := x[i]! but not shared anymore after if e.isNone then unreachable!:

import Init.Util

def foo (n : Nat) : Nat := Id.run do
  let mut x : Array (Option Nat) := Array.mkArray0
  for _ in [0:n] do
    x := x.push (some 0)
  for i in [0:n] do
    x := dbgTraceIfShared "0" x
    let e := x[i]!
    x := dbgTraceIfShared "1" x
    if e.isNone then
      unreachable!
    x := dbgTraceIfShared "2" x
    x := x.set! i none
    x := dbgTraceIfShared "3" x
    x := x.set! i e
    x := dbgTraceIfShared "4" x
  x.size

def main : IO Unit := do
  let r := foo 100000
  IO.println s!"{r}"

Also, it doesn't really matter how I inspect e, this will also do (ref count of x drops to 1 after if e == ... and this is fast):

import Init.Util

def foo (n : Nat) : Nat := Id.run do
  let mut x : Array (Option Nat) := Array.mkArray0
  for _ in [0:n] do
    x := x.push (some 0)
  let mut c : Nat := 0
  for i in [0:n] do
    x := dbgTraceIfShared "0" x
    let e := x[i]!
    x := dbgTraceIfShared "1" x
    if e == some 123 then
      c := c + 1
    x := dbgTraceIfShared "2" x
    x := x.set! i none
    x := dbgTraceIfShared "3" x
    x := x.set! i e
    x := dbgTraceIfShared "4" x
  c + x.size

def main : IO Unit := do
  let r := foo 100000
  IO.println s!"{r}"

Sebastian Ullrich (Dec 13 2024 at 15:28):

A cases floating is creating a nonlinear use of the array, a_1 in this part of trace.compiler.simp:

    let _x_139 := a_1.set! a _x_138;
    Decidable.casesOn _x_137
      (fun h =>
        let _x_140 := outOfBounds;
        let _x_141 := _x_139.set! a _x_140;
        ForInStep.yield _x_141)
      fun h =>
      let _x_142 := a_1.get a h;
      let _x_143 := _x_139.set! a _x_142;
      ForInStep.yield _x_143;

We should avoid doing that. As far as I can see, this optimization does not (yet) exist in the new compiler, something to keep in mind for @Cameron Zwarich :) .

Shreyas Srinivas (Dec 13 2024 at 15:42):

But why is it bad that it is faster?

Sebastian Ullrich (Dec 13 2024 at 15:47):

I'm saying this is the old code generator pessimizing the first example

Shreyas Srinivas (Dec 13 2024 at 15:47):

Oh okay

Shreyas Srinivas (Dec 13 2024 at 15:48):

I am grateful for this question since this resulted me going down the rabbit hole of FFI, code generation, and all that

Last updated: May 02 2025 at 03:31 UTC