Zulip Chat Archive

Stream: lean4

Topic: Mysterious array access performance

Mario Carneiro (Aug 27 2021 at 03:34):

I haven't yet managed to isolate the problem into an MWE, but I can roughly describe it: For some particular array arr I have observed that readArray is slow (~7 s) here:

def generateArray : Array Expr := sorry
def readArray (arr : Array Expr) : IO Unit := sorry

#eval show IO Unit from do
  let arr := generateArray
  timeit "slow" (readArray arr)

but fast (~0.2 ms) here:

def freshen [Inhabited α] (arr : Array α) : IO (Array α) := do
  let mut out := arr
  for i in [0:out.size] do
    out := out.setD i out[i]
  out

#eval show IO Unit from do
  let arr := generateArray
  let arr ← freshen arr
  timeit "fast" (readArray arr)

What could cause this behavior? readArray is supposed to be fast and basically calls arr[i] for various values of i (it does not modify the array). freshen is a value level no-op but may do something interesting to the reference counts, but I don't see how that would affect the performance of Array.get so dramatically.

Mario Carneiro (Aug 27 2021 at 03:59):

Oh, even adding let arr <- if false then freshen arr else arr makes it fast, so this seems like a compiler bug

Kyle Miller (Aug 27 2021 at 05:28):

I seem to remember that the implementation of Perceus in Lean 4 takes a conservative approach for global variables (i.e., ones that might be shared between threads), where these always take the slow path for the reference count check. The freshen function certainly makes a local copy of generateArray, so it would avoid this, and perhaps if arr <- if false then freshen arr else arr does too (though I'm somewhat dubious). Maybe it'd be worth taking a look at the generated C code?

Mario Carneiro (Aug 27 2021 at 05:47):

Finally got a self contained example:

def freshen (arr : Array Unit) : IO (Array Unit) := arr.setD 0 ()

def build (es : Array Unit) : Array Unit := do
  let mut out := #[]
  for e in es do
    out := out.push ()
  out

def bench (f : ∀ {α : Type}, α → IO Unit := fun _ => ()) : IO Unit := do
  let out ← freshen $ Array.mkArray 50000 ()
  let mut arr := build out
  -- arr ← freshen arr
  timeit "time" $ do
    for _ in [:30] do
      f $ #[1, 2, 3, 4].map fun ty => arr[ty]

#eval bench

Mario Carneiro (Aug 27 2021 at 05:53):

Note that all of the tests I've been doing so far have been in the interpreter, not the generated C code. For all I know things are completely different in the compiled version

Kyle Miller (Aug 27 2021 at 05:54):

Here's a weird data point to add. If you change freshen to not be in the IO monad, then it runs fast again:

def freshen [Inhabited α] (arr : Array α) : Array α := do
  let mut out := arr
  for i in [0:out.size] do
    out := out.setD i out[i]
  out

Mario Carneiro (Aug 27 2021 at 05:55):

Here out <- freshen ... is just there to ensure that it and arr aren't hoisted into globals

Kyle Miller (Aug 27 2021 at 06:00):

(It's kind of a nice surprise that let out <- freshen ... or let out := freshen ... both work when freshen is using the identity monad.) I wonder why the IO monad slows things down so much later...

Kyle Miller (Aug 27 2021 at 06:06):

This version of freshen also shows this slow/fast behavior depending on whether or not the IO $ is present:

def freshen [Inhabited α] (arr : Array α) : IO $ Array α := do
  let mut out := Array.empty
  for i in [0:arr.size] do
    out := out.push arr[i]
  out

Mario Carneiro (Aug 27 2021 at 06:08):

I think the only constraints on freshen to reproduce the bug are that it modifies the array and it is in the IO monad

Kyle Miller (Aug 27 2021 at 06:12):

This is a version where freshen doesn't modify the array, but the original array is global. Removing IO makes it run fast.

def freshen (arr : Array Unit) : IO (Array Unit) := arr

def build (es : Array Unit) : Array Unit := do
  let mut out := #[]
  for e in es do
    out := out.push ()
  out

def arr := freshen $ Array.mkArray 50000 ()

def bench (f : ∀ {α : Type}, α → IO Unit := fun _ => ()) : IO Unit := do
  let out <- arr
  let mut arr := build out
  -- arr ← freshen arr
  timeit "time" $ do
    for _ in [:30] do
      f $ #[1, 2, 3, 4].map fun ty => arr[ty]

#eval bench

Mario Carneiro (Aug 27 2021 at 06:14):

in that version uncommenting the arr ← freshen arr doesn't make it fast though

Mario Carneiro (Aug 27 2021 at 06:14):

but I guess that's not super relevant, there are a bunch of things that make it fast (i.e. dodge the bug conditions)

Kyle Miller (Aug 27 2021 at 06:15):

I didn't mean to leave that in. I just thought it's weird that removing IO makes it fast, but this is still enough to make it slow.

Mario Carneiro (Aug 27 2021 at 06:16):

it looks like @[inline] on arr in your version also makes it fast

Kyle Miller (Aug 27 2021 at 06:19):

With this one, removing timeit "time1" makes it slow. Nice to see a Heisenbug...

def build (es : Array Unit) : Array Unit := do
  let mut out := #[]
  for e in es do
    out := out.push ()
  out

def arr : IO (Array Unit) := Array.mkArray 50000 ()

def bench (f : ∀ {α : Type}, α → IO Unit := fun _ => ()) : IO Unit := do
  let arr <- arr
  let arr <- timeit "time1" $ build arr
  timeit "time2" $ do
    for _ in [:30] do
      f $ #[1, 2, 3, 4].map fun ty => arr[ty]

#eval bench

Mario Carneiro (Aug 27 2021 at 06:22):

I also can't seem to outline any part of the time2 body without making it fast

Mac (Aug 27 2021 at 06:30):

Kyle Miller said:

With this one, removing timeit "time1" makes it slow. Nice to see a Heisenbug...

This one I can explain.

Mac (Aug 27 2021 at 06:31):

Removing the timeit makes the line --- i.e.,let arr <- build arr -- reduce to let arr := build arr.

Mac (Aug 27 2021 at 06:32):

Whereas otherwise, the line reduces to timeit "time1" $ build arr >>= fun arr => ... (i.e., the rest of the code is in a function and is thus lazy).

Mac (Aug 27 2021 at 06:33):

I suspect that all these performances issues probably have to do with the absence/presence of laziness.

Mario Carneiro (Aug 27 2021 at 06:34):

Even making the most pessimistic assumptions, I don't see how 120 array accesses could take 2-6 seconds

Mac (Aug 27 2021 at 06:35):

Because the let arr := build arr line results in the next line becoming:

timeit "time2" $ do
    for _ in [:30] do
      f $ #[1, 2, 3, 4].map fun ty => (build arr)[ty]

Mac (Aug 27 2021 at 06:35):

i..e., it is rebuilding the array each step instead of using the same array across all steps.

Mario Carneiro (Aug 27 2021 at 06:35):

whoa, it does that? That seems like a really bad optimization under most circumstances

Mario Carneiro (Aug 27 2021 at 06:36):

It would be great if we could spot that in the IR

Mac (Aug 27 2021 at 06:36):

This isn't an optimization. Lean is a strict language, that is what that means in a strict language.

Mac (Aug 27 2021 at 06:37):

A let definition is just an alias if its definition and use are in the same closure.

Mac (Aug 27 2021 at 06:39):

Though one could argue the problem is that let arr <- build arr (monadic) becomes let arr := build arr (non-monadic), which is not great.

Mac (Aug 27 2021 at 06:39):

I believe @Sebastian Ullrich mentioned that the Id monad may become lazy at some point. If so, this problem would vanish.

Kyle Miller (Aug 27 2021 at 06:39):

Mac said:

Removing the timeit makes the line --- i.e.,let arr <- build arr -- reduce to let arr := build arr.

Could you explain this again? I'm not sure I follow. Wouldn't this predict that the following would run fast?

def build (es : Array Unit) : Array Unit := do
  let mut out := #[]
  for e in es do
    out := out.push ()
  out

def arr : IO (Array Unit) := Array.mkArray 50000 ()

def bench (f : ∀ {α : Type}, α → IO Unit := fun _ => ()) : IO Unit := do
  let arr <- arr
  let arr <- pure $ build arr -- slow even if you add the IO annotations
  timeit "time2" $ do
    for _ in [:30] do
      f $ #[1, 2, 3, 4].map fun ty => arr[ty]

#eval bench

Mario Carneiro (Aug 27 2021 at 06:40):

Even for non-monadic values, I don't think there is any situation that mandates that a computation should be duplicated

Mario Carneiro (Aug 27 2021 at 06:40):

the compiler is permitted to duplicate computation (in the sense that this will still produce correct results), but I would generally consider that a (performance) bug

Mac (Aug 27 2021 at 06:42):

@Kyle Miller The IO annotation means nothing, IO.pure is not noninline (and IO is not opaque currently).

Mac (Aug 27 2021 at 06:42):

IO is just short for EStateM IO.Error Unit which can easily be reduced away.

Mario Carneiro (Aug 27 2021 at 06:43):

Mac said:

I believe Sebastian Ullrich mentioned that the Id monad may become lazy at some point. If so, this problem would vanish.

Well no, not if it's hoisting := let statements in a monadic block

Mac (Aug 27 2021 at 06:43):

In fact, the pure of EStateM is marked @[inline] so it will be forcibly inlined.

Kyle Miller (Aug 27 2021 at 06:44):

Then shouldn't this noinline version of pure circumvent that?

def build (es : Array Unit) : Array Unit := do
  let mut out := #[]
  for e in es do
    out := out.push ()
  out

def arr : IO (Array Unit) := Array.mkArray 50000 ()

@[noinline]
def mypure (x : α) : IO α := pure x

def bench (f : ∀ {α : Type}, α → IO Unit := fun _ => ()) : IO Unit := do
  let arr <- arr
  let arr <- mypure $ build arr
  timeit "time2" $ do
    for _ in [:30] do
      f $ #[1, 2, 3, 4].map fun ty => arr[ty]

#eval bench

Mac (Aug 27 2021 at 06:46):

It actually does for me

Mario Carneiro (Aug 27 2021 at 06:46):

I've confirmed @Mac 's theory: starting from this version, among the outputs of set_option trace.compiler.ir.result true is:

def Array.mapMUnsafe.map._at.bench._spec_2 (x_1 : obj) (x_2 : usize) (x_3 : usize) (x_4 : obj) : obj :=
  inc x_1;
  let x_5 : obj := build x_1;
  let x_6 : u8 := USize.decLt x_3 x_2;
  case x_6 : u8 of
  Bool.false →
    dec x_5;
    dec x_1;
    let x_7 : obj := unsafeCast ◾ ◾ x_4;
    ret x_7
  Bool.true →
    let x_8 : obj := Array.uget ◾ x_4 x_3 ◾;
    let x_9 : obj := 0;
    let x_10 : obj := Array.uset ◾ x_4 x_3 x_9 ◾;
    let x_11 : obj := unsafeCast ◾ ◾ x_8;
    let x_12 : obj := instInhabitedPUnit;
    let x_13 : obj := Array.get! ◾ x_12 x_5 x_11;
    dec x_11;
    dec x_5;
    let x_14 : usize := 1;
    let x_15 : usize := USize.add x_3 x_14;
    let x_16 : obj := unsafeCast ◾ ◾ x_13;
    let x_17 : obj := Array.uset ◾ x_10 x_3 x_16 ◾;
    let x_18 : obj := Array.mapMUnsafe.map._at.bench._spec_2 x_1 x_2 x_15 x_17;
    ret x_18

This is the inner loop for the map operation in #[1, 2, 3, 4].map fun ty => arr[ty], and you can see let x_5 : obj := build x_1, so build will get called 120 times.

Kyle Miller (Aug 27 2021 at 06:48):

(It looks like I had something like 16 versions running simultaneously after editing, and of course the slow ones reported their times last. My last example does run in less than a second.)

Mac (Aug 27 2021 at 06:51):

@Kyle Miller Also note that mypure requires the @[noinline] annotation, it will run slow without it (i.e., it will get inlined).

Kyle Miller (Aug 27 2021 at 06:52):

@Mac Yes, I'm agreeing with you. That's why I put @[noinline] there in the first place, and I misinterpreted how VS Code outputs to the console.

Mac (Aug 27 2021 at 06:54):

@Kyle Miller Sorry, I realized that :laughing: . I was just elaborating that simply removing the @[inline] from the existing pure is insufficient, it will still get inlined.

Mac (Aug 27 2021 at 06:54):

Note that, I believe, the following works (fast):

def bench (f : ∀ {α : Type}, α → IO Unit := fun _ => ()) : IO Unit := do
  let arr <- arr
  let arr <- do build arr
  timeit "time2" do
    for _ in [:30] do
      f $ #[1, 2, 3, 4].map fun ty => arr[ty]

Mario Carneiro (Aug 27 2021 at 06:55):

From what I can tell, it goes wrong somewhere around when mapMUnsafe is specialized, where the function argument fun ty => arr[ty] has its dependencies inlined to fun ty => (build n)[ty]

Mac (Aug 27 2021 at 06:58):

@Mario Carneiro imo, this is the expected behavior.

Mac (Aug 27 2021 at 06:59):

Computations should be wrapped in monads, that is the point of them.

Mario Carneiro (Aug 27 2021 at 06:59):

executing let statements multiple times when they were not written that way is not expected behavior. Executing them not at all is okay, as well as lowering them over case statements, but lowering them into a loop is not okay

Kyle Miller (Aug 27 2021 at 07:00):

Going back to @Mario Carneiro's earlier example, inserting a do makes it fast, as @Mac would expect. But also does replacing the line with let arr := build 50000.

def build (n : Nat) : Array Unit := do
  let mut out := #[]
  for _ in [0:n] do
    out := out.push ()
  out

@[noinline] def size : IO Nat := pure 50000

def bench (f : ∀ {α : Type}, α → IO Unit := fun _ => ()) : IO Unit := do
  let arr <- do build (← size)
  timeit "time" $
    for _ in [:30] do
      f $ #[1, 2, 3, 4].map fun ty => arr[ty]

#eval bench

Mac (Aug 27 2021 at 07:00):

@Mario Carneiro To me, a let x := should be inlined at every x -- that is the expected behavior to me. I think of let as an alias.

Mario Carneiro (Aug 27 2021 at 07:00):

There are lots of optimizations that can lead to a monadic computation being turned into a pure computation after inlining, so using <- does not help

Kyle Miller (Aug 27 2021 at 07:00):

Re what Mario is saying, this is the basis for Haskell's monomorphism restriction, to prevent runtime blowup like this.

Kyle Miller (Aug 27 2021 at 07:01):

So at least Philip Wadler finds this sort of thing to be unexpected (I believe he was the one who pushed for it)

Mario Carneiro (Aug 27 2021 at 07:01):

Mac said:

Mario Carneiro To me, a let x := should be inlined at every x, that is the expected behavior to me. I think of let as an alias.

That is how things work in a lazy language, not a strict language. In ML let statements are executed in the order you wrote them

Mac (Aug 27 2021 at 07:02):

@Kyle Miller And the monomorphism restriction is probably one of the most frustrating (and probably hated) parts of Haskell. :laughing: Many of the language extensions in GHC are explicitly designed to get around it.

Kyle Miller (Aug 27 2021 at 07:03):

@Mario Carneiro Wait, not quite, in a lazy language like Haskell each variable has a secret reference cell that gets overwritten with the evaluated value. So it preserves the evaluate-once semantics.

Mario Carneiro (Aug 27 2021 at 07:04):

If you want alias-like behavior, you should use let f () := ... and then use f () where needed

Mac (Aug 27 2021 at 07:05):

@Mario Carneiro You literally inverted that.

Mac (Aug 27 2021 at 07:05):

let f () := ... is how you create a lazy reference.

Mario Carneiro (Aug 27 2021 at 07:05):

yes

Mac (Aug 27 2021 at 07:05):

A lazy reference is not an alias. It is the exact opposite of an alias.

Mario Carneiro (Aug 27 2021 at 07:06):

although as kyle says, this differs from a haskell style lazy value because it isn't memoizing

Mario Carneiro (Aug 27 2021 at 07:07):

If by alias you mean the entire body of the definition is executed when the value is used, then f () := has the right behavior for this in lean

Mac (Aug 27 2021 at 07:08):

In fact, I would argue the point of the invention of let expressions in most programming languages was to create aliases (as previous kinds of assignment created stack variables).

Mario Carneiro (Aug 27 2021 at 07:08):

I'm very confused by your usage of the word 'alias'

Mario Carneiro (Aug 27 2021 at 07:09):

most languages have a let x := e statement that actually runs the code on the right and stores it on the left

Mario Carneiro (Aug 27 2021 at 07:09):

I haven't seen true call by name outside algol and haskell

Mac (Aug 27 2021 at 07:09):

To me, an alias x := y means that everywhere the syntax x appears it is replaced with y (i.e., substituted, macro expanded, inline, whatever term you like).

Mario Carneiro (Aug 27 2021 at 07:09):

yes, that's not what let means in almost any language

Sebastian Ullrich (Aug 27 2021 at 07:10):

What Mario says

Mac (Aug 27 2021 at 07:10):

That is, x is just an convenient abbreviation for y

Kyle Miller (Aug 27 2021 at 07:10):

@Mac By bringing up the monomorphism restriction, I just mean to say that having lets that evaluate multiple times is surprising. The point is that typeclass-polymorphic functions have an implicit instance argument, which inhibits being able to memoize the value of a typeclass-polymorphic variable. One way or another you have to make a decision how you're going to handle that (either way is fine, so long as it's documented).

Mario's example of let f () := ... is sort of simulating what the monomorphism restriction is supposed to prevent. Even when there's the basic sort of Haskell-style memoization, this ensures that each use of f () is evaluated afresh every time, so it's like the body of the let was substituted in at each point.

Mac (Aug 27 2021 at 07:12):

@Mario Carneiro Maybe I was a bit strict in my definition, I don't mean to say it has to be replaced with exactly y. It can also be replaced with some reduced/optimized form of y, but it shouldn't guarantee new storage.

Mario Carneiro (Aug 27 2021 at 07:12):

It doesn't guarantee new storage, but it should at least guarantee that the right hand side is evaluated at most once for every evaluation of the surrounding context

Mac (Aug 27 2021 at 07:14):

@Mario Carneiro I disagree. I think the amount of times it evaluates its expression should be undefined.

Kyle Miller (Aug 27 2021 at 07:15):

@Mac How do you do any practical functional programming in the pure part of a program that way?

Mario Carneiro (Aug 27 2021 at 07:15):

well, the number of times a pure expression is evaluated is not an "observable property", so it isn't subject to any guarantees really, but as a quality-of-implementation matter it seems like a good principle to stick to to ensure predictable performance

Mac (Aug 27 2021 at 07:15):

That is, I don't think let should be use for computationally sensitive tasks (that should be the domain of monads).

Mario Carneiro (Aug 27 2021 at 07:16):

that sounds perverse for a pure functional language

Mac (Aug 27 2021 at 07:16):

@Kyle Miller I am not saying it can't, I am just saying it shouldn't be required.

Leonardo de Moura (Aug 27 2021 at 07:17):

This is a bug. Could you please create an issue using the smallest example that exposes the problem?

Mac (Aug 27 2021 at 07:18):

Well, there is the second verdict (after Sebastian's). :laughing:

Leonardo de Moura (Aug 27 2021 at 07:18):

@Mario Carneiro is the bug impacting mathport?

Mario Carneiro (Aug 27 2021 at 07:19):

it is, but the discussion here has unearthed a half dozen workarounds (it's a fragile bug), so I don't think it will be a problem

Mac (Aug 27 2021 at 07:20):

@Kyle Miller Anyway, to me, pure code should be pure -- it should not care about side effects (and I consider runtime / number of computations to be a side-effect). Obviously, the goal is to none-the-less run it as fast as possible, but the computational aspect should not be part of the pure code's specification.

Mac (Aug 27 2021 at 07:21):

That is, pure code should not have defined runtime behaviors.

Mario Carneiro (Aug 27 2021 at 07:22):

leanprover/lean4#646

Mac (Aug 27 2021 at 07:22):

This is already to an extent true for Lean as the interpreter, kernel, and compiler all produce very different code with different runtime behaviors.

Leonardo de Moura (Aug 27 2021 at 07:23):

Mario Carneiro said:

leanprover/lean4#646

Thanks. I will take a look when I wake up

Kyle Miller (Aug 27 2021 at 07:35):

Mario Carneiro said:

I haven't seen true call by name outside algol and haskell

It took me a while to remember what it was called, but Kernel (a scheme variant) has $vau, a primitive that lets you implement special forms, though it's not quite call-by-name since you do have to use an eval form to evaluate the passed-in expressions. The interesting thing is that it makes it so that things that are traditionally macros can be used as first-class functions.

Mario Carneiro (Aug 27 2021 at 07:39):

it's true that in languages with a macro functionality like C / Rust / Lisp / Scheme you can implement call by name as a call to a macro (where "let" is a macro definition and use is a call to the macro)

Leonardo de Moura (Aug 27 2021 at 18:16):

Pushed a temporary fix for issue #646.
The perfect fix will only happen when we port the remaining parts of the compiler (still written in C++) to Lean.

Last updated: May 02 2025 at 03:31 UTC