Zulip Chat Archive
Stream: lean4
Topic: Mysterious array access performance
Mario Carneiro (Aug 27 2021 at 03:34):
I haven't yet managed to isolate the problem into an MWE, but I can roughly describe it: For some particular array arr
I have observed that readArray
is slow (~7 s) here:
def generateArray : Array Expr := sorry
def readArray (arr : Array Expr) : IO Unit := sorry
#eval show IO Unit from do
let arr := generateArray
timeit "slow" (readArray arr)
but fast (~0.2 ms) here:
def freshen [Inhabited α] (arr : Array α) : IO (Array α) := do
let mut out := arr
for i in [0:out.size] do
out := out.setD i out[i]
out
#eval show IO Unit from do
let arr := generateArray
let arr ← freshen arr
timeit "fast" (readArray arr)
What could cause this behavior? readArray
is supposed to be fast and basically calls arr[i]
for various values of i
(it does not modify the array). freshen
is a value level no-op but may do something interesting to the reference counts, but I don't see how that would affect the performance of Array.get
so dramatically.
Mario Carneiro (Aug 27 2021 at 03:59):
Oh, even adding let arr <- if false then freshen arr else arr
makes it fast, so this seems like a compiler bug
Kyle Miller (Aug 27 2021 at 05:28):
I seem to remember that the implementation of Perceus in Lean 4 takes a conservative approach for global variables (i.e., ones that might be shared between threads), where these always take the slow path for the reference count check. The freshen
function certainly makes a local copy of generateArray
, so it would avoid this, and perhaps if arr <- if false then freshen arr else arr
does too (though I'm somewhat dubious). Maybe it'd be worth taking a look at the generated C code?
Mario Carneiro (Aug 27 2021 at 05:47):
Finally got a self contained example:
def freshen (arr : Array Unit) : IO (Array Unit) := arr.setD 0 ()
def build (es : Array Unit) : Array Unit := do
let mut out := #[]
for e in es do
out := out.push ()
out
def bench (f : ∀ {α : Type}, α → IO Unit := fun _ => ()) : IO Unit := do
let out ← freshen $ Array.mkArray 50000 ()
let mut arr := build out
-- arr ← freshen arr
timeit "time" $ do
for _ in [:30] do
f $ #[1, 2, 3, 4].map fun ty => arr[ty]
#eval bench
Mario Carneiro (Aug 27 2021 at 05:53):
Note that all of the tests I've been doing so far have been in the interpreter, not the generated C code. For all I know things are completely different in the compiled version
Kyle Miller (Aug 27 2021 at 05:54):
Here's a weird data point to add. If you change freshen
to not be in the IO
monad, then it runs fast again:
def freshen [Inhabited α] (arr : Array α) : Array α := do
let mut out := arr
for i in [0:out.size] do
out := out.setD i out[i]
out
Mario Carneiro (Aug 27 2021 at 05:55):
Here out <- freshen ...
is just there to ensure that it and arr
aren't hoisted into globals
Kyle Miller (Aug 27 2021 at 06:00):
(It's kind of a nice surprise that let out <- freshen ...
or let out := freshen ...
both work when freshen
is using the identity monad.) I wonder why the IO
monad slows things down so much later...
Kyle Miller (Aug 27 2021 at 06:06):
This version of freshen
also shows this slow/fast behavior depending on whether or not the IO $
is present:
def freshen [Inhabited α] (arr : Array α) : IO $ Array α := do
let mut out := Array.empty
for i in [0:arr.size] do
out := out.push arr[i]
out
Mario Carneiro (Aug 27 2021 at 06:08):
I think the only constraints on freshen
to reproduce the bug are that it modifies the array and it is in the IO monad
Kyle Miller (Aug 27 2021 at 06:12):
This is a version where freshen
doesn't modify the array, but the original array is global. Removing IO
makes it run fast.
def freshen (arr : Array Unit) : IO (Array Unit) := arr
def build (es : Array Unit) : Array Unit := do
let mut out := #[]
for e in es do
out := out.push ()
out
def arr := freshen $ Array.mkArray 50000 ()
def bench (f : ∀ {α : Type}, α → IO Unit := fun _ => ()) : IO Unit := do
let out <- arr
let mut arr := build out
-- arr ← freshen arr
timeit "time" $ do
for _ in [:30] do
f $ #[1, 2, 3, 4].map fun ty => arr[ty]
#eval bench
Mario Carneiro (Aug 27 2021 at 06:14):
in that version uncommenting the arr ← freshen arr
doesn't make it fast though
Mario Carneiro (Aug 27 2021 at 06:14):
but I guess that's not super relevant, there are a bunch of things that make it fast (i.e. dodge the bug conditions)
Kyle Miller (Aug 27 2021 at 06:15):
I didn't mean to leave that in. I just thought it's weird that removing IO
makes it fast, but this is still enough to make it slow.
Mario Carneiro (Aug 27 2021 at 06:16):
it looks like @[inline]
on arr
in your version also makes it fast
Kyle Miller (Aug 27 2021 at 06:19):
With this one, removing timeit "time1"
makes it slow. Nice to see a Heisenbug...
def build (es : Array Unit) : Array Unit := do
let mut out := #[]
for e in es do
out := out.push ()
out
def arr : IO (Array Unit) := Array.mkArray 50000 ()
def bench (f : ∀ {α : Type}, α → IO Unit := fun _ => ()) : IO Unit := do
let arr <- arr
let arr <- timeit "time1" $ build arr
timeit "time2" $ do
for _ in [:30] do
f $ #[1, 2, 3, 4].map fun ty => arr[ty]
#eval bench
Mario Carneiro (Aug 27 2021 at 06:22):
I also can't seem to outline any part of the time2
body without making it fast
Mac (Aug 27 2021 at 06:30):
Kyle Miller said:
With this one, removing
timeit "time1"
makes it slow. Nice to see a Heisenbug...
This one I can explain.
Mac (Aug 27 2021 at 06:31):
Removing the timeit
makes the line --- i.e.,let arr <- build arr
-- reduce to let arr := build arr
.
Mac (Aug 27 2021 at 06:32):
Whereas otherwise, the line reduces to timeit "time1" $ build arr >>= fun arr => ...
(i.e., the rest of the code is in a function and is thus lazy).
Mac (Aug 27 2021 at 06:33):
I suspect that all these performances issues probably have to do with the absence/presence of laziness.
Mario Carneiro (Aug 27 2021 at 06:34):
Even making the most pessimistic assumptions, I don't see how 120 array accesses could take 2-6 seconds
Mac (Aug 27 2021 at 06:35):
Because the let arr := build arr
line results in the next line becoming:
timeit "time2" $ do
for _ in [:30] do
f $ #[1, 2, 3, 4].map fun ty => (build arr)[ty]
Mac (Aug 27 2021 at 06:35):
i..e., it is rebuilding the array each step instead of using the same array across all steps.
Mario Carneiro (Aug 27 2021 at 06:35):
whoa, it does that? That seems like a really bad optimization under most circumstances
Mario Carneiro (Aug 27 2021 at 06:36):
It would be great if we could spot that in the IR
Mac (Aug 27 2021 at 06:36):
This isn't an optimization. Lean is a strict language, that is what that means in a strict language.
Mac (Aug 27 2021 at 06:37):
A let
definition is just an alias if its definition and use are in the same closure.
Mac (Aug 27 2021 at 06:39):
Though one could argue the problem is that let arr <- build arr
(monadic) becomes let arr := build arr
(non-monadic), which is not great.
Mac (Aug 27 2021 at 06:39):
I believe @Sebastian Ullrich mentioned that the Id
monad may become lazy at some point. If so, this problem would vanish.
Kyle Miller (Aug 27 2021 at 06:39):
Mac said:
Removing the
timeit
makes the line --- i.e.,let arr <- build arr
-- reduce tolet arr := build arr
.
Could you explain this again? I'm not sure I follow. Wouldn't this predict that the following would run fast?
def build (es : Array Unit) : Array Unit := do
let mut out := #[]
for e in es do
out := out.push ()
out
def arr : IO (Array Unit) := Array.mkArray 50000 ()
def bench (f : ∀ {α : Type}, α → IO Unit := fun _ => ()) : IO Unit := do
let arr <- arr
let arr <- pure $ build arr -- slow even if you add the IO annotations
timeit "time2" $ do
for _ in [:30] do
f $ #[1, 2, 3, 4].map fun ty => arr[ty]
#eval bench
Mario Carneiro (Aug 27 2021 at 06:40):
Even for non-monadic values, I don't think there is any situation that mandates that a computation should be duplicated
Mario Carneiro (Aug 27 2021 at 06:40):
the compiler is permitted to duplicate computation (in the sense that this will still produce correct results), but I would generally consider that a (performance) bug
Mac (Aug 27 2021 at 06:42):
@Kyle Miller The IO annotation means nothing, IO.pure
is not noninline
(and IO is not opaque currently).
Mac (Aug 27 2021 at 06:42):
IO
is just short for EStateM IO.Error Unit
which can easily be reduced away.
Mario Carneiro (Aug 27 2021 at 06:43):
Mac said:
I believe Sebastian Ullrich mentioned that the
Id
monad may become lazy at some point. If so, this problem would vanish.
Well no, not if it's hoisting :=
let statements in a monadic block
Mac (Aug 27 2021 at 06:43):
In fact, the pure
of EStateM
is marked @[inline]
so it will be forcibly inlined.
Kyle Miller (Aug 27 2021 at 06:44):
Then shouldn't this noinline
version of pure
circumvent that?
def build (es : Array Unit) : Array Unit := do
let mut out := #[]
for e in es do
out := out.push ()
out
def arr : IO (Array Unit) := Array.mkArray 50000 ()
@[noinline]
def mypure (x : α) : IO α := pure x
def bench (f : ∀ {α : Type}, α → IO Unit := fun _ => ()) : IO Unit := do
let arr <- arr
let arr <- mypure $ build arr
timeit "time2" $ do
for _ in [:30] do
f $ #[1, 2, 3, 4].map fun ty => arr[ty]
#eval bench
Mac (Aug 27 2021 at 06:46):
It actually does for me
Mario Carneiro (Aug 27 2021 at 06:46):
I've confirmed @Mac 's theory: starting from this version, among the outputs of set_option trace.compiler.ir.result true
is:
def Array.mapMUnsafe.map._at.bench._spec_2 (x_1 : obj) (x_2 : usize) (x_3 : usize) (x_4 : obj) : obj :=
inc x_1;
let x_5 : obj := build x_1;
let x_6 : u8 := USize.decLt x_3 x_2;
case x_6 : u8 of
Bool.false →
dec x_5;
dec x_1;
let x_7 : obj := unsafeCast ◾ ◾ x_4;
ret x_7
Bool.true →
let x_8 : obj := Array.uget ◾ x_4 x_3 ◾;
let x_9 : obj := 0;
let x_10 : obj := Array.uset ◾ x_4 x_3 x_9 ◾;
let x_11 : obj := unsafeCast ◾ ◾ x_8;
let x_12 : obj := instInhabitedPUnit;
let x_13 : obj := Array.get! ◾ x_12 x_5 x_11;
dec x_11;
dec x_5;
let x_14 : usize := 1;
let x_15 : usize := USize.add x_3 x_14;
let x_16 : obj := unsafeCast ◾ ◾ x_13;
let x_17 : obj := Array.uset ◾ x_10 x_3 x_16 ◾;
let x_18 : obj := Array.mapMUnsafe.map._at.bench._spec_2 x_1 x_2 x_15 x_17;
ret x_18
This is the inner loop for the map operation in #[1, 2, 3, 4].map fun ty => arr[ty]
, and you can see let x_5 : obj := build x_1
, so build
will get called 120 times.
Kyle Miller (Aug 27 2021 at 06:48):
(It looks like I had something like 16 versions running simultaneously after editing, and of course the slow ones reported their times last. My last example does run in less than a second.)
Mac (Aug 27 2021 at 06:51):
@Kyle Miller Also note that mypure
requires the @[noinline]
annotation, it will run slow without it (i.e., it will get inlined).
Kyle Miller (Aug 27 2021 at 06:52):
@Mac Yes, I'm agreeing with you. That's why I put @[noinline]
there in the first place, and I misinterpreted how VS Code outputs to the console.
Mac (Aug 27 2021 at 06:54):
@Kyle Miller Sorry, I realized that :laughing: . I was just elaborating that simply removing the @[inline]
from the existing pure is insufficient, it will still get inlined.
Mac (Aug 27 2021 at 06:54):
Note that, I believe, the following works (fast):
def bench (f : ∀ {α : Type}, α → IO Unit := fun _ => ()) : IO Unit := do
let arr <- arr
let arr <- do build arr
timeit "time2" do
for _ in [:30] do
f $ #[1, 2, 3, 4].map fun ty => arr[ty]
Mario Carneiro (Aug 27 2021 at 06:55):
From what I can tell, it goes wrong somewhere around when mapMUnsafe
is specialized, where the function argument fun ty => arr[ty]
has its dependencies inlined to fun ty => (build n)[ty]
Mac (Aug 27 2021 at 06:58):
@Mario Carneiro imo, this is the expected behavior.
Mac (Aug 27 2021 at 06:59):
Computations should be wrapped in monads, that is the point of them.
Mario Carneiro (Aug 27 2021 at 06:59):
executing let statements multiple times when they were not written that way is not expected behavior. Executing them not at all is okay, as well as lowering them over case
statements, but lowering them into a loop is not okay
Kyle Miller (Aug 27 2021 at 07:00):
Going back to @Mario Carneiro's earlier example, inserting a do
makes it fast, as @Mac would expect. But also does replacing the line with let arr := build 50000
.
def build (n : Nat) : Array Unit := do
let mut out := #[]
for _ in [0:n] do
out := out.push ()
out
@[noinline] def size : IO Nat := pure 50000
def bench (f : ∀ {α : Type}, α → IO Unit := fun _ => ()) : IO Unit := do
let arr <- do build (← size)
timeit "time" $
for _ in [:30] do
f $ #[1, 2, 3, 4].map fun ty => arr[ty]
#eval bench
Mac (Aug 27 2021 at 07:00):
@Mario Carneiro To me, a let x :=
should be inlined at every x
-- that is the expected behavior to me. I think of let
as an alias.
Mario Carneiro (Aug 27 2021 at 07:00):
There are lots of optimizations that can lead to a monadic computation being turned into a pure computation after inlining, so using <-
does not help
Kyle Miller (Aug 27 2021 at 07:00):
Re what Mario is saying, this is the basis for Haskell's monomorphism restriction, to prevent runtime blowup like this.
Kyle Miller (Aug 27 2021 at 07:01):
So at least Philip Wadler finds this sort of thing to be unexpected (I believe he was the one who pushed for it)
Mario Carneiro (Aug 27 2021 at 07:01):
Mac said:
Mario Carneiro To me, a
let x :=
should be inlined at everyx
, that is the expected behavior to me. I think oflet
as an alias.
That is how things work in a lazy language, not a strict language. In ML let
statements are executed in the order you wrote them
Mac (Aug 27 2021 at 07:02):
@Kyle Miller And the monomorphism restriction is probably one of the most frustrating (and probably hated) parts of Haskell. :laughing: Many of the language extensions in GHC are explicitly designed to get around it.
Kyle Miller (Aug 27 2021 at 07:03):
@Mario Carneiro Wait, not quite, in a lazy language like Haskell each variable has a secret reference cell that gets overwritten with the evaluated value. So it preserves the evaluate-once semantics.
Mario Carneiro (Aug 27 2021 at 07:04):
If you want alias-like behavior, you should use let f () := ...
and then use f ()
where needed
Mac (Aug 27 2021 at 07:05):
@Mario Carneiro You literally inverted that.
Mac (Aug 27 2021 at 07:05):
let f () := ...
is how you create a lazy reference.
Mario Carneiro (Aug 27 2021 at 07:05):
yes
Mac (Aug 27 2021 at 07:05):
A lazy reference is not an alias. It is the exact opposite of an alias.
Mario Carneiro (Aug 27 2021 at 07:06):
although as kyle says, this differs from a haskell style lazy value because it isn't memoizing
Mario Carneiro (Aug 27 2021 at 07:07):
If by alias you mean the entire body of the definition is executed when the value is used, then f () :=
has the right behavior for this in lean
Mac (Aug 27 2021 at 07:08):
In fact, I would argue the point of the invention of let
expressions in most programming languages was to create aliases (as previous kinds of assignment created stack variables).
Mario Carneiro (Aug 27 2021 at 07:08):
I'm very confused by your usage of the word 'alias'
Mario Carneiro (Aug 27 2021 at 07:09):
most languages have a let x := e
statement that actually runs the code on the right and stores it on the left
Mario Carneiro (Aug 27 2021 at 07:09):
I haven't seen true call by name outside algol and haskell
Mac (Aug 27 2021 at 07:09):
To me, an alias x := y
means that everywhere the syntax x
appears it is replaced with y
(i.e., substituted, macro expanded, inline, whatever term you like).
Mario Carneiro (Aug 27 2021 at 07:09):
yes, that's not what let
means in almost any language
Sebastian Ullrich (Aug 27 2021 at 07:10):
What Mario says
Mac (Aug 27 2021 at 07:10):
That is, x is just an convenient abbreviation for y
Kyle Miller (Aug 27 2021 at 07:10):
@Mac By bringing up the monomorphism restriction, I just mean to say that having let
s that evaluate multiple times is surprising. The point is that typeclass-polymorphic functions have an implicit instance argument, which inhibits being able to memoize the value of a typeclass-polymorphic variable. One way or another you have to make a decision how you're going to handle that (either way is fine, so long as it's documented).
Mario's example of let f () := ...
is sort of simulating what the monomorphism restriction is supposed to prevent. Even when there's the basic sort of Haskell-style memoization, this ensures that each use of f ()
is evaluated afresh every time, so it's like the body of the let
was substituted in at each point.
Mac (Aug 27 2021 at 07:12):
@Mario Carneiro Maybe I was a bit strict in my definition, I don't mean to say it has to be replaced with exactly y. It can also be replaced with some reduced/optimized form of y
, but it shouldn't guarantee new storage.
Mario Carneiro (Aug 27 2021 at 07:12):
It doesn't guarantee new storage, but it should at least guarantee that the right hand side is evaluated at most once for every evaluation of the surrounding context
Mac (Aug 27 2021 at 07:14):
@Mario Carneiro I disagree. I think the amount of times it evaluates its expression should be undefined.
Kyle Miller (Aug 27 2021 at 07:15):
@Mac How do you do any practical functional programming in the pure part of a program that way?
Mario Carneiro (Aug 27 2021 at 07:15):
well, the number of times a pure expression is evaluated is not an "observable property", so it isn't subject to any guarantees really, but as a quality-of-implementation matter it seems like a good principle to stick to to ensure predictable performance
Mac (Aug 27 2021 at 07:15):
That is, I don't think let should be use for computationally sensitive tasks (that should be the domain of monads).
Mario Carneiro (Aug 27 2021 at 07:16):
that sounds perverse for a pure functional language
Mac (Aug 27 2021 at 07:16):
@Kyle Miller I am not saying it can't, I am just saying it shouldn't be required.
Leonardo de Moura (Aug 27 2021 at 07:17):
This is a bug. Could you please create an issue using the smallest example that exposes the problem?
Mac (Aug 27 2021 at 07:18):
Well, there is the second verdict (after Sebastian's). :laughing:
Leonardo de Moura (Aug 27 2021 at 07:18):
@Mario Carneiro is the bug impacting mathport?
Mario Carneiro (Aug 27 2021 at 07:19):
it is, but the discussion here has unearthed a half dozen workarounds (it's a fragile bug), so I don't think it will be a problem
Mac (Aug 27 2021 at 07:20):
@Kyle Miller Anyway, to me, pure code should be pure -- it should not care about side effects (and I consider runtime / number of computations to be a side-effect). Obviously, the goal is to none-the-less run it as fast as possible, but the computational aspect should not be part of the pure code's specification.
Mac (Aug 27 2021 at 07:21):
That is, pure code should not have defined runtime behaviors.
Mario Carneiro (Aug 27 2021 at 07:22):
Mac (Aug 27 2021 at 07:22):
This is already to an extent true for Lean as the interpreter, kernel, and compiler all produce very different code with different runtime behaviors.
Leonardo de Moura (Aug 27 2021 at 07:23):
Mario Carneiro said:
Thanks. I will take a look when I wake up
Kyle Miller (Aug 27 2021 at 07:35):
Mario Carneiro said:
I haven't seen true call by name outside algol and haskell
It took me a while to remember what it was called, but Kernel (a scheme variant) has $vau
, a primitive that lets you implement special forms, though it's not quite call-by-name since you do have to use an eval
form to evaluate the passed-in expressions. The interesting thing is that it makes it so that things that are traditionally macros can be used as first-class functions.
Mario Carneiro (Aug 27 2021 at 07:39):
it's true that in languages with a macro functionality like C / Rust / Lisp / Scheme you can implement call by name as a call to a macro (where "let" is a macro definition and use is a call to the macro)
Leonardo de Moura (Aug 27 2021 at 18:16):
Pushed a temporary fix for issue #646.
The perfect fix will only happen when we port the remaining parts of the compiler (still written in C++) to Lean.
Last updated: Dec 20 2023 at 11:08 UTC