Zulip Chat Archive

Stream: lean4

Topic: Data-level parallelism


Andrés Goens (Aug 03 2022 at 11:35):

Is there anything out there for data-level parallelism? E.g. some kind of parallel map (map/reduce style) or similar? The way I've found so far is to use Task, which from this discussion sounds like it would be a lot of overhead to spawn a new task for every single computation in a map.

Sebastian Ullrich (Aug 03 2022 at 11:40):

That's a pretty long thread, can you be more specific?

Andrés Goens (Aug 03 2022 at 11:49):

Sebastian Ullrich said:

That's a pretty long thread, can you be more specific?

Haha, indeed it is! I can try to be more specific: From the discussion it seems that if we have n Tasks, then the runtime system will spawn n worker threads. My first thought was (without testing it though) that this would not be very efficient if we wanted to run a computation (like a map) on a very large Array, as the overhead of spawning a new thread for every value would not be worth it. Is that wrong, e.g. does the runtime system not spawn OS threads for that but rather be much more lightweight and it might be worth it?

There's obviously other ways around this otherwise, like splitting the computation manually into m Tasks, but I wanted to ask if there was any code out there that had done something like this

Sebastian Ullrich (Aug 03 2022 at 11:52):

That's not true, non-dedicated tasks (the default) are backed by a thread pool bounded by the number of hardware threads

James Gallicchio (Aug 03 2022 at 12:34):

for the collections library, I plan to implement parallelized versions of map/whatever that chunk the data appropriately to minimize the overhead; I think even with a dedicated thread pool you'd want some granularity

James Gallicchio (Aug 03 2022 at 12:34):

(but, very happy to hear that the overhead will not be the same as spawning a system thread)

Yuri de Wit (Aug 03 2022 at 13:34):

Curious: is there a Channel abstraction to go with Task?

Andrés Goens (Aug 03 2022 at 17:19):

For future reference, I did some experimenting/measuring around and the performance is pretty decent even for a lot of threads and small-ish computation! Here's a version of this naive parallel map that I described

def parallelMap : (α  β)  Array α  IO (Array β)
  | f, as =>
  let ts := as.map λ a => Task.spawn (λ _ => f a);
  let rs := ts.map Task.get;
return rs

Using that I tested the following setup on an x86 machine with 24 HW threads and compared the two variants:

def someComputation : Nat  Nat
 | n => List.range n |>.foldl (init := 42) (λ x y => (x+1) * (y+1))

def main : IO Unit := do
  let foo := List.range 10000 |>.toArray |>.map someComputation
  /-
  real  1m21.293s
  user  1m21.244s
  sys   0m0.036s
  -/
  let foo <- List.range 10000 |>.toArray |> parallelMap someComputation
  /-
  real  0m6.240s
  user  2m22.260s
  sys   0m0.372s
  -/
  IO.println s!"res: {foo.size}"

That's a speedup of almost 14x (in a 24 thread machine, for this compute-bound problem we can probably think of HW threads as cores), which is surprisingly good. Or put differently, there's factor ~1.7 in terms of CPU (user) time, which for this embarrassingly parallel problem should roughly correspond to the thread overhead (of ~70%). With a nicely chunked version like @James Gallicchio 's upcoming one this should be even better :wink:

Tomas Skrivan (Aug 04 2022 at 08:38):

Would it be possible to have parallelMap outside of IO monad? And other parallel algorithms too? If you are careful, you can make sure they are fully deterministic i.e. the result does not depend on the order of thread execution.

Sebastian Ullrich (Aug 04 2022 at 08:40):

It does look like the code is using the monad for return only :) ...

Sebastian Ullrich (Aug 04 2022 at 08:43):

For the sake of completeness, the main optimization we're currently missing regarding task overhead is local queues with work stealing. All scheduling inside the current thread pool goes through a big global mutex.

Andrés Goens (Aug 04 2022 at 08:43):

Oh, I've no idea why I thought Task lived in IO, just dropping them works indeed, thanks for that observation!

Tomas Skrivan (Aug 04 2022 at 08:46):

Ohh I also thought that Task lives in IO :)

Gabriel Ebner (Aug 04 2022 at 09:31):

Yuri de Wit said:

Curious: is there a Channel abstraction to go with Task?

Soon.. https://github.com/gebner/lean4/blob/10983d956cca532fcd3bdd6e7cd30ee1b87532de/src/Lean/Mutex.lean

Yuri de Wit (Aug 04 2022 at 16:23):

@Gabriel Ebner re: Mutex.lean, a notice a few patterns (?) I don't understand.

Consider:

private opaque BaseMutexImpl : NonemptyType.{0}

def BaseMutex : Type := BaseMutexImpl.type
  1. What is BaseMutexImpl.type? It seems that elaboration is adding this to the environment when using opaque, but why?
  2. Why the indirection from BaseMutex to BaseMutexImpl.type? The same is used in Condvar.
  3. I see that BaseMutexImpl is generated as a lean_box(0): is this just so it can be threaded through the API calls like "World" is in main?

(sorry for the tangent in this thread)

Gabriel Ebner (Aug 04 2022 at 17:49):

That's just the boilerplate required for FFI types. What opaque BaseMutexImpl : NonemptyType does is declare a constant of type { α : Type // Nonempty α }, i.e. a type about which we only know a single thing, namely that it is nonempty. (We need this so that we can declare opaque BaseMutex.new : BaseIO BaseMutex later: for consistency Lean needs to check that BaseIO BaseMutex is nonempty, which is only true if BaseMutex is nonempty.)

Gabriel Ebner (Aug 04 2022 at 17:49):

The reason why we don't write opaque BaseMutex : NonemptyType is because we don't want to see/type BaseMutex.type afterwards.

Gabriel Ebner (Aug 04 2022 at 17:52):

I see that BaseMutexImpl is generated as a lean_box(0)

This is true for every type, and it's called erasure. For example if you write def foo := (Nat, 42) then the first component will also be the scalar value 0 (which is generated with the C code lean_box(0)).

Yuri de Wit (Aug 04 2022 at 17:59):

Gabriel Ebner said:

I see that BaseMutexImpl is generated as a lean_box(0)

This is true for every type, and it's called erasure. For example if you write def foo := (Nat, 42) then the first component will also be the scalar value 0 (which is generated with the C code lean_box(0)).

Ok. So there is still a tuple with two slots at runtime. It is just that the first one will be lean_object(0)? I was naively assuming that erasure meant that it would completely disappear from the runtime (e.g. a function with 2 parameters where one param is erased would become a function with 1 parameter at runtime).

Gabriel Ebner (Aug 04 2022 at 18:09):

No, the memory representation of an inductive does not depend on the parameter (Prod α β has the same memory representation no matter what α and β are). Otherwise you'd run into huge and ugly issues with polymorphism. (Should def blah : α × β → α be compiled differently depending on α/β, or get a flag? And what about def hmm : (α : Type) × (α × Nat) → Nat := fun ⟨_, _, n⟩ => n?)

Phil Nguyen (May 06 2023 at 08:06):

Andrés Goens said:

For future reference, I did some experimenting/measuring around and the performance is pretty decent even for a lot of threads and small-ish computation! Here's a version of this naive parallel map that I described

def parallelMap : (α  β)  Array α  IO (Array β)
  | f, as =>
  let ts := as.map λ a => Task.spawn (λ _ => f a);
  let rs := ts.map Task.get;
return rs

Using that I tested the following setup on an x86 machine with 24 HW threads and compared the two variants:

def someComputation : Nat  Nat
 | n => List.range n |>.foldl (init := 42) (λ x y => (x+1) * (y+1))

def main : IO Unit := do
  let foo := List.range 10000 |>.toArray |>.map someComputation
  /-
  real  1m21.293s
  user  1m21.244s
  sys   0m0.036s
  -/
  let foo <- List.range 10000 |>.toArray |> parallelMap someComputation
  /-
  real  0m6.240s
  user  2m22.260s
  sys   0m0.372s
  -/
  IO.println s!"res: {foo.size}"

That's a speedup of almost 14x (in a 24 thread machine, for this compute-bound problem we can probably think of HW threads as cores), which is surprisingly good. Or put differently, there's factor ~1.7 in terms of CPU (user) time, which for this embarrassingly parallel problem should roughly correspond to the thread overhead (of ~70%). With a nicely chunked version like James Gallicchio 's upcoming one this should be even better :wink:

I tried a variation of this program and discovered a bug in the current build (https://github.com/leanprover/lean4/issues/2208). But it's exciting to see how well Lean's parallelism scales (on 128 threads) when it happens.

Tomas Skrivan (May 06 2023 at 15:36):

How feasible would it be to write a parallel map that splits the array into chunks and runs map on each chunk in a separate thread?

This should be done inplace, when you allow for copying then it is easy to implement.

Mario Carneiro (May 06 2023 at 15:42):

I don't think it is possible without a C wrapper. All in-place array ops require that the array is unshared and owned by the calling thread

Tomas Skrivan (May 06 2023 at 15:48):

Yeah that is what I thought. I just wondering if there is some unsafe hackery that would allow me to get around it.

Mario Carneiro (May 06 2023 at 15:52):

I wouldn't recommend it, any unsafe hackery on the lean side would have to play well with the reference counting system and I don't think this application can

Mario Carneiro (May 06 2023 at 15:52):

better to do the unsafe hackery in C

James Gallicchio (May 06 2023 at 15:52):

(at least, until we have linear type annotations in Lean :wink:)

Mario Carneiro (May 06 2023 at 15:53):

I'm not sure that would help, the array is actually shared here

Mario Carneiro (May 06 2023 at 15:53):

this is like trying to implement split_at_mut in safe rust

Mario Carneiro (May 06 2023 at 15:54):

the only thing that is preventing things from crashing and burning is the promise that the threads will not reach outside of their subarray (despite having access to do so)

James Gallicchio (May 06 2023 at 15:59):

Mario Carneiro said:

the only thing that is preventing things from crashing and burning is the promise that the threads will not reach outside of their subarray (despite having access to do so)

but we can enforce the bounds with some dependent type hackery, no? The underlying array type can be an array pointer along with a predicate on the indices that says whether you own each element; the get method requires proof you own the index; there's a split and join that are no-ops but which ensure you split/join ownership safely

James Gallicchio (May 06 2023 at 15:59):

(I will read about split_at_mut in rust, though)

Mario Carneiro (May 06 2023 at 16:01):

but we can enforce the bounds with some dependent type hackery, no?

Yes, but the dependent type system doesn't play well with the refcount reasoning (exactly because we don't have a linear type system)

Mario Carneiro (May 06 2023 at 16:02):

it's not obvious to me how you would type the functions even if there were linear annotations though

Mario Carneiro (May 06 2023 at 16:03):

along with a predicate on the indices that says whether you own each element

ownership isn't a predicate, because it has to be linear and proofs of predicates can be copied

Tomas Skrivan (May 06 2023 at 16:03):

I was thinking about having a monad parameterized by the index bounds of each subarray and providing setting and getting elements. You would not have an access to any Array/Subarray object.

James Gallicchio (May 06 2023 at 16:04):

and then the primitive would be to combine two such monadic computations if their bounds are contiguous/disjoint?

Mario Carneiro (May 06 2023 at 16:06):

I think you could do it with a wrapper around Array written in C (which acquires unique access to the array on construction), and you can call a function to split it into pieces (handles to the same array object but with known subarray bounds), and each piece is separately ref-counted and can do in-place mutation when the piece is uniquely owned

Tomas Skrivan (May 06 2023 at 16:06):

I was thinking more of parallelMap that splits array into chunks and you can modify each chunk by providing a function in that monad.

Mario Carneiro (May 06 2023 at 16:06):

parallelMap is the high level interface, which could be built on top

Tomas Skrivan (May 06 2023 at 16:07):

Maybe I should have said parallelMapChunk or something like that.

Mario Carneiro (May 06 2023 at 16:13):

here's a plausible interface:

/-- A slice of an array, similar to `Subarray A` but more opaque -/
opaque ArrayChunk (A : Type u) : Type u
/-- Consumes the array and returns an `ArrayChunk` spanning the whole array -/
opaque ArrayChunk.mk : Array A -> ArrayChunk A
/-- The number of elements in this chunk -/
opaque ArrayChunk.size : ArrayChunk A -> Nat
/-- Consumes the `ArrayChunk` and returns pieces split at index `n` -/
opaque ArrayChunk.splitAt (n : Nat) : ArrayChunk A -> ArrayChunk A × ArrayChunk A
/-- Return the element at index `i`-/
opaque ArrayChunk.get (a : ArrayChunk A) (i : Nat) (h : i < a.size) : A
/-- Set the element at index `i` -/
opaque ArrayChunk.set (a : ArrayChunk A) (i : Nat) (h : i < a.size) (v : A) : ArrayChunk A

You could also make ArrayChunk A a structure around List A for proving purposes (and extern implement all the ops similarly to arrays)

Mario Carneiro (May 06 2023 at 16:13):

you actually don't need anything here to be monadic

Tomas Skrivan (May 06 2023 at 16:33):

Nice, but it is not clear to me when would you clean up the original array. Is the last ArrayChunk standing responsible for cleaning upt the original array? How would they keep track of that?

Mario Carneiro (May 06 2023 at 16:34):

it would, and you can also have another function for getting the array back

opaque ArrayChunk.toArray : ArrayChunk A -> Array A

Mario Carneiro (May 06 2023 at 16:34):

all the ArrayChunk pieces are holding on to refs of the Array, so it gets cleaned up by the usual refcounting approach

Mario Carneiro (May 06 2023 at 16:35):

the unsafe hackery is that ArrayChunk.set does a destructive update of the array if the ArrayChunk itself is unshared, not the Array it is holding a reference to

Mario Carneiro (May 06 2023 at 16:36):

the invariant is that the array in an ArrayChunk only has direct references within other ArrayChunks, no one can get access to it directly

Tomas Skrivan (May 06 2023 at 16:36):

And ArrayChunk.mk ensures that the ref count is one or it makes a copy, right?

Mario Carneiro (May 06 2023 at 16:36):

exactly

Mario Carneiro (May 06 2023 at 16:37):

same thing for ArrayChunk.toArray: it ensures the chunk is unshared and also covers the whole array, or else it copies the slice into a new array

Tomas Skrivan (May 06 2023 at 16:40):

This is a bit unclear to me. Take an array, split it into bunch of chunks, modify those. Then how do I recover the original array? Do I need destroy all but one chunk and call toArray on that chunk?

Mario Carneiro (May 06 2023 at 16:43):

I was thinking about that. I think you need to hold on to the first chunk, but this will break linearity for the pieces. To fix this the chunk should hold a "disabled" flag which is set when you call splitAt on it; after that point destructive updates are not allowed if you use set on it (since ownership of this chunk has been given away to the pieces)

Mario Carneiro (May 06 2023 at 16:45):

even after the main chunk has been disabled, you can still use toArray destructively (once all the child chunks are cleaned up) because you can look at the refcount of the array itself to see that there are no other ArrayChunks floating around

Tomas Skrivan (May 06 2023 at 16:48):

What if ArrayChunk.mk would also return OpaqueArray which has only one method toArray used to recover the original array. To me that sounds a bit cleaner then setting up some flags.

Mario Carneiro (May 06 2023 at 16:50):

actually you need the flags in all the chunks because you can hold on to any chunk after calling splitAt

Mario Carneiro (May 06 2023 at 16:51):

so the top level chunk isn't exactly being special cased here

James Gallicchio (May 06 2023 at 16:51):

alternatively, have an API that doesn't let you split an array more than one time :P

Tomas Skrivan (May 06 2023 at 16:51):

Ohh I see the problem

Mario Carneiro (May 06 2023 at 16:51):

except that toArray on a chunk only works destructively if that chunk owns the whole array

Mario Carneiro (May 06 2023 at 16:52):

but you could conceivably splitAt 0 and then use one of the children, seems a bit niche though

Tomas Skrivan (May 06 2023 at 16:54):

Mario Carneiro said:

except that toArray on a chunk only works destructively if that chunk owns the whole array

Does that mean the ref counter of the whole array is one? Or you mean something else?

Mario Carneiro (May 06 2023 at 16:55):

two things: the ref counter of the array is one, and also the chunk's bounds (it holds a start and stop, or maybe start and size) match those of the array

Tomas Skrivan (May 06 2023 at 16:55):

I don't see why can't you use any chunk

Mario Carneiro (May 06 2023 at 16:56):

The chunk only "knows" about its subarray at the logical level. If we were modeling it in lean it would be List A where the list is just that subslice of the original array

Mario Carneiro (May 06 2023 at 16:57):

so it can't reconstruct the original array unless it has all the elements

Mario Carneiro (May 06 2023 at 16:57):

it can reconstruct its slice, but that's a copy

Mario Carneiro (May 06 2023 at 16:58):

I think this is a requirement, because those other parts of the array can change concurrently

Tomas Skrivan (May 06 2023 at 17:06):

Right, makes sense.

Wait, will toArray have to block the thread until the ref counter goes down to one?

Tomas Skrivan (May 06 2023 at 17:07):

Otherwise it might still be modified concurrently and you can't make a sensible copy out of it.

Mario Carneiro (May 06 2023 at 17:09):

Oh good point

Mario Carneiro (May 06 2023 at 17:10):

actually I think you need to collect the child chunks with some kind of join operator

Mario Carneiro (May 06 2023 at 17:12):

because as far as lean is concerned the value of the original chunk can't change if you just try to use it directly

Mario Carneiro (May 06 2023 at 17:15):

A slightly simpler but more limiting interface would be a callback, where the child chunks are passed to the callback and collected once it is done

Tomas Skrivan (May 06 2023 at 17:16):

How would the type signature of that look like?

Mario Carneiro (May 06 2023 at 17:33):

which one? join would just be ArrayChunk A -> ArrayChunk A -> ArrayChunk A but with a somewhat complicated condition for when the joining is legal (not requiring a copy)

Tomas Skrivan (May 06 2023 at 17:34):

And the callback?

Mario Carneiro (May 06 2023 at 17:35):

withSplitAt : ArrayChunk A -> Nat -> (ArrayChunk A -> ArrayChunk A -> B) -> B

Tomas Skrivan (May 06 2023 at 17:36):

But can't you return those chunks in B and cause problems?

Mario Carneiro (May 06 2023 at 17:36):

basically any "invalid" use will cause lots of copies but no logical breakage

Mario Carneiro (May 06 2023 at 17:39):

with the callback version you don't need the "disabled" flags, since withSplitAt will consume the chunk. If you try to hold on to it then a copy is made

Mario Carneiro (May 06 2023 at 17:44):

actually the callback version doesn't solve the collection problem, you still need the new values of the chunks to be passed back in or else it is logically impossible for the array's value to depend on the new value of the chunks (you will just have an unused chunk variable after the modification and nowhere to put it)


Last updated: Dec 20 2023 at 11:08 UTC