Zulip Chat Archive

Stream: lean4

Topic: Task code not running in parallel

Geoffrey Irving (Feb 08 2024 at 15:58):

If I'm doing embarrassingly parallel computation of a bunch of function calls f 0, f1, ..., f (n-1), is there a recommended best strategy for creating Task objects for maximum parallelism? Currently I have some code using binary splitting (chopping the interval in half until it is below a chunk size), but it doesn't seem to be parallelizing in practice.

Here is my current binary splitting routine:

https://github.com/girving/ray/blob/834ba75ba274a95834147b155d1d87f12896b3af/Ray/Render/Image.lean#L141

Geoffrey Irving (Feb 08 2024 at 15:59):

In particular, is this "make two Tasks", then combine them via bind the right approach?

/-- Build a color array out of a function, parallelizing down to chunks of size `≤ chunk` -/
def parallel_colors' (f : ℕ → Color UInt8) (n o chunk : ℕ) : Task ByteArray :=
  if n ≤ max 1 chunk then
    Task.spawn (fun _ ↦ push_colors f n o (.mkEmpty n))
  else
    let n0 := n / 2
    let t0 := parallel_colors' f n0 o chunk
    let t1 := parallel_colors' f (n - n0) (o + n0) chunk
    t0.bind fun b0 ↦ t1.bind fun b1 ↦ Task.pure (b0 ++ b1)
termination_by n decreasing_by
  · exact n0_lt_n (by assumption)
  · exact n1_lt_n (by assumption)

Geoffrey Irving (Feb 08 2024 at 16:04):

Is there a lake flag or environment variable to set the number of threads? I don't see anything mentioned in the Task documentation, unfortunately:

https://lean-lang.org/lean4/doc/task.html

Henrik Böving (Feb 08 2024 at 16:23):

Re environment variable: LEAN_NUM_THREADS

Geoffrey Irving (Feb 08 2024 at 16:24):

^ No effect, unfortunately: it still pins exactly one core.

Geoffrey Irving (Feb 08 2024 at 22:20):

Here's a self-contained example, which similarly seems to run on one thread:

import Mathlib.Algebra.Order.Floor.Div

/-!
## Try to do trial division for primes in parallel
-/

def slow_prime_loop (n : ℕ) : ℕ → Bool
| 0 => True
| 1 => True
| k+1 => n % (k+1) != 0 && slow_prime_loop n k

def slow_prime (n : ℕ) : Bool :=
  1 < n && slow_prime_loop n n.sqrt

def count_range (lo hi : ℕ) : ℕ :=
  (hi - lo).fold (fun n t ↦ t + if slow_prime (lo + n) then 1 else 0) 0

def count (n : ℕ) (chunk : ℕ) :=
  let tasks := (Array.range (n ⌊/⌋ chunk)).map (fun c ↦ Task.spawn (fun _ ↦
    count_range (c * chunk) ((c+1) * chunk)))
  tasks.foldl (fun n t ↦ n + t.get) 0

def main : IO Unit := do
  let n := 1000 * 1000 * 1000
  IO.println ("n = " ++ repr n)
  IO.println ("π(n) = " ++ repr (count n 1000))

Geoffrey Irving (Feb 09 2024 at 09:30):

@Kyle Miller Can I ask about how you do parallelism in https://github.com/kmill/lean4-raytracer/blob/d45a2d3cff9e66eb0296435beda3f147fa5dc5b6/Main.lean#L264?

Ah, I think I just figured it out: I was confused since it seems like all threads are just rendering the same image, but in fact that's exactly what they are doing: they're just doing different point samples and then averaging.

Geoffrey Irving (Feb 09 2024 at 10:10):

@Sebastian Ullrich While I have you, any chance I could get a hint what might be failing to parallelize here?

Sebastian Ullrich (Feb 09 2024 at 10:12):

I'm not sure, sorry. You might want to look at the IR!

Geoffrey Irving (Feb 09 2024 at 10:12):

Yes, that's a good idea now that I can do it. Thanks!

Geoffrey Irving (Feb 09 2024 at 10:35):

Alas, the IR seems perfectly reasonable, and I don't see any problems: it does a specialized Array.map to make the tasks, which look appropriately lazy, then a specialized Array.foldl. So the problem is elsewhere. I'll keep tinkering (though probably not today).

Names changed below from the original IR for readability:

def map_lambda (x_1 : @& obj) (x_2 : @& obj) (x_3 : @& obj) : obj :=
  let x_4 : obj := Nat.mul x_1 x_2;
  let x_5 : obj := 1;
  let x_6 : obj := Nat.add x_1 x_5;
  let x_7 : obj := Nat.mul x_6 x_2;
  dec x_6;
  let x_8 : obj := count_range x_4 x_7;
  dec x_4;
  ret x_8

def map_lambda_boxed (x_1 : obj) (x_2 : obj) (x_3 : obj) : obj :=
  let x_4 : obj := map_lambda x_1 x_2 x_3;
  dec x_3;
  dec x_2;
  dec x_1;
  ret x_4

def map_count (x_1 : obj) (x_2 : usize) (x_3 : usize) (x_4 : obj) : obj :=
  let x_5 : u8 := USize.decLt x_3 x_2;
  case x_5 : u8 of
  Bool.false →
    dec x_1;
    ret x_4
  Bool.true →
    let x_6 : obj := Array.uget ◾ x_4 x_3 ◾;
    let x_7 : obj := 0;
    let x_8 : obj := Array.uset ◾ x_4 x_3 x_7 ◾;
    inc x_1;
    let x_9 : obj := pap map_lambda_boxed x_6 x_1;
    let x_10 : obj := Task.Priority.default;
    let x_11 : obj := Task.spawn ◾ x_9 x_10;
    let x_12 : usize := 1;
    let x_13 : usize := USize.add x_3 x_12;
    let x_14 : obj := Array.uset ◾ x_8 x_3 x_11 ◾;
    let x_15 : obj := map_count x_1 x_2 x_13 x_14;
    ret x_15

def count_foldl (x_1 : @& obj) (x_2 : usize) (x_3 : usize) (x_4 : obj) : obj :=
  let x_5 : u8 := USize.decEq x_2 x_3;
  case x_5 : u8 of
  Bool.false →
    let x_6 : obj := Array.uget ◾ x_1 x_2 ◾;
    let x_7 : obj := Task.get ◾ x_6;
    let x_8 : obj := Nat.add x_4 x_7;
    dec x_7;
    dec x_4;
    let x_9 : usize := 1;
    let x_10 : usize := USize.add x_2 x_9;
    let x_11 : obj := count_foldl x_1 x_10 x_3 x_8;
    ret x_11
  Bool.true →
    ret x_4

def count (x_1 : obj) (x_2 : obj) : obj :=
  let x_3 : obj := Nat.div x_1 x_2;
  dec x_1;
  let x_4 : obj := Array.range x_3;
  let x_5 : obj := Array.size ◾ x_4;
  let x_6 : usize := USize.ofNat x_5;
  dec x_5;
  let x_7 : usize := 0;
  let x_8 : obj := map_count x_2 x_6 x_7 x_4;
  let x_9 : obj := Array.size ◾ x_8;
  let x_10 : obj := 0;
  let x_11 : u8 := Nat.decLt x_10 x_9;
  case x_11 : u8 of
  Bool.false →
    dec x_9;
    dec x_8;
    let x_12 : obj := 0;
    ret x_12
  Bool.true →
    let x_13 : u8 := Nat.decLe x_9 x_9;
    case x_13 : u8 of
    Bool.false →
      dec x_9;
      dec x_8;
      let x_14 : obj := 0;
      ret x_14
    Bool.true →
      let x_15 : usize := USize.ofNat x_9;
      dec x_9;
      let x_16 : obj := count_foldl x_8 x_7 x_15 x_10;
      dec x_8;
      ret x_16

Kyle Miller (Feb 09 2024 at 16:56):

@Geoffrey Irving I have another parallelism example that splits work across multiple threads too. Here's where tasks are created and then waited upon: https://github.com/kmill/arrow_poly/blob/master/Main.lean#L117-L123

Geoffrey Irving (Feb 09 2024 at 20:05):

Interesting, the systematic difference is that your code has "less functional" tasks, in that the tasks themselves are inside the IO monad too (and do IO monad loops). I'll try that, though it's unfortunate that the non-IO version doesn't work.

Kyle Miller (Feb 09 2024 at 20:07):

Oh right, a gotcha in your example is that count_range (c * chunk) ((c+1) * chunk) is free to be hoisted out of Task.spawn

Geoffrey Irving (Feb 09 2024 at 20:08):

Though as noted above the IR doesn't show that hoisting.

Last updated: May 02 2025 at 03:31 UTC