Zulip Chat Archive

Stream: lean4

Topic: Parallelism without printing things

Geoffrey Irving (Mar 01 2024 at 16:25):

So far, the only examples of parallelized Lean code that I've seen print something in each thread. This is because the only examples of parallelized Lean code that I've seen is @Kyle Miller's https://github.com/kmill/lean4-raytracer/blob/master/Main.lean.

Is there a way to parallelize Lean code without printing (at least until after the parallelism completes)? E.g., here is some code that runs nicely in parallel (it pins 2 CPUs), but if I remove the print in loud_count it runs in serial (presumably because of hoisting):

import Mathlib.Data.Nat.Basic
import Std.Data.Nat.Basic

/-!
## Try to do trial division for primes in parallel
-/

variable {α : Type}

def slow_prime_loop (n : ℕ) : ℕ → Bool
| 0 => True
| 1 => True
| k+1 => n % (k+1) != 0 && slow_prime_loop n k

def slow_prime (n : ℕ) : Bool :=
  1 < n && slow_prime_loop n n.sqrt

def count (lo hi : ℕ) : ℕ :=
  (hi - lo).fold (fun n t ↦ t + if slow_prime (lo + n) then 1 else 0) 0

def loud_count (lo hi : ℕ) : IO ℕ := do
  let x := IO.lazyPure (fun _ ↦ count lo hi)
  IO.println s!"count {lo} {hi}"  -- Removing this breaks parallelism
  x

def IO.wait' (x : Task (Except IO.Error α)) : IO α := do
  IO.ofExcept (← IO.wait x)

def main : IO Unit := do
  let n := 1000 * 1000 * 10
  let k := n / 2
  let t0 ← IO.asTask <| loud_count 0 k
  let t1 ← IO.asTask <| loud_count k n
  let r := (← IO.wait' t0) + (← IO.wait' t1)
  IO.println s!"π(n) = {r}"

Geoffrey Irving (Mar 01 2024 at 16:27):

Separately, it seems like Task.spawn is intended to allow parallelism within purely functional code, but after some trying I've failed to construct an example where it parallelizes (I can only get parallelism out of IO.asTask, and there only if I have some interleaved side effect like printing).

Andrés Goens (Mar 01 2024 at 16:34):

I used a pattern like this with Task:

         let mut workers : Array (Task (MyComputation)) := #[]
            let task := Task.spawn λ _ =>  myComputation
            workers := workers.push task
         for worker in workers do
            let myResult := worker.get

(this is in Id.run for the do notation in a pure context).

Here's the full version of that in case you want to see an example

Tomas Skrivan (Mar 01 2024 at 16:41):

This gets all my 20 CPUs working:

/-- `m` number of threads
    `n` number of operations per thread -/
def parallelReduce (n m : Nat) (f : Fin (n*m) → α) (op : α → α → α) (unit : α) : α := Id.run do
  let mut tasks : Array (Task α) := Array.mkEmpty m

  for i in [0:m] do
    let partialReduce : Unit → α := λ _ => Id.run do
      let mut a := unit
      for j in [i*n : (i+1)*n] do
        a := op a (f ⟨j,sorry⟩)
      a
    tasks := tasks.push (Task.spawn partialReduce)

  let mut a := unit
  for t in tasks do
    a := op a t.get
  a

def n := 1000000
def m := 10

def s := (n*m * (n*m-1))/2 -- expected result

#eval parallelReduce 10000000 20 (fun x => x.1) (·+·) 0

Geoffrey Irving (Mar 01 2024 at 16:42):

Hmm, so is the principle just: you have to make things complicated enough so the compiler doesn't bother to see through it? Is there a way to more directly fix the simple case of parallelizing two tasks?

Geoffrey Irving (Mar 01 2024 at 16:43):

Indeed if it mask what's going on with a mutable array of tasks, it works without the printing and with Task.spawn. But this seems a bit unsatisfying.

Geoffrey Irving (Mar 01 2024 at 16:45):

To elaborate: the reason I was hitting the "simple, two case" example in my original complicated application is that I was doing recursive splitting, so each function was dealing with one or two subtasks with clear control flow.

Geoffrey Irving (Mar 01 2024 at 16:49):

In any case, I suppose I'm unblocked, since I can rearchitect my code to hide what's going on and then it will work. :)

Thank you!

Mario Carneiro (Mar 02 2024 at 05:12):

Geoffrey Irving said:

Hmm, so is the principle just: you have to make things complicated enough so the compiler doesn't bother to see through it? Is there a way to more directly fix the simple case of parallelizing two tasks?

@Geoffrey Irving I was a bit skeptical of this claim because the compiler isn't that smart, and indeed I can't replicate the original example: running your test with the println commented out results in a program that uses 2 CPUs like I would expect. The main thing to watch out for is that if both arguments to count are constants (closed terms) such that the count call itself is a closed term, then it will be calculated at initialization time, and this is not parallelized. The original code is at risk of this because you have hard-coded n and k instead of taking them via stdin or laundering them through let n <- pure 100 etc. Having loud_count as an intermediate also blocks this though, as long as it is not inlined (which you can ensure using @[noinline]).

Geoffrey Irving (Mar 02 2024 at 12:40):

Yep, that makes sense, and @[noinline] will help!

Geoffrey Irving (Mar 02 2024 at 12:52):

Ah, to clarify: initialization is not parallelized even if Task is used? I think this would explain my initial difficulties as well. Is there a bug tracking that, or a reason why it's unfixable?

Mario Carneiro (Mar 03 2024 at 18:14):

Geoffrey Irving said:

Ah, to clarify: initialization is not parallelized even if Task is used? I think this would explain my initial difficulties as well. Is there a bug tracking that, or a reason why it's unfixable?

IO things are never hoisted to initialization time. Pure tasks using Task.mk could be hoisted, and they will be parallelized the same way as they would be if run at runtime because they are evaluated in basically the same way, just before main. But this example uses IO.asTask and that will have to run at runtime.

Geoffrey Irving (Mar 03 2024 at 18:53):

Will the arguments of Task.spawn will also get hoisted out of the spawn, if they are constant?

James Gallicchio (Mar 13 2024 at 09:25):

(deleted)

Last updated: May 02 2025 at 03:31 UTC