Zulip Chat Archive

Stream: lean4

Topic: avoiding boxing alternatives other than inline

Jeff Packer (Apr 03 2024 at 04:34):

Hi. I am testing Lean4 performance as a general purpose programming language for scientific computing as Lean4 itself intrigues me along with the ability to prove things about code, but I'm also kind of a performance nut.

Can anything to be done, that doesn't create an abomination, to increase the performance of this code? I am a beginner in Lean, so I feel I am likely missing something.

(The code approximates pi via Monte Carlo simulation using Xoshitor256+.)

@[inline]
def calcPi (hits trials : UInt64) : Float :=
  4.0 * hits.toFloat / trials.toFloat

structure Xoshiro256plus where
  s0 : UInt64
  s1 : UInt64
  s2 : UInt64
  s3 : UInt64


@[inline] def rotl (x k : UInt64) : UInt64 := (x <<< k) ||| (x >>> (64 - k))

def Xoshiro256plus.next (state : Xoshiro256plus) : UInt64 × Xoshiro256plus :=
  let t := state.s1 <<< 17
  let s2' := state.s2 ^^^ state.s0
  let s3' := state.s3 ^^^ state.s1
  let s1' := state.s1 ^^^ state.s2
  let s0' := state.s0 ^^^ state.s3
  let state' :=
  {
    s0 := s0',
    s1 := s1',
    s2 := s2' ^^^ t,
    s3 := rotl s3' 45
  }

  (state.s0 + state.s3, state')

def seededXoshiro256plus : IO Xoshiro256plus := do
  let s0 := ByteArray.toUInt64LE! (← IO.getRandomBytes 8)
  let s1 := ByteArray.toUInt64LE! (← IO.getRandomBytes 8)
  let s2 := ByteArray.toUInt64LE! (← IO.getRandomBytes 8)
  let s3 := ByteArray.toUInt64LE! (← IO.getRandomBytes 8)
  pure { s0 := s0, s1 := s1, s2 := s2, s3 := s3  }

@[inline]
def randFloat (gen : Xoshiro256plus) : (Float × Xoshiro256plus) :=
  let (x, gen') := inline gen.next
  let x' := (x >>> 11).toFloat / ((1:UInt64) <<< (53:UInt64)).toFloat
  (x', gen')

partial def doHits (gen : Xoshiro256plus) (numTrials : UInt64) : UInt64 :=
  match numTrials with
  | 0 => 0
  | _ =>
    let (x, gen') := randFloat gen
    let (y, gen') := randFloat gen'
    let b : Bool := (x*x + y*y <= 1.0)
    match b with
    | true => 1 + doHits gen' (numTrials - 1)
    | false => doHits gen' (numTrials - 1)
-- termination_by numTrials

def main : IO Unit := do
  let mut gen ← seededXoshiro256plus
  let trials := 2000000
  let hits := doHits gen trials
  let stdout ← IO.getStdout
  stdout.putStrLn s!"pi is approx {calcPi hits trials}"

Notably, I had to inline the call to Xoshiro256plus.next in randFloat to achieve code 2x slower than C++. That is impressive!

But if Xoshiro256plus.next is not inlined then it is 10x slower. Granted, that is not really a good apples to apples comparison, since the C++ is aggressively inlined... but all that boxing and unboxing also really hurts. The issue may also be exacerbated by function call overhead since the C compiler is unlikely to inline calls to the huge functions generated by the Lean compiler.

By the way, 10x slower is also still very impressive.

Mac Malone (Apr 07 2024 at 20:59):

@Jeff Packer To avoid inlining the whole of Xoshiro256plus.next, you could split the state advancement from the value generation like so:

def Xoshiro256plus.nextState (state : Xoshiro256plus) : Xoshiro256plus :=
  let t := state.s1 <<< 17
  let s2' := state.s2 ^^^ state.s0
  let s3' := state.s3 ^^^ state.s1
  let s1' := state.s1 ^^^ state.s2
  let s0' := state.s0 ^^^ state.s3
  {
    s0 := s0',
    s1 := s1',
    s2 := s2' ^^^ t,
    s3 := rotl s3' 45
  }

@[inline] def Xoshiro256plus.next (state : Xoshiro256plus) : UInt64 × Xoshiro256plus :=
  (state.s0 + state.s3, state.nextState)

This would avoid any UInt64 boxing.

Mac Malone (Apr 07 2024 at 21:02):

Also, as an aside, my style for writing Xoshiro256plus.nextState would be:

def Xoshiro256plus.nextState (state : Xoshiro256plus) : Xoshiro256plus where
  s0 := state.s0 ^^^ state.s3
  s1 := state.s1 ^^^ state.s2
  s2 := (state.s2 ^^^ state.s0) ^^^ (state.s1 <<< 17)
  s3 := rotl (state.s3 ^^^ state.s1) 45

Last updated: May 02 2025 at 03:31 UTC