Zulip Chat Archive
Stream: lean4
Topic: avoiding boxing alternatives other than inline
Jeff Packer (Apr 03 2024 at 04:34):
Hi. I am testing Lean4 performance as a general purpose programming language for scientific computing as Lean4 itself intrigues me along with the ability to prove things about code, but I'm also kind of a performance nut.
Can anything to be done, that doesn't create an abomination, to increase the performance of this code? I am a beginner in Lean, so I feel I am likely missing something.
(The code approximates pi via Monte Carlo simulation using Xoshitor256+.)
@[inline]
def calcPi (hits trials : UInt64) : Float :=
4.0 * hits.toFloat / trials.toFloat
structure Xoshiro256plus where
s0 : UInt64
s1 : UInt64
s2 : UInt64
s3 : UInt64
@[inline] def rotl (x k : UInt64) : UInt64 := (x <<< k) ||| (x >>> (64 - k))
def Xoshiro256plus.next (state : Xoshiro256plus) : UInt64 × Xoshiro256plus :=
let t := state.s1 <<< 17
let s2' := state.s2 ^^^ state.s0
let s3' := state.s3 ^^^ state.s1
let s1' := state.s1 ^^^ state.s2
let s0' := state.s0 ^^^ state.s3
let state' :=
{
s0 := s0',
s1 := s1',
s2 := s2' ^^^ t,
s3 := rotl s3' 45
}
(state.s0 + state.s3, state')
def seededXoshiro256plus : IO Xoshiro256plus := do
let s0 := ByteArray.toUInt64LE! (← IO.getRandomBytes 8)
let s1 := ByteArray.toUInt64LE! (← IO.getRandomBytes 8)
let s2 := ByteArray.toUInt64LE! (← IO.getRandomBytes 8)
let s3 := ByteArray.toUInt64LE! (← IO.getRandomBytes 8)
pure { s0 := s0, s1 := s1, s2 := s2, s3 := s3 }
@[inline]
def randFloat (gen : Xoshiro256plus) : (Float × Xoshiro256plus) :=
let (x, gen') := inline gen.next
let x' := (x >>> 11).toFloat / ((1:UInt64) <<< (53:UInt64)).toFloat
(x', gen')
partial def doHits (gen : Xoshiro256plus) (numTrials : UInt64) : UInt64 :=
match numTrials with
| 0 => 0
| _ =>
let (x, gen') := randFloat gen
let (y, gen') := randFloat gen'
let b : Bool := (x*x + y*y <= 1.0)
match b with
| true => 1 + doHits gen' (numTrials - 1)
| false => doHits gen' (numTrials - 1)
-- termination_by numTrials
def main : IO Unit := do
let mut gen ← seededXoshiro256plus
let trials := 2000000
let hits := doHits gen trials
let stdout ← IO.getStdout
stdout.putStrLn s!"pi is approx {calcPi hits trials}"
Notably, I had to inline the call to Xoshiro256plus.next
in randFloat
to achieve code 2x slower than C++. That is impressive!
But if Xoshiro256plus.next
is not inlined then it is 10x slower. Granted, that is not really a good apples to apples comparison, since the C++ is aggressively inlined... but all that boxing and unboxing also really hurts. The issue may also be exacerbated by function call overhead since the C compiler is unlikely to inline calls to the huge functions generated by the Lean compiler.
By the way, 10x slower is also still very impressive.
Mac Malone (Apr 07 2024 at 20:59):
@Jeff Packer To avoid inlining the whole of Xoshiro256plus.next
, you could split the state advancement from the value generation like so:
def Xoshiro256plus.nextState (state : Xoshiro256plus) : Xoshiro256plus :=
let t := state.s1 <<< 17
let s2' := state.s2 ^^^ state.s0
let s3' := state.s3 ^^^ state.s1
let s1' := state.s1 ^^^ state.s2
let s0' := state.s0 ^^^ state.s3
{
s0 := s0',
s1 := s1',
s2 := s2' ^^^ t,
s3 := rotl s3' 45
}
@[inline] def Xoshiro256plus.next (state : Xoshiro256plus) : UInt64 × Xoshiro256plus :=
(state.s0 + state.s3, state.nextState)
This would avoid any UInt64
boxing.
Mac Malone (Apr 07 2024 at 21:02):
Also, as an aside, my style for writing Xoshiro256plus.nextState
would be:
def Xoshiro256plus.nextState (state : Xoshiro256plus) : Xoshiro256plus where
s0 := state.s0 ^^^ state.s3
s1 := state.s1 ^^^ state.s2
s2 := (state.s2 ^^^ state.s0) ^^^ (state.s1 <<< 17)
s3 := rotl (state.s3 ^^^ state.s1) 45
Last updated: May 02 2025 at 03:31 UTC