Zulip Chat Archive
Stream: mathlib4
Topic: Ordered merge slower than append-then-sort
Sofia (Jan 02 2022 at 15:36):
Hmm. I have arrays on the order of 1000s of elements, which are already ordered and I want to merge them into an ordered array. As a temporary hack I was appending them together, then sorting. This was too slow. I've tried to speed it up, but my code is quite a bit slower.
I suspect the append
is better optimized and dwarfing my code. Any suggestions?
The obviously correct reference.
as.append bs |>.heapSort (hash . < hash .)
The should-be-faster code. @[specialize]
helped a little. Adding the bounds proofs did not have any effect.
@[specialize] def ordered_merge (as bs : Array α) (f : α -> α -> Bool) : Array α := Id.run do
let mut cs := Array.mkEmpty (as.size + bs.size)
let mut left : USize := 0
let mut right : USize := 0
for _ in [: as.size + bs.size] do
if h : left.toNat < as.size then
if h' : right.toNat < bs.size then
if f (as.uget left h) (bs.uget right h') then
cs := cs.push <| as.uget left h
left := left + 1
else
cs := cs.push <| bs.uget right h'
right := right + 1
else
cs := cs.append as[left.toNat :]
break
else
cs := cs.append bs[right.toNat :]
break
cs
Henrik Böving (Jan 02 2022 at 15:58):
append
on Array
is indeed better optimized because it uses a C implementation under the hood so you will have a hard time keeping up with that in Lean i suppose
Out of curiosity, how are you measuring performance here? Do you have some cool setup for that or do you just time the execution of the binary?
Sofia (Jan 02 2022 at 16:01):
I'm using timeit
in Lean and looking at the process in htop. When it has many threads, it is building those arrays. When it uses only one, it is doing the merges.
new <- timeit "join" do
let mut new := #[]
for t in tasks do
new := new.ordered_merge (<- IO.ofExcept (<- IO.wait t)) (hash . < hash .)
new
Sofia (Jan 02 2022 at 16:01):
Hmm. Looking at that part of the code again, I recall I was just doing append before in that loop, then doing the sort on the outside.
Sofia (Jan 02 2022 at 16:02):
That'd further amplify the append vs. iteration significantly.
Sofia (Jan 02 2022 at 16:05):
I don't think the append
is implemented in C? Following back the implementation, it is all Lean. A little implemented with unsafe, but that is still Lean.
Henrik Böving (Jan 02 2022 at 16:07):
Ahhh it was Array.push
that is implemented in C right...I always mix those two up.
Sofia (Jan 02 2022 at 16:18):
Getting a nice speed up by just appending them and doing in-place swaps instead.
Sofia (Jan 02 2022 at 16:18):
But I think there is a bug so...
Sofia (Jan 02 2022 at 16:20):
Yeah.. it isn't interleaving them o.o
Sofia (Jan 02 2022 at 16:21):
Ah.
Sofia (Jan 02 2022 at 17:41):
Tired.. should really be sleeping and not doing silly things like figuring out how to merge sorted lists.. Anyway, I already have a "gallop" function (stolen shamelessly from https://docs.rs/datafrog) so I'm using that to find runs, then merge over runs. This improves performance at least.
@[specialize] def ordered_merge (as bs : Array α) (f : α -> α -> Bool) : Array α := Id.run do
let mut cs := Array.mkEmpty (as.size + bs.size)
let mut l := 0
let mut r := 0
for _ in [0 : as.size + bs.size] do
if l >= as.size then
cs := cs.append bs[r:]
break
if r >= bs.size then
cs := cs.append as[l:]
break
let l' := as.gallop (f . bs[r.min (bs.size - 1)]) l
let r' := bs.gallop (f . as[l'.min (as.size - 1)]) r
cs := cs.append as[l : l'] |>.append bs[r : r']
l := l'
r := r'
cs
Sofia (Jan 02 2022 at 17:44):
By faster I meant compared to my other code. Yet this is still slower than the "naive" cs : = as.append bs |>.heapSort
which doesn't even exploit the fact these are sorted. But at least gets to append many steps first to amortize it.
Sofia (Jan 02 2022 at 17:48):
Will try making that a k-way ordered merge tomorrow... or another time. Unless anyone else wants to take a stab. <3
Sofia (Jan 02 2022 at 17:49):
This should suffice to complete the above code.
instance [Hashable α] : Hashable (Array α) := ⟨fun ar => ar.foldl (fun a b => hash (a, b)) 0⟩
partial def gallop (arr : @& Array α) (f : α -> Bool) (start := 0) (stop := arr.size) : Nat :=
let stop := stop.min arr.size
let rec leap_left step index :=
if step > 0 then
if index + step < stop ∧ f arr[index + step] then
leap_left (step >>> 1) (index + step)
else
leap_left (step >>> 1) index
else
index + 1
let rec leap_right step index :=
if index + step < stop ∧ f arr[index + step] then
leap_right (step <<< 1) (index + step)
else
leap_left (step >>> 1) index
if arr.size > start ∧ f arr[start] then
leap_right 1 start
else
start
Sofia (Jan 02 2022 at 17:51):
An invariant for that order function in the last merge function. (hash . <= hash .)
is necessary, not just <
.
Gabriel Ebner (Jan 02 2022 at 17:55):
I can't reproduce your performance problems at all. You're probably getting non-deterministic results from using tasks and then attribute it to the merging algorithm. This is the benchmark I've run:
import Mathlib.Data.BinaryHeap
@[specialize] def ordered_merge (as bs : Array α) (f : α -> α -> Bool) : Array α := Id.run do
let mut cs := Array.mkEmpty (as.size + bs.size)
let mut left : USize := 0
let mut right : USize := 0
for _ in [: as.size + bs.size] do
if h : left.toNat < as.size then
if h' : right.toNat < bs.size then
if f (as.uget left h) (bs.uget right h') then
cs := cs.push <| as.uget left h
left := left + 1
else
cs := cs.push <| bs.uget right h'
right := right + 1
else
cs := cs.append as[left.toNat :]
break
else
cs := cs.append bs[right.toNat :]
break
cs
@[specialize]
def heapSortMerge (as bs : Array α) (f : α → α → Bool) : Array α :=
as ++ bs |>.heapSort f
@[specialize]
def quickSortMerge [Inhabited α] (as bs : Array α) (f : α → α → Bool) : Array α :=
as ++ bs |>.qsort f
def as : Array Nat := List.range 100000 |>.toArray
def bs : Array Nat := List.range 100000 |>.toArray
#eval timeit "heapSortMerge" do (); heapSortMerge as bs (· < ·) |>.size -- 2.82s
#eval timeit "ordered_merge" do (); ordered_merge as bs (· < ·) |>.size -- 185ms
#eval timeit "quickSortMerge" do (); quickSortMerge as bs (· < ·) |>.size -- stack overflow
Sofia (Jan 02 2022 at 17:56):
@Gabriel Ebner Nice benchmark! However as mentioned above I noticed the real problem with the comparison. The heap sort method was deferred after appending many of these arrays. Thus the fair comparison would be a k-way merge.
Gabriel Ebner (Jan 02 2022 at 17:57):
Other issues that I could see are:
1) Maybe your hash
function is expensive
2) You're repeatedly merging arrays in a loop. This is quadratic in the size of the tasks
array.
Sofia (Jan 02 2022 at 17:58):
(2) is yes, until the k-way merge. The hash is Lean's normal hash which is cheap.
Sofia (Jan 02 2022 at 18:00):
@Gabriel Ebner For my comparison, I see the deferred heap sort at 24s and the quadratic 2-way merge as 30s. I expect the k-way merge to be a lot better here.
Sofia (Jan 02 2022 at 18:02):
Either way I really need to get some sleep... I'd appreciate any further suggestions. <3
Last updated: Dec 20 2023 at 11:08 UTC