Zulip Chat Archive

Stream: mathlib4

Topic: Ordered merge slower than append-then-sort


Sofia (Jan 02 2022 at 15:36):

Hmm. I have arrays on the order of 1000s of elements, which are already ordered and I want to merge them into an ordered array. As a temporary hack I was appending them together, then sorting. This was too slow. I've tried to speed it up, but my code is quite a bit slower.

I suspect the append is better optimized and dwarfing my code. Any suggestions?

The obviously correct reference.

as.append bs |>.heapSort (hash . < hash .)

The should-be-faster code. @[specialize] helped a little. Adding the bounds proofs did not have any effect.

@[specialize] def ordered_merge (as bs : Array α) (f : α -> α -> Bool) : Array α := Id.run do
    let mut cs := Array.mkEmpty (as.size + bs.size)
    let mut left : USize := 0
    let mut right : USize := 0

    for _ in [: as.size + bs.size] do
        if h : left.toNat < as.size then
            if h' : right.toNat < bs.size then
                if f (as.uget left h) (bs.uget right h') then
                    cs := cs.push <| as.uget left h
                    left := left + 1
                else
                    cs := cs.push <| bs.uget right h'
                    right := right + 1
            else
                cs := cs.append as[left.toNat :]
                break
        else
            cs := cs.append bs[right.toNat :]
            break

    cs

Henrik Böving (Jan 02 2022 at 15:58):

append on Array is indeed better optimized because it uses a C implementation under the hood so you will have a hard time keeping up with that in Lean i suppose

Out of curiosity, how are you measuring performance here? Do you have some cool setup for that or do you just time the execution of the binary?

Sofia (Jan 02 2022 at 16:01):

I'm using timeit in Lean and looking at the process in htop. When it has many threads, it is building those arrays. When it uses only one, it is doing the merges.

new <- timeit "join" do
    let mut new := #[]
    for t in tasks do
        new := new.ordered_merge (<- IO.ofExcept (<- IO.wait t)) (hash . < hash .)
    new

Sofia (Jan 02 2022 at 16:01):

Hmm. Looking at that part of the code again, I recall I was just doing append before in that loop, then doing the sort on the outside.

Sofia (Jan 02 2022 at 16:02):

That'd further amplify the append vs. iteration significantly.

Sofia (Jan 02 2022 at 16:05):

I don't think the append is implemented in C? Following back the implementation, it is all Lean. A little implemented with unsafe, but that is still Lean.

Henrik Böving (Jan 02 2022 at 16:07):

Ahhh it was Array.push that is implemented in C right...I always mix those two up.

Sofia (Jan 02 2022 at 16:18):

Getting a nice speed up by just appending them and doing in-place swaps instead.

Sofia (Jan 02 2022 at 16:18):

But I think there is a bug so...

Sofia (Jan 02 2022 at 16:20):

Yeah.. it isn't interleaving them o.o

Sofia (Jan 02 2022 at 16:21):

Ah.

Sofia (Jan 02 2022 at 17:41):

Tired.. should really be sleeping and not doing silly things like figuring out how to merge sorted lists.. Anyway, I already have a "gallop" function (stolen shamelessly from https://docs.rs/datafrog) so I'm using that to find runs, then merge over runs. This improves performance at least.

@[specialize] def ordered_merge (as bs : Array α) (f : α -> α -> Bool) : Array α := Id.run do
    let mut cs := Array.mkEmpty (as.size + bs.size)

    let mut l := 0
    let mut r := 0

    for _ in [0 : as.size + bs.size] do
        if l >= as.size then
            cs := cs.append bs[r:]
            break
        if r >= bs.size then
            cs := cs.append as[l:]
            break

        let l' := as.gallop (f . bs[r.min (bs.size - 1)]) l
        let r' := bs.gallop (f . as[l'.min (as.size - 1)]) r
        cs := cs.append as[l : l'] |>.append bs[r : r']
        l := l'
        r := r'

    cs

Sofia (Jan 02 2022 at 17:44):

By faster I meant compared to my other code. Yet this is still slower than the "naive" cs : = as.append bs |>.heapSort which doesn't even exploit the fact these are sorted. But at least gets to append many steps first to amortize it.

Sofia (Jan 02 2022 at 17:48):

Will try making that a k-way ordered merge tomorrow... or another time. Unless anyone else wants to take a stab. <3

Sofia (Jan 02 2022 at 17:49):

This should suffice to complete the above code.

instance [Hashable α] : Hashable (Array α) := fun ar => ar.foldl (fun a b => hash (a, b)) 0

partial def gallop (arr : @& Array α) (f : α -> Bool) (start := 0) (stop := arr.size) : Nat :=
    let stop := stop.min arr.size

    let rec leap_left step index :=
        if step > 0 then
            if index + step < stop  f arr[index + step] then
                leap_left (step >>> 1) (index + step)
            else
                leap_left (step >>> 1) index
        else
            index + 1

    let rec leap_right step index :=
        if index + step < stop  f arr[index + step] then
            leap_right (step <<< 1) (index + step)
        else
            leap_left (step >>> 1) index

    if arr.size > start  f arr[start] then
        leap_right 1 start
    else
        start

Sofia (Jan 02 2022 at 17:51):

An invariant for that order function in the last merge function. (hash . <= hash .) is necessary, not just <.

Gabriel Ebner (Jan 02 2022 at 17:55):

I can't reproduce your performance problems at all. You're probably getting non-deterministic results from using tasks and then attribute it to the merging algorithm. This is the benchmark I've run:

import Mathlib.Data.BinaryHeap

@[specialize] def ordered_merge (as bs : Array α) (f : α -> α -> Bool) : Array α := Id.run do
    let mut cs := Array.mkEmpty (as.size + bs.size)
    let mut left : USize := 0
    let mut right : USize := 0

    for _ in [: as.size + bs.size] do
        if h : left.toNat < as.size then
            if h' : right.toNat < bs.size then
                if f (as.uget left h) (bs.uget right h') then
                    cs := cs.push <| as.uget left h
                    left := left + 1
                else
                    cs := cs.push <| bs.uget right h'
                    right := right + 1
            else
                cs := cs.append as[left.toNat :]
                break
        else
            cs := cs.append bs[right.toNat :]
            break

    cs

@[specialize]
def heapSortMerge (as bs : Array α) (f : α  α  Bool) : Array α :=
  as ++ bs |>.heapSort f

@[specialize]
def quickSortMerge [Inhabited α] (as bs : Array α) (f : α  α  Bool) : Array α :=
  as ++ bs |>.qsort f

def as : Array Nat := List.range 100000 |>.toArray
def bs : Array Nat := List.range 100000 |>.toArray

#eval timeit "heapSortMerge" do (); heapSortMerge as bs (· < ·) |>.size -- 2.82s
#eval timeit "ordered_merge" do (); ordered_merge as bs (· < ·) |>.size -- 185ms
#eval timeit "quickSortMerge" do (); quickSortMerge as bs (· < ·) |>.size -- stack overflow

Sofia (Jan 02 2022 at 17:56):

@Gabriel Ebner Nice benchmark! However as mentioned above I noticed the real problem with the comparison. The heap sort method was deferred after appending many of these arrays. Thus the fair comparison would be a k-way merge.

Gabriel Ebner (Jan 02 2022 at 17:57):

Other issues that I could see are:
1) Maybe your hash function is expensive
2) You're repeatedly merging arrays in a loop. This is quadratic in the size of the tasks array.

Sofia (Jan 02 2022 at 17:58):

(2) is yes, until the k-way merge. The hash is Lean's normal hash which is cheap.

Sofia (Jan 02 2022 at 18:00):

@Gabriel Ebner For my comparison, I see the deferred heap sort at 24s and the quadratic 2-way merge as 30s. I expect the k-way merge to be a lot better here.

Sofia (Jan 02 2022 at 18:02):

Either way I really need to get some sleep... I'd appreciate any further suggestions. <3


Last updated: Dec 20 2023 at 11:08 UTC