Zulip Chat Archive

Stream: lean4

Topic: Insertion Sort

Marcus Rossel (Aug 22 2024 at 09:48):

In lean4#5092 @Kim Morrison adds an implementation of merge sort. As future work, they state that insertion sort could be used for small chunks to improve performance.
It seems that insertion sort is currently only defined in Mathlib. Would it be a reasonable PR to move this implementation to core Lean first?

Kim Morrison (Aug 22 2024 at 10:48):

I think I'd want to have evidence first that it really can speed things up.

Kim Morrison (Aug 22 2024 at 10:50):

If you're interested in looking at sorting, working out a clean way to get rid of Mathlib's mergeSort', and allowing us to proceed with just one version, would be great.

The basic problem I didn't solve is that in Mathlib we want a Prop valued comparator, while in Lean we want a Bool valued comparator.

Kim Morrison (Aug 22 2024 at 10:51):

I am open to changing the design of mergeSort in the Lean repository to help resolve this difficulty.

Kim Morrison (Aug 22 2024 at 10:52):

(Also, I'm not actually sure that Mathlib's insertion stable is stable sort --- you do have to do the comparison in the right way for this to happen. It might, but I haven't looked. Proving/making-it-true in Mathlib's insertion sort would be another helpful project.)

Eric Wieser (Aug 22 2024 at 10:58):

Kim Morrison said:

The basic problem I didn't solve is that in Mathlib we want a Prop valued comparator, while in Lean we want a Bool valued comparator.

Is this a cosmetic choice or a performance one? I realize it's consistent with docs#List.filter taking a bool-valued predicate, but the general pattern is at odds with mathlib

Eric Wieser (Aug 22 2024 at 10:59):

(what it's not consistent with is how if works, not that I would advocate for bif to become the default!)

Eric Wieser (Aug 22 2024 at 11:00):

docs#List.mergeSort is still the mathlib one; I guess the primed one is currently in a bump branch.

Marcus Rossel (Aug 22 2024 at 11:03):

Kim Morrison said:

(Also, I'm not actually sure that Mathlib's insertion stable is stable sort --- you do have to do the comparison in the right way for this to happen. It might, but I haven't looked. Proving/making-it-true in Mathlib's insertion sort would be another helpful project.)

At a glance, it looks like Mathlib's insertion sort is stable. For merge sort you defined three notions of stability:

-- 1
theorem mergeSort_enum {l : List α} : (mergeSort (enumLE le) (l.enum)).map (·.2) = mergeSort le l

-- 2
theorem mergeSort_stable
    (trans : ∀ (a b c : α), le a b → le b c → le a c)
    (total : ∀ (a b : α), !le a b → le b a) :
    ∀ {c : List α} (_ : c.Pairwise le) (_ : c <+ l),
    c <+ mergeSort le l

-- 3
theorem mergeSort_stable_pair
    (trans : ∀ (a b c : α), le a b → le b c → le a c)
    (total : ∀ (a b : α), !le a b → le b a)
    (hab : le a b) (h : [a, b] <+ l) : [a, b] <+ mergeSort le l

Is one of these variants preferred, or is prudent to just show all three? (I guess at least 3 follows easily from 2)

Marcus Rossel (Aug 22 2024 at 12:34):

#16065

Marcus Rossel (Aug 29 2024 at 16:16):

Kim Morrison said:

I think I'd want to have evidence first that it really can speed things up.

On that note, I've been trying out what happens when we just naively unroll the cases for list lengths 2 and 3:

Partially Unrolled Merge Sort

def mergeSortTR₃ (le : α → α → Bool) (l : List α) : List α :=
  run ⟨l, rfl⟩
where
  run : {n : Nat} → { l : List α // l.length = n } → List α
  | 0, ⟨[], _⟩     => []
  | 1, ⟨[a], _⟩    => [a]
  | 2, ⟨[a, b], _⟩ => if le a b then [a, b] else [b, a]
  | 3, xs          => sort₃ xs
  | _ + 4, xs      =>
    let (l, r) := splitRevInTwo xs
    mergeTR le (run' l) (run r)
  run' : {n : Nat} → { l : List α // l.length = n } → List α
  | 0, ⟨[], _⟩     => []
  | 1, ⟨[a], _⟩    => [a]
  | 2, ⟨[b, a], _⟩ => if le a b then [a, b] else [b, a]
  | 3, xs          => sort₃' xs
  | _ + 4, xs      =>
    let (l, r) := splitRevInTwo' xs
    mergeTR le (run' r) (run l)
  sort₃ : { l : List α // l.length = 3 } → List α
    | ⟨l@([a, b, c]), _⟩ =>
      match le a b, le b c with
      | true,  true  => l
      | false, false => [c, b, a]
      | true,  false => if le a c then [a, c, b] else [c, a, b]
      | false, true  => if le a c then [b, a, c] else [b, c, a]
  sort₃' : { l : List α // l.length = 3 } → List α
    | ⟨l@([c, b, a]), _⟩ =>
      match le a b, le b c with
      | true,  true  => l
      | false, false => [c, b, a]
      | true,  false => if le a c then [a, c, b] else [c, a, b]
      | false, true  => if le a c then [b, a, c] else [b, c, a]

Using lean4/tests/bench/mergeSort/bench.py I get the following for the current version of mergeSort:

current-mergeSort.png

... and the following for the version with unrolling:

unrolled-mergeSort.png

So I guess that means there's no use in even trying to use insertion sort as a subroutine?

Kevin Buzzard (Aug 29 2024 at 16:32):

What does i indicate in the graph?

Henrik Böving (Aug 29 2024 at 17:30):

Kevin Buzzard said:

What does i indicate in the graph?

normalized length of the list afaict from the script

Marcus Rossel (Aug 29 2024 at 20:04):

Kim Morrison said:

working out a clean way to get rid of Mathlib's mergeSort', and allowing us to proceed with just one version, would be great.

I've opened a separate topic for this.

Kim Morrison (Aug 29 2024 at 23:52):

@Marcus Rossel, two things:

Be very cautious about believing the results from the benchmarking script. It is only sampling over 1 order of magnitude variation in list sizes, and trying to fit a function involving log. Dubious statistics. :-)
Those results are suspiciously close --- have you run a benchmark that actually gave obviously different results? Might be worth doing so just to be sure you are really benchmarking the change you've made. (You have to rebuild all of lean, and be using the rebuilt toolchain?)

Kim Morrison (Aug 29 2024 at 23:53):

And a third thing:

Great that you're looking into this!

Marcus Rossel (Aug 30 2024 at 08:16):

@Kim Morrison I am indeed running into some sort of issue. My results don't even show a difference in performance between plain mergeSort and mergeSortTR₂. I performed the following steps:

Build the current version of Lean using:

cmake --preset release
make -C build/release -j$(nproc || sysctl -n hw.logicalcpu)

Run the test script for k from 1 to 30 with three runs per k:

cd tests/bench/mergeSort

for k in $(seq 1 30);
do
echo "$k: $(lake exec mergeSort $k), $(lake exec mergeSort $k), $(lake exec mergeSort $k)"
done

Remove the @[csimp] tag from mergeSort_eq_mergeSortTR₂.
Rebuild Lean, again using:

cmake --preset release
make -C build/release -j$(nproc || sysctl -n hw.logicalcpu)

Again run the test script for k from 1 to 30 with three runs per k:

for k in $(seq 1 30);
do
echo "$k: $(lake exec mergeSort $k), $(lake exec mergeSort $k), $(lake exec mergeSort $k)"
done

These are the results, so something must be off.

Kim Morrison (Aug 30 2024 at 11:32):

@Marcus Rossel have you set up your toolchain correctly? I suspect you are running the default toolchain provided by elan, rather than the one you have just built. Try this:

elan toolchain link lean4 build/release/stage1
elan override set lean4

Marcus Rossel (Aug 30 2024 at 11:41):

Kim Morrison said:

Marcus Rossel have you set up your toolchain correctly?

That might be it - I only ran:

elan toolchain link lean4 build/release/stage1
elan toolchain link lean4-stage0 build/release/stage0

Marcus Rossel (Aug 30 2024 at 11:54):

Oooooh. The benchmark file explicitly uses mergeSortTR₂ :see_no_evil::

def main (args : List String) : IO Unit := do
  let k := 5
  let some arg := args[0]? | throw <| IO.userError s!"specify length of test data in multiples of 10^{k}"
  let some i := arg.toNat? | throw <| IO.userError s!"specify length of test data in multiples of 10^{k}"
  let n := i * (10^k)
  let i₁ := List.iota n
  let i₂ := List.range n
  let i₃ ← (List.range n).mapM (fun _ => IO.rand 0 1000)
  let i₄ := (List.range (i * (10^(k-3)))).bind (fun k => (k * 1000 + 1) :: (k * 1000) :: List.range' (k * 1000 + 2) 998)
  let start ← IO.monoMsNow
  let o₁ := (mergeSortTR₂ (· ≤ ·) i₁).length == n
  let o₂ := (mergeSortTR₂ (· ≤ ·) i₂).length == n
  let o₃ := (mergeSortTR₂ (· ≤ ·) i₃).length == n
  let o₄ := (mergeSortTR₂ (· ≤ ·) i₄).length == n
  IO.println (((← IO.monoMsNow) - start)/4)
  IO.Process.exit (if o₁ && o₂ && o₃ && o₄ then 0 else 1)

Marcus Rossel (Aug 30 2024 at 11:55):

So we do get an improvement!

Marcus Rossel (Aug 30 2024 at 15:06):

Using the following (temporary) insertion sort based implementation of mergeSortTR₃:

Implementation

def mergeSortTR₃ (cutoff : Nat) (le : α → α → Bool) (l : List α) : List α :=
  run ⟨l, rfl⟩
where
  run : {n : Nat} → { l : List α // l.length = n } → List α
  | 0, ⟨[], _⟩    => []
  | 1, ⟨[a], _⟩   => [a]
  | n@(_ + 2), xs =>
    if n < cutoff then
      insertionSort xs
    else
      let (l, r) := splitRevInTwo xs
      mergeTR le (run' l) (run r)
  run' : {n : Nat} → { l : List α // l.length = n } → List α
  | 0, ⟨[], _⟩  => []
  | 1, ⟨[a], _⟩ => [a]
  | n@(_ + 2), xs   =>
    if n < cutoff then
      insertionSort' xs
    else
      let (l, r) := splitRevInTwo' xs
      mergeTR le (run' r) (run l)
  insertionSort : List α → List α
  | []     => []
  | [a]    => [a]
  | b :: l => orderedInsert b (insertionSort l)
  orderedInsert (a : α) : List α → List α
  | []          => [a]
  | bl@(b :: l) => if le a b then a :: bl else b :: orderedInsert a l
  insertionSort' : List α → List α
  | []     => []
  | [a]    => [a]
  | b :: l => orderedInsert' b (insertionSort' l)
  orderedInsert' (a : α) : List α → List α
  | []          => [a]
  | bl@(b :: l) => if le a b then b :: orderedInsert a l else a :: bl -- TODO: Is this correct?

... I tried different cutoffs for when a list should be sorted by insertion sort. The following cutoffs yield the following average speedups over mergeSortTR₂ when benchmarked with the above script and 1 <= i <= 30:

Cutoff    16 |  20 |    24 |    28 |    32 |    40
Speedup  10% | 11% | 12.4% | 12.2% | 12.5% | 12.3%

So first of all it seems to be worth it. And second, I'm guessing the sweet spot will end up somewhere around 30.

Jannis Limperg (Aug 30 2024 at 16:08):

The optimal cutoff also depends on the type of objects stored, right? Because if the comparison is expensive, insertion sort is relatively more costly. So perhaps this should be tested with something like lists of short strings.

François G. Dorais (Aug 30 2024 at 22:11):

An one-line tail-recursive insertion sort is described at batteries#569

Kim Morrison (Aug 30 2024 at 23:58):

All I know here is that Java at one point in time used a cut-off of 8. Copying someone seems a good approach here relative to trying to work this out from first principles. :-)

Marcus Rossel (Aug 31 2024 at 08:47):

On that note, does this now justify moving insertionSort into core, or should the insertion sorty part of mergeSort remain an implementation detail?

Kim Morrison (Sep 04 2024 at 05:55):

I think I'm not ready to add the extra complexity of the insertion sort layer to the implementation in Lean.

Last updated: May 02 2025 at 03:31 UTC