Zulip Chat Archive

Stream: lean4

Topic: UnionFind proof assistance

Sofia (Aug 11 2022 at 03:00):

Hello. I'd like some assistance writing proofs for a Union-Find implementation. I've figured out how to write all the proofs except for growing the collection, which is Fin-specific and I couldn't find the right rewrites to apply. That aside, I feel there are better ways to write the proofs. Any suggestions?

Oh and I feel like it'd be nice to have a proof there cannot exist cycles, but for now I've just gone with a for-loop for worst-case fuel and exit early. Not even sure where I'd start to express such an acyclic proof, can do without, just would be a very nice property to have proven.

structure UnionFind (size : Nat) :=
  (data : Array (Fin size))
  (correct_size : size = data.size)

namespace UnionFind
  def parent (u : @& UnionFind size) (i : Fin size) : Fin size :=
    u.data[by rw [u.correct_size] at i; exact i]

  def grow (u : UnionFind size) : UnionFind (size + 1) :=
    {
      data := u.data.map (fun x => Fin.mk x.val sorry) |>.push (Fin.mk size sorry),
      correct_size := sorry
    }

  private def set_parent (u : UnionFind size) (i : Fin size) (p : Fin size) : UnionFind size :=
    let i := by rw [u.correct_size] at i; exact i
    {
      data := u.data.set i p,
      correct_size := by have q := u.data.size_set i p; have s := u.correct_size; rw [q.symm] at s; exact s,
    }

  def find (u : UnionFind size) (i : Fin size) : Prod (UnionFind size) (Fin size) := Id.run do
    let mut u := u
    let mut i := i
    for _ in [0 : size] do
      let p := u.parent i
      let g := u.parent p
      if p = g then
        break
      u := u.set_parent i g
      i := g
    (u, i)

  def union (u : UnionFind size) (a b : Fin size) : UnionFind size :=
    let (u, a) := u.find a
    let (u, b) := u.find b
    u.set_parent a b
end UnionFind

Sofia (Aug 11 2022 at 03:03):

@Mario Carneiro I noted your UnionFind implementation in Mathlib. 300 lines, returns proofs, not quite what I expected. Was there any specific goal with this approach not met by my approach? https://github.com/leanprover-community/mathlib4/blob/master/Mathlib/Data/UnionFind.lean

Sofia (Aug 11 2022 at 03:05):

I will add ranks to my implementation.

Sofia (Aug 11 2022 at 03:06):

I have written very few proofs so far. Was happy to get as far as I did here. :)

Sofia (Aug 11 2022 at 04:29):

Updated towards adding the rank, but stuck with the proofs still.

structure UFNode (size : Nat) :=
  (parent rank : Fin size)

structure UnionFind (size : Nat) :=
  (data : Array (UFNode size))
  (correct_size : size = data.size)
-- Iff a node is not its own parent, the rank must increase.
--(rank_proof : forall s : Fin data.size, s != data[s].parent <-> data[s].rank < data[data[s].parent].rank))

namespace UnionFind
  def empty : UnionFind 0 where
    data := #[]
    correct_size := by decide

  def grow (u : UnionFind size) : UnionFind (size + 1) where
    data := sorry -- u.data.map (fun x => Fin.mk x.val sorry) |>.push (Fin.mk size sorry),
    correct_size := sorry --by decide

  def lookup (u : @& UnionFind size) (i : Fin size) : UFNode size :=
    u.data[by rw [u.correct_size] at i; exact i]

  def set_parent (u : UnionFind size) (i p : Fin size) : UnionFind size :=
    let i := by rw [u.correct_size] at i; exact i
    let n := { parent := p, rank := u.lookup p |>.rank /- + 1 -/ }
    {
      data := u.data.set i n
      correct_size := by have q := u.data.size_set i n; have s := u.correct_size; rw [q.symm] at s; exact s,
    }

  -- find with path halving.
  def find (u : UnionFind size) (i : Fin size) : Prod (UnionFind size) (UFNode size) :=
    let node := u.lookup i
    if node.parent = i then
      (u, node)
    else
      let g := u.lookup node.parent |>.parent
      find (u.set_parent i g) g
  decreasing_by sorry -- rank is monotonic

  def union (u : UnionFind size) (a b : Fin size) : UnionFind size :=
    let (u, a) := u.find a
    let (u, b) := u.find b
    if _h : a.rank < b.rank then
      u.set_parent a.parent b.parent -- h
    else
      u.set_parent b.parent a.parent -- h.symm
end UnionFind

Sofia (Aug 11 2022 at 04:40):

Whoops, there are 3 orders, not 2. Should be an extra branch in that union. Also.. (not h).symm ?

Mario Carneiro (Aug 11 2022 at 05:25):

Sofia said:

Mario Carneiro I noted your UnionFind implementation in Mathlib. 300 lines, returns proofs, not quite what I expected. Was there any specific goal with this approach not met by my approach? https://github.com/leanprover-community/mathlib4/blob/master/Mathlib/Data/UnionFind.lean

The UnionFind type itself does not depend on size
It stores data associated to the nodes, although that could arguably be separated from the data structure itself
It does path compression, which is required to get the famed inverse-ackermann runtime bound
It supports grow without remapping the array (which is O(n))

I can concede that putting the size in the type leads to an easier interface though, since it's easier to track the size that way.

Sofia (Aug 11 2022 at 05:36):

'Full' path compression has more memory accesses due to the second iteration. 'Splitting' updates the same number of nodes and 'halving' updates half the nodes. Full path compression is known to be slower, and splitting and halving perform the same. I've chosen halving to reduce write throughput regardless. https://arxiv.org/abs/1911.06347

Sofia (Aug 11 2022 at 05:40):

I might add the data in, but have a few other details to figure out before deciding quite how everything is laid out. For context, I'm implementing UnionFind towards an E-graph implementation which compiles to a Datalog incremental relational database for worst-case optimal queries, with multi-patterns and adaptive query scheduling. https://arxiv.org/abs/2108.02290

Sofia (Aug 11 2022 at 05:46):

Also curious about implementing an SMT solver using the E-graph with these incremental worst-case optimal queries.

Sofia (Aug 11 2022 at 06:12):

@Mario Carneiro Oh I forgot to ask. Does the map over Fin values to update their bound not become free? IIUC Lean will understand the value is used once, thus do an in-place map, then realize the only changes are in proof terms, thus are no-ops, right?

Mario Carneiro (Aug 11 2022 at 06:15):

One would hope so, but you have to look at the generated code to now for sure. I'm fairly confident that this is still too magical for lean ATM

Mario Carneiro (Aug 11 2022 at 06:17):

In general I'm dubious about all these changes without actual performance numbers for comparison

Sofia (Aug 11 2022 at 06:17):

If you mean the path compression methods? That paper I linked is all about those numbers.

Mario Carneiro (Aug 11 2022 at 06:18):

I mean in lean

Mario Carneiro (Aug 11 2022 at 06:18):

the paper seems to be about concurrent unionfind, which is completely different

Mario Carneiro (Aug 11 2022 at 06:18):

all the unionfind functions are inherently serial, they mutate the data structure

Mario Carneiro (Aug 11 2022 at 06:20):

if you used them with a shared pointer to the UnionFind value it would have terrible performance, since it would have to copy the whole array first

Sofia (Aug 11 2022 at 06:20):

It discusses both. An interesting detail of the concurrent union find is you can use plain reads/writes to update the structure and you only need memory barriers to check if a self-parent candidate is actually true. No atomics, just a single fence at the end of find whose result is likely true.

Sofia (Aug 11 2022 at 06:21):

This is because stale or not, all invariants hold.

Mario Carneiro (Aug 11 2022 at 06:21):

This isn't C / Rust, we don't have that level of access to memory in the first place

Mario Carneiro (Aug 11 2022 at 06:22):

Optimization is very dependent on the compiler and runtime here

Sofia (Aug 11 2022 at 06:22):

Indeed. Would be nice though. Either way, the path compression methods were benchmarked serially in addition to later for the concurrent case.

Mario Carneiro (Aug 11 2022 at 06:22):

again, it needs to be lean numbers

Sofia (Aug 11 2022 at 06:23):

I think the bigger problem would be all the pointer-chasing. Ex. An array of fin or nat aren't inlined IIUC.

Sofia (Aug 11 2022 at 06:23):

Sure

Mario Carneiro (Aug 11 2022 at 06:25):

Note that the implementation of unionfind didn't start out 300 lines. That's only after doing all the proofs. It was only around 30 lines before all the proofs and invariants were added

Sofia (Aug 11 2022 at 06:26):

Aye. It was the interface which returns the size proof that I found surprising at first. But it does feel like that was a lot of proof work and I wonder if there would be a better way to represent it.

Mario Carneiro (Aug 11 2022 at 06:26):

you know it didn't start out looking like that, it got added because you need it to do a later part of the proof

Sofia (Aug 11 2022 at 06:27):

Right.

Mario Carneiro (Aug 11 2022 at 06:27):

I think there could be another way to represent it, but it's tricky to avoid passing unnecessary arguments

Mario Carneiro (Aug 11 2022 at 06:27):

in particular size

Sofia (Aug 11 2022 at 06:28):

Speaking of, I was a little surprised when I started playing with Lean 4 that Vect a k isn't implemented (except in comments), everything seems to use Array a.

Mario Carneiro (Aug 11 2022 at 06:28):

dependent types are not good for the compiler

Sofia (Aug 11 2022 at 06:29):

Heh. Well, I hope this will not be true for my own.

Mario Carneiro (Aug 11 2022 at 06:30):

They block various kinds of code motion optimizations because of having to introduce cast

Mario Carneiro (Aug 11 2022 at 06:30):

I don't think yours is an exception, unless you have reason to believe that the blocked optimizations aren't important for you

Sofia (Aug 11 2022 at 06:32):

Not quite sure which block you mean. Either way my design is largely based on equality saturation, and reasoning with symbolic terms.

Sofia (Aug 11 2022 at 06:37):

In the case of storing a vector, letting it grow, then shrink, if the compiler can see the upper bound (and it isn't too high), then it can allocate that region without re-allocating and copying the data around to make it fit. While in the earlier or intermediate stages, any unused allocated area can be repurposed for ephemeral local scratch space. Similarly any padding can be used for sentinels.

Sofia (Aug 11 2022 at 06:39):

Most of the time arrays/vectors can have statically known upper bounds through a little symbolic analysis. When this analysis fails, then we have to pay for the re-allocations and copies if necessary. Further we don't just want to consider the worst-case, as the average may be much friendlier and we don't want to waste memory just to be conservative.

Sofia (Aug 11 2022 at 06:44):

Linear memory management is a really good starting point.

Sofia (Aug 11 2022 at 12:44):

Huh. I updated Lean, now I get this error on one of my proofs which was working before.

  def lookup (u : @& UnionFind size) (i : Fin size) : UFNode size :=
    u.data[by rw [u.correct_size] at i; exact i]

UnionFind.lean:60:4: error: failed to prove index is valid, possible solutions:
  - Use `have`-expressions to prove the index is valid
  - Use `a[i]!` notation instead, runtime check is perfomed, and 'Panic' error message is produced if index is not valid
  - Use `a[i]?` notation instead, result is an `Option` type
  - Use `a[i]'h` notation instead, where `h` is a proof that index is valid

Mario Carneiro (Aug 11 2022 at 13:23):

you could use .get there

Mario Carneiro (Aug 11 2022 at 13:24):

the a[i] syntax now expects a Nat in the first position

Mario Carneiro (Aug 11 2022 at 13:24):

or a Fin _ but you would have to type ascribe it

Sofia (Aug 11 2022 at 13:40):

Ah. First it was Nat, then Fin, now Nat again. o.o

Sofia (Aug 11 2022 at 13:40):

Thanks, can confirm using .get worked.

Sofia (Aug 11 2022 at 13:42):

Why would the explicit type ascription work if not the type just being itself?

Mario Carneiro (Aug 11 2022 at 13:45):

It now accepts multiple types there, so a by block won't work because the type is unconstrained

Sofia (Aug 11 2022 at 13:46):

Ah, that bit. Okay. Huh.

Mario Carneiro (Aug 11 2022 at 13:47):

If you give it a Nat, then it expects a proof to be supplied, if you give it Fin _ then it doesn't need the proof

Sofia (Aug 11 2022 at 13:47):

Not even.. u.data[by rw [u.correct_size] at i; exact (i : Fin _)]

Mario Carneiro (Aug 11 2022 at 13:47):

try u.data[(by rw [u.correct_size] at i; exact i : Fin _)]

Sofia (Aug 11 2022 at 13:48):

Or u.data[(by rw [u.correct_size] at i; exact i) : Fin _]

Sofia (Aug 11 2022 at 13:48):

Your suggestion worked.

Sofia (Aug 11 2022 at 13:49):

Means the .get is shortest. shrug

Mario Carneiro (Aug 11 2022 at 13:53):

you could also do u.data[i.1]'(by rw [u.correct_size] at i; exact i.2) IIUC

Sofia (Aug 11 2022 at 13:57):

I tried a variant, with .val and .isLt which didn't work because the value is detached from the rewrite it seems.

Sofia (Aug 11 2022 at 13:58):

error: type mismatch
  i.isLt
has type
  i.val < Array.size u.data : Prop
but is expected to have type
  i✝.val < Array.size u.data : Prop

Mario Carneiro (Aug 11 2022 at 14:03):

MWE?

Sofia (Aug 11 2022 at 14:08):

structure Foo (size : Nat) where
  data : Array Unit
  size_proof : size = data.size

def wme (f : Foo size) (i : Fin size) :=
  f.data[i.1]'(by rw [f.size_proof] at i; exact i.2)