Zulip Chat Archive

Stream: lean4

Topic: Finite Sets


Marcus Rossel (May 22 2022 at 11:36):

{What is the, Is there a} type used for programming with finite sets in Lean 4?
I'm currently using HashSet, but AFAIK that's lacking certain basic operations like map, so I'm unsure of whether this is the type intended for (computable) finite sets in general.

Kevin Buzzard (May 22 2022 at 11:43):

My MSc student @Hubert Ostoja-Petkowski has ported mathlib3's finset to Lean 4 -- see https://github.com/Hop311/M4R/blob/main/M4R/Set/Finite/Finset.lean . I should say that this is not in the context of mathlib4, this is an independent project.

Marcus Rossel (May 22 2022 at 11:44):

Kevin Buzzard said:

My MSc student Hubert Ostoja-Petkowski has ported mathlib3's finset to Lean 4 -- see https://github.com/Hop311/M4R/blob/main/M4R/Set/Finite/Finset.lean . I should say that this is not in the context of mathlib4, this is an independent project.

Oh perfect, thanks! And it doesn't use any quotients :)

Kyle Miller (May 22 2022 at 16:39):

That looks like a specifically noncomputable Finset implementation, which probably wouldn't work for a programming application.

@Marcus Rossel If you have a computable Finset, beware they have O(n) lookup, O(n) insert, and O(n^2) map (or O(n) if the function being mapped is injective). This is because the only operation on the underlying type they use is Eq. If you want the properties of a HashSet, it probably would be better to use it instead and implement missing functionality.

Kyle Miller (May 22 2022 at 16:46):

I've thought about finset as a "least common denominator" implementation for computable finite sets -- decidable Eq is usually not hard to come by. But with additional operations you can make much more efficient data structures. For example, if you have a total order you can implement finite sets by using sorted lists (that's O(ln n) lookup, O(n) insert, and O(n) map), or if you have a hash function you can use some hash table structure (that's on average O(1) lookup, O(1) insert, and probably O(n) map).

Marcus Rossel (May 22 2022 at 16:46):

@Kyle Miller: Thanks for the feedback! I did end up going with HashSet + add missing functionality, so your suggestion is reassuring :innocent:
Do you think basic functions like HashSet.union and HashSet.map would be welcome as contributions to Lean 4?

Kyle Miller (May 22 2022 at 16:52):

I'm not sure what contributions Lean 4 is accepting right now, so someone else should answer that question. (I myself have a small library of Array theorems, but I haven't looked into whether anyone would want them yet.)

Henrik Böving (May 22 2022 at 17:13):

Marcus Rossel said:

Kyle Miller: Thanks for the feedback! I did end up going with HashSet + add missing functionality, so your suggestion is reassuring :innocent:
Do you think basic functions like HashSet.union and HashSet.map would be welcome as contributions to Lean 4?

You can just stuff everything into Mathlib right now if you want to, at some point there will be a "real" stdlib that lives outside of the compiler.

James Gallicchio (May 23 2022 at 05:48):

I'm planning to add HashSet + more functionality to the collections library I'm making. Is your use case public? I'd love to try adapting it to the collections library!

James Gallicchio (May 23 2022 at 05:49):

(And I'd be happy to have the code there instead of in Mathlib if it seems particularly out of place in Mathlib)

Marcus Rossel (May 23 2022 at 08:50):

James Gallicchio said:

I'm planning to add HashSet + more functionality to the collections library I'm making. Is your use case public? I'd love to try adapting it to the collections library!

It's just this, so far:

import Std
open Std

instance {α β} [BEq α] [BEq β] : BEq (α  β) where
  beq
    | .inl x₁, .inl x₂
    | .inr x₁, .inr x₂ => BEq.beq x₁ x₂
    | _, _ => false

instance {α β} [Hashable α] [Hashable β] : Hashable (α  β) where
  hash
    | .inl x | .inr x => Hashable.hash x

namespace Std.HashSet

variable {α β} [BEq α] [BEq β] [Hashable α] [Hashable β]

def map (s : HashSet α) (f : α  β) : HashSet β :=
  s.fold (fun res a => res.insert <| f a) (mkHashSet s.numBuckets)

def union (s₁ s₂ : HashSet α) : HashSet α :=
  s₁.fold (·.insert ·) s₂

def bUnion (s : HashSet α) (f : α  HashSet β) : HashSet β :=
  s.fold (fun res a => (f a).union res) (mkHashSet s.numBuckets)

def taggedUnion (s₁ : HashSet α) (s₂ : HashSet β) : HashSet (α  β) :=
  s₂.fold (fun res b => res.insert <| .inr b) (s₁.map Sum.inl)

def prod (s₁ : HashSet α) (s₂ : HashSet β) : HashSet (α × β) :=
  s₁.bUnion fun a => s₂.map fun b => (a, b)

def filter (s : HashSet α) (f : α  Bool) : HashSet α :=
  s.fold (fun res a => if f a then res else res.erase a) s

end Std.HashSet

def List.toHashSet [BEq α] [Hashable α] (l : List α) : HashSet α :=
  l.foldl (·.insert ·) (mkHashSet <| l.length / 2)

macro "{" elems:term,* "}" : term => `(List.toHashSet [$elems,*])

instance [ToString α] [BEq α] [Hashable α] : Repr (HashSet α) where
  reprPrec s _ :=
    match s.toList.reverse with
    | [] => "∅"
    | l => "{" ++ (l.toStringAux true) ++ "}"

instance {α} [BEq α] [Hashable α] : ForIn m (HashSet α) α where
  forIn s := s.toList.forIn

Yakov Pechersky (May 23 2022 at 11:30):

The sum type instance will likely cause problems. If you have nat + nat, the values of inl 1 and inr 1 will have a collision.

Marcus Rossel (May 23 2022 at 11:52):

Yakov Pechersky said:

The sum type instance will likely cause problems. If you have nat + nat, the values of inl 1 and inr 1 will have a collision.

Yeah, I wasn't sure how to fix that initially. But it shouldn't cause critical problems, should it? The only consequence I see is worse performance as a result of more hash collisions.

Do you think this would fix it?:

instance {α β} [Hashable α] [Hashable β] : Hashable (α  β) where
  hash
    | .inl x => mixHash 0 (hash x)
    | .inr x => mixHash 1 (hash x)

Kyle Miller (May 23 2022 at 12:12):

I think you're right that it's not a critical problem -- worst case you're forced to have two objects per hash bucket even with unlimited buckets. Still, it's nice knowing your hash functions don't have likely collisions.

It seems using mixHash the way you have is the way to fix the hash function. You might want to choose numbers more interesting than 0 and 1 though to get https://github.com/leanprover/lean4/blob/master/src/runtime/hash.h#L24 to mix things up more (that's the C implementation of mixHash I believe). I noticed these tend to be 2-4 digit numbers elsewhere in the library.

Jannis Limperg (May 23 2022 at 14:47):

I've also seen small primes used as constants for mixHash, e.g. for Expr. But I wasn't able to find any general info on what good constants for this hash function are.

James Gallicchio (May 23 2022 at 15:04):

(I can't find this particular hash function online, does anyone know what it was pulled from?)

Jannis Limperg (May 23 2022 at 15:21):

hash.cpp points here: http://burtleburtle.net/bob/hash/doobs.html

Marcus Rossel (May 23 2022 at 16:54):

@Jannis Limperg I saw 11 and 13 being used by Option's hashing implementation.


Last updated: Dec 20 2023 at 11:08 UTC