Zulip Chat Archive

Stream: batteries

Topic: Sets


Julian Berman (Mar 29 2024 at 14:36):

If I want a set-like datatype but where I want to be able to prove things about it without doing it all myself, I have to depend on Mathlib at this point right? Std4 doesn't seem to have a set type other than HashSet (which doesn't look like it has anything proven about it)?

Henrik Böving (Mar 29 2024 at 16:08):

there is no set-like datatype that is useful for programming and verified right now. Indeed HashSet has no verification but Mathlib's Set is not meant for any computation.

So it depends on whether you want to program with it or not.

Julian Berman (Mar 29 2024 at 16:29):

Yeah, I want both "in theory" but the proving part is more important than the performance characteristics at this very moment, since I'd rather be able to tell whether our design is right and worry about performance later I think. Thanks for the confirmation.

Henrik Böving (Mar 29 2024 at 16:32):

What I'm currently doing in LeanSAT is usually to define some sort of invariant on the HashMap that, if induced on I am sure I could derive some properties from if I had a tiny bit of HashMap theory. And then I propose the things that I know should easily be true as axioms with a proof sketch until the point where I can't continue anymore.

Julian Berman (Mar 29 2024 at 16:36):

Aha, interesting, OK, I suppose I should consider that as well, I mostly haven't touched HashMap itself at all because Lean.Json seems to be encouraging me to use RBMaps, so really I'm a bit haphazard already and trying to see where I can make most progress with the small amount I know.

Julian Berman (Mar 29 2024 at 16:36):

(I realize of course "which map type to use" being distinct from my same question about sets, but yeah good you mentioned that.)

Henrik Böving (Mar 29 2024 at 16:41):

https://github.com/leanprover/leansat/pull/37/files#diff-6e67bbbc7359ab88f7e957125fcdfed3a2af1a80d69cac4f46670cc34e8fb67bR9 for example like this

Kyle Miller (Mar 29 2024 at 17:06):

Something else about this problem is that mathematically we want sets to satisfy an extensionality law (sets are equal iff they have the same elements). Many data structures have multiple representations for the same underlying set.

Mathlib has docs#Finset for finite sets you can program with and prove stuff about, but they've got bad run time characteristics (it's best to think of them as being unordered Lists, so pretty much every operation is O(n) or O(n^2)).

Julian Berman (Mar 29 2024 at 17:10):

I'm missing the nuance of what you added I think Kyle.

Julian Berman (Mar 29 2024 at 17:11):

I'd surely expect any set-like type to satisfy that "mathematical" property, right?

Julian Berman (Mar 29 2024 at 17:11):

Unless by "same" you don't mean "equal"?

Kyle Miller (Mar 29 2024 at 17:12):

I'm saying that, without fancy features like quotients or other cleverness, many set data structures don't have the property that they are = if they have the same elements.

Kyle Miller (Mar 29 2024 at 17:14):

For example, a hash set's underlying Array might be various lengths, and depending on the order of insertion, depending on the hash set implementation, you could get non-= hash sets even though they have the same elements.

Kyle Miller (Mar 29 2024 at 17:15):

The point is, it might be too much to ask for an efficient data structure that has the same properties of a mathematical set. (You can fix this with quotients though, at the cost of needing to work with quotients.)

Kyle Miller (Mar 29 2024 at 17:16):

It might be worth thinking about docs#Set as being the abstract interface for a set. You can make the function HashSet X -> Set X to go from implementation to abstract interface, and maybe with enough lemmas that explain how all the HashSet operations interact with this function, that could help with proving things from the Set point of view. (I'm sort of thinking Java interface when I say "interface".)

Julian Berman (Mar 29 2024 at 17:16):

I guess I still don't know enough to understand where the issue is.

Julian Berman (Mar 29 2024 at 17:17):

Right exactly, I sort of expected some efficient data structure with a "line" in its interface, below which is some implementation detail, and above which are the things I'd expect about a set.

Julian Berman (Mar 29 2024 at 17:17):

E.g. surely Python's set satisfies that mathematical property, but I don't need to worry about how.

Kyle Miller (Mar 29 2024 at 17:19):

Here is a simple set data structure, and the ext theorem is false for it:

structure MySet (α : Type) where
  elts : List α

def MySet.contains {α : Type} [BEq α] (s : MySet α) (x : α) : Bool :=
  s.elts.contains x

theorem MySet.ext {α : Type} [BEq α] (s t : MySet α) :
    s = t   x, s.contains x  t.contains x := sorry

Julian Berman (Mar 29 2024 at 17:20):

Right I guess I would definitely not call that a set-like data structure, but maybe I'm just avoiding the point.

Kyle Miller (Mar 29 2024 at 17:21):

Why not? It has a membership operation

Julian Berman (Mar 29 2024 at 17:21):

(Thanks for humoring me / explaining of course!). I think because programming-wise I definitely assume a set is order-invariant? Not just a container.

Julian Berman (Mar 29 2024 at 17:22):

I'd expect any programming set to satisfy your ext theorem.

Kyle Miller (Mar 29 2024 at 17:22):

What about the hash set example? There are plenty of hash set implementations that don't satisfy it.

Julian Berman (Mar 29 2024 at 17:23):

I haven't actually tried to use HashSet. It doesn't satisfy that property? As I say I guess my mental image is Python's sets (though I think any other language I've used probably is similar). They're hash sets, but still satisfy that property.

Kyle Miller (Mar 29 2024 at 17:25):

Python's set definitely does not satisfy that property, because you can distinguish sets that have the same elements:

>>> s = set()
>>> t = set()
>>> s == t
True
>>> id(s) == id(t)
False

That may seem like cheating, but the point of mathematical = is that they'd be indistinguishable, and you can substitute one for the other.

I can cook up a less cheating example too. (Edit: maybe not, I thought I remembered that sets preserved insertion order, but that might be way back in python 2!)

Julian Berman (Mar 29 2024 at 17:27):

Aha. Sorry so probably this is a more basic Lean thing I should know by now, but = is not a typeclass then? I.e. you can't redefine = arbitrarily?

Kyle Miller (Mar 29 2024 at 17:28):

docs#Eq

Julian Berman (Mar 29 2024 at 17:28):

(I was indeed mapping = to Python's ==.)

Kyle Miller (Mar 29 2024 at 17:29):

The way you override = in Lean is you use quotient types to create a new type with the new equality.

Julian Berman (Mar 29 2024 at 17:29):

Aha, right...

Kyle Miller (Mar 29 2024 at 17:31):

Here's the Python set implementation: https://github.com/python/cpython/blob/main/Include/cpython/setobject.h#L36 and https://github.com/python/cpython/blob/main/Objects/setobject.c

Let's say we translated that to a structure in Lean somehow.

The big issue is that these use linear probing, so the underlying data structure really is represented differently with different insertion orders. https://github.com/python/cpython/blob/main/Objects/setobject.c#L219

Kyle Miller (Mar 29 2024 at 17:33):

There's also logic for the table getting too full or too empty, and then resizing the table. That means the "same" set can even be represented with different table sizes in practice.

Julian Berman (Mar 29 2024 at 17:33):

Right I understand that there's an issue as soon as we have to start talking about the underlying data representation to do anything.

Kyle Miller (Mar 29 2024 at 17:37):

In the design here, there seem to be two different strategies:

  1. Work to make a practical data type that has extensionality. This might be from using quotients to impose an extensional equality, or from being clever with the underlying representation.
  2. Use Set as the abstract interface, and prove everything in terms of the "cast" to Set (and maybe have automation to help with the translation?)

Maybe also

  1. Realize that your application doesn't need extensionality (i.e., you don't need to substitute one set for another in equalities, justified by them having the same elements), so it's ok that it's not a mathematical set.

Alex Keizer (Mar 29 2024 at 17:56):

What is the problem with quotients, from an API user's perspective?

It of course adds quite a big burden on the implementor, which now has to prove the quotient relation is respected by every operation, but this in some sense the "right thing", if you want to show that the Set implementation does not actually depend on the insertion order

Marc Huisinga (Mar 29 2024 at 18:00):

Julian Berman said:

Aha, interesting, OK, I suppose I should consider that as well, I mostly haven't touched HashMap itself at all because Lean.Json seems to be encouraging me to use RBMaps, so really I'm a bit haphazard already and trying to see where I can make most progress with the small amount I know.

IIRC Lean.Json was written before deriving handlers were a thing, and implementing Ord seemed less annoying than implementing Hashable.

For programming purposes, I personally tend to stick with RBMap unless I need the better performance of HashMap because linearity bugs when using the former have a much smaller fallout than linearity bugs when using the latter.

Edit: Ah, this restriction may have also played a role in why I ended up using RBMaps.

Kyle Miller (Mar 29 2024 at 18:01):

@Alex Keizer I believe it's just a library implementation burden.

Timo Carlin-Burns (Mar 29 2024 at 19:54):

I made a thread about making HashMap quotient earlier, and one of the problems is that it would mean that HashMap.toList cannot be implemented, and it wasn't clear that extensionality was worth this cost.

Kyle Miller (Mar 29 2024 at 20:07):

@Alex Keizer I take it back. Marc reminded me that if you define a type using a quotient then it can't appear in nested inductive types. (More precisely, it can't have the type-being-defined as one of its arguments.)

Julian Berman (Mar 29 2024 at 20:19):

Timo Carlin-Burns said:

I made a thread about making HashMap quotient earlier, and one of the problems is that it would mean that HashMap.toList cannot be implemented, and it wasn't clear that extensionality was worth this cost.

I'll have to think about your toList implications (if that means it can't be implemented because only some arbitrary ordering is possible? That seems ok to me so I guess I have to understand why that matters?)
But I'm glad you linked that as I also left the thread thinking "does this mean == is all I should ever care about" which I see you muse about there

Kyle Miller (Mar 29 2024 at 20:38):

By the way, yet another way to relax equality, which pre-dates BEq, is docs#Setoid, which gives you a Prop-valued relation x ≈ y. These have to be lawful in the sense that they're reflexive, symmetric, and transitive, which BEq doesn't require at all.

This also plugs into docs#Quotient, which gives you a type whose = is this .

Mentioning this because you could also decide that maybe is all you should care about. I find it to be confusing how == sometimes is meant to be a Bool-valued =, and sometimes it's a Bool-valued , depending on the type!

Kyle Miller (Mar 29 2024 at 20:41):

The word "setoid" doesn't really help with advertising this typeclass, and how it's sort of like "the typeclass for Eq" that people tend to think that Eq is.

I think it would be better if there were a HasApprox or HasRel notation typeclass, and then Setoid adds the equivalence relation laws (LawfulApprox or LawfulRel?).

Julian Berman (Mar 29 2024 at 20:45):

I think what I need to do to start is make more clear (to myself) what I even mean by "care about" -- what I really mean is "I/a collaborator are slightly worried we will do something wrong and make lots of existing Lean tactics not work". The thing we are proof-of-concepting is a JSON Schema implementation, and morally the return type of validation is something like a slightly enriched Set ValidationError. Order is not relevant, and in any other language I'd use the obvious set type for the language. But of course the point of Lean is being able to prove things about this set, and if no tactics are going to work with what we choose then we are in bad shape

Eric Wieser (Mar 29 2024 at 21:38):

One thing that might make sense in Std; "sets" implemented as (strictly)ordered lists?

Yakov Pechersky (Mar 29 2024 at 22:50):

One additional python trivia bit, since python dicts and sets are now by spec insert order iterable, you can have two equal sets s and t but list(s) and list(t) are not equal. I see Kyle mentioned this already and I likely have made the same memory mistake. Further edit: dicts are insertion order as a feature, not just implementation detail, and so, carry it as a guarantee. Sets are not guaranteed to be insertion order iterable.

Mario Carneiro (Mar 30 2024 at 00:00):

For what it's worth, I have definitely considered upstreaming mathlib's Set type to Std. We already have the notation, and it would not be much work to maintain the basic boolean operators on set, maybe image and range too, but dropping all the lattice theory. Basically the Data.Set.Basic and Data.Set.Lattice files

Mario Carneiro (Mar 30 2024 at 00:01):

I have occasionally found uses for it in software verification contexts when talking about invariants

Mario Carneiro (Mar 30 2024 at 00:03):

I would make sure to clearly separate this from data structures for maintaining a finite set of items, like RBSet and HashSet. I don't think we need too much more in this regard in std, other than filling out the API of the existing data structures. We may want to revisit Finset in the future but for now I think we don't need it.

Mario Carneiro (Mar 30 2024 at 00:09):

Alex Keizer said:

What is the problem with quotients, from an API user's perspective?

It of course adds quite a big burden on the implementor, which now has to prove the quotient relation is respected by every operation, but this in some sense the "right thing", if you want to show that the Set implementation does not actually depend on the insertion order

It's not just about the implementation burden. Not all operations will actually satisfy the quotient properties, and it is bad to make those other operations not expressible. (See the horrible hack used to make Repr ℝ work in mathlib.) The main ones will, but for example Rust's Vec<T> has a capacity() method which returns different values on "equal" vectors. Iterating a HashMap will be incredibly painful for the user code if the HashMap is a quotient type.

Mario Carneiro (Mar 30 2024 at 00:14):

The implementation burden isn't even really any different, because all the lemmas will have to be there in the end either way. But if it's explicitly represented as map1 ~ map2 -> (map1.find? x <-> map2.find? x), then you have the flexibility to consider different equivalence relations in different lemmas, to reflect how sensitive the individual operation is to internal details.

James Gallicchio (Apr 03 2024 at 00:32):

fwiw, LeanColls approaches this by separating data from its theory models. there's a typeclass for finite sets, which requires a ToFinset function. then all correctness can be stated in terms of toFinset, which has the nice properties of a quotient here, but you can still iterate over the elements, and there's some theorem linking iteration over the elements to the finset model

James Gallicchio (Apr 03 2024 at 00:41):

(though that theorem linking iteration to the finset model is missing because I haven't gotten around to it yet :sweat_smile:)


Last updated: May 02 2025 at 03:31 UTC