Zulip Chat Archive

Stream: general

Topic: fast data structures?

Kevin Lacker (Oct 27 2020 at 18:31):

(Re-asking here after crickets in new-members.) While trying to clean up this implementation of rewrite_search I am trying to figure out which data structures are the fastest and had a couple questions. I would expect that it's the ones implemented in the VM. It looks like d_array is supposed to be, for example. My first question is, does this work quickly as I'd expect? When I look at the code some of the methods don't seem like they would be efficient. For example from https://github.com/leanprover-community/lean/blob/0ec2a19/library/init/data/array/basic.lean#L11:

/-- `write a i v` sets the `i`th member of `a` to be `v`. Has builtin VM implementation. -/
def write (a : d_array n α) (i : fin n) (v : α i) : d_array n α :=
{data := λ j, if h : i = j then eq.rec_on h v else a.read j}

That implementation looks like it will be O(n) to evaluate a read after doing n writes. Is it actually being replaced with something else more efficient by the VM?

My second question is whether there are other efficient data structures. In particular a good map structure like a hashmap or a treemap would be nice. There are a bunch of different tree-like structures lying around and I was wondering if there's conventional wisdom on which are the best to use.

Johan Commelin (Oct 27 2020 at 18:32):

I think that indeed the definition of write is replaced in the VM

Gabriel Ebner (Oct 27 2020 at 18:33):

https://github.com/leanprover-community/lean/blob/016b855d61d4e8959a2ad689678b0d7dc80138bb/src/library/vm/vm_array.cpp#L189

Kevin Lacker (Oct 27 2020 at 18:35):

why is it even defined in lean at all if it's replaced by the vm? I guess that part is just used as formalization?

Kevin Lacker (Oct 27 2020 at 18:36):

in that directory i see rb_map, that looks promising

Mario Carneiro (Oct 27 2020 at 18:47):

There is native.rb_map for a good rb_map implementation

Mario Carneiro (Oct 27 2020 at 18:47):

The pure rb_map in core is kind of broken

Kevin Lacker (Oct 27 2020 at 18:47):

yikes. how is it "kind of" broken?

Mario Carneiro (Oct 27 2020 at 18:48):

There is also docs#hash_map if you want a hash map

Gabriel Ebner (Oct 27 2020 at 18:48):

Isn't the hash map even slower?

Mario Carneiro (Oct 27 2020 at 18:48):

than what?

Mario Carneiro (Oct 27 2020 at 18:49):

The operations of rb_map are fine, but the invariant is wrong so you can't construct the full API

Kevin Lacker (Oct 27 2020 at 18:49):

this hash_map uses a hash function written in lean iiuc, so it sort of can't be all that fast

Gabriel Ebner (Oct 27 2020 at 18:49):

Than native.rb_map.

Mario Carneiro (Oct 27 2020 at 18:50):

Oh sure, anything written in C++ is going to be faster just because it's compiled

Mario Carneiro (Oct 27 2020 at 18:50):

I wasn't comparing those anyway

Mario Carneiro (Oct 27 2020 at 18:51):

This is sort of the reason my plan to formalize Okasaki's pure functional data structures book fell through

Kevin Lacker (Oct 27 2020 at 18:51):

lean has its own implementation of an array, huh. https://github.com/leanprover-community/lean/blob/016b855d61d4e8959a2ad689678b0d7dc80138bb/src/library/parray.h i wonder why not use STL data structures

Mario Carneiro (Oct 27 2020 at 18:52):

besides the fact that we still have no memoization primitive, the performance of the lean 3 VM is sufficiently bad that it's kind of pointless

Mario Carneiro (Oct 27 2020 at 18:52):

I wonder why not use STL data structures

Which one?

Mario Carneiro (Oct 27 2020 at 18:52):

parray is a persistent array, std::array would not be a replacement

Kevin Lacker (Oct 27 2020 at 18:53):

oh i see. can I get a regular old non-persistent array, or is that philosophically forbidden?

Mario Carneiro (Oct 27 2020 at 18:54):

I think it would be useful to have such a thing, but it is very difficult to use properly in lean

Mario Carneiro (Oct 27 2020 at 18:55):

specifically, you have to use the array linearly, and unfortunately lean sometimes holds on to extra references in places where it probably shouldn't and this kills your performance

Mario Carneiro (Oct 27 2020 at 18:55):

If you use a parray completely linearly, I believe it acts basically like an array

Mario Carneiro (Oct 27 2020 at 18:56):

IIRC there is an option you can set to see when non-destructive updates are being performed

Kevin Lacker (Oct 27 2020 at 19:00):

ok. so when you say the native.rb_map is good and the pure rb_map in core is kind of broken, which one is the good one? I think I only see https://github.com/leanprover-community/lean/blob/0ec2a19/library/init/meta/rb_map.lean#L15 in the lean codebase

Gabriel Ebner (Oct 27 2020 at 19:34):

The other one is called rbmap (without the underscore).

Kevin Lacker (Oct 27 2020 at 19:37):

so rb_map good, rbmap bad? which is which here

Mario Carneiro (Oct 27 2020 at 19:38):

the C++ one has a native in the name

Kevin Lacker (Oct 27 2020 at 19:40):

ok cool. yet another question - what's the difference between d_array and array? They look the same to me. A comment at https://github.com/leanprover-community/lean/blob/0ec2a19/library/init/data/array/basic.lean#L138 says that array is "a non-dependent array", but I have never heard the phrase "dependent array" before and google does not enlighten me

Mario Carneiro (Oct 27 2020 at 19:40):

Look at the types

Mario Carneiro (Oct 27 2020 at 19:41):

d_array allows the type of each element of the array to be different

Mario Carneiro (Oct 27 2020 at 19:41):

You won't find such a feature in most programming languages because they lack dependent types

Kevin Lacker (Oct 27 2020 at 19:43):

i am familiar with this feature, from such obscure language as python, javascript, php, ruby... ;-)

Mario Carneiro (Oct 27 2020 at 19:43):

for example, d_array 100 (λ i, array i.1 ℕ) is a two-dimensional triangular array

Mario Carneiro (Oct 27 2020 at 19:44):

python, javascript, etc cannot express this

Mario Carneiro (Oct 27 2020 at 19:44):

they can say "this is an array of things" but they can't have a strongly typed version

Kevin Lacker (Oct 27 2020 at 19:45):

OK i get it - so probably array is saner for most applications

Mario Carneiro (Oct 27 2020 at 19:45):

yes

Mario Carneiro (Oct 27 2020 at 19:46):

it's useful for proving properties though, for example if you want the elements of the array to satisfy some invariant that depends on the index

Kevin Lacker (Oct 27 2020 at 20:25):

yet another array question. the array's length is part of its type signature. is there a way to have something akin to a java "arraylist" where it can be of any length, but its implementation is an array rather than a list, so that random access is efficient? i just want to do something like, pass arrays around as arguments without describing what the length is all the time

Mario Carneiro (Oct 27 2020 at 20:26):

See the buffer type

Johan Commelin (Oct 27 2020 at 20:43):

Why is there a has_lt instance in

protected meta def mk (key : Type) [has_lt key] [decidable_rel ((<) : key → key → Prop)] (data : Type) : rb_lmap key data :=
rb_map.mk key (list data)

Because rb_map.mk doesn't need it...

Gabriel Ebner (Oct 28 2020 at 08:20):

There's a has_lt key in docs#native.rb_map.mk

Johan Commelin (Oct 28 2020 at 08:48):

Ooh, you are right

Last updated: May 02 2025 at 03:31 UTC