Zulip Chat Archive

Stream: general

Topic: Counting the size of a structure


Sorawee Porncharoenwase (Jun 27 2021 at 03:35):

Given a Lean data structure, is there a way to count the size of the value, taken reference sharing into account?

As a concrete example, consider:

inductive A : Type
| a : A
| b : A  A  A

open A

def c := b a a -- 3 / 2
def d := b c c -- 7 / 3
def e := b d d -- 15 / 4
def f := b e e  -- 31 / 5

If I walk over the tree and count every node one, I will get that f has size 31. However, in most functional languages, references can be shared, so a more accurate size would be 5.

Does Lean share references? And if so, is it possible to compute the size somehow?

Yaël Dillies (Jun 27 2021 at 09:33):

Would something like that do what you want?

import data.finset.basic

inductive A : Type
| a : A
| b : A  A  A

open A

instance : decidable_eq A := sorry

def without_ref : A  
| a       := 1
| (b l r) := without_ref l + without_ref r + 1

def with_ref_aux : A  finset A
| a       := 
| (b l r) := insert (b l r) (with_ref_aux l  with_ref_aux r)

def with_ref (x : A) :  :=
(with_ref_aux x).card + 1

def c := b a a -- 3 / 2
def d := b c c -- 7 / 3
def e := b d d -- 15 / 4
def f := b e e  -- 31 / 5

#eval without_ref f
#eval with_ref_aux c
#eval with_ref c

Yaël Dillies (Jun 27 2021 at 09:36):

You'll have to define to define what you mean by size anyway. If you really mean reference and not equality, then no that's impossible. If c = d = b a a, then you can't expect size (b c c) = 3 and size (b c d) = 4, because then substituting c = d would give 3 = 4.

Sorawee Porncharoenwase (Jun 27 2021 at 10:31):

I really do mean reference, unfortunately.

Can you confirm though that Lean does share reference internally? Intuitively it should be, since _not_ sharing seems harder to implement, but I want to be sure.

Yaël Dillies (Jun 27 2021 at 10:32):

I don't know. Ask Gabriel Ebner or Mario Carneiro.

Eric Wieser (Jun 27 2021 at 10:39):

I assume what you mean here is "size when run in the VM"?

Yakov Pechersky (Jun 27 2021 at 12:00):

There's docs#sizeof that is used, inter alia, by the equation compiler to infer that terms aren't growing without bound in recursive definitions. Perhaps it is related to what you need.

Jason Rute (Jun 27 2021 at 12:00):

I think it might help for you to give some discussion about what you are looking to do with this information. Also there are different levels here.

There is the pure mathematical level. In that case, as Yael pointed out, since Lean as a math language, can’t behave differently on equal objects, there is no way to have a math function which takes a value as input and can say anything about its implementation. (This goes as well for sizeof mentioned above.)

Then there is the expression level. Your f definition is stored in the environment, and using meta code, you can examine its expression, recursing over the tree. In that case, I know references are shared since e in the expression for f is just basically a reference to the previous e declaration. You would have to do it as something like refsizeof “e” where refsizeof is a meta function which looks up a name from the environment, gets its expression, and traverses the tree. (Edit: the refsizeof doesn’t exist. You would have to write it.)

Finally, there is the execution level, where this datastructure is turned into computer code and uses pointers to other data structures. It would be silly not to share references, but I don’t know how the VM works exactly and if there is any way to probe it. Also, for Lean 4 (every thing I’ve said so far is for Lean 3) you might find this paper interesting where they show optimizations to recursive computations exploiting shared references: https://dl.acm.org/doi/10.1145/3408997

Stephen Butler (Jun 27 2021 at 13:44):

Can I ask the dumb question - why do you want/care about this? What is it you are ultimately trying to achieve?

If what you want is not possible, it's possible there's another way to achieve what you actually want without having to resolve this particular issue.

Kyle Miller (Jun 27 2021 at 15:45):

Leo wrote some code to calculate the size of a term for a Lean 4 test case: https://github.com/leanprover/lean4/blob/master/tests/lean/run/KyleAlg.lean (see dagSize) -- not sure if it can be ported to Lean 3, but maybe it's useful to see.

Jason Rute (Jun 27 2021 at 20:15):

I can't seem to find any meta function in Lean 3 that gives access to the pointer address of an object (this is what makes that Lean 4 code work as well as the optimizations in the paper). However, if you really want to be certain that Lean 3 shares references (in the VM), here is a simple proof by example:

inductive A : Type
| a : A
| b : A  A  A

/-- Binary tree of given height (not counting root).  Implemented via sharing references. -/
def binary_tree :   A
| 0 := A.a
| (n+1) := let t := binary_tree n in A.b t t

def left_path_size : A  
| A.a := 0
| (A.b t1 _) := 1 + (left_path_size t1)

#eval left_path_size (binary_tree 2) -- 2

/- This tree if expanded is larger than the number of atoms in the universe -/
#eval left_path_size (binary_tree 1000) -- 1000

Mario Carneiro (Jun 28 2021 at 15:12):

As others have already surmised, this information is not exposed to lean code, although the VM is keeping track of it. It would not be hard to expose though, with a signature like this:

meta constant reference_count {A B} (x : A) (f : A -> nat -> B) : B

This is used by lean to make some behavioral changes for unshared arrays, for example. But knowing that something has 6 references or 5 is not necessarily so useful, and also you will have to be careful about off by one errors because the reference_count function itself will count as an extra reference (that's why it is being passed through the function in this example instead of just consumed)

Mario Carneiro (Jun 28 2021 at 15:15):

expr.replace is an example of a function that exploits dag sharing in order to efficiently traverse a highly shared expr data structure. Unfortunately I don't know a way to do the same thing for user types, so for example with the binary_tree type in Jason's example I don't know how you could write a size function that doesn't take exponential time in the highly shared case

Mario Carneiro (Jun 28 2021 at 15:17):

we could just do something similar to lean 4 and expose pointer positions so that you can stick them in a hash table. Of course you will lose any ability to prove the correctness of such a function, and there is no way to avoid the unsafety of these accessors because they don't respect equality

Sorawee Porncharoenwase (Jun 29 2021 at 03:29):

Oh, I didn't necessarily want a function like reference_count that "just works" (although it would be cool to have one nonetheless). Having a function that takes a value and return an address (perhaps virtual one) would suffice for me. I can walk the structure and do the counting myself.

Sorawee Porncharoenwase (Jun 29 2021 at 03:31):

But as Jason said, this doesn't seem to be possible to Lean 3. But it is possible in Lean 4, as Kyle's link shows.

Jason Rute (Jun 29 2021 at 03:36):

As Mario said it would be possible to add access to the pointer as a meta function, but it would require modifying Lean 3 to expose it. (It would have to be meta since it doesn’t respect equality. Similarly the Lean 4 version is unsafe.)


Last updated: Dec 20 2023 at 11:08 UTC