Zulip Chat Archive

Stream: lean4 dev

Topic: Unboxed structure types

Simon Dima (Jul 01 2025 at 08:38):

It would be nice if inductive types with only one constructor (such as Prod and structures) could be represented unboxed in the runtime and passed to/returned from functions without allocating on the heap and following a pointer. It looks like the .struct constructor of IRType is intended for this purpose, but it seems unused currently.

Of course, structs can't be passed unboxed everywhere, since polymorphic functions need values to have the same size whatever their type (though I do wonder whether the specialize pass monomorphizes polymorphic functions called on statically-known types, which would increase the number of places where struct unboxing is possible).

@Cameron Zwarich, I assume you've already considered this. How desirable/realistic would it be to implement?

Cameron Zwarich (Jul 01 2025 at 13:51):

This is one of the two big things that I'm focusing on in the compiler this quarter (the other one being join point improvements like recursive join points, making all passes be more optimistic with respect to join points, etc.). The simplest form is returning a simple structure; it gets a bit more complicated if you consider other inductive types or throw parameters in as well.

I'm actually doing a bit of a warm-up exercise and attempting another optimization mentioned in the IR comments. There are both object and tobject types, but at the moment we only use object. The original intent was that tobject would represent a value that is either an object pointer or a tagged scalar, and object would represent a value that is definitely an object pointer. This was never implemented, and we actually made everything object but with tobject semantics. If we did this correctly, we could use lean_inc_ref/lean_dec_ref directly on object, while using the current lean_inc/lean_dec on tobject.

Simon Dima (Jul 01 2025 at 14:02):

The distinction being that lean_inc always checks whether its argument is a shifted-and-lsb-tagged scalar or a pointer, and that lean_inc_ref omits that check?

Simon Dima (Jul 01 2025 at 14:29):

This is one of the two big things that I'm focusing on in the compiler this quarter

Would that also include passing fixed-width integer types as regular C uintN_t instead of shifting-and-tagging (which lean.h seems to be calling lean_box and lean_unbox)?

Robin Arnez (Jul 01 2025 at 14:30):

Simon Dima schrieb:

This is one of the two big things that I'm focusing on in the compiler this quarter

Would that also include passing fixed-width integer types as regular C uintN_t instead of shifting-and-tagging (which lean.h seems to be calling lean_box and `lean_unbox)?

This should already be done..?

Robin Arnez (Jul 01 2025 at 14:32):

To be exact, UInt8 etc. values in return types and parameters are represented as uint8_t and are read from/written to inductives using lean_ctor_get_uint8 / lean_ctor_set_uint8 etc.

Simon Dima (Jul 01 2025 at 14:39):

Robin Arnez said:

To be exact, UInt8 etc. values in return types and parameters are represented as uint8_t and are read from/written to inductives using lean_ctor_get_uint8 / lean_ctor_set_uint8 etc.

I've seen https://lean-lang.org/doc/reference/latest/The-Type-System/Inductive-Types/#inductive-types-ffi which seems to be describing the latter part of what you're saying (fixed-width integer types in constructor arguments), but if I'm reading https://lean-lang.org/doc/reference/latest/The-Type-System/Inductive-Types/#inductive-types-runtime-special-support right, it says that standalone UInt32 et al are shifted-and-tagged.

Simon Dima (Jul 01 2025 at 14:44):

By the way, is there an inconsistency in terminology between the names of functions in the C runtime and the docs? The latter link refers to shifting-and-tagging as a way of storing small scalar values "unboxed", which seems to imply that "boxed" means "behind a pointer" in the reference, but lean_box and lean_unbox are exactly about doing and undoing this encoding

Robin Arnez (Jul 01 2025 at 14:48):

That's really misleading documentation. UInt8, …, UInt64, Int8, …, Int64, and USize are always represented as uint8_t, uint16_t etc in parameters and return values. Only when they need to be boxed (e.g. to put them into Arrays / Lists) they are represented as a (x << 1) & 1 or a pointer. In other words, the first few sentences describe what happens when you have a generic data structure and store a fixed-length integer in it.

Simon Dima (Jul 01 2025 at 14:59):

https://lean-lang.org/doc/reference/latest/Basic-Types/Fixed-Precision-Integers/#fixed-ints seems to have the correct information on this then

Simon Dima (Jul 01 2025 at 15:32):

Pinging @David Thrane Christiansen, since the confusing bit was introduced here.
I think it would also make sense to add "boxed" as a tech term to the reference, since it's not immediately obvious what it means.

Mario Carneiro (Jul 01 2025 at 15:58):

I think this is a clash of nomenclatures; "boxed" is frequently used outside lean to talk about introducing a pointer indirection

Mario Carneiro (Jul 01 2025 at 16:00):

lean has "boxing" that sometimes but not always introduces a pointer indirection

David Thrane Christiansen (Jul 02 2025 at 07:30):

Thanks for catching this. I added "boxed" as a hyperlinkable term, removed the incorrect text, and added a cross-reference to the correct text from where the incorrect text was.

Much appreciated!

Cameron Zwarich (Jul 03 2025 at 06:10):

I actually have a related naming problem now that I’m working on the IR code. Some inductive types that are enum-like get converted to an actual scalar type like u8 (actually, that’s the only one due to the constructor limit). We also have other inductive types where some constructors are represented as tagged scalars (which are “boxed” in the sense mentioned above). The code currently calls these “scalar” constructors, but it is a bit confusing given the inductive types that actually get converted to scalars.

Another confusing naming distinction is the difference between object and tobject. The t stands for “tagged” (referring to the tagged or boxed scalar values mentioned before), but all objects have a tag stored in them. I would like to introduce a third thing (taggedScalar or the like) so that tobject is the lattice join of object and this new thing. This would simplify some analyses and make them more type-directed.

Does anyone have a better naming scheme for all of these concepts?

Srayan Jana (Sep 01 2025 at 22:18):

I've been curious actually, does Lean box (i.e store on the heap) all of its structs. That's what Java does, and the Java devs regard it as a huge mistake (which is why Valhalla exists as a project)

Srayan Jana (Sep 01 2025 at 22:24):

Okay going through this channel and reading it a bit more, it seems that even though stuff like Strings are represented in the code as Linked List of Chars, they are infact an array of chars.

But, there isn't a way to force Lean to make your struct unboxed?

Kiran (Nov 30 2025 at 00:48):

I can't say for the specific motivations in the case of Lean, but the general motivation for this memory representation in functional programming languages is to have a unified representation of all values at runtime, which is required if you want to support polymorphic functions without having to monomorphise them (and in the case of polymorphic recursion, the set of concrete types that a function will be instantiated with may be nontrivial/undecidable to compute).

Consider: def id (a: A) : A if types could have variable sizes (if structs were unboxed), then you'd have to generate multiple versions of id which take all possible sizes of values (monomorphise it). With the boxing of structs, runtime values all take the same size, so handling polymorphic functions are easy.

Last updated: Feb 28 2026 at 14:05 UTC