Zulip Chat Archive

Stream: lean4 dev

Topic: Type and universe information kept during compilation

Simon Dima (Apr 30 2025 at 16:39):

I'm trying to understand the new compiler. My current understanding of what happens to types is that

After elaboration, we know the type of all expressions because everything is fully annotated; types at this stage are just regular Exprs
During translation of the code to LCNF, the types are translated to LCNF types, which erase dependency but try to keep most other type information, including universe levels, for the first pass of LCNF program transformations.
At the start of the mono phase, universe levels are erased away from LCNF types
During translation to IR, the types get translated to IR types, which give information about the low-level memory representation of values in the runtime.

I have several questions about this process; answers of any length to any subset of these would be much appreciated :)

The Rocq/Coq extraction pipeline, which goes to OCaml (more precisely, to an untyped lamdba calculus IR used in the OCaml compiler), has full type erasure as one of the first steps. Why does the Lean compiler do things differently and keep type information? Is it because the OCaml runtime has uniform memory representation for values and Lean's does not?
I found a comment mentioning that keeping as much type information as possible in LCNF types enables more optimizations. Which ones are those? The statement that type information is necessary for optimization is surprising to me, as the OCaml compiler (if I understand it correctly) discards type information after typechecking and before code optimization.
Why is universe level information kept in LCNF types? Intuitively, it would make sense for information about universe levels to be only important to keep the system sound as a prover, and completely irrelevant for any computations.
In LCNF types, there seem to be two placeholders for erased types, lcAny and lcErased; I found a comment stating the former is for a dependent type which has been erased and the latter for an irrelevant type, but I'm not sure what that means or why distinguishing these cases is important later. Aren't all types computationally irrelevant?

Cameron Zwarich (Apr 30 2025 at 17:41):

The Rocq/Coq extraction pipeline, which goes to OCaml (more precisely, to an untyped lamdba calculus IR used in the OCaml compiler), has full type erasure as one of the first steps. Why does the Lean compiler do things differently and keep type information? Is it because the OCaml runtime has uniform memory representation for values and Lean's does not?

Yes, we are in the interesting position of trying to erase most type information, while also preserving enough information to support an unboxed representation of primitive types. This means that while we are ultimately based on the same principle as Rocq's extraction pipeline, Lean is doing something slightly different (and a bit more subtle).

Originally, I think there was an idea to preserve typeability before toMono (a strategy which has been very successful in GHC's Core representation, for example), but the difficulties of doing so in practice with dependent types led to this goal being dropped, and optimizations can produce partially ill-typed code. After the new compiler is enabled on master, I think we'll probably want to find a way to refactor all of this code to reflect the present reality and more precisely preserve the information we need.

I found a comment mentioning that keeping as much type information as possible in LCNF types enables more optimizations. Which ones are those? The statement that type information is necessary for optimization is surprising to me, as the OCaml compiler (if I understand it correctly) discards type information after typechecking and before code optimization.

Type classes are elaborated into otherwise normal uses of dependent records. The compiler uses type information to specialize and inline type classes. This is especially important for Lean because we implement effects using monads, and the performance difference between specialized monadic code and generic code can be gigantic.

The compiler also uses type information to optimize case expressions, but arguably these optimizations could be done in a slightly more pessimistic form on an untyped representation by just tracking which constructors flow to which case expressions.

Why is universe level information kept in LCNF types? Intuitively, it would make sense for information about universe levels to be only important to keep the system sound as a prover, and completely irrelevant for any computations.

The level information is only preserved so that we can reuse all of the standard functions for manipulating Exprs, which assume that universe levels are provided. They are only present to enforce logical consistency, which doesn't matter in the compiler. If we wanted, we could accept some mild duplication of code and have a completely separate representation of LCNF types that ignores universe levels entirely.

In LCNF types, there seem to be two placeholders for erased types, lcAny and lcErased; I found a comment stating the former is for a dependent type which has been erased and the latter for an irrelevant type, but I'm not sure what that means or why distinguishing these cases is important later. Aren't all types computationally irrelevant?

Perhaps the names or descriptions are bad (please feel free to suggest alternatives), but here it goes:

lcErased reflects a term that is irrelevant (like the ■ in Rocq's erasure work)
lcAny reflects a term with a dependency that has been erased, and also other cases where we can no longer precisely represent the type (like ⊤ in a lattice). In practical terms, this means we know that it will get compiled to an object type rather than a primitive.

The test cases in tests/lean/run/lcnfErasure.lean have some good examples of the behavior of `toMonoType.

Last updated: May 02 2025 at 03:31 UTC