Zulip Chat Archive

Stream: lean4

Topic: Vector.cast preventing tail call optimization

cmlsharp (Jan 29 2026 at 02:55):

Consider the following two functions

def loop1 (n : Nat) (acc : Vector Nat k) : IO (Vector Nat (k + n)) :=
  if h : n = 0 then
    pure (h ▸ acc)
  else
    have h1 : k + n = (k + 1) + (n - 1) := by omega
    h1 ▸ loop1 (n - 1) (acc.push 1)

def loop2 (n : Nat) (acc : Vector Nat k) : IO (Vector Nat (k + n)) :=
  if h : n = 0 then
    pure (acc.cast (by omega))
  else
    (Vector.cast (by omega)) <$> loop2 (n - 1) (acc.push 1)

For arbitrary monads, I understand why 'loop2' could not be tail recursive in general. However, for typical lawful monads, with sufficient inlining I would have expected 'loop2' to be able to be made tail recursive. Indeed, if IO is changed to Id, the compiler produces equivalent tail-call-optimized code for both loop1 and loop2. However, in IO, the compiler produces the following C code:

LEAN_EXPORT lean_object* l_loop1(lean_object* x_1, lean_object* x_2, lean_object* x_3) {
_start: {
    lean_object* x_5;
    uint8_t x_6;

    x_5 = lean_unsigned_to_nat(0u);
    x_6 = lean_nat_dec_eq(x_2, x_5);

    if (x_6 == 0) {
        lean_object* x_7;
        lean_object* x_8;
        lean_object* x_9;
        lean_object* x_10;

        x_7 = lean_unsigned_to_nat(1u);
        x_8 = lean_nat_add(x_1, x_7);
        lean_dec(x_1);

        x_9 = lean_nat_sub(x_2, x_7);
        lean_dec(x_2);

        x_10 = lean_array_push(x_3, x_7);

        x_1 = x_8;
        x_2 = x_9;
        x_3 = x_10;
        goto _start;
    } else {
        lean_object* x_12;

        lean_dec(x_2);
        lean_dec(x_1);

        x_12 = lean_alloc_ctor(0, 1, 0);
        lean_ctor_set(x_12, 0, x_3);
        return x_12;
    }
}}

LEAN_EXPORT lean_object* l_loop2(lean_object* x_1, lean_object* x_2, lean_object* x_3) {
_start: {
    lean_object* x_5;
    uint8_t x_6;

    x_5 = lean_unsigned_to_nat(0u);
    x_6 = lean_nat_dec_eq(x_2, x_5);

    if (x_6 == 0) {
        lean_object* x_7;
        lean_object* x_8;
        lean_object* x_9;
        lean_object* x_10;
        lean_object* x_11;
        uint8_t x_12;

        x_7 = lean_unsigned_to_nat(1u);
        x_8 = lean_nat_add(x_1, x_7);
        x_9 = lean_nat_sub(x_2, x_7);
        x_10 = lean_array_push(x_3, x_7);

        x_11 = l_loop2(x_8, x_9, x_10);

        lean_dec(x_9);
        lean_dec(x_8);

        x_12 = !lean_is_exclusive(x_11);
        if (x_12 == 0) {
            return x_11;
        } else {
            lean_object* x_13;
            lean_object* x_14;

            x_13 = lean_ctor_get(x_11, 0);
            lean_inc(x_13);
            lean_dec(x_11);

            x_14 = lean_alloc_ctor(0, 1, 0);
            lean_ctor_set(x_14, 0, x_13);
            return x_14;
        }
    } else {
        lean_object* x_15;

        x_15 = lean_alloc_ctor(0, 1, 0);
        lean_ctor_set(x_15, 0, x_3);
        return x_15;
    }
}}

is this expected behavior?

cmlsharp (Jan 29 2026 at 03:17):

(Also this is not IO specific. this seems to happen for any monad except Id)

Henrik Böving (Jan 29 2026 at 08:21):

Given the kinds of optimizations that we have in the compiler it is at least expected to me that this happens and I do not have a pass in mind that would capture this. Btw it does of course work for the reader monad as well. I can think of a pass that will detect this kind of behavior for IO and many IO-like monads but for some monads this is always going to be something the compiler cannot tell.

Robin Arnez (Jan 29 2026 at 09:43):

Isn't this basically lean4#7374?

Jakub Nowak (Jan 29 2026 at 09:48):

Btw, you can fix that locally:

@[inline]
def cast_impl (h : n = m) (xs : Vector α n) : Vector α m :=
    h ▸ xs

@[csimp]
theorem foo : @Vector.cast = @cast_impl := by
  unfold Vector.cast cast_impl
  funext _ _ _ h
  subst h
  simp

def loop1 (n : Nat) (acc : Vector Nat k) : IO (Vector Nat (k + n)) :=
  if h : n = 0 then
    pure (h ▸ acc)
  else
    have h1 : k + n = (k + 1) + (n - 1) := by omega
    h1 ▸ loop1 (n - 1) (acc.push 1)

def loop2 (n : Nat) (acc : Vector Nat k) : IO (Vector Nat (k + n)) :=
  if h : n = 0 then
    pure (acc.cast (by omega))
  else
    (Vector.cast (by omega)) <$> loop2 (n - 1) (acc.push 1)

Robin Arnez (Jan 29 2026 at 09:52):

You can do csimp... but the compiler should be able to handle this itself

Jakub Nowak (Jan 29 2026 at 09:55):

I agree. Though, is it problem with bind? Shouldn't compiler be able to realize that Vector.cast is just id at runtime representation level?

Robin Arnez (Jan 29 2026 at 10:04):

Hmm, why does this not simp more?

  [] compiler phase: mono, pass: simp
  [simp] size: 15
      def loop2 k n acc a.1 : EST.Out IO.Error lcAny (Array Nat) :=
        let _x.2 := 0;
        let _x.3 := Nat.decEq n _x.2;
        cases _x.3 : EST.Out IO.Error lcAny (Array Nat)
        | Bool.false =>
          let _x.4 := 1;
          let _x.5 := Nat.add k _x.4;
          let _x.6 := Nat.sub n _x.4;
          let _x.7 := Array.push ◾ acc _x.4;
          let _x.8 := @loop2 _x.5 _x.6 _x.7 a.1;
          cases _x.8 : EST.Out IO.Error lcAny (Array Nat)
          | EST.Out.ok a.9 a.10 =>
            let _x.11 := @EST.Out.ok ◾ ◾ ◾ a.9 a.10;
            return _x.11
          | EST.Out.error a.12 a.13 =>
            return _x.8
        | Bool.true =>
          let _x.14 := @EST.Out.ok ◾ ◾ ◾ acc a.1;
          return _x.14

_x.11 should be simped to _x.8 and then the match should be made redundant

cmlsharp (Jan 29 2026 at 10:20):

Yeah iirc (not at my laptop atm) the compiler is capabale of performing the tail call optimization on e.g. (fun x => h ▸ x) <$> loop1 …

Both this and Vector.cast (once it is inlined) are just the identity function, so I don’t understand why they lead to different outcomes.

Jakub Nowak (Jan 29 2026 at 10:23):

cmlsharp said:

Yeah iirc (not at my laptop atm) the compiler is capabale of performing the tail call optimization on e.g. (fun x => h ▸ x) <$> loop1 …

Yes, that why it works if you use csimp to tell compiler to use h ▸ x instead of destructure and construct back, which is what Vector.cast does. From what @Robin Arnez posted, as I understand it, it looks like the problem is that compiler didn't realize that Vector.cast is identity.

Henrik Böving (Jan 29 2026 at 10:23):

Robin Arnez said:

Hmm, why does this not simp more?

  [] compiler phase: mono, pass: simp
  [simp] size: 15
      def loop2 k n acc a.1 : EST.Out IO.Error lcAny (Array Nat) :=
        let _x.2 := 0;
        let _x.3 := Nat.decEq n _x.2;
        cases _x.3 : EST.Out IO.Error lcAny (Array Nat)
        | Bool.false =>
          let _x.4 := 1;
          let _x.5 := Nat.add k _x.4;
          let _x.6 := Nat.sub n _x.4;
          let _x.7 := Array.push ◾ acc _x.4;
          let _x.8 := @loop2 _x.5 _x.6 _x.7 a.1;
          cases _x.8 : EST.Out IO.Error lcAny (Array Nat)
          | EST.Out.ok a.9 a.10 =>
            let _x.11 := @EST.Out.ok ◾ ◾ ◾ a.9 a.10;
            return _x.11
          | EST.Out.error a.12 a.13 =>
            return _x.8
        | Bool.true =>
          let _x.14 := @EST.Out.ok ◾ ◾ ◾ acc a.1;
          return _x.14

_x.11 should be simped to _x.8 and then the match should be made redundant

The simplifier does not have this machinery at the moment

Henrik Böving (Jan 29 2026 at 10:24):

At least not explicitly, it could happen as a sequence of let value optimization then default case optimization and then trivial case erasure. Though the compiler is careful to erase cases at the LCNF stage because they can lead to less optimal reuse behavior further down the line. In general removing any case statement at the LCNF level can be dangerous for FBIP. So it should happen at the IR level after reuse though this of course makes it much more finicky to figure out.

Henrik Böving (Jan 29 2026 at 10:35):

Having written passes that actually pessimize build times significantly because they drop cases the fact that doing cast <$> works at the LCNF level is more worrying to me than the fact that Vector.cast <$> doesn't work really :upside_down:

Henrik Böving (Jan 29 2026 at 10:37):

See for example https://github.com/leanprover/lean4/pull/11078

Jakub Nowak (Jan 29 2026 at 10:41):

If I understood correctly, you're saying that the compiler is already able to simplify

cases _x.8 : EST.Out IO.Error lcAny (Array Nat)
          | EST.Out.ok a.9 a.10 =>
            let _x.11 := x_8;
            return _x.11
          | EST.Out.error a.12 a.13 =>
            return _x.8

return _x.8

So it seems like the missing kind of optimization that we would wish to have is replacing the "constructed-back" expression inside cases statement with the variable being matched.

Henrik Böving (Jan 29 2026 at 10:42):

Yes but as I have explained this is not something we want to happen this early in the pipeline, it needs to happen after insertResetReuse. Almost any touching of cases before doing this can have pessimization effects.

Henrik Böving (Jan 29 2026 at 10:43):

The PR above only tries to detect more situations where you case on the same variable twice and remove the second case and that was already an FBIP pessimization, one needs to be really careful with any sort of case touching.

Jakub Nowak (Jan 29 2026 at 10:46):

Hm, that makes it wonder if removing the trivial case erasure might have positive effect in some cases? It might be worth investigating why does removal of cases causes FBIP pessimization.

Henrik Böving (Jan 29 2026 at 10:52):

Hm, that makes it wonder if removing the trivial case erasure might have positive effect in some cases?

Almost certainly yes, there are also several tensions here. For example, we have the default case optimization which collects the most common case in the simplifier already (it happens later in IR as well) and this might of course obscure what constructors are in flight in certain branches which can in turn inhibit reuse (and potentially other optimizations if they don't work hard enough to analyze the situation). On the other hand, removing this optimization would mean that you might repeatedly inspect the same (potentially non trivially sized) code arm over and over again so it is still benefitial.

It might be worth investigating why does removal of cases causes FBIP pessimization.

Because the heuristic is very sensitive to a case statement followed by several projections being present in a certain way in order to even consider reuse (for performance reasons). If you inspect the IR pipeline you will see that it only ever considers removing dead projections and certain trivial cases after inserting reuse:

if compiler.reuse.get (← getOptions) then
  decls := decls.map (Decl.insertResetReuse (← getEnv))
  logDecls `reset_reuse decls
decls := decls.map Decl.elimDead
logDecls `elim_dead decls
decls := decls.map Decl.simpCase

simpCase here does:

/-- Simplify `case`
  - Remove unreachable branches.
  - Remove `case` if there is only one branch.
  - Merge most common branches using `Alt.default`. -/

doing any of these before inserting reuse instructions can and does cause problems.

Henrik Böving (Jan 29 2026 at 10:56):

It's indeed quite an unfortunate situation that LCNF's pure pipeline has to consider the optimizations of the impure pipeline instead of just optimizing for its simple cost model of "do less branches if possible" but that's just how it is. If this optimization should happen it should be done by simpCase in IR. I might get to that this quarter as I migrate the legacy IR optimizations to LCNF.

Jakub Nowak (Jan 29 2026 at 11:21):

I don't want to pester you further, could you point me where I could read about reuse/FBIP to understand what it does?

Henrik Böving (Jan 29 2026 at 12:14):

The original paper is here https://arxiv.org/pdf/1908.05647 though the implementation in Lean does deviate somewhat noticeably from it by now.

cmlsharp (Jan 29 2026 at 15:11):

In this case would it make sense upstream the above csimp lemma? Unless it is indeed a mistake that cast <$> works.

Henrik Böving (Jan 29 2026 at 15:15):

I don't see why this would need to be done as a csimp lemma, this could also just be the implementation of Vector.cast.

In any case I would recommend not to rely on the compiler to figure out a tail call for you for anything but a direct h ▸ f args tail application. This is the only one where the compiler absolutely guarantees that it becomes an id, everything else is subject to heuristics etc that can blow up at any moment.

cmlsharp (Jan 29 2026 at 15:23):

Oh yeah good point. This does make me wonder if something like OCaml’s @[tailcall] attribute (which just causes a warning if a call is not actually tail call optimized) would be useful.

Henrik Böving (Jan 29 2026 at 15:26):

Yes a @[tailcall] attribute can be useful. However, it has to have a more creative meaning than in OCaml. For example, Lean produces lots of specializations for most of the code it works on. Should @[tailcall] also ensure these specializations are tail recursive? The semantics are not so clear unfortunately, some real though has to be put into things like this.

cmlsharp (Jan 29 2026 at 15:55):

Yeah I was wondering about specializations also.

I think Scala also has such an attribute and I think they do have statically dispatched generics. I should look at what they do.

cmlsharp (Jan 29 2026 at 16:10):

I suspect what happens is that their tailrec attribute is checked prior to specialization

Henrik Böving (Jan 29 2026 at 17:11):

cmlsharp said:

I suspect what happens is that their tailrec attribute is checked prior to specialization

If you do that your cast_impl <$> would not be recognized as a tail call btw

cmlsharp (Jan 29 2026 at 19:39):

Yeah, I understand that. The other obvious alternative would of course be to check after the specialization has been dispatched, but that would lead to unintuitive error messages at usage sites presumably. You're right that its a difficult problem! (A middle ground could be something like @[tailcall(IO, Id, Except)] or something where you check a closed list)

Last updated: Feb 28 2026 at 14:05 UTC