Zulip Chat Archive

Stream: lean4

Topic: (nevermind)

cmlsharp (Dec 31 2025 at 23:44):

Currently it seems like in order to write maximally efficient operations on arrays, one is essentially forced to write a reference implementation and an “efficient” version that uses USize for indexing (moreover these must be tied together by something unsafe such as implemented_by since the correctness here relies on something unprovable). This is obviously somewhat annoying (you have to have two roughly identical function, proving equivalence is somewhat awkward as it requires an unsafe assumption), but it also makes a very significant performance difference from my own benchmarking (which is usually the reason you’re using Arrays in the first place).

One might hope that you could just do the annoying thing once (say in the implementation of fold or ForIn) and then have other functions building on arrays use these. While this is usually faster than an approach that uses Nat indexing, it’s still usually meaningfully slower than the unsafe manual recursive version. (I think) this is because if you have compound state carried over between iterations, you end up paying the cost of allocating a tuple that’s immediately unpacked in the next iteration.

I’m curious if this is a problem others have noticed. Obviously some kind of general unboxing optimization would solve this, but short of that would it be possible to automate any part of this? Perhaps if one used Fins for indexing into an Array, could the compiler tell this has to fit in a usize (at least in principle)?

Henrik Böving (Jan 01 2026 at 01:07):

First of all, these USize optimizations are very low yield micro optimizations and don't matter in 99% of the use cases that we have. The only additional thing that happens if you don't use the USize version (as opposed to a properly optimized Nat version) is that you pay a very few additional shift instructions across an array loop which are basically for free (the relevant instructions on a modern x86 CPU have a latency of 1 cycle and a throughput of 0.5 cycles). Compared to all the other overhead that we experience, this overhead is barely noticeable in practice. Doing a single allocation in your loop is going to cost you more than the overhead combined. You will of course be able to observe it in microbenchmarks that have purely arithmetic workloads though.

Somewhat amusingly using Nat instead of USize can apparently even improve performance sometimes as you can see here :upside_down: not quite sure why that is.

That being said one thing that does matter is that we still emit reference counting instructions for these kinds of values most of the time (this is where the properly optimized part comes into play). This is relevant because it fills up both our icache and branch predictor with the unnecessary information about the code that's just going to say we do not have to ref count anyway. A single lean_dec already emits:

        test    dil, 1
        jne     .LBB0_4
        mov     eax, dword ptr [rdi]
        cmp     eax, 2
        jl      .LBB0_3
        dec     eax
        mov     dword ptr [rdi], eax
        ret
.LBB0_3:
        test    eax, eax
        jne     lean_dec_ref_cold@PLT
.LBB0_4:
        ret

out of which only the first two instructions actually matter. I do have a dataflow analysis in mind that's going to detect situations such as looping over an array and will insert the necessary information to optimize away the ref-counts. We might also be able to make use of this information to simplify arithmetic operations by one or two instructions I guess. Unfortunately that dataflow analysis is going to have to wait until we do some further refactoring work on the compiler in the upcoming quarter. I did already put some infrastructure in place to get it started though: lean4#11530 and lean4#11549. One could take it a step up and use this information to just make USize out of them as well though this feels a little brittle to me. Consider a situation where our loop index is passed to another tertiary function in our loop body. We can only detect
that the USize transformation is legal if either:

the function has a USize-y version available as well
or the function gets inlined to the point where we can detect that it is USize-y in our particular caller context.

Given that very few functions will fall into category 1 but quite a few functions might fall into category 2 we might suddenly generate completely different code based on the inlining choice for this one function, that seems a bit dangerous to me.

(I think) this is because if you have compound state carried over between iterations, you end up paying the cost of allocating a tuple that’s immediately unpacked in the next iteration.

Could you expand on this? I don't see how this is relevant to the USIze optimization

cmlsharp (Jan 01 2026 at 02:57):

Thank you for the reply! Actually upon going back and re-checking, I'm realizing that there was a flaw in my original benchmarking code that was leading me to significantly overestimate the effects of USize indexing which makes my entire question mostly irrelevant. :upside_down:

I still don't fully understand why it impacted the figures I saw as much as it did, but I will investigate further.

Notification Bot (Jan 01 2026 at 02:58):

This topic was moved here from #lean4 > Efficient array indexing by cmlsharp.

Last updated: Feb 28 2026 at 14:05 UTC