Zulip Chat Archive

Stream: lean4

Topic: Highly nondeterministic instruction count in radar?

Thomas Murrills (Dec 08 2025 at 17:04):

I'm doing some benching, and re-benched an earlier commit to see if the earlier bench was a fluke. It seems it was, but I'm not sure how it could be: there seem to be ~400G extra instructions in one run compared to another, and I would have thought instructions are mostly deterministic. However, the contents of the commits are exactly the same.

The benches in question are:

fast: benches 3085d730fc9d5f89d63b830f819add2371eef099
slow: benches 332ece0991aae90921658ce7fa682b1973554f4c
comparison of slow to fast

Note that git diff 3085d730fc9d5f89d63b830f819add2371eef099 332ece0991aae90921658ce7fa682b1973554f4c is empty. (This is on #32440 if anyone would like to pull the branch; it consists of a modification to a linter.)

This seems unusual to me. Is it?

If so, how might I find out what's causing the jump in instruction count between these two runs?

Henrik Böving (Dec 08 2025 at 18:47):

Instruction count is not deterministic, lean is a concurrent system with lots of reliances on caches, some of which are based on memory addresses etc. A small amount of variance is to be expected

Thomas Murrills (Dec 08 2025 at 18:48):

Is this small, though? It produces major and minor changes according to radar.

Henrik Böving (Dec 08 2025 at 18:49):

What is small and isn't highly depends on the workload. The major and minor stuff in radar has to be hand tuned to account for that sometimes.

Thomas Murrills (Dec 08 2025 at 18:50):

By workload, you mean e.g. the other runs currently being performed by the benching machine?

Henrik Böving (Dec 08 2025 at 18:51):

What program you are benchmarking. The runners only benchmark one thing at a time

Thomas Murrills (Dec 08 2025 at 18:52):

But the commits are exactly the same; shouldn't it be the same program?

Thomas Murrills (Dec 08 2025 at 18:53):

Oh wait; I suppose you mean building mathlib vs. something else, and that radar might not yet be fine-tuned to mathlib?

Junyan Xu (Dec 08 2025 at 19:12):

I also observed extra ~400G instructions on two of my PRs following radar's recent recovery from corrupted mathlib clone.

Henrik Böving (Dec 08 2025 at 19:36):

Thomas Murrills said:

But the commits are exactly the same; shouldn't it be the same program?

As I have explained running lean inherently has a variance of instructions due to things like concurrency and pointer based caches. Running lean twice on the same hardware without anything else running can thus still have variance. Perfectly deterministic execution is basically impossible. If you want that you are approximately 50-60 years too late :P

Thomas Murrills (Dec 08 2025 at 19:42):

Makes sense—sorry, I think that confusion came from me misinterpreting what you meant by "depends on the workload" etc. initially... :upside_down:

Thomas Murrills (Dec 08 2025 at 19:43):

(deleted)

Joscha Mennicken (Dec 08 2025 at 19:46):

Hmm, I was aware that open-mathlib//instructions was a bit more volatile but build//instructions apparently is also more noisy than I expected. The linked bench results don't show any specific culprit, just more red than green for the individual modules (but within expected limits for each module).

I've turned down the sensitivity on both of these metrics for now. We're also expecting faster benchmark runner hardware with more cores soon, which may behave differently yet again.

Thomas Murrills (Dec 08 2025 at 20:06):

Thanks! :)

Thomas Murrills (Dec 08 2025 at 20:07):

I don't know if this is possible, but it strikes me that maybe this is even a bit of an xy problem: I'm trying to judge the mathlib-wide performance of my linter modification, but the difference I care about gets bundled in with the performance of everything else, and therefore can get covered by noise. This seems common to many meta code modifications.

One argument is that if it's covered by noise, don't worry about it, but sometimes it's hard to tell—especially if you're going to gradually expand its scope.

I wonder if it would be possible to have some sort of meta combinator I could include in a temporary commit that singled out the performance of the linter in radar—e.g. if I set profiler to true and used profileit(M) within my linter (which is a trick I've used locally, but is a bit too uninformative on single files). No idea how big of an ask this is, though, so feel free to ignore this feature request. :)

Kevin Buzzard (Dec 08 2025 at 21:36):

I was extremely surprised by this radar output for a change which looked to me essentially like a no-op: https://github.com/leanprover-community/mathlib4/pull/31040#issuecomment-3621974001 . build//instructions + 13.8T (+8.2%).

Matthew Ballard (Dec 08 2025 at 21:40):

Hmm. I should have paid more attention to your previous posts.

Kevin Buzzard (Dec 08 2025 at 21:44):

I should add that I don't know if post-merge bench's work with radar, but I've been fighting off deadlines for references today so haven't taken the time to check out the before/after commits and compare yet.

Matthew Ballard (Dec 08 2025 at 21:45):

We traded 910 for 1000 priority right?

Matthew Ballard (Dec 08 2025 at 21:46):

Oh maybe not

Eric Wieser (Dec 08 2025 at 21:47):

Henrik Böving said:

Instruction count is not deterministic, lean is a concurrent system with lots of reliances on caches, some of which are based on memory addresses etc. A small amount of variance is to be expected

Would doing benchmarking builds with Elab.async set to false help with this?

Henrik Böving (Dec 08 2025 at 22:40):

To a degree sure. But after all you want to benchmark the actual experience of users building the software. There could be meta programs around that hinder parallelism for example

Joscha Mennicken (Dec 09 2025 at 13:34):

Kevin Buzzard said:

I should add that I don't know if post-merge bench's work with radar, but I've been fighting off deadlines for references today so haven't taken the time to check out the before/after commits and compare yet.

If you !bench a PR after a bors merge, the result should be identical to pre bors merge since bors doesn't affect the PR and its commits, but instead creates a new commit on master.

Kevin Buzzard (Dec 10 2025 at 23:45):

Are these changes real? Both PRs seem quite innocuous.

#rss > Significant commits to mathlib4 @ 💬 says `open-mathlib//instructions: +437.4M (+0.6%) and then the next post #rss > Significant commits to mathlib4 @ 💬 goes down again.

Floris van Doorn (Dec 11 2025 at 13:05):

That really looks like it's noise. It's also only +/- 0.6%

Last updated: Feb 28 2026 at 14:05 UTC