Zulip Chat Archive

Recently, I've witnessed on several occasions higher-than-normal instability in benchmarking runs. These generally don't show up as "significant" according to our metrics, but they can be near 100B instructions over all of Mathlib, with effects on files that should / could not have been affected by the changes in a given PR.

Do we have any idea what's causing this lack of reproducibility?

Jireh Loreaux (Jul 18 2025 at 02:29):

@Sebastian Ullrich do have any idea / have you noticed this?

Sebastian Ullrich (Jul 18 2025 at 08:10):

It's on my mental list but I haven't found time to dive into it so far. We're planning to overhaul and refine our whole benchmarking story starting in the coming months

Sebastian Ullrich (Jul 18 2025 at 08:11):

Is the most plausible assumption that this started with the toolchain bump?

Eric Wieser (Jul 25 2025 at 08:31):

What do we think the current noise threshold is? I've seen a bunch of PR reviews where the sentiment is either "great, this is faster" or "we should look into why this is slower", and almost all of the results seem to be in unrelated files suggesting there is very little signal

Eric Wieser (Jul 25 2025 at 08:32):

Should we update the bot to include a warning that the results are currently very noisy and link to this thread?

Sebastian Ullrich (Jul 25 2025 at 09:13):

We can increase the threshold from 10G to 20G. Alternatively, if someone wants to help with investigating the root cause, replicating the noise locally using perf stat would be the first step

Last updated: Feb 28 2026 at 14:05 UTC

leanprover-community / mathlib