Zulip Chat Archive

Stream: general

Topic: Preview of new build time benchmark

Jannis Limperg (Oct 30 2020 at 16:08):

At Rob's request, I'm setting up a new build time benchmarking server that'll hopefully be a little more consistent than Scott's bot. You can see its measurements at mathlib-bench.limperg.de.

At the moment, the thing is very bare-bones. If you have any additional features you think would be useful, that's what this thread is for! In particular, Scott's bot generates a lot of profiling output; if you need that, I can probably set it up.

One thing I'll certainly improve: the server doesn't benchmark every commit, so the time difference is not "time difference caused by this commit" but rather "time difference caused by all commits since the last commit we benchmarked". This needs to be clearer in the UI.

Rob Lewis (Oct 30 2020 at 16:11):

This is awesome, thanks Jannis!

Rob Lewis (Oct 30 2020 at 16:11):

Saving the profile output would be great. I think it's a useful feature of Scott's script.

Kevin Lacker (Oct 30 2020 at 16:12):

there are three of those lines that are colored red or green - does that indicate that you think the build was broken on the red ones, and fixed on the green ones?

Rob Lewis (Oct 30 2020 at 16:12):

We should also make it announce results to Zulip. (Maybe time to rename the #travis stream, heh.)

Kevin Lacker (Oct 30 2020 at 16:12):

or is +/- 8% within typical expected noise

Rob Lewis (Oct 30 2020 at 16:13):

When you link to GitHub diffs, using three ... instead of two .. gives a nicer output for this kind of thing. I discovered this when I was setting up CI scripts. Compare https://github.com/leanprover-community/mathlib/compare/581b2af3eb36e4eb460cce3e3d7842e32e7a3349...58f8817e6aff0abdf7bc294ef72c3e441db151ab (three) and https://github.com/leanprover-community/mathlib/compare/581b2af3eb36e4eb460cce3e3d7842e32e7a3349..58f8817e6aff0abdf7bc294ef72c3e441db151ab (two)

Rob Lewis (Oct 30 2020 at 16:16):

It would be nice to display the date/time the build started, possibly the date/time of the commit.

Rob Lewis (Oct 30 2020 at 16:18):

Are these builds running single threaded? Those are pretty fast times. What kind of VM are you using?

Rob Lewis (Oct 30 2020 at 16:18):

I guess single threaded isn't really important as long as the times are consistent.

Rob Lewis (Oct 30 2020 at 16:21):

Your experimenting has cost $65 of our Azure credits. A lot in comparison to the $100 we've spent so far on olean serving, but a small fraction of the bucket we need to use up before June :p

Rob Lewis (Oct 30 2020 at 16:28):

An advanced feature: it would be really great if we could trigger a benchmark build from a PR with a comment on GitHub. The GitHub side is easy enough to do with an action. It would just need a hook on the server side and a way to display the result differently from a master build in the UI.

Kevin Lacker (Oct 30 2020 at 16:30):

a couple minor suggestions. first, i suggest measuring the time change not from the previous build, but from the average of the previous 10 builds. right now the chart shows three "outlier"-colored listings, when really what happened was that a couple consecutive runs were a bit off in different directions. none of those are really time outliers

Kevin Lacker (Oct 30 2020 at 16:32):

second, i suggest putting a time on each row for when the snapshot of the repo was taken - it's easier to reason about "oh the problem started yesterday morning, hmm" rather than "oh the problem started at diff 62b3"

Gabriel Ebner (Oct 30 2020 at 16:35):

The HTML report is gorgeous!
I'd really like to see more detailed profiling output, that is, runtime per file. (I believe Scott had that.) Ideally there'd be a nice table where you could see how the runtime of each file changes between mathlib commits.

Jannis Limperg (Oct 30 2020 at 16:56):

Lots of good suggestions, thanks!

When you link to GitHub diffs, using three ... instead of two .. gives a nicer output for this kind of thing.

Heh, wouldn't have guessed that. Thanks!

It would be nice to display the date/time the build started, possibly the date/time of the commit.

This is also easy.

Are these builds running single threaded? Those are pretty fast times. What kind of VM are you using?

These are four threads on E4as_v4 VMs. In my tests, these were the VMs that introduced the least variance. Four vs one thread didn't seem to make a huge difference in terms of relative variance, but that experiment is still running so maybe I'll have to go back to one thread when the results are in.

An advanced feature: it would be really great if we could trigger a benchmark build from a PR with a comment on GitHub. The GitHub side is easy enough to do with an action. It would just need a hook on the server side and a way to display the result differently from a master build in the UI.

Can do, but this requires some architectural changes, so it'll take a bit.

first, i suggest measuring the time change not from the previous build, but from the average of the previous 10 builds. right now the chart shows three "outlier"-colored listings, when really what happened was that a couple consecutive runs were a bit off in different directions. none of those are really time outliers

Good idea, will do.

I'd really like to see more detailed profiling output, that is, runtime per file. (I believe Scott had that.) Ideally there'd be a nice table where you could see how the runtime of each file changes between mathlib commits.

Wil do; it seems like this feature is in high demand.

or is +/- 8% within typical expected noise

Currently changes of more than +/- 3% (actually 2.65% for reasons) are marked red (worse) or green (better). It looks like the VM/Lean/OS introduces about this amount of error.

We should also make it announce results to Zulip. (Maybe time to rename the #travis stream, heh.)

If people want this, I can certainly do it. Do you actually look at #travis though? I always found it too spammy to be subscribed. Maybe just announce outliers?

Rob Lewis (Oct 30 2020 at 17:17):

Jannis Limperg said:

If people want this, I can certainly do it. Do you actually look at #travis though? I always found it too spammy to be subscribed. Maybe just announce outliers?

I do, but just announcing outliers is probably plenty, since we know where to look for the full output.

Johan Commelin (Oct 30 2020 at 17:28):

Ooh, I love the #travis spam.
We should really rename that stream though

Julian Berman (Oct 30 2020 at 17:35):

Not sure it's helpful (you may all know these things better than we/I do), but just on the general topic of benchmarks and reducing error/noise in timing -- the top three articles here are nice things from Python-land which will be of at least some relevance: https://vstinner.readthedocs.io/benchmark.html

Kevin Lacker (Oct 30 2020 at 17:55):

I think the main issue with consistency is just using a VM for benchmarking. there will always be some contention with whatever else is using the machine. but, it's way better than nothing! i just wouldn't worry about much optimization work on benchmarking consistency as long as it's on a VM, I would not be surprised by random 10-20% fluctuations that are entirely out of our control

Julian Berman (Oct 30 2020 at 19:04):

Yeah, agreed.

Jannis Limperg (Oct 30 2020 at 19:54):

Julian, thank you for the benchmarking notes. I'll go through them to see which parts are applicable.

Wrt noise, the VM is certainly not great and any micro-optimisations to reduce noise will likely be a drop in the bucket. But it's not horrible either. I did an experiment where I built the same commit 30 times and the biggest deviation from the mean was 3.4%. Standard deviation 49s, which is 1.5% of the mean. So the bot can't detect small optimisations/regressions, but bigger fish will show up.

Johan Commelin (Oct 30 2020 at 19:58):

I guess we can also track an "average of the last 5 builds" parameter.

Rob Lewis (Oct 30 2020 at 19:59):

And for now, the VMs are free, reasonably fast, and more reliable than Scott's server pushing to #travis.

Kevin Lacker (Oct 30 2020 at 20:07):

so, the way VMs are typically inconsistent is based on how other people are scheduled around you. if your VM is co-scheduled with a bunch of light users, performance will look pretty stable, but at any time the cloud provider may reallocate resources, you find your job is coexisting with a heavier user, and performance is lower. it is possible to have a situation like, performance seems totally consistent for a week, and then is 20% lower the next week, while you make no changes. measuring a low standard deviation won't really capture that. all this is not to say this is a bad idea, but to point out for future consideration that an apparent large performance regression on a VM is a "necessary but not sufficient" condition for an actual performance regression :smile:

Jannis Limperg (Oct 30 2020 at 20:09):

That's fair. We'll have to see what the numbers look like over longer periods of time.

Rob Lewis (Oct 30 2020 at 20:10):

Kevin, since you seem to know something about this: we have ~$10k credits on MS Azure and are projecting to use ~$300 for olean serving, which is why we went with this setup for benchmarking. Do you know if there's a way to do more reliable benchmarking using Azure?

Rob Lewis (Oct 30 2020 at 20:11):

We don't need perfect reliability and Jannis' setup is a clear improvement over what we had before.

Rob Lewis (Oct 30 2020 at 20:12):

But it's quite possible there's a better approach using the same resources and we're just not aware.

Kevin Lacker (Oct 30 2020 at 20:14):

I think it's a good idea to keep this running. Just don't be surprised if half the time, when it looks like things got slow, actually it was just VM flakiness

Jannis Limperg (Oct 30 2020 at 20:18):

I just found out that Azure also offers dedicated servers. :) They cost between 2000 and 87000 Euro/month. :(

Sebastian Ullrich (Oct 31 2020 at 11:51):

Have you tried profiling instructions instead of wall-clock time, using e.g. perf stat? It doesn't measure the speedup from parallelism, though.

Jannis Limperg (Oct 31 2020 at 12:40):

No; in fact I've never used perf stat before. Are these numbers useful? If they are, I can report them as well.

Sebastian Ullrich (Oct 31 2020 at 12:48):

They should be pretty much proportional to task-clock time while being less noisy, so they could be a strictly better regression indicator; I've never benchmarked on a VM, however.

Last updated: May 02 2025 at 03:31 UTC

Stream: general

Topic: Preview of new build time benchmark

Jannis Limperg (Oct 30 2020 at 16:08):

Rob Lewis (Oct 30 2020 at 16:11):

Rob Lewis (Oct 30 2020 at 16:11):

Kevin Lacker (Oct 30 2020 at 16:12):

Rob Lewis (Oct 30 2020 at 16:12):

Kevin Lacker (Oct 30 2020 at 16:12):

Rob Lewis (Oct 30 2020 at 16:13):

Rob Lewis (Oct 30 2020 at 16:16):

Rob Lewis (Oct 30 2020 at 16:18):

Rob Lewis (Oct 30 2020 at 16:18):

Rob Lewis (Oct 30 2020 at 16:21):

Rob Lewis (Oct 30 2020 at 16:28):

Kevin Lacker (Oct 30 2020 at 16:30):

Kevin Lacker (Oct 30 2020 at 16:32):

Gabriel Ebner (Oct 30 2020 at 16:35):

Gabriel Ebner (Oct 30 2020 at 16:36):

Gabriel Ebner (Oct 30 2020 at 16:38):

Gabriel Ebner (Oct 30 2020 at 16:41):

Jannis Limperg (Oct 30 2020 at 16:56):

Rob Lewis (Oct 30 2020 at 17:17):

Johan Commelin (Oct 30 2020 at 17:28):

Julian Berman (Oct 30 2020 at 17:35):

Kevin Lacker (Oct 30 2020 at 17:55):

Julian Berman (Oct 30 2020 at 19:04):

Jannis Limperg (Oct 30 2020 at 19:54):

Johan Commelin (Oct 30 2020 at 19:58):

Rob Lewis (Oct 30 2020 at 19:59):

Kevin Lacker (Oct 30 2020 at 20:07):

Jannis Limperg (Oct 30 2020 at 20:09):

Rob Lewis (Oct 30 2020 at 20:10):

Rob Lewis (Oct 30 2020 at 20:11):

Rob Lewis (Oct 30 2020 at 20:12):

Kevin Lacker (Oct 30 2020 at 20:14):

Jannis Limperg (Oct 30 2020 at 20:18):

Sebastian Ullrich (Oct 31 2020 at 11:51):

Jannis Limperg (Oct 31 2020 at 12:40):

Sebastian Ullrich (Oct 31 2020 at 12:48):