Zulip Chat Archive

Stream: mathlib4

Topic: Build time tracking

Jireh Loreaux (Apr 07 2025 at 13:05):

Should we add a relatively small collection of metrics from the speed center to the list? This would help us catch Mathlib-wide changes with significant negative impact I think.

Jovan Gerbscheid (Apr 07 2025 at 13:07):

The main ones are simp, type class search (and the whole build of course)

Anne Baanen (Apr 07 2025 at 13:14):

I have also been thinking about other ways to integrate the speed center with Zulip. I'm a bit worried that a weekly report will be hard to act upon since it doesn't really show the cause. If Mathlib is slower to compile because there is more available in Mathlib, then there is nothing to worry about. And in the course of a week we have a lot of additions...

Anne Baanen (Apr 07 2025 at 13:16):

The direction I was thinking is a bot in the #rss stream posting every significant difference, like on the speedcenter homepage. Either as a separate thread in that stream, or editing the benchmark results into the mathlib4 commit message. (Not sure which one would be better: the first one should be easier to implement and ensures we don't miss anything, the second one would be easier to follow along the whole history with.)

Robin Carlier (Apr 07 2025 at 13:20):

Speaking of the speedcenter: are there relatively easy way a user can obtain similar data from a local setup? I’d love to be able to benchmark some of my stuff locally.

Michael Rothgang (Apr 07 2025 at 13:57):

I like the idea of using the #rss stream.

Jovan Gerbscheid (Apr 07 2025 at 14:08):

I like the idea of posting every significant difference. However, note that the speedcenter homepage there are a lot of false positives.

afbeelding.png

Anne Baanen (Apr 07 2025 at 14:35):

True! I expect we'll decide not to act upon most of the messages. And I recall we can set a significance threshold for all of the reported stats in the speedcenter itself so that both the homepage and the Zulip thread get cleaned up for the price of one :)

Maja Kądziołka (Apr 08 2025 at 10:58):

Anne Baanen said:

I have also been thinking about other ways to integrate the speed center with Zulip. I'm a bit worried that a weekly report will be hard to act upon since it doesn't really show the cause. If Mathlib is slower to compile because there is more available in Mathlib, then there is nothing to worry about. And in the course of a week we have a lot of additions...

I suppose that one naive way to think about this could be the compile time to lines of code ratio :thinking:

Anne Baanen (Apr 08 2025 at 10:59):

Some years and Lean versions ago I ran a regression on this and I recall the best fit was that compile time is cubic in lines of code.

Anne Baanen (Apr 08 2025 at 11:00):

(More precisely, compile time in seconds for the whole library is cubic in cloc output.)

Anne Baanen (Apr 08 2025 at 11:03):

Found it!

Here are some graphs I made a while ago:

cloc_vs_time.png

lines_per_second.png

Fit.png

I took the part of the graph before 13th of June, to skip the big jump. According to LibreOffice, the growth is roughly cubical up to then, x^2.934 to be precise, with R² ≥ 0.98.

If I take all the data, the growth is x^2.31 with R² ≥ 0.95.

Anne Baanen (Apr 08 2025 at 11:06):

So yeah, it's data from 2021. I would not trust these numbers at all nowadays, and the big recent changes to parallelism means there's not going to be a lot of recent data to run the regression on either.

Maja Kądziołka (Apr 08 2025 at 11:08):

Anne Baanen said:

Some years and Lean versions ago I ran a regression on this and I recall the best fit was that compile time is cubic in lines of code.

that's... troubling. not a good trend to see, IMHO

Anne Baanen (Apr 08 2025 at 11:12):

(Shall we move the build time discussion so Damiano's maxHeartbeats work doesn't get buried?)

Anne Baanen (Apr 08 2025 at 13:10):

Here's a fit to the data of last March, showing a much more linear growth: 2025-03.png (Trying a power law or similar didn't give any noticeably better R² values.)

Edward van de Meent (Apr 08 2025 at 13:11):

it's interesting how it looks like it locally has a much smaller slope...

Anne Baanen (Apr 08 2025 at 13:30):

jq code for parsing the speedcenter JSON output into a CSV: jq '.commits[].values | map(floor) | @csv' -r speedcenter-march.json > speedcenter-march.csv. Loaded it into Apple Numbers and fiddled with it until I got a decent scatter plot.

Anne Baanen (Apr 08 2025 at 13:30):

And the API endpoint to get this JSON is https://speed.lean-lang.org/mathlib4/api/graph/detail/e7b27246-a3e6-496a-b552-ff4b45c7236e?start_time=1740805200&end_time=1744905600&dimensions=size:.lean::build:instructions (but getting the data is quite slow and timed out a few times for me...)

Anne Baanen (Apr 08 2025 at 13:37):

Edward van de Meent said:

it's interesting how it looks like it locally has a much smaller slope...

True! I'm not sure how best to deal with the fact that this is very much a piecewise function...

Anne Baanen (Apr 08 2025 at 15:47):

I now have a prototype benchmark analysis script. Example output:

refactor(Interval/Finset/Fin): use Finset.attachFin (#23738)
e37082f608ffd1ef0cc7cfe5ba34eb7bec578ca5
Author: Yury G. Kudryashov <urkud@urkud.name>

Significant benchmark differences:

~Mathlib.Data.Nat.Factorial.SuperFactorial --- instructions: -14.1 G (-55.4 %)

It also works for the lean4 speedcenter, e.g.:

chore: panic on blocking waits in sync tasks (#7853)
1b40c46ab1eeec1e1bdc1c8526067b9af8efb7ac
Author: Sebastian Ullrich <sebasti@nullri.ch>

Significant benchmark differences:

binarytrees.st --- task-clock: -0.188 s (-6.4 %)
binarytrees.st --- wall-clock: -0.189 s (-6.4 %)
lake env --- task-clock: -0.00854 s (-5.5 %)
lake env --- wall-clock: -0.0085 s (-5.45 %)
lake startup --- maxrss: 1.24 KKB (1.66 %)
liasolver --- maxrss: 1.73 KKB (2.32 %)
nat_repr --- maxrss: 1.75 KKB (2.35 %)
parser --- maxrss: 1.75 KKB (2.35 %)
qsort --- maxrss: 1.75 KKB (2.35 %)

Anne Baanen (Apr 08 2025 at 15:52):

Here's a fun bikeshedding question: how to indicate that a benchmark improved or worsened? Maybe an emoji like :green_circle: / :cross_mark:?

Anne Baanen (Apr 08 2025 at 15:53):

I will also need to figure out how to link to the correct repo (mathlib4/lean4/???).

Mario Carneiro (Apr 08 2025 at 15:55):

Anne Baanen said:

Here's a fun bikeshedding question: how to indicate that a benchmark improved or worsened? Maybe an emoji like :green_circle: / :cross_mark:?

Where is this being indicated?

Anne Baanen (Apr 08 2025 at 15:56):

I was thinking per-dimension, so e.g.:

chore: panic on blocking waits in sync tasks (#7853)
1b40c46ab1eeec1e1bdc1c8526067b9af8efb7ac
Author: Sebastian Ullrich <sebasti@nullri.ch>

Significant benchmark differences:

:green_circle: binarytrees.st --- task-clock: -0.188 s (-6.4 %)
:green_circle: binarytrees.st --- wall-clock: -0.189 s (-6.4 %)
:green_circle: lake env --- task-clock: -0.00854 s (-5.5 %)
:green_circle: lake env --- wall-clock: -0.0085 s (-5.45 %)
:cross_mark: lake startup --- maxrss: 1.24 KKB (1.66 %)
:cross_mark: liasolver --- maxrss: 1.73 KKB (2.32 %)
:cross_mark: nat_repr --- maxrss: 1.75 KKB (2.35 %)
:cross_mark: parser --- maxrss: 1.75 KKB (2.35 %)
:cross_mark: qsort --- maxrss: 1.75 KKB (2.35 %)

Mario Carneiro (Apr 08 2025 at 15:56):

I would suggest something like a red up triangle and blue down triangle if possible

Anne Baanen (Apr 08 2025 at 15:57):

The issue is that sometimes higher numbers are better.

Anne Baanen (Apr 08 2025 at 15:57):

So :upwards_trend: and :downwards_trend: wouldn't work either.

Mario Carneiro (Apr 08 2025 at 15:57):

then a blue up triangle and red down triangle

Anne Baanen (Apr 08 2025 at 15:59):

I can't find any other triangles than red, but maybe:

:red_triangle_up: :down: and :red_triangle_down: / :up:?

Anne Baanen (Apr 08 2025 at 16:00):

Hmm, doesn't pair very well visually.

:up_button: :down: and :down_button: :up:?

Anne Baanen (Apr 08 2025 at 16:12):

With icons and PR/commit links:

feat: a finite crystallographic reduced irreducible root pairing containing two roots with Coxeter weight three is spanned by this pair (mathlib4#23634)
fa4b09af1251c0b743ac5b3aba0aeea7f85ad9a2
Author: Oliver Nash <github@olivernash.org>

Significant benchmark differences:

:up_button: ~Mathlib.LinearAlgebra.RootSystem.Finite.Lemmas --- instructions: 34.8 G (18.3 %)

refactor(Interval/Finset/Fin): use Finset.attachFin (mathlib4#23738)
e37082f608ffd1ef0cc7cfe5ba34eb7bec578ca5
Author: Yury G. Kudryashov <urkud@urkud.name>

Significant benchmark differences:

:down: ~Mathlib.Data.Nat.Factorial.SuperFactorial --- instructions: -14.1 G (-55.4 %)

Anne Baanen (Apr 08 2025 at 16:27):

Posting to Zulip also seems to work: see #junk > Benchmark results for mathlib4

Anne Baanen (Apr 08 2025 at 16:27):

I'll leave it at this for right now to get some feedback, let me know what you think!

Anne Baanen (Apr 08 2025 at 16:30):

Code is available at https://github.com/Vierkantor/lean-infra/blob/velcom-bot/velcom-bot/velcom-bot.py

Johan Commelin (Apr 09 2025 at 04:04):

So :up_button: :down_button: are "bad" and :up: :down: are "good"?

Mario Carneiro (Apr 09 2025 at 05:46):

Anne Baanen said:

Posting to Zulip also seems to work: see #junk > Benchmark results for mathlib4

this is a private stream, you should post it here instead

Anne's test bot (Apr 09 2025 at 07:35):

feat: make TryThis work in widget messages (lean4#7610)
e6ce55ffd44e206cc5210f457694d8fef176945d
Author: Wojciech Nawrocki <wjnawrocki@protonmail.com>

Significant benchmark differences:

:down: bv_decide_large_aig.lean --- task-clock: -0.148 s (-3.16 %)
:up_button: bv_decide_mod --- task-clock: 0.456 s (1.7 %)
:up_button: bv_decide_mod --- wall-clock: 0.442 s (1.65 %)
:up_button: ilean roundtrip --- compress: 0.0153 (1.08 %)
:down: ilean roundtrip --- parse: -0.0267 (-4.26 %)
:down: stdlib --- maxrss: -10.8 KKB (-1.19 %)
:up_button: workspaceSymbols --- task-clock: 0.303 s (2.95 %)
:up_button: workspaceSymbols --- wall-clock: 0.304 s (2.96 %)

Anne's test bot (Apr 09 2025 at 07:35):

chore: split out small chunks from MeasureTheory.Integral.Lebesgue.Basic (mathlib4#23819)
f07b6a302a773f2794fbde911eb5582aba2f6146
Author: Jeremy Tan Jie Rui <reddeloostw@gmail.com>

Significant benchmark differences:

:down: ~Mathlib.MeasureTheory.Integral.Lebesgue.Basic --- instructions: -20.4 G (-18.2 %)

Anne's test bot (Apr 09 2025 at 07:35):

feat: pre-composing by an essentially surjective functor doesn't change the essential image (mathlib4#23840)
afe5ac93dbf8e74db404b12aa51ab7839a5d6dbd
Author: Yaël Dillies <yael.dillies@gmail.com>

Significant benchmark differences:

:up_button: build --- fix level params: 2.19 s (7.54 %)

Anne Baanen (Apr 09 2025 at 07:37):

(In order to not spam the thread too much, I only moved over a few example messages.)

Sebastian Ullrich (Apr 09 2025 at 07:52):

If some metrics are found to be too noisy (potentially everything but instruction counts), I can exclude them in the DB so that they are hidden from the speedcenter homepage as well

Anne Baanen (Apr 09 2025 at 08:01):

For Mathlib I think share common exprs, .olean serialization, fix level params and instantiate metavars are the main sources of noise right now. They seem to commonly vary 5%-8% between commits. I would suggest a threshold of at least 10% (or completely silenced).

Mario Carneiro (Apr 09 2025 at 08:20):

can we adaptively determine standard deviations and post when they are above the threshold?

Anne Baanen (Apr 09 2025 at 08:22):

There are a few metrics that report "change/stddev". Not sure how that works with the rest of the system though.

Sebastian Ullrich (Apr 09 2025 at 08:24):

They are derived from multiple runs on the same commit but Mathlib itself is too expensive to run multiple times

Mario Carneiro (Apr 09 2025 at 08:25):

is it possible to use e.g. the past 3-5 commits instead?

Sebastian Ullrich (Apr 09 2025 at 08:26):

Not without changing the code, which no-one is working on or even familiar with

Sebastian Ullrich (Apr 09 2025 at 08:26):

Anne Baanen said:

For Mathlib I think share common exprs, .olean serialization, fix level params and instantiate metavars are the main sources of noise right now. They seem to commonly vary 5%-8% between commits. I would suggest a threshold of at least 10% (or completely silenced).

silenced

Anne Baanen (Apr 09 2025 at 08:29):

Thanks!

Eric Wieser (Apr 09 2025 at 09:06):

Anne Baanen said:

I can't find any other triangles than red, but maybe:

We could always add custom emojis

Anne Baanen (Apr 09 2025 at 12:56):

Minor change: now older runs should be processed earlier than newer ones, so they appear in the stream in the right order.

Anne Baanen (Apr 09 2025 at 13:09):

Slightly silly suggestions:
down-good.png
down-sad.png
up-good.png
up-sad.png

Edward van de Meent (Apr 09 2025 at 13:15):

better than nothing

Ruben Van de Velde (Apr 09 2025 at 13:16):

A ringing endorsement

Edward van de Meent (Apr 09 2025 at 13:16):

i guess we can have the bot also just post both emoji rather than some custom combo

Edward van de Meent (Apr 09 2025 at 13:16):

that seems to me like a less silly way to go about it

Edward van de Meent (Apr 09 2025 at 13:17):

that being said, silly is also nice

Mario Carneiro (Apr 09 2025 at 13:19):

so {:red_circle: :up:, :blue_circle: :up:, :red_circle: :down:, :blue_circle: :down:}?

Edward van de Meent (Apr 09 2025 at 13:30):

well, i was thinking { :sad: :up:, :smiley: :up:, :sad: :down:, :smiley: :down:}

Edward van de Meent (Apr 09 2025 at 13:31):

in particular that might be more recognisable to any colorblind people?

Matthew Ballard (Apr 09 2025 at 13:32):

I think reporting any PR where the change in total instructions is > 100B (or < 100B) would be an improvement for tracking. Generally, I’ve found that > 1B/line indicates a problem but that might require touching the other code. Does it?

Anne Baanen (Apr 09 2025 at 13:39):

You mean lines of code for the whole of Mathlib or per file? The first one should not be too hard to implement since we already have total .lean size. I don't see any per-file line counts in VelCom, so the current bot implementation wouldn't be able to get that information.

Mario Carneiro (Apr 09 2025 at 13:42):

Edward van de Meent said:

well, i was thinking { :sad: :up:, :smiley: :up:, :sad: :down:, :smiley: :down:}

I think it is better to have something very visually obvious, including a color change. If this was HTML there would be more options for being able to style the text and so on so it looks like a stock ticker

Mario Carneiro (Apr 09 2025 at 13:42):

another possibility is something like { :red_circle: , :small_blue_diamond: } for increasing the visual differentiation

Mario Carneiro (Apr 09 2025 at 13:43):

although this biases toward regression reporting, while big improvements might also be something to call out

Edward van de Meent (Apr 09 2025 at 13:43):

how about :check_mark:/:cross_mark: ?

Matthew Ballard (Apr 09 2025 at 13:44):

Anne Baanen said:

You mean lines of code for the whole of Mathlib or per file? The first one should not be too hard to implement since we already have total .lean size. I don't see any per-file line counts in VelCom, so the current bot implementation wouldn't be able to get that information.

The ratio of total instructions change to total lines added in the PR.

Michael Rothgang (Apr 09 2025 at 15:58):

FWIW, the Rust project has a bot post an automated comment of benchmarking runs (similar to our bot). It uses a (green) check-mark and a red cross for improvements and regressions.

Michael Rothgang (Apr 09 2025 at 16:00):

So, I'd tend to keep it simple: add a checkmark for "improvements" and cross for regressions. If we want to denote "large" and "somewhat large" differently, just go for two checkmarks/crosses?

Michael Rothgang (Apr 09 2025 at 16:01):

Orwell's "plusgood, good, ungood, plusungood" was of course (purposefully) simplistic - but imho it captures the essence in this context quite well.

Anne's test bot (Apr 09 2025 at 16:31):

chore: use mixin ordered algebraic typeclasses (part 2) (mathlib4#20595)
d7d140ed9486cbd76db4f6c9aad255a2c1f230e5
Author: Yuyang Zhao 赵雨扬 <zhao.yu-yang@foxmail.com>

Significant benchmark differences:

:cross_mark: ~Mathlib.Algebra.Order.Archimedean.Basic --- instructions: 10.8 G (17.6 %)
:cross_mark: ~Mathlib.Algebra.Order.CauSeq.Basic --- instructions: 14.5 G (16.6 %)
:cross_mark: ~Mathlib.Algebra.Order.Floor.Ring --- instructions: 30.1 G (46.7 %)
:cross_mark: ~Mathlib.Algebra.Order.Floor.Semiring --- instructions: 15.2 G (36.2 %)
:cross_mark: ~Mathlib.Algebra.Order.ToIntervalMod --- instructions: 43 G (53.8 %)
:cross_mark: ~Mathlib.Analysis.Convex.Deriv --- instructions: 13.1 G (20.7 %)
:cross_mark: ~Mathlib.Analysis.Convex.Jensen --- instructions: 10.9 G (20.2 %)
:cross_mark: ~Mathlib.Analysis.Convex.Mul --- instructions: 15.3 G (20.8 %)
:cross_mark: ~Mathlib.Analysis.Convex.Side --- instructions: 11.4 G (12.5 %)
:check_mark: ~Mathlib.Analysis.MeanInequalitiesPow --- instructions: -12 G (-20.5 %)
:cross_mark: ~Mathlib.Combinatorics.SimpleGraph.Regularity.Uniform --- instructions: 11.3 G (25 %)
:cross_mark: ~Mathlib.GroupTheory.ArchimedeanDensely --- instructions: 11.6 G (18.5 %)
:check_mark: ~Mathlib.LinearAlgebra.Orientation --- instructions: -13.5 G (-15.7 %)
:check_mark: ~Mathlib.Topology.Algebra.Field --- instructions: -11.2 G (-29.7 %)
:cross_mark: ~Mathlib.Topology.Instances.AddCircle --- instructions: 10.9 G (19.8 %)

Anne's test bot (Apr 09 2025 at 16:34):

chore: use mixin ordered algebraic typeclasses (part 2) (mathlib4#20595)
d7d140ed9486cbd76db4f6c9aad255a2c1f230e5
Author: Yuyang Zhao 赵雨扬 <zhao.yu-yang@foxmail.com>

Significant benchmark differences:

:red_circle: :up: ~Mathlib.Algebra.Order.Archimedean.Basic --- instructions: 10.8 G (17.6 %)
:red_circle: :up: ~Mathlib.Algebra.Order.CauSeq.Basic --- instructions: 14.5 G (16.6 %)
:red_circle: :up: ~Mathlib.Algebra.Order.Floor.Ring --- instructions: 30.1 G (46.7 %)
:red_circle: :up: ~Mathlib.Algebra.Order.Floor.Semiring --- instructions: 15.2 G (36.2 %)
:red_circle: :up: ~Mathlib.Algebra.Order.ToIntervalMod --- instructions: 43 G (53.8 %)
:red_circle: :up: ~Mathlib.Analysis.Convex.Deriv --- instructions: 13.1 G (20.7 %)
:red_circle: :up: ~Mathlib.Analysis.Convex.Jensen --- instructions: 10.9 G (20.2 %)
:red_circle: :up: ~Mathlib.Analysis.Convex.Mul --- instructions: 15.3 G (20.8 %)
:red_circle: :up: ~Mathlib.Analysis.Convex.Side --- instructions: 11.4 G (12.5 %)
:large_blue_diamond: :down: ~Mathlib.Analysis.MeanInequalitiesPow --- instructions: -12 G (-20.5 %)
:red_circle: :up: ~Mathlib.Combinatorics.SimpleGraph.Regularity.Uniform --- instructions: 11.3 G (25 %)
:red_circle: :up: ~Mathlib.GroupTheory.ArchimedeanDensely --- instructions: 11.6 G (18.5 %)
:large_blue_diamond: :down: ~Mathlib.LinearAlgebra.Orientation --- instructions: -13.5 G (-15.7 %)
:large_blue_diamond: :down: ~Mathlib.Topology.Algebra.Field --- instructions: -11.2 G (-29.7 %)
:red_circle: :up: ~Mathlib.Topology.Instances.AddCircle --- instructions: 10.9 G (19.8 %)

Anne Baanen (Apr 09 2025 at 16:37):

I think I like the :cross_mark: / :check_mark: approach better, the circles seem a bit too visually heavy. One drawback is that :cross_mark: might be interpreted to mean the build for that file failed...

Aaron Liu (Apr 09 2025 at 16:40):

For me, the arrows are a bit unreadable, being white on a light-blue background.

Mario Carneiro (Apr 09 2025 at 16:41):

maybe just a non-emoji up/down triangle?

Eric Wieser (Apr 09 2025 at 16:42):

Should we table-ify as


:cross_mark:	~Mathlib.Algebra.Order.Archimedean.Basic	instructions: 10.8 G	17.6 %
:cross_mark:	~Mathlib.Algebra.Order.CauSeq.Basic	instructions: 14.5 G	16.6 %
:cross_mark:	~Mathlib.Algebra.Order.Floor.Ring	instructions: 30.1 G	46.7 %
:cross_mark:	~Mathlib.Algebra.Order.Floor.Semiring	instructions: 15.2 G	36.2 %
:cross_mark:	~Mathlib.Algebra.Order.ToIntervalMod	instructions: 43 G	53.8 %
:cross_mark:	~Mathlib.Analysis.Convex.Deriv	instructions: 13.1 G	20.7 %
:cross_mark:	~Mathlib.Analysis.Convex.Jensen	instructions: 10.9 G	20.2 %
:cross_mark:	~Mathlib.Analysis.Convex.Mul	instructions: 15.3 G	20.8 %
:cross_mark:	~Mathlib.Analysis.Convex.Side	instructions: 11.4 G	12.5 %
:check_mark:	~Mathlib.Analysis.MeanInequalitiesPow	instructions: -12 G	-20.5 %
:cross_mark:	~Mathlib.Combinatorics.SimpleGraph.Regularity.Uniform	instructions: 11.3 G	25 %
:cross_mark:	~Mathlib.GroupTheory.ArchimedeanDensely	instructions: 11.6 G	18.5 %
:check_mark:	~Mathlib.LinearAlgebra.Orientation	instructions: -13.5 G	-15.7 %
:check_mark:	~Mathlib.Topology.Algebra.Field	instructions: -11.2 G	-29.7 %
:cross_mark:	~Mathlib.Topology.Instances.AddCircle	instructions: 10.9 G	19.8 %

Kevin Buzzard (Apr 09 2025 at 18:01):

(off-topic: in case anyone is wondering what the heck I'm doing merging PRs with slowdowns like that, they should check out the benchmarks in the follow-up PR which I'm about 2 hours into reviewing right now!)

Anne's test bot (Apr 10 2025 at 13:51):

chore: update stage0
91c245663b7006c7c01f51f8891bb45c32a9cc67
Author: Lean stage0 autoupdater <>

Significant benchmark differences:


:large_blue_diamond: ▼	Init.Data.BitVec.Lemmas re-elab	maxrss: -27.5 K KB	-1.85 %
:large_blue_diamond: ▼	big_do	task-clock: -0.0627 s	-3.5 %
:large_blue_diamond: ▼	big_do	wall-clock: -0.066 s	-3.79 %
:large_blue_diamond: ▼	binarytrees.st	task-clock: -0.231 s	-7.33 %
:large_blue_diamond: ▼	binarytrees.st	wall-clock: -0.231 s	-7.32 %
:red_circle: ▲	lake build no-op	maxrss: 9.45 K KB	1.77 %
:large_blue_diamond: ▼	stdlib	blocked (unaccounted): -4.56	-2.5 %
:large_blue_diamond: ▼	stdlib	instantiate metavars: -0.0893	-2.21 %
:large_blue_diamond: ▼	stdlib	tactic execution: -3.08	-2.54 %
:large_blue_diamond: ▼	stdlib	task-clock: -40.4 s	-2.65 %
:large_blue_diamond: ▼	stdlib	wall-clock: -3.6 s	-2.32 %

Anne Baanen (Apr 10 2025 at 13:57):

Here's an example tableificated benchmark output. Are we happy enough with this to ask Sebastian to install the bot? Any later tweaks could then be done via PRs to https://github.com/Kha/lean-infra :)

Mario Carneiro (Apr 10 2025 at 15:04):

I'm a bit worried about having this on every mathlib commit in #rss, it will make the benchmark data look more important than the actual content (and the mathlib4 rss already goes by quite fast). Can we limit the number of differences to 3 or so?

Mario Carneiro (Apr 10 2025 at 15:05):

if it's another topic then at least people can mute it

Anne Baanen (Apr 10 2025 at 15:08):

It will be in a different topic called Benchmark results for {repo_name} (where repo_name is either lean4 or mathlib4 currently). There have been 8 benchmarks with significant differences in Mathlib in the last 24 hours, out of 37 total commits.

Anne Baanen (Apr 10 2025 at 15:09):

To be clear, a typical benchmark result will look more like: #mathlib4 > Build time tracking @ 💬, with one or two changes, than #mathlib4 > Build time tracking @ 💬.

Mario Carneiro (Apr 10 2025 at 15:17):

Even so, I imagine that some commits will cause significant improvements/regressions on almost every file or on a bunch of correlated metrics, so I expect it to be common that the number of significant deviations on a commit is very large. So it makes sense to have both a lower threshold on showing the benchmark as well as a maximum number of benchmarks to show if they are all just saying the same thing. Maybe add "...and N more" at the end to indicate the report is incomplete?

Michael Rothgang (Apr 10 2025 at 15:19):

Bikeshedding first: I personally don't find the latest output self-explanatory. Using blue and red is good, but circle and diamond... is not obvious to me.

Mario Carneiro (Apr 10 2025 at 15:20):

Square and heart?

Mario Carneiro (Apr 10 2025 at 15:21):

:red_square: :blue_heart:

Michael Rothgang (Apr 10 2025 at 15:22):

Better: but what about checkmark and cross?

Michael Rothgang (Apr 10 2025 at 15:22):

(I really don't want this thread to drown in bikeshedding - but at the same time, having readable output is important.)

Mario Carneiro (Apr 10 2025 at 15:23):

:check: :cross_mark:

Anne Baanen (Apr 10 2025 at 15:23):

I don't like the cross too much since it suggests a build failure. :/

Michael Stoll (Apr 10 2025 at 15:24):

:check: :red_square:

Michael Rothgang (Apr 10 2025 at 15:25):

Regarding long output, we could borrow another idea from the rust project's presentation: display a summary of changes, and allow expanding for details. Here's a screenshot to illustrate:
Bildschirmfoto vom 2025-04-10 17-25-00.png

Michael Rothgang (Apr 10 2025 at 15:25):

Rust also categorises benchmarks as primary and secondary (the latter are mostly stress-tests to ensure some aspect doesn't accidentally regress); we could omit that distinction.

Michael Rothgang (Apr 10 2025 at 15:26):

"mean" is a geometric mean of the percentage changes.

Michael Rothgang (Apr 10 2025 at 15:26):

Basically, I would propose that for e.g. >=5 changes, we'd just display such a table with improvements, regressions and everything summarised this way, and have expandable output with the details.

Mario Carneiro (Apr 10 2025 at 15:27):

our file benchmarks are basically secondary benchmarks by that definition

Michael Rothgang (Apr 10 2025 at 15:27):

(Or perhaps, separate tables for instructions, wall-time, rss, etc.)

Michael Rothgang (Apr 10 2025 at 15:28):

Rust has separate tables for all of these, but only shows instruction counts by default, as these are the most stable. The other tables also exist, but are hidden by default...

Mario Carneiro (Apr 10 2025 at 15:28):

I do like the idea of summarizing benchmarks, I've had benchmark overload issues ever since we added all those file benchmarks

Michael Rothgang (Apr 10 2025 at 15:30):

@Damiano Testa What do you think about changing the benchnmarking bot's summary comments to adopt such a table format? (I can volunteer to make such a PR, in due time.)

Anne's test bot (Apr 10 2025 at 15:32):

chore: bump toolchain to v4.19.0-rc2 (mathlib4#23614)
24d8de9da5cf8a1a39a8e7f618548973e4bbe75d
Author: Kim Morrison <kim@tqft.net>

Significant benchmark differences:


:check: ▼	build	added 365 packages, and audited 366 packages in: -1 s	-14.3 %
:red_square: ▲	build	aesop: 92.5 s	42.6 %
:red_square: ▲	build	compilation: 316 s	32.6 %
:red_square: ▲	build	elaboration: 2.19 K s	116 %
:red_square: ▲	build	import: 1.63 K s	32.5 %

+ 128 more...:

Anne Baanen (Apr 10 2025 at 15:36):

Alright, changed the emoji and added a truncation message. This is pretty much the amount of work I can put into the project until after Easter (so around the 22nd of April). Feel free to tweak the script here until you're happy https://github.com/Vierkantor/lean-infra/blob/velcom-bot/velcom-bot/velcom-bot.py in the meantime :)

Mario Carneiro (Apr 14 2025 at 03:35):

#24020 was opened to ask about why import Mathlib has gotten so slow of late, and I agree. Do we have benchmark data to support this observation?

Last updated: May 02 2025 at 03:31 UTC