Zulip Chat Archive

Stream: mathlib4

Topic: flamegraphs


Sebastian Ullrich (Apr 14 2023 at 14:48):

Using https://leanprover.zulipchat.com/#narrow/stream/270676-lean4/topic/Introducing.20FlameTC/near/348664964, here are interactive flamegraphs of the three most time-consuming mathlib4 files:

  1. Mathlib.Data.Polynomial.FieldDivision
  2. Mathlib.Topology.Algebra.Module.Basic
  3. Mathlib.GroupTheory.MonoidLocalization

Patrick Massot (Apr 14 2023 at 14:50):

This looks very nice, but do you also have explanation about how to read this and use it to make things faster?

Sebastian Ullrich (Apr 14 2023 at 14:55):

Interestingly, the graph structure is quite different between the three modules. No. 1 is visibly dominated by big unification problems (of diamonds, presumably), some but not all of them contained in etaExperiment. For no. 2 I don't see a clear pattern, there is one expensive declaration but also many cheaper ones; with 2.8k lines, the file is much longer than the other ones. No. 3 is notable for the absence of tracked time: it is probably spent interpeting [to_additive], which you can see being used in many top-level frames.

Sebastian Ullrich (Apr 14 2023 at 14:57):

It would be nice if different components could be colored differently, but that doesn't seem possible at least with speedscope

Mauricio Collares (Apr 14 2023 at 14:58):

!4#3414 improves the first one by 78.5% and the second one by 55% (measured in instructions)

Patrick Massot (Apr 14 2023 at 14:59):

Sebastian, I'm asking a much more basic question. I have no idea what I'm looking at.

Sebastian Ullrich (Apr 14 2023 at 15:04):

Well the second part of your original question couldn't be more complex! But as to the first part, it is basically a timeline (in microseconds, showing components taking >10ms) of Lean processing the respective file. On the top, you can see the first line of each declaration, and nested below them you can mostly see tactic, TC, and defeq invocations.

Sebastian Ullrich (Apr 14 2023 at 15:14):

@Henrik Böving Actually it seems to be a reverse timeline: derivative_rootMultiplicity_of_root should be the first theorem in the first file

Henrik Böving (Apr 14 2023 at 15:14):

Ah! I knew the fact that I just new :: events in addChild will come to bite me at some point :D

Patrick Massot (Apr 14 2023 at 15:16):

Ok, so for instance it says theorem divByMonic_add_X_sub_C_mul_derivate_divByMonic_eq_derivative used 7.8 seconds, right? And then how do I interpret the next lines?

Patrick Massot (Apr 14 2023 at 15:17):

This example is the second one in the graph

Sebastian Ullrich (Apr 14 2023 at 15:21):

Most of the time in that theorem seems to be spent in a tactic block starting with have (you need to be a bit careful here if text is elided with [...] since it might hide further tactics apart from that have). If you go further down, you can see that time is mostly spent in two rw calls in that block, each of which is mostly made up of one big unification problem (when emojis come in, it's either TC or unification).

Mario Carneiro (Apr 14 2023 at 15:23):

The [...] seems to be unfortunate in some places, for example the biggest top level theorem sorting by "left-heavy" in the first graph is set_option synthInstance.etaExperiment true in [...]

Mario Carneiro (Apr 14 2023 at 15:24):

sure would be nice to know what the next line is

Henrik Böving (Apr 14 2023 at 15:24):

Patrick Massot said:

This looks very nice, but do you also have explanation about how to read this and use it to make things faster?

The flamegraph format originnates from here: https://github.com/brendangregg/FlameGraph/ (in general brendan is like the performance guy he has a million tools, methodologies guides and what not that you can use to make things fast but flamegraphs are by far his most popular invention)

Now if you look at this slightly less chaotic example in the README: https://www.brendangregg.com/FlameGraphs/cpu-bash-flamegraph.svg (you do not have to understand the program that it runs its just because I like this visualization more than the one from speedscope :D)

Basically each of the lines here is a function, if a function calls another function it is displayed on top of the previous functions line. The width of a line tells you how much run time it consumed. Now this is still not enough to do performance engineering though. If you look at the example flamegraph you can see that __libc_start_main seems to consume all the time right? But if you were to look at it its actually just a few CPU instructions before jumping into main. The things that are actually consuming time are the sections of lines above which no function can be seen. These are of course the most top level functions but for example if we zoom in on this inode function: image.png you can see that it does consume some time itself bbut also calls other functions. Most notably ext4_mark_iloc_dirty which has the largest "free space above the line" and is thus the function in the call stack that is actually consuming the most time.

Now going back to the visualization from above. My tool does not look at nested function calls but instead it looks at the components of Lean. So for example the elaborator starts working on a theorem, then it does a tactic call that does maybe a type class thing and a unification and so on. However the same principle still applies, width is time, and if its above each other one caused the other and free space is actual computation time.

Now how can you use it to make stuff faster, mostly in two ways:

  1. Reduce the "space without stuff above it" as good as possible. This is probably not something you can do yourself but rather something a Lean compiler / tactic developer can do by improving their algorithm
  2. Use it to find stuff that is consuming time in general (this can just be wide bars in general, regardless of whether they have much open space above them) and then
    a) get rid of it by replacing it wiith something that consumes less time
    b) report it as an issue because you dont think that Lean should be doing such an expensive thing here

Patrick Massot (Apr 14 2023 at 15:27):

I'm sorry, I still don't understand. My naive reading gives a total greater that 7.8 seconds. To me it looks like the have takes 7.4s, then the first rw takes 3s, the second rw takes 0.2s, third one 0.3s and fourth one 3.7s and then assumption takes 0.04s. How does that add up to 7.8s?

Patrick Massot (Apr 14 2023 at 15:27):

Is the have processed in parallel?

Sebastian Ullrich (Apr 14 2023 at 15:30):

The bars form a tree, any bars below a specific bar are included in its execution time. The entire tactic block starting with have takes ~7.4s, then within that two rws take ~3s. The have itself must be too quick to even be recorded.

Sebastian Ullrich (Apr 14 2023 at 15:31):

The have line really is something like have ...; rw ...; ...

Patrick Massot (Apr 14 2023 at 15:32):

Ok, thanks to both of you this is now a lot clearer.

Patrick Massot (Apr 14 2023 at 15:33):

This is really a very unintuitive tool, but I can believe it could be very useful once people get used to it.

Sebastian Ullrich (Apr 14 2023 at 15:35):

Sebastian Ullrich said:

It would be nice if different components could be colored differently, but that doesn't seem possible at least with speedscope

There are still many things we could improve on the export side: record missing parts like attribute application in the third example, probably drop docstrings, improve one-line formatting of classes like term elaboration and tactic blocks, ...

Henrik Böving (Apr 14 2023 at 15:38):

Sebastian Ullrich said:

record missing parts

https://github.com/leanprover/lean4/pull/2187 wink wink :P

Patrick Massot (Apr 14 2023 at 15:46):

Let me check I understand what this graph is telling us. Is it really telling me, still in the same proof, that Lean spends 3.6s unifying↑Polynomial.C (Polynomial.eval a f) =?= ↑Polynomial.C ?m.2043350?

Sebastian Ullrich (Apr 14 2023 at 15:49):

I'm afraid we do currently, though it is hard to say what's actually being done without looking at implicit arguments as well, for which it can be easier to switch to the editor and use trace.profiler directly to get the usual trace widgets

Sebastian Ullrich (Apr 14 2023 at 15:58):

@Henrik Böving Oh, I just realized ; is probably not escaped in the flamegraph export, this should be one frame image.png

Henrik Böving (Apr 14 2023 at 15:59):

Does it allow for escaping with a backslash?

Henrik Böving (Apr 14 2023 at 16:00):

image.png thats a no^^

Sebastian Ullrich (Apr 14 2023 at 16:01):

This flamegraph format may be a bit too limited after all


Last updated: Dec 20 2023 at 11:08 UTC