Zulip Chat Archive

Stream: mathlib4

Topic: Frustrating debugging problem


Kevin Buzzard (May 07 2025 at 17:09):

I'm no computer scientist -- does anyone have any hints about how to proceed with the problem I describe below?

I want to analyse PR #24003 . It causes minor slowdowns but I think it's a step in the right direction and I would be prepared to take these slowdowns on the chin and merge (these files are fast anyway). But I would at least like to understand what the slowdowns are, because they might just be something dumb which is easily fixable.

The file with the most instructions added to it by the changes in the PR is Mathlib.Analysis.Analytic.Basic which is +46*10^9 instructions, an increase of 17%. This file is 1400 lines long and was not touched by the PR. The 17% increase does not seem like noise, the PR has been benchmarked several times against two different commits of master and always comes out slower. So the big question is: what exactly is slower? Is every declaration a bit slower, or is it just two declarations which are 20 times slower?

I have a nice two-monitor setup on a fast PC, with the two commits of mathlib checked out which are being compared by the speedcenter here. It's true that the PR makes the file compile more slowly; indeed I can insert a character at the start of both files as quickly as I can and then watch them compile. Both of them compile very quickly but the change definitely causes a slowdown.

Ideally I could stick #count_heartbeats just after the module docstring and compare the outputs manually to get a feeling for what's happening, but right now #count_heartbeats gives spurious results (see issue #23905).

Can people suggest other ways in which I can try and see what's going on? I've never been in this position before. When analysing slow alg geom or comm alg files, it's always manifestly obvious where the problem is, it's in the declaration which says set_option maxHeartbeats 800000 in in front of it. But that's not happening here, these files are super-fast. If there's no simple way to do this I might just give up because I am fussing about essentially nothing here. To put this +46*10^9 instructions figure into some context: a typical PR just adding API might add this many instructions to mathlib, it is a tiny figure. I was just curious.

Damiano Testa (May 07 2025 at 17:35):

Kevin, I am responsible for having produced a wrong #count_heartbeats function. Let me give it a second try. Would you mind trying out creating a new file with the content of the code below and then importing that file into wherever you want to measure?

Damiano Testa (May 07 2025 at 17:35):

import Lean.Elab.Command
import Mathlib.Util.CountHeartbeats
open Lean Elab Command

/-!
# The "countHeartbeats'" linter

The "countHeartbeats'" linter reports the output of `#count_heartbeats in` for every declaration.
-/

namespace Mathlib.Linter

/--
The "countHeartbeats'" linter reports the output of `#count_heartbeats in` for every declaration.
-/
register_option linter.countHeartbeats' : Bool := {
  defValue := true
  descr := "enable the countHeartbeats' linter"
}

initialize warnIfRef : IO.Ref Nat  IO.mkRef 0

open Lean Elab Command in
/--
Using `#count_at_least n` makes the `linter.countHeartbeats'` report a warning on commands that
take at least `n` heartbeats.
-/
elab "#count_at_least " n:num : command => do
  logInfo m!"Warn when the heartbeats exceed {n}"
  warnIfRef.set n.getNat

/-- Add string `a` at the start of the first component of the name `n`. -/
private partial def prepend (n : Name) (a : String := "_") : Name :=
  n.modifyBase fun
    | .anonymous => .str .anonymous a
    | .str .anonymous s => .str .anonymous (a ++ s)
    | .str p s => .str (prepend p a) s
    | .num p n => .num (.str p a) n

namespace CountHeartbeats'

@[inherit_doc Mathlib.Linter.linter.countHeartbeats']
def countHeartbeatsLinter : Linter where run := withSetOptionIn fun stx  do
  unless Linter.getLinterValue linter.countHeartbeats' ( getOptions) do
    return
  if ( get).messages.hasErrors then
    return
  let some ID := stx.find? (·.isOfKind ``Parser.Command.declId) | return
  let newID := prepend ID[0].getId
  let stx' : Syntax := stx.replaceM (m := Id) fun s =>
    if s.getKind == ``Parser.Command.declId then
      some (s.modifyArg 0 (mkIdentFrom · newID))
    else
      none

  let oldState  get
  let (errors?, report, chb) :=  do
    elabCommand ( `(command|#count_heartbeats in $(stx')))
    let msgs := ( get).messages
    let allErrors := ( msgs.reported.toArray.mapM (·.toString))
    let (unknownId, chb) := ( msgs.reportedPlusUnreported.toArray.mapM (·.toString)).partition
      (if let [_, id] := ·.splitOn "unknown identifier" then true else false)
    return (msgs.hasErrors, unknownId, chb)
  set oldState
  let nm  liftCoreM do realizeGlobalConstNoOverloadWithInfo ID[0] <|> return ID[0].getId
  for d in chb do
    if d.startsWith "Used " then
      let num := d.drop "Used ".length |>.takeWhile (!·.isWhitespace) |>.toNat!
      if num  ( warnIfRef.get) then
        logInfoAt ID m!"{.ofConstName nm}: {d}"
      else
        logWarningAt ID m!"{.ofConstName nm}: {d}"
  if !report.isEmpty then
    logErrorAt stx[0] "Probably unreliable results!"

initialize addLinter countHeartbeatsLinter

end CountHeartbeats'

end Mathlib.Linter

Damiano Testa (May 07 2025 at 17:36):

I did not create the #count_heartbeat command, but the linter is always on, so simply importing it in a file should give you the timing.

Damiano Testa (May 07 2025 at 17:36):

I hope that this is "mostly" right now.

Damiano Testa (May 07 2025 at 17:42):

Here is the beginning of its reports on the file that you mentioned in the PR:

sum: Used 872 heartbeats, which is less than the current maximum of 200000.
Basic.lean:94:4
partialSum: Used 873 heartbeats, which is less than the current maximum of 200000.
Basic.lean:98:8
partialSum_continuous: Used 842 heartbeats, which is less than the current maximum of 200000.
Basic.lean:117:4
radius: Used 4428 heartbeats, which is less than the current maximum of 200000.
Basic.lean:121:8
le_radius_of_bound: Used 2483 heartbeats, which is less than the current maximum of 200000.
Basic.lean:126:8
le_radius_of_bound_nnreal: Used 2635 heartbeats, which is less than the current maximum of 200000.
Basic.lean:131:8
le_radius_of_isBigO: Used 2494 heartbeats, which is less than the current maximum of 200000.
Basic.lean:136:8
le_radius_of_eventually_le: Used 2399 heartbeats, which is less than the current maximum of 200000.
Basic.lean:140:8
le_radius_of_summable_nnnorm: Used 2682 heartbeats, which is less than the current maximum of 200000.
Basic.lean:143:8
le_radius_of_summable: Used 2398 heartbeats, which is less than the current maximum of 200000.

Damiano Testa (May 07 2025 at 18:00):

I also added the command #count_at_least n, so that writing #count_at_least 2000 will make the linter emit an info (blue) when the heartbeat count is < 2000 and emit a warning (yellow) otherwise.

Damiano Testa (May 07 2025 at 18:05):

If the linter results are correct, there are exactly 2 declarations (structures, in fact) in that file that take more than 7000 heartbeats and they are the consecutive docs#HasFPowerSeriesOnBall (7389) and docs#HasFPowerSeriesWithinOnBall (7548).

Damiano Testa (May 07 2025 at 18:06):

(And trying #count_heartbeats in manually on these declarations yields essentially the same numbers.)

Kevin Buzzard (May 07 2025 at 18:53):

@Damiano Testa my understanding was that it was changes to core Lean (the work on parallelism within files?), made after your work on #count_heartbeats, which was what stopped it working. I can confirm that this version above, written in a different file and imported, makes #count_heartbeats perform exactly how I wanted! Many thanks!

Kevin Buzzard (May 07 2025 at 19:02):

Can you explain this output (compiling the file on the comnmand line):

info: ././././Mathlib/Analysis/Analytic/Basic.lean:91:14: 'sum' used 765 heartbeats, which is less than the current maximum of 200000.
warning: ././././Mathlib/Analysis/Analytic/Basic.lean:91:14: sum: Used 872 heartbeats, which is less than the current maximum of 200000.
info: ././././Mathlib/Analysis/Analytic/Basic.lean:96:4: 'partialSum' used 765 heartbeats, which is less than the current maximum of 200000.
warning: ././././Mathlib/Analysis/Analytic/Basic.lean:96:4: partialSum: Used 873 heartbeats, which is less than the current maximum of 200000.
info: ././././Mathlib/Analysis/Analytic/Basic.lean:100:8: 'partialSum_continuous' used 765 heartbeats, which is less than the current maximum of 200000.
warning: ././././Mathlib/Analysis/Analytic/Basic.lean:100:8: partialSum_continuous: Used 842 heartbeats, which is less than the current maximum of 200000.
info: ././././Mathlib/Analysis/Analytic/Basic.lean:119:4: 'radius' used 2150 heartbeats, which is less than the current maximum of 200000.
warning: ././././Mathlib/Analysis/Analytic/Basic.lean:119:4: radius: Used 4428 heartbeats, which is less than the current maximum of 200000.
...

I get "warning" and "info" about each declaration, with different numbers (oh is info the old output or something?)

Eric Wieser (May 07 2025 at 19:07):

Ideally we'd use the Firefox profiler diff view here

Eric Wieser (May 07 2025 at 19:08):

Unfortunately Lean writes (slightly) invalid profile files, and the diff view crashes

Eric Wieser (May 07 2025 at 19:09):

(I half fixed this a while ago, but the diff viewer got more picky around the same time it landed in Lean!)

Kevin Buzzard (May 07 2025 at 19:12):

(oh I realised I need to remove #count_heartbeats with Damiano's import!)

Damiano Testa (May 07 2025 at 19:25):

I'm not at a computer now, but, was the double reporting issue due to #count_heartbeats?

Damiano Testa (May 07 2025 at 19:26):

Regarding the failure of the old command, it was never clear to me what the root cause was, and there are usually workarounds for most issues anyways. Hopefully this one will work for longer!


Last updated: Dec 20 2025 at 21:32 UTC