Zulip Chat Archive

Stream: lean4

Topic: instance construction patterns

Matthew Ballard (Oct 03 2023 at 12:55):

We can speed up mathlib by modifying instance construction patterns to align more closely with the inheritance pattern of structures in Lean 4. In particular, we want to have a compact terms which Lean can unify without unfolding. If unfolding is required, we want to make sure we unfold as little as possible.

Small changes can have some major impact, for example #7430 and #7434 by @Chris Hughes are significant speed ups.

Chris has a library note #7432 to explicate changes like these. Reading it, I realize that calls Function.Injective/Surjective.x make for more unfolding than direct extending existing instances but my precise understanding is hazy. What is the main driver for the performance penalty we are paying during unification with these?

For example, here is Function.Injective.addMonoidWithOne

@[reducible]
def addMonoidWithOne {M₁} [Zero M₁] [One M₁] [Add M₁] [SMul ℕ M₁] [NatCast M₁]
    [AddMonoidWithOne M₂] (f : M₁ → M₂) (hf : Injective f) (zero : f 0 = 0) (one : f 1 = 1)
    (add : ∀ x y, f (x + y) = f x + f y) (nsmul : ∀ (x) (n : ℕ), f (n • x) = n • f x)
    (nat_cast : ∀ n : ℕ, f n = n) : AddMonoidWithOne M₁ :=
  { hf.addMonoid f zero add nsmul with
    natCast := Nat.cast,
    natCast_zero := hf (by erw [nat_cast, Nat.cast_zero, zero]),
    natCast_succ := fun n => hf (by erw [nat_cast, Nat.cast_succ, add, one, nat_cast]), one := 1 }

Is it just that we need to repeatedly unify all the data carrying instances like One M₁ Zero M₁ etc vs them being hidden inside terms that get unified without unfolding with direct inheritance? This aligns with my trace diving where these seemed to be at the bottom and to cost a nontrivial amount.

@Eric Wieser and @Sebastian Ullrich

Chris Hughes (Oct 03 2023 at 16:38):

The first PR where I noticed these sorts of changes making a significant difference was #6803. It's a fairly good case study. In particular the lemma inf_of_range_eq_base_range. This lemma does not unfold the type PushoutI anywhere in the proof I don't think.

Changing the group instance from

instance : Group (PushoutI φ) :=
  { Con.group (PushoutI.con φ) with
    toMonoid := PushoutI.monoid }

instance : Group (PushoutI φ) :=
  Con.group (PushoutI.con φ)

worsens the speed of inf_of_range_eq_base_range from aournd 2s to timing out. Note that the Con.group instance is defined in an optimised way, with the toMonoid field being given by Con.monoid.

It's also interesting that changing the definition of the monoid instance from

protected instance mul : Mul (PushoutI φ) := by
  delta PushoutI; infer_instance

protected instance one : One (PushoutI φ) := by
  delta PushoutI; infer_instance

instance monoid : Monoid (PushoutI φ) :=
  { Con.monoid _ with
    toMul := PushoutI.mul
    toOne := PushoutI.one }

instance monoid : Monoid (PushoutI φ) :=
   Con.monoid _

worsens the speed of inf_of_range_eq_base_range from around 2s to around 5s. This is particularly surprising to me, as there are no other instances on PushoutI other than Group, Monoid, Mul and One. I don't really see why this one would make a difference.

I guessed that it was something to do with unfolding Con.Group that was slowing things down, but I don't really know that. Also, this does not explain why changing the Monoid instance makes a difference.

Matthew Ballard (Oct 03 2023 at 16:51):

A guess for monoid: when you write toMul and toOne explicitly then Lean has to apply enough constructors to get down to those two fields. Perhaps unification sees this "open path" and prioritizes it when it appears on one side of =?= and it has to decide how to unfold things.

Matthew Ballard (Oct 03 2023 at 16:58):

Another possibility: whatever it has to unify monoid with is already expanded in this way.

Chris Hughes (Oct 03 2023 at 19:24):

What trace settings are you using to analyze this?

Matthew Ballard (Oct 03 2023 at 19:25):

I am just guessing

Chris Hughes (Oct 03 2023 at 19:36):

You mentionned trace diving earlier. Which settings did you use for that?

Matthew Ballard (Oct 03 2023 at 20:32):

set_option profiler true in
set_option trace.Meta.synthInstance in
set_option trace.Meta.isDefEq in

Kevin Buzzard (Oct 03 2023 at 21:58):

that's usually what I'm tracing too.

Anatole Dedecker (Oct 03 2023 at 22:05):

It's a different issue but it seems to fall under the name of the thread, so let me report it here: when investigating some slowness of typeclass inference regarding docs#BoundedContinuousFunction, I realized that the definitions like docs#BoundedContinuousFunction.instAddBoundedContinuousFunction which use docs#BoundedContinuousFunction.mkOfBound to construct data are bad, because the constant appears as data and Lean doesn't see that it won't be useful in the end, so it spends a lot of time trying to unify the constants... I suppose the same thing could also explain some of the performane problems around docs#ContinuousLinearMap and docs#ContinuousMultilinearMap, where we use this pattern quite frequently too.

Chris Hughes (Oct 04 2023 at 11:34):

One thing that I didn't realise until today, is that the kernel still typechecks the declaration inf_of_range_eq_base_range very fast (0.02s).

Chris Hughes (Oct 04 2023 at 11:49):

So one way of speeding up mathlib might be to basically assume all terms of type Group G are equal during elaboration and tactic execution and then maybe you get a proof that fails in the kernel if you happen to have two different terms of type Group G. If not exactly that there are surely optimisations in unification that rely on the assumption that two different terms of the same class is unusual.

Matthew Ballard (Oct 04 2023 at 12:16):

The observation about use of docs#BoundedCountinuousFunction.mkOfBound is nice and squares with “expose only the minimal amount data you need” from Function.InjSurj

Implicit in this question is the fact that “things were fine in Lean 3” (right?). Are there lessons there?

Matthew Ballard (Oct 04 2023 at 12:19):

I guess a refinement is “expose only the minimal amount of data you need to construct data”

Matthew Ballard (Oct 04 2023 at 12:38):

I still don’t understand #6803 though since it goes in the opposite direction. In the best of possible worlds, something else is exposed already that shouldn’t be. Fixing that would give the better performance overall

Matthew Ballard (Oct 04 2023 at 17:03):

A data point for the “misalignment” of instances causing slowdowns is RingTheory.Jacobson here. I combined the changes from lean4#2478 and the ones moving away from Function.Injective/Surjective.x. Without the latter, Jacobson slowed significantly with lean4#2478 (~ +40%). Without it, it is sped up even more (~ -40%).

Optimistic me hopes that this is the cause of other regressions for changes here.

Chris Hughes (Oct 09 2023 at 13:03):

Matthew Ballard said:

A data point for the “misalignment” of instances causing slowdowns is RingTheory.Jacobson here. I combined the changes from lean4#2478 and the ones moving away from Function.Injective/Surjective.x. Without the latter, Jacobson slowed significantly with lean4#2478 (~ +40%). Without it, it is sped up even more (~ -40%).

Optimistic me hopes that this is the cause of other regressions for changes here.

What do you mean by misalignment of instances?

Matthew Ballard (Oct 09 2023 at 13:16):

First a caveat. My understanding unification is akin to the apes understanding of the obelisk in 2001.

Given that. Lean4#2478 has a very mild change: in some cases, replace A.mk ... with a where def a : A := sorry. This results in some very surprising speed ups for unification. My guess is that unification sees A.mk ... and says "oh hey, this is already unfolded, I should unfold the other side to continue". But the other side is just a itself. Once it unfolds a is has to recheck all the constituent pieces. With the change it becomes a = a done.

So misalignment here means two quickly defeq terms appearing in unification with different levels of unfolding.

Matthew Ballard (Oct 09 2023 at 13:18):

From this experience, I default to "pack the instance as tightly as possible" for constructions (unless there is good reason not to.) Then I hope that makes unification happy. :monkey_face:

Chris Hughes (Oct 09 2023 at 13:40):

Well I'll add another data point. Building up all the instances up to Semiring in Polynomial did give a significant speedup in #7520 (with slowdowns in some files). Doing the same thing in FGModuleCat in #7563 made things worse.

Anne Baanen (Oct 11 2023 at 09:41):

Can I steer this discussion in the direction of a conclusion? Right now I am suffering from painful slowdowns in my code, which are fixed by applying these changes. So I would like to advocate for the following pragmatic approach:

Finish the library note as best as we can. We can make clear his is the current state of knowledge and may be updated or invalidated as we better understand and upgrade Lean 4.
In all the speedup PRs, annotate the changes with this library note.
Merge the speedup PRs that give a notable improvement in the majority of the cases, while not regressing too badly on the remaining cases.
I am happy, because it doesn't take minutes to put a Basis structure on my module.
If we know better how to speed up things, we can search for references to the library note and optimize further.
If Lean 4 is upgraded so everything just goes fast, we can search for references to the library note and put back things in the "natural" way.

The danger of going this way is that we go for a greedy approach to optimizing when a more reasoned solution can be better overall. However, I believe the current gains are from barely working to working fine, which is a lot more profit than going from working fine to working optimally. Moreover, by being careful to annotate our changes with the library note, we gain a way to back out if we learn new things.

Eric Wieser (Oct 11 2023 at 09:54):

My feat is that by going for a greedy approach, we lose the ability to evaluate other solutions that come from a change in core; as to test them we first have to revert all the mathlib changes

Sebastian Ullrich (Oct 11 2023 at 10:10):

change in core

Indeed, lean4#2644 may or may not significantly speed up these kinds of declarations. The current problem with it that it actually uncovered existing bugs in Lean that sadly mathlib depends on. We will see how much work it is to fix mathlib but if someone has a completely or at least somewhat freestanding example of this kind of issue, it might be good to first test it under that PR.

Anne Baanen (Oct 11 2023 at 10:27):

The problem seems to be that the slowdowns only really appear at scale, so reducing examples is hard. Maybe I should spend some time on an auto-MWE script...

Sebastian Ullrich (Oct 11 2023 at 11:37):

I completely understand, and such a script would be incredibly useful I'd say. In this case, we "already" can compile at least 400 files in mathlib, so early in mathlib might be sufficient

Matthew Ballard (Oct 11 2023 at 13:42):

If we want to speed up mathlib now, there is also lean4#2478. Currently we are waiting until we can fix AG to make everything green but that might take quite a bit since the problems run deeper than the changes there. But it does improve performance in many of the same places these changes are (and drops instructions overall by ~4% or so). It is hard to gauge how independent things are when you have to update a separate PR with that toolchain each time in addition to master.

Also I worry that we are improving things at the expense of creating bigger blockers to more systemic improvements in the future. Leaving problems we just "live with". In particular, expanding terms while improving speed seems like something that could bite us.

Matthew Ballard (Oct 11 2023 at 13:43):

lean4#2644 is a great idea!

Anne Baanen (Oct 11 2023 at 15:08):

Anne Baanen said:

The problem seems to be that the slowdowns only really appear at scale, so reducing examples is hard. Maybe I should spend some time on an auto-MWE script...

Okay, I spent some time on an auto-MWE script: https://leanprover.zulipchat.com/#narrow/stream/113488-general/topic/Minimizer.20script Let's see if we can reduce the DirectLimit field instance...

Matthew Ballard (Oct 11 2023 at 19:17):

Re: lean4#2644 I've only seen things become less reducible in the errors. A good place to focus for mwe would be docs#WithTop.aux failures in Order.LocallyFinite.lean

Matthew Ballard (Oct 12 2023 at 00:06):

It looks like lean4#2644 fixes some very broken things in RingTheory.Kaehler

Matthew Ballard (Oct 12 2023 at 00:26):

Before

cumulative profiling times:
    attribute application 16.5ms
    compilation 114ms
    compilation new 178ms
    dsimp 260ms
    elaboration 6.32s
    import 455ms
    initialization 14.9ms
    interpretation 11s
    linting 277ms
    norm_num 0.717ms
    parsing 24.4ms
    simp 1.78s
    tactic execution 8.14s
    type checking 3.51s
    typeclass inference 40.2s
    Command being timed: "lake env lean --profile Mathlib/RingTheory/Kaehler.lean"
    User time (seconds): 62.40
    System time (seconds): 0.99
    Percent of CPU this job got: 87%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 1:12.53
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 1290672
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 74
    Minor (reclaiming a frame) page faults: 216369
    Voluntary context switches: 24
    Involuntary context switches: 53417
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 16384
    Exit status: 0

After

cumulative profiling times:
    attribute application 12.7ms
    compilation 13s
    compilation new 244ms
    dsimp 197ms
    elaboration 2.44s
    import 546ms
    initialization 18.6ms
    interpretation 2.35s
    linting 203ms
    norm_num 0.657ms
    parsing 21.3ms
    simp 1.13s
    tactic execution 3.51s
    type checking 2.75s
    typeclass inference 24.7s
    Command being timed: "lake env lean --profile Mathlib/RingTheory/Kaehler.lean"
    User time (seconds): 49.85
    System time (seconds): 0.87
    Percent of CPU this job got: 98%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:51.44
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 1397312
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 70
    Minor (reclaiming a frame) page faults: 215477
    Voluntary context switches: 20
    Involuntary context switches: 32504
    Swaps: 0
    File system inputs: 0
    File system outputs: 0
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 16384
    Exit status: 0

Matthew Ballard (Oct 12 2023 at 01:40):

Cuts significantly into the build time for AlgebraicGeometry.Morphisms.RingHomProperties. It goes from 7:45 to 6:33 on my machine.

Matthew Ballard (Oct 12 2023 at 02:30):

And RepresentationTheory.GroupCohomology.Basic is free of bumps to heartbeats

Matthew Ballard (Oct 12 2023 at 04:58):

Bench

Matthew Ballard (Oct 12 2023 at 05:00):

Hilarious. Some files have 90% drops

Scott Morrison (Oct 12 2023 at 06:02):

So the explanation for the regressions (requiring erw instead of rw or simp) that we're seeing in either #7606, or @Matthew Ballard version at #7634, is that the DefEq cache has up to now been cheating. It has been reusing cached isDefEq results obtained with transparency := .defaultfor transparency := .reducible. And Mathlib has been taking advantage of this!

I posted a mostly-minimised example of a simp -> erw regression at https://github.com/leanprover/lean4/pull/2644#issuecomment-1758955555. If anyone would like to try further minimising that it would be helpful!

Matthew Ballard (Oct 12 2023 at 12:32):

Reading more carefully, for RingTheory.Kaehler we have

compilation 114ms

before and

compilation 13s

after

Eric Wieser (Oct 12 2023 at 12:35):

@Scott Morrison, could you use ```lean for that code block on github so that it highlights?

Sebastien Gouezel (Oct 12 2023 at 12:50):

If you merge master on the branch, then you will benefit from suppress_compilation in RingTheory.Kaehlerand these 13s of compilation should go away.

Matthew Ballard (Oct 12 2023 at 13:11):

I would also like to know more precisely why this is happening

Sebastien Gouezel (Oct 12 2023 at 13:22):

Usually, it's because a declaration becomes fast enough that the compiler doesn't give up right away, and instead compiles things even if it is costly. So, it's a clear sign that things got better!

Matthew Ballard (Oct 12 2023 at 13:26):

I follow it until C++ code and then I've run away. There is a try branch and it seems the compiler could theoretically just try forever.

Matthew Ballard (Oct 12 2023 at 19:20):

Good news everyone. It looks like lean4#2644 fixes AG enough to make lean4#2478 now improve the wall clock. Here is a benchmark for a toolchain containing both fixes vs current master. We go from -11% reduction in instructions and -14.5% reduction in wall clock with lean4#2644 alone to -14% instructions and -15% wall clock.

Yury G. Kudryashov (Nov 28 2023 at 20:45):

So, what's the conclusion about instances using Function.Injective.addCommGroup etc?

Last updated: May 02 2025 at 03:31 UTC