Zulip Chat Archive

Stream: mathlib4

Topic: Ways to speed up Mathlib

Michael Stoll (Apr 25 2024 at 17:56):

In my own recent PRs and also when reviewing, I started looking at how fast the code builds. I noticed a few potential bottlenecks like

slow typeclass inference
slow invocations of convert(see here)

To get a clearer picture and to see what can be done to speed things up, I ran the following experiments.

I added (`profiler, true) to the leanOptions in Mathlib's lakefile.lean, did lake exe cache clean and then lake build Mathlib, saving the output to a file. This produces a long list of timings (summary timings for various components per file and individual timings for, e.g., typeclass inference if it took at least 100ms on my machine), which I then looked at to extract some information where potential gains could be had.
I picked one of the top ten slow Mathlib files (NumberTheory.RamificationInertia; chosen because it is in NumberTheory, so I figured I could more easily fix stuff if necessary) and tried to speed it up.

The result of 2. can be seen at #12412: by a combination of

adding local instances
squeezing simps
filling in _s
replacing slow tactics by faster ones

I was able to get a speed-up by a factor of 3.

What I did was to add set_option profiler true at the beginning of the file; then I tried to edit things to make most lines in its output saying that something took 100ms or more disappear. If it was a slow typeclass search, I added set_option trace.Meta.synthInstance true in before the relevant declaration to find out which exact instance it was looking for, then added a shortcut instance for that (with the term that #synth found for me).

The take-away is that by investing some work, slow files can be made quite a bit faster, but this is not an efficient process that can be used on a larger scale (say, to speed up the slowest 200 files by 50% or so).

Michael Stoll (Apr 25 2024 at 18:00):

When I looked at the output of 1., I noticed lots of lines like

interpretation of Mathlib.Prelude.Rename.initFn._@.Mathlib.Mathport.Rename._hyg.247._lambda_3 took 670ms

(with varying timings that had an increasing trend while the build was going on). I figured that this likely comes from the #align_import lines at the beginning of most files. So I wanted to find out what the performance cost of all the #aligns, #noaligns and #align_imports actually is. So I made a branch where I replaced the elaborators (if that's the right term) for these commands by no-ops and benchmarked the result. This is #12410. The result is that we get an overall speed-up of about 3.7%.

Matthew Ballard (Apr 25 2024 at 18:03):

Great! I really appreciate the effort and time do this.

I do think a world where we have to restate the instances in each file is one where the typeclass system is fundamentally broken.

Michael Stoll (Apr 25 2024 at 18:05):

Today, I did some first data mining on the output of 1. I will give some of the results in the following posts.
The absolute times are (I think) by a factor of maybe 4 or so slower than on the fast machines CI runs on.

I'm first looking at the distribution of build times per file and of the proportions of total time for the various separate tasks the profiler reports.

number of files taking between 1/2 and 1 seconds: 23
number of files taking between 1 and 2 seconds: 237
number of files taking between 2 and 4 seconds: 649
number of files taking between 4 and 8 seconds: 888
number of files taking between 8 and 16 seconds: 838
number of files taking between 16 and 32 seconds: 601
number of files taking between 32 and 64 seconds: 388
number of files taking between 64 and 128 seconds: 144
number of files taking between 128 and 256 seconds: 33

total time for typeclass inference: 24348.2 seconds  (36.55%)
total time for simp: 10286.9 seconds  (15.44%)
total time for interpretation: 7201.30 seconds  (10.81%)
total time for elaboration: 5818.16 seconds  (8.734%)
total time for tactic execution: 5509.02 seconds  (8.271%)
total time for import: 4732.64 seconds  (7.105%)
total time for type checking: 3119.07 seconds  (4.683%)
total time for compilation: 2721.79 seconds  (4.086%)
total time for aesop: 682.934 seconds  (1.025%)
total time for linting: 624.481 seconds  (0.9375%)
total time for dsimp: 468.961 seconds  (0.7040%)
total time for .olean serialization: 385.102 seconds  (0.5781%)
total time for initialization: 321.888 seconds  (0.4832%)
total time for parsing: 181.773 seconds  (0.2729%)
total time for attribute application: 83.3917 seconds  (0.1252%)
total time for ring: 62.1803 seconds  (0.09335%)
total time for norm_num: 42.9837 seconds  (0.06453%)
total time for C code generation: 18.5345 seconds  (0.02783%)

Mario Carneiro (Apr 25 2024 at 18:06):

Michael Stoll said:

When I looked at the output of 1., I noticed lots of lines like
interpretation of Mathlib.Prelude.Rename.initFn._@.Mathlib.Mathport.Rename._hyg.247._lambda_3 took 670ms
(with varying timings that had an increasing trend while the build was going on). I figured that this likely comes from the #align_import lines at the beginning of most files. So I wanted to find out what the performance cost of all the #aligns, #noaligns and #align_imports actually is. So I made a branch where I replaced the elaborators (if that's the right term) for these commands by no-ops and benchmarked the result. This is #12410. The result is that we get an overall speed-up of about 3.7%.

This is most likely the cost of initializing the #align mapping, used to report duplicate alignment errors

Mario Carneiro (Apr 25 2024 at 18:06):

it's lazily initialized so it gets triggered the first time you use one of the #align commands

Michael Stoll (Apr 25 2024 at 18:09):

So typeclass inference takes more than a third of the total time (and simp is next with between 1/7 and 1/6).

Therefore I was looking next at slow typeclass searches. I list the top twenty or so instance types, first according to the total time recorded for all searches that took at least 100ms.

total time for typeclass inference of Subsingleton (> 0.1s): 813.943s
total time for typeclass inference of Nonempty (> 0.1s): 329.758s
total time for typeclass inference of Module (> 0.1s): 291.393s
total time for typeclass inference of SMul (> 0.1s): 234.497s
total time for typeclass inference of ContinuousConstSMul (> 0.1s): 232.057s
total time for typeclass inference of HSMul (> 0.1s): 196.080s
total time for typeclass inference of ZeroHomClass (> 0.1s): 175.237s
total time for typeclass inference of CoeT (> 0.1s): 169.718s
total time for typeclass inference of OneHomClass (> 0.1s): 150.614s
total time for typeclass inference of Algebra (> 0.1s): 149.607s
total time for typeclass inference of CoeFun (> 0.1s): 132.219s
total time for typeclass inference of MulHomClass (> 0.1s): 131.083s
total time for typeclass inference of AddHomClass (> 0.1s): 125.526s
total time for typeclass inference of MulAction (> 0.1s): 123.861s
total time for typeclass inference of AddMonoidHomClass (> 0.1s): 112.021s
total time for typeclass inference of AddCommMonoid (> 0.1s): 84.6660s
total time for typeclass inference of Fintype (> 0.1s): 70.7670s
total time for typeclass inference of CategoryTheory.Limits.HasColimit (> 0.1s): 68.7030s
total time for typeclass inference of CovariantClass (> 0.1s): 53.6410s
total time for typeclass inference of IsScalarTower (> 0.1s): 48.9460s

As a percentage of the total time for all slow instance searches:

proportion for typeclass inference of Subsingleton (> 0.1s): 15.6282%
proportion for typeclass inference of Nonempty (> 0.1s): 6.33154%
proportion for typeclass inference of Module (> 0.1s): 5.59491%
proportion for typeclass inference of SMul (> 0.1s): 4.50247%
proportion for typeclass inference of ContinuousConstSMul (> 0.1s): 4.45563%
proportion for typeclass inference of HSMul (> 0.1s): 3.76485%
proportion for typeclass inference of ZeroHomClass (> 0.1s): 3.36465%
proportion for typeclass inference of CoeT (> 0.1s): 3.25868%
proportion for typeclass inference of OneHomClass (> 0.1s): 2.89187%
proportion for typeclass inference of Algebra (> 0.1s): 2.87254%
proportion for typeclass inference of CoeFun (> 0.1s): 2.53868%
proportion for typeclass inference of MulHomClass (> 0.1s): 2.51687%
proportion for typeclass inference of AddHomClass (> 0.1s): 2.41017%
proportion for typeclass inference of MulAction (> 0.1s): 2.37820%
proportion for typeclass inference of AddMonoidHomClass (> 0.1s): 2.15087%
proportion for typeclass inference of AddCommMonoid (> 0.1s): 1.62564%
proportion for typeclass inference of Fintype (> 0.1s): 1.35877%
proportion for typeclass inference of CategoryTheory.Limits.HasColimit (> 0.1s): 1.31914%
proportion for typeclass inference of CovariantClass (> 0.1s): 1.02994%
proportion for typeclass inference of IsScalarTower (> 0.1s): 0.939791%

Michael Stoll (Apr 25 2024 at 18:10):

And here sorted according to the number of slow instance searches:

total number of typeclass inferences of Subsingleton (> 0.1s): 3487
total number of typeclass inferences of ContinuousConstSMul (> 0.1s): 1919
total number of typeclass inferences of Nonempty (> 0.1s): 1807
total number of typeclass inferences of Module (> 0.1s): 1416
total number of typeclass inferences of SMul (> 0.1s): 1397
total number of typeclass inferences of HSMul (> 0.1s): 1113
total number of typeclass inferences of Algebra (> 0.1s): 875
total number of typeclass inferences of CoeT (> 0.1s): 702
total number of typeclass inferences of ZeroHomClass (> 0.1s): 574
total number of typeclass inferences of CoeFun (> 0.1s): 570
total number of typeclass inferences of MulHomClass (> 0.1s): 527
total number of typeclass inferences of OneHomClass (> 0.1s): 471
total number of typeclass inferences of MulAction (> 0.1s): 464
total number of typeclass inferences of AddMonoidHomClass (> 0.1s): 452
total number of typeclass inferences of AddHomClass (> 0.1s): 440
total number of typeclass inferences of Fintype (> 0.1s): 315
total number of typeclass inferences of AddCommMonoid (> 0.1s): 313
total number of typeclass inferences of CovariantClass (> 0.1s): 237
total number of typeclass inferences of TopologicalSpace (> 0.1s): 223
total number of typeclass inferences of IsScalarTower (> 0.1s): 203

Michael Stoll (Apr 25 2024 at 18:11):

... and by the time for the slowest search:

max time for typeclass inferences of DivisionRing: 6.78s
max time for typeclass inferences of Decidable: 4.74s
max time for typeclass inferences of Module: 4.10s
max time for typeclass inferences of NormedAddCommGroup: 4.07s
max time for typeclass inferences of TopologicalSpace: 3.93s
max time for typeclass inferences of CommSemiring: 3.93s
max time for typeclass inferences of QuasiSober: 3.92s
max time for typeclass inferences of Algebra: 3.87s
max time for typeclass inferences of DecidablePred: 3.87s
max time for typeclass inferences of SeminormedAddGroup: 3.84s
max time for typeclass inferences of AddCommMonoid: 3.79s
max time for typeclass inferences of Inner: 3.78s
max time for typeclass inferences of Norm: 3.76s
max time for typeclass inferences of Sup: 3.73s
max time for typeclass inferences of Semiring: 3.66s
max time for typeclass inferences of CommRing: 3.62s
max time for typeclass inferences of Subsingleton: 3.61s
max time for typeclass inferences of Star: 3.45s
max time for typeclass inferences of AddMonoid: 3.41s
max time for typeclass inferences of AddHomClass: 2.99s

Henrik Böving (Apr 25 2024 at 18:11):

@Sebastian Ullrich has done some work that will hopefully generate a lot more performance insights in a much more visually rich way on the nightly branch if you are interested.

Michael Stoll (Apr 25 2024 at 18:17):

Clearly, Subsingleton dominates the two most relevant measures here. So I looked at this case in more detail:

Typeclass inference of Subsingleton:
number of instances taking between 0.1000 and 0.2000 seconds: 2275
     total time: 342.89s
number of instances taking between 0.2000 and 0.4000 seconds: 885
     total time: 230.85s
number of instances taking between 0.4000 and 0.8000 seconds: 255
     total time: 133.44s
number of instances taking between 0.8000 and 1.600 seconds: 47
     total time: 53.822s
number of instances taking between 1.600 and 3.200 seconds: 23
     total time: 46.040s
number of instances taking between 3.200 and 6.400 seconds: 2
     total time: 6.9000s

I am pretty sure that 99% of all Subsingleton instance sarches are triggered by convert (or perhaps congr!); see the thread mentioned already at the beginning. (This is corroborated by the fact that immediately after a bunch of typeclass inference of Subsingleton took ... lines, there is often a line reporting that convert took more than 100ms.) If we assume that Subsingleton instance searches take also about 15% of the total instance search time (the slow ones on which I have data make up a bit more than 1/5 of the total instance search time), then we end up at an estimate of 5-6% of the build time of Mathlib that is taken by (unsuccessfully) looking for instances of Subsingleton.

I think this is very strong motivation for changing congr!/convert so that these searches are avoided.

Michael Stoll (Apr 25 2024 at 18:26):

From the list of top slow instances, it appears that algebra is responsible for a large part:

SMul, ContinuousConstSMul and HSMul together: 12.7%
ZeroHomClass. OneHomClass. MulHomClass, AddHomClass, AddMonoidHomClasstogether: 13.3%
Module, Algebra, AddCommMonoid and IsScalarTower together: 11.0%

Then there is

Nonempty(6.3%), where it might be that also some tactics are responsible
CoeT and CoeFunwith 5.8%

Michael Stoll (Apr 25 2024 at 18:37):

Matthew Ballard said:

I do think a world where we have to restate the instances in each file is one where the typeclass system is fundamentally broken.

Clearly, Mathlib relies heavily on type classes, so provides a stress-test for the system, which I think it does not really pass. I think (short of curating instances manually for each file separately) something can be gained by

tweaking priorities (this needs a clear picture of which instances are likely to be useful more often than which others),
scoping some instances (as in #12080), so that they can be made available if needed, but don't mess up the search by default (this could be an option for instances that are tailored to fairly specific situations or maybe also for instances that relate, say, the algebraic and order hierarchies).

But I don't think there is a solution that makes instance search uniformly fast without some targeted tinkering in a fairly large number of files. IIRC, there is even a paper about this.

Sebastian Ullrich (Apr 25 2024 at 19:32):

Henrik Böving said:

Sebastian Ullrich has done some work that will hopefully generate a lot more performance insights in a much more visually rich way on the nightly branch if you are interested.

Indeed; according to https://profiler.firefox.com/public/d8vedqevhsp763pgpd66afqrz8fabd3zxwp60dr/calltree/?globalTrackOrder=0&invertCallstack&thread=0&timelineType=category&transforms=fg-1&v=10, convert is the 8th most expensive elaborator in Mathlib with 3% of the total time.
image.png

And according to https://profiler.firefox.com/public/d8vedqevhsp763pgpd66afqrz8fabd3zxwp60dr/calltree/?globalTrackOrder=0&thread=0&timelineType=category&transforms=ff-26&v=10, indeed 49% of that time is spent in an app builder's instance synth, though this does not currently print the specific involved instances.
image.png

This is from https://github.com/leanprover/lean4/pull/3801#issuecomment-2034662660

Michael Stoll (Apr 25 2024 at 19:46):

Nice! Is there documentation somewhere on how to extract specific information from it?

Sebastian Ullrich (Apr 25 2024 at 19:58):

The underlying data is simply the cumulative tree of trace nodes over 10ms, the challenge is how to make sense of that data using the generic transformations Firefox Profiler supports. I'm still learning about that part myself, the above is the first time I realized I could use "Focus category" and inverting the stack to find the most expensive elaborators

Matthew Ballard (Apr 25 2024 at 21:25):

aesop is # 4

Filippo A. E. Nuccio (Apr 26 2024 at 01:34):

Michael Stoll said:

scoping some instances (as in #12080), so that they can be made available if needed, but don't mess up the search by default (this could be an option for instances that are tailored to fairly specific situations or maybe also for instances that relate, say, the algebraic and order hierarchies).

I think that scoping instances should be triggered by mathematical rather than performances reasons. Otherwise we end up mixing design choices that are somewhat inconsistent and I find that a gaining a factor of 2 or 3 in speed is not worth the loss of design coherence.

Kim Morrison (Apr 26 2024 at 07:14):

Yes, while recognising that this is difficult, it's always better if we can speed these things up in tactic implementations or language implementations rather than filling files with workarounds.

Kim Morrison (Apr 26 2024 at 07:15):

Working out how to make the subsingleton search cheaper would be great. If there were some way to prune branches of the type class search...

Sébastien Gouëzel (Apr 26 2024 at 09:02):

Is a discrimination tree used in typeclass search? I mean, if you're looking for an instance of the form [Subsingleton α] where α is just a generic type then there is nothing to discriminate against, so all instances that always apply have to be searched, but you could eliminate right away all the instances of the form Subsingleton (Foo m n). And conversely if you're looking for an instance of the form Subsingleton (Bar a b), then you could also eliminate right away all instances like Subsingleton (Foo m n) if Bar is not Foo, i.e., you could use a discrimination tree one step further.

(I'm aware that this if very naive, because you would maybe need to reduce Foo or Bar before making this kind of search, and this can be expensive, but I would argue that in this kind of context one shouldn't reduce at all, everything should be purely syntactic).

Eric Wieser (Apr 26 2024 at 09:04):

Is the issue here perhaps that TC search in convert ends up looking for Subsingleton ?m where ?m is an unresolved metavariable? That would then be forced to do the slow wildcard match rather than the fast special case one.

Sebastian Ullrich (Apr 26 2024 at 09:05):

Yes, of course a discrimination tree is used. That's why instances rarely need to use no_index

Sébastien Gouëzel (Apr 26 2024 at 09:06):

If your search is Subsingleton ?m then you can stop right away, right?

Sebastian Ullrich (Apr 26 2024 at 09:08):

If someone could locate a file that has slow converts, I can upload a more detailed profile that shows the specific TC queries

David Loeffler (Apr 26 2024 at 09:13):

Mathlib.NumberTheory.ZetaValues has loads of slow converts

Sébastien Gouëzel (Apr 26 2024 at 09:18):

When trying to find a subsingleton instance on a generic type in a new file importing Mathlib, Lean tries to apply docs#CharP.CharOne.subsingleton and then goes on a wild goose chase trying to find a ring structure on α. Locally desactivating this instance gives me a x10 speedup on this (failing) instance search.

Sébastien Gouëzel (Apr 26 2024 at 09:18):

Let me try to desactivate the instance in mathlib to see if anything breaks.

David Loeffler (Apr 26 2024 at 09:22):

The first slow convert in ZetaValues is at line 69, where it spends 400ms trying to prove Subsingleton (ℝ → ℝ). Switching off Sebastien's CharOne instance speeds this up to about 250ms but that's still slower than one would like.

Jannis Limperg (Apr 26 2024 at 09:32):

Matthew Ballard said:

aesop is # 4

Just FYI, Aesop performance is dominated by simp, so this is sort of another 8 percentage points for simp. But I'm looking into how to reduce the amount of work simp has to do in Aesop.

Michael Stoll (Apr 26 2024 at 09:40):

Let me mention again the thread on slowness of convert in case someone missed it.

I think the better solution would be for convert to only look for Subsingleton instances when there is a reasonable chance it will find one (or make this configurable).

Sebastian Ullrich (Apr 26 2024 at 09:43):

David Loeffler said:

The first slow convert in ZetaValues is at line 69, where it spends 400ms trying to prove Subsingleton (ℝ → ℝ). Switching off Sebastien's CharOne instance speeds this up to about 250ms but that's still slower than one would like.

Indeed, here it is in the detailed profile: https://profiler.firefox.com/public/dr7kkc48x0x4ntt2npx1me6hn0hyjggf3hadpmg/calltree/?globalTrackOrder=0&search=convert&thread=0&timelineType=category&v=10
image.png

Interestingly most of the time is not spent on trying to apply some specific instance so that would require further profiling. That's as far as I'll indulge for now.

Michael Stoll (Apr 26 2024 at 09:48):

Filippo A. E. Nuccio said:

I think that scoping instances should be triggered by mathematical rather than performances reasons. Otherwise we end up mixing design choices that are somewhat inconsistent and I find that a gaining a factor of 2 or 3 in speed is not worth the loss of design coherence.

I would expect that there is not that much difference between the two. In my understanding, slow typeclass searches are caused by failing to find instances that the system tries to find because there are some "relative" instances (i.e., of the kind "if X is an instance of Y, then it is also an instance of Z") it follows that a mathematician would immediately see cannot lead to anything.

Independently, I'm not sure I agree with the last statement: an overall factor of 2 or 3 does make a significant difference that might be worth conisdering as part of a trade-off, for CI runs as well as when editing Lean files.

Michael Stoll (Apr 26 2024 at 09:56):

WIth

set_option profiler true
set_option profiler.threshold 1

before hasDerivAt_bernoulliFun, I get the output

typeclass inference of NontriviallyNormedField took 1.38ms
typeclass inference of NormedSpace took 1.54ms
typeclass inference of HSub took 1.89ms
typeclass inference of HSub took 1.91ms
typeclass inference of CoeT took 4.11ms
typeclass inference of HMul took 1.36ms
typeclass inference of CoeT took 1.93ms
typeclass inference of HMul took 1.47ms
typeclass inference of NormedSpace took 8.7ms
typeclass inference of Algebra took 6.5ms
typeclass inference of Subsingleton took 10.4ms
typeclass inference of Subsingleton took 7.3ms
typeclass inference of Subsingleton took 7.62ms
typeclass inference of Subsingleton took 60.8ms
typeclass inference of Subsingleton took 5.84ms
typeclass inference of Subsingleton took 7.03ms
typeclass inference of Subsingleton took 7.14ms
typeclass inference of Algebra took 4.96ms
typeclass inference of Algebra took 2.79ms
typeclass inference of Subsingleton took 166ms
typeclass inference of Subsingleton took 9.67ms
tactic execution of Mathlib.Tactic.convert took 29.7ms

(and without setting the threshold to 1 and building from the command line

typeclass inference of Subsingleton took 136ms
cumulative profiling times:
        attribute application 0.000951ms
        elaboration 4.64ms
        linting 3.35ms
        parsing 0.334ms
        simp 12.6ms
        tactic execution 23ms
        type checking 8.28ms
        typeclass inference 255ms

With set_option trace.Meta.synthInstance true in addition, the convert line produces (expanding the slowest one by one level)

[Meta.synthInstance] [0.012711s] ❌ Subsingleton Prop ▶
[Meta.synthInstance] [0.007315s] ❌ Subsingleton (NormedAddCommGroup F) ▶
[Meta.synthInstance] [0.008023s] ❌ Subsingleton (NormedSpace 𝕜 F) ▶
[Meta.synthInstance] [0.063941s] ❌ Subsingleton (𝕜 → F) ▶
[Meta.synthInstance] [0.006815s] ❌ Subsingleton F ▶
[Meta.synthInstance] [0.008203s] ❌ Subsingleton (NormedAddCommGroup ℝ) ▶
[Meta.synthInstance] [0.007834s] ❌ Subsingleton (NormedSpace ℝ ℝ) ▶
[Meta.synthInstance] [0.204544s] ❌ Subsingleton (ℝ → ℝ) ▼
  [] [0.000017s] new goal Subsingleton (ℝ → ℝ) ▶
  [] [0.000147s] ✅ apply @instSubsingletonForAll to Subsingleton (ℝ → ℝ) ▶
  [] [0.000058s] ❌ apply instSubsingleton to Subsingleton ℝ ▶
  [] [0.000101s] ✅ apply @CharP.CharOne.subsingleton to Subsingleton ℝ ▶
  [] [0.000113s] ✅ apply @Semiring.toNonAssocSemiring to NonAssocSemiring ℝ ▶
  [] [0.000062s] ✅ apply Real.semiring to Semiring ℝ ▶
  [resume] [0.000011s] propagating Semiring ℝ to subgoal Semiring ℝ of NonAssocSemiring ℝ ▶
  [resume] [0.000038s] propagating NonAssocSemiring ℝ to subgoal NonAssocSemiring ℝ of Subsingleton ℝ ▶
  [] [0.000101s] ✅ apply @IdemSemiring.toSemiring to Semiring ℝ ▶
  [] [0.000094s] ✅ apply @KleeneAlgebra.toIdemSemiring to IdemSemiring ℝ ▶
  [] [0.000100s] ✅ apply @IdemCommSemiring.toIdemSemiring to IdemSemiring ℝ ▶
  [] [0.000091s] ✅ apply @StrictOrderedSemiring.toSemiring to Semiring ℝ ▶
  [] [0.000062s] ✅ apply Real.strictOrderedSemiring to StrictOrderedSemiring ℝ ▶
  [resume] [0.000011s] propagating StrictOrderedSemiring ℝ to subgoal StrictOrderedSemiring ℝ of Semiring ℝ ▶
  [] [0.000093s] ✅ apply @LinearOrderedSemiring.toStrictOrderedSemiring to StrictOrderedSemiring ℝ ▶
  [] [0.000066s] ✅ apply Real.instLinearOrderedSemiringReal to LinearOrderedSemiring ℝ ▶
  [resume] [0.000010s] propagating LinearOrderedSemiring ℝ to subgoal LinearOrderedSemiring ℝ of StrictOrderedSemiring ℝ ▶
  [] [0.000093s] ✅ apply @LinearOrderedCommSemiring.toLinearOrderedSemiring to LinearOrderedSemiring ℝ ▶
  [] [0.000131s] ✅ apply @LinearOrderedSemifield.toLinearOrderedCommSemiring to LinearOrderedCommSemiring ℝ ▶
  [] [0.000100s] ✅ apply @CanonicallyLinearOrderedSemifield.toLinearOrderedSemifield to LinearOrderedSemifield ℝ ▶
  [] [0.000101s] ✅ apply @LinearOrderedField.toLinearOrderedSemifield to LinearOrderedSemifield ℝ ▶
  [] [0.000056s] ✅ apply Real.instLinearOrderedField to LinearOrderedField ℝ ▶
  [resume] [0.000010s] propagating LinearOrderedField ℝ to subgoal LinearOrderedField ℝ of LinearOrderedSemifield ℝ ▶
  [resume] [0.000013s] propagating LinearOrderedSemifield ℝ to subgoal LinearOrderedSemifield ℝ of LinearOrderedCommSemiring ℝ ▶
  [resume] [0.000009s] propagating LinearOrderedCommSemiring ℝ to subgoal LinearOrderedCommSemiring ℝ of LinearOrderedSemiring ℝ ▶
  [] [0.000088s] ✅ apply @NormedLinearOrderedField.toLinearOrderedField to LinearOrderedField ℝ ▶
  [] [0.000075s] ✅ apply Real.normedLinearOrderedField to NormedLinearOrderedField ℝ ▶
  [resume] [0.000019s] propagating NormedLinearOrderedField ℝ to subgoal NormedLinearOrderedField ℝ of LinearOrderedField ℝ ▶
  [] [0.000100s] ✅ apply @LinearOrderedCommRing.toLinearOrderedCommSemiring to LinearOrderedCommSemiring ℝ ▶
  [] [0.000056s] ✅ apply Real.linearOrderedCommRing to LinearOrderedCommRing ℝ ▶
  [resume] [0.000010s] propagating LinearOrderedCommRing ℝ to subgoal LinearOrderedCommRing ℝ of LinearOrderedCommSemiring ℝ ▶
  [] [0.000072s] ✅ apply @LinearOrderedField.toLinearOrderedCommRing to LinearOrderedCommRing ℝ ▶
  [resume] [0.000014s] propagating LinearOrderedField ℝ to subgoal LinearOrderedField ℝ of LinearOrderedCommRing ℝ ▶
  [] [0.000091s] ✅ apply @LinearOrderedRing.toLinearOrderedSemiring to LinearOrderedSemiring ℝ ▶
  [] [0.000056s] ✅ apply Real.instLinearOrderedRingReal to LinearOrderedRing ℝ ▶
  [resume] [0.000019s] propagating LinearOrderedRing ℝ to subgoal LinearOrderedRing ℝ of LinearOrderedSemiring ℝ ▶
  [] [0.000104s] ✅ apply @LinearOrderedCommRing.toLinearOrderedRing to LinearOrderedRing ℝ ▶
  [resume] [0.000010s] propagating LinearOrderedCommRing ℝ to subgoal LinearOrderedCommRing ℝ of LinearOrderedRing ℝ ▶
  [] [0.000109s] ✅ apply @StrictOrderedCommSemiring.toStrictOrderedSemiring to StrictOrderedSemiring ℝ ▶
  [] [0.000056s] ✅ apply Real.strictOrderedCommSemiring to StrictOrderedCommSemiring ℝ ▶
  [resume] [0.000010s] propagating StrictOrderedCommSemiring ℝ to subgoal StrictOrderedCommSemiring ℝ of StrictOrderedSemiring ℝ ▶
  [] [0.000073s] ✅ apply @LinearOrderedCommSemiring.toStrictOrderedCommSemiring to StrictOrderedCommSemiring ℝ ▶
  [resume] [0.000010s] propagating LinearOrderedCommSemiring
        ℝ to subgoal LinearOrderedCommSemiring ℝ of StrictOrderedCommSemiring ℝ ▶
  [] [0.000098s] ✅ apply @StrictOrderedCommRing.toStrictOrderedCommSemiring to StrictOrderedCommSemiring ℝ ▶
  [] [0.000055s] ✅ apply Real.instStrictOrderedCommRingReal to StrictOrderedCommRing ℝ ▶
  [resume] [0.000014s] propagating StrictOrderedCommRing ℝ to subgoal StrictOrderedCommRing ℝ of StrictOrderedCommSemiring ℝ ▶
  [] [0.000071s] ✅ apply @LinearOrderedCommRing.toStrictOrderedCommRing to StrictOrderedCommRing ℝ ▶
  [resume] [0.000009s] propagating LinearOrderedCommRing ℝ to subgoal LinearOrderedCommRing ℝ of StrictOrderedCommRing ℝ ▶
  [] [0.000093s] ✅ apply @StrictOrderedRing.toStrictOrderedSemiring to StrictOrderedSemiring ℝ ▶
  [] [0.000056s] ✅ apply Real.strictOrderedRing to StrictOrderedRing ℝ ▶
  [] 1229 more entries... ▶
[Meta.synthInstance] [0.011365s] ❌ Subsingleton ℝ ▶
[Meta.synthInstance] [0.000009s] ❌ Subsingleton ℝ ▶

Sebastian Ullrich (Apr 26 2024 at 10:02):

Oh, then the >10ms filter in my profile was misleading. "1229 more entries" is, of course, not great.

Michael Stoll (Apr 26 2024 at 10:03):

(BTW, when enabling the instance traces, I still get pop-ups saying "Lean server printed an error: 3 53" or similar.)

Kevin Buzzard (Apr 26 2024 at 10:05):

The arithmetic progression error? Yes I still get those too when debugging instance traces. I was just assuming someone would fix it at some point, I thought it was well-known.

Frédéric Dupuis (Apr 26 2024 at 10:05):

Michael Stoll said:

I would expect that there is not that much difference between the two. In my understanding, slow typeclass searches are caused by failing to find instances that the system tries to find because there are some "relative" instances (i.e., of the kind "if X is an instance of Y, then it is also an instance of Z") it follows that a mathematician would immediately see cannot lead to anything.

In the particular case of Subsingleton α, one could argue that what humans do to see that the search can't succeed is to search for Nontrivial α and we succeed very quickly. I wonder if we could somehow include something like this in typeclass search, i.e. under appropriate conditions (not sure what that would be), try to refute the hypothesis by searching for the "opposite" typeclass.

Michael Stoll (Apr 26 2024 at 10:05):

Is there a way to expand one full level of the trace search without cut-off at 50 entries?

Michael Stoll (Apr 26 2024 at 10:07):

Frédéric Dupuis said:

I wonder if we could somehow include something like this in typeclass search, i.e. under appropriate conditions (not sure what that would be), try to refute the hypothesis by searching for the "opposite" typeclass.

There were some discussions on "non-instances": here and here.

Michael Stoll (Apr 26 2024 at 10:10):

I was looking through about half the 1279 or so top-level entries in the trace for ❌ Subsingleton (ℝ → ℝ) and my impression is that each of them is quite fast (on the order of tenths of ms), but 1000 fast things are slow when taken together.

Floris van Doorn (Apr 26 2024 at 10:21):

Sébastien Gouëzel said:

Is a discrimination tree used in typeclass search? I mean, if you're looking for an instance of the form [Subsingleton α] where α is just a generic type then there is nothing to discriminate against, so all instances that always apply have to be searched, but you could eliminate right away all the instances of the form Subsingleton (Foo m n). And conversely if you're looking for an instance of the form Subsingleton (Bar a b), then you could also eliminate right away all instances like Subsingleton (Foo m n) if Bar is not Foo, i.e., you could use a discrimination tree one step further.

(I'm aware that this if very naive, because you would maybe need to reduce Foo or Bar before making this kind of search, and this can be expensive, but I would argue that in this kind of context one shouldn't reduce at all, everything should be purely syntactic).

I'm guessing that most of the time is spent on failing type-class searches, to see if certain goals don't have to be generated. For a failing type-class search, you unfortunately have to search the whole tree of instances.

Kevin Buzzard (Apr 26 2024 at 10:24):

This sounds like the slow failing SMul \C \R and SMul K (\MCO K) issues -- the latter has a trace with 3500 omitted lines and there was nothing you could point your finger at, it was just a silly question which no mathematician would ever ask but which took a long time to fall. Setting trace.profiler true just gave a bunch of numbers which didn't add up, so I concluded that the issue just must be typeclass inference asking a huge amount of silly questions. It would be amazing to be able to add failing typeclass searches to the instance cache somehow.

Michael Stoll (Apr 26 2024 at 10:27):

I'm looking at this right now:

import Mathlib

set_option profiler true
set_option profiler.threshold 1
set_option trace.Meta.synthInstance true

#synth Subsingleton ℝ

In the trace I see

[Meta.synthInstance] [0.078250s] ❌ Subsingleton ℝ ▼
  [] [0.000041s] new goal Subsingleton ℝ ▼
    [instances] #[@IsEmpty.instSubsingleton, @Unique.instSubsingleton, @CharP.CharOne.subsingleton, instSubsingleton]
  [] [0.000155s] ❌ apply instSubsingleton to Subsingleton ℝ ▶
  [] [0.000262s] ✅ apply @CharP.CharOne.subsingleton to Subsingleton ℝ ▶
  [] [0.000299s] ✅ apply @Semiring.toNonAssocSemiring to NonAssocSemiring ℝ ▶
  [] [0.000202s] ✅ apply Real.semiring to Semiring ℝ ▶
  [resume] [0.000023s] propagating Semiring ℝ to subgoal Semiring ℝ of NonAssocSemiring ℝ ▶
  [resume] [0.000078s] propagating NonAssocSemiring ℝ to subgoal NonAssocSemiring ℝ of Subsingleton ℝ ▼
    [] size: 2
    [] [0.000048s] no instances for CharP ℝ 1 ▶
  [] [0.000246s] ❌ apply @DirectSum.GradeZero.semiring to Semiring ℝ ▶
  [] [0.000244s] ✅ apply @IdemSemiring.toSemiring to Semiring ℝ ▶
  [] [0.000269s] ✅ apply @KleeneAlgebra.toIdemSemiring to IdemSemiring ℝ ▶
  [] [0.000245s] ✅ apply @IdemCommSemiring.toIdemSemiring to IdemSemiring ℝ ▶
  [] [0.000265s] ✅ apply @DivisionSemiring.toSemiring to Semiring ℝ ▶
  [] [0.000290s] ✅ apply @Semifield.toDivisionSemiring to DivisionSemiring ℝ ▶
  [] [0.000302s] ✅ apply @LinearOrderedSemifield.toSemifield to Semifield ℝ ▶
...

Note that ✅ apply Real.semiring to Semiring ℝ establishes Semiring ℝ and therefore also NonAssocSemiring ℝ. Also,

CharP.CharOne.subsingleton {R : Type u}  [NonAssocSemiring R] [CharP R 1] : Subsingleton R

so it now tries to look for an instance for CharP ℝ 1 and fails. But then it seems to continue with Semiring related stuff. Why is that? I.e., why does this not cause the consideration of CharP.CharOne.subsingleton to end and proceeding to see if it can apply Unique.instSubsingleton instead?

Note that the trace continues for a good 400 further entries that appear to visit lagre parts of the algebraic and order hierarchies before ending with the two lines

  [] [0.000089s] ✅ apply @Unique.instSubsingleton to Subsingleton ℝ ▶
  [] [0.000082s] ✅ apply @IsEmpty.instSubsingleton to Subsingleton ℝ ▶

that I would have expected to immediately follow

    [] [0.000048s] no instances for CharP ℝ 1 ▶

Michael Stoll (Apr 26 2024 at 10:30):

Kevin Buzzard said:

This sounds like the slow failing SMul \C \R and SMul K (\MCO K) issues

SMul and friends seem to cause slow typeclass searches in many places (see the data I posted earlier).

Maybe this could be a kind of testing ground for seeing what can be gained by changing priorities and/or scoping?

Floris van Doorn (Apr 26 2024 at 10:33):

I also noticed CharP.CharOne.subsingleton, and I think it looks suspicious as a global instance: it means that we have to navigate a large part of the algebraic hierarchy to check that something is not a subsingleton.

The statement of this lemma is

CharP.CharOne.subsingleton {R : Type*} [NonAssocSemiring R] [CharP R 1] : Subsingleton R

Also, in the Lean 3 community version, I believe we changed the order of search for such instances, from left-to-right to right-to-left, i.e. searching first for CharP R 1 and only then for NonAssocSemiring R (and check that it's compatible with the implicit class on R in CharP. The reason is that more constraining type-classes are generally the right-most typeclasses, so it will fail more quickly.

Was it a conscious decision in Lean 4 to do instance search from left-to-right?

Michael Stoll (Apr 26 2024 at 10:46):

Here is an #mwe for the phenomenon I observed above.

class A (α : Type) where

class B (α : Type) where

class C (α : Type) where

class D (α : Type) where

instance test (α : Type) [B α] [C α] : A α where

instance B_nat : B Nat where

instance DtoB (α : Type) [D α] : B α where

set_option trace.Meta.synthInstance true

#synth A Nat

Tis gives the fully expanded trace

[Meta.synthInstance] ❌ A Nat ▼
  [] new goal A Nat ▼
    [instances] #[test]
  [] ✅ apply test to A Nat ▼
    [tryResolve] ✅ A Nat ≟ A Nat
    [] new goal B Nat ▼
      [instances] #[DtoB, B_nat]
  [] ✅ apply B_nat to B Nat ▼
    [tryResolve] ✅ B Nat ≟ B Nat
  [resume] propagating B Nat to subgoal B Nat of A Nat ▼
    [] size: 1
    [] no instances for C Nat ▼
      [instances] #[]
  [] ✅ apply DtoB to B Nat ▼
    [tryResolve] ✅ B Nat ≟ B Nat
    [] no instances for D Nat ▼
      [instances] #[]

To find an instance for A Nat, it looks at test and so tries to find instances for B Nat and C Nat. It finds the instance B_nat for the former and then fails to find an instance for the latter. So far, so good.

But why is it then trying again to find an instance for B Nat?

Michael Stoll (Apr 26 2024 at 10:50):

It looks like failing to find a later required instance needed to apply something like test above (think CharP.CharOne.subsingleton) causes a continuation of the (previously already successful!) search for earlier instances.

Michael Stoll (Apr 26 2024 at 10:52):

My understanding of the "philosophy" of the typeclass system is that in any given case, either there is an instance, or else there is none. Which implies that when an instance has been found, there is no need whatsoever to look for another one, and in the example above, the search should stop as soon as no instance for C Nat is found (rather than trying to find another instance for B Nat).

Filippo A. E. Nuccio (Apr 26 2024 at 10:54):

Michael Stoll said:

Filippo A. E. Nuccio said:

I think that scoping instances should be triggered by mathematical rather than performances reasons. Otherwise we end up mixing design choices that are somewhat inconsistent and I find that a gaining a factor of 2 or 3 in speed is not worth the loss of design coherence.

I would expect that there is not that much difference between the two. In my understanding, slow typeclass searches are caused by failing to find instances that the system tries to find because there are some "relative" instances (i.e., of the kind "if X is an instance of Y, then it is also an instance of Z") it follows that a mathematician would immediately see cannot lead to anything.

Independently, I'm not sure I agree with the last statement: an overall factor of 2 or 3 does make a significant difference that might be worth conisdering as part of a trade-off, for CI runs as well as when editing Lean files.

What I mean by "mathematical reasons" is for instance the following. We do not want a global instance of metric space on $\mathbb{R}^n$ because there are several possible metrics but in a certain file we might want to trigger a scoped instance to access all results needing a [MetricSpace X]. An analogous argument for not having a Valued instance on $\mathbb{Q}$ but a scoped instance [Valued $\mathbb{Q}$] when working with a fixed prime $p$ if we need so.

Sébastien Gouëzel (Apr 26 2024 at 11:00):

Sébastien Gouëzel said:

Let me try to desactivate the instance in mathlib to see if anything breaks.

Done in #12445. green overall, -1.4% wall clock, -0.94% instructions.

Michael Stoll (Apr 26 2024 at 11:01):

I'm opening a new thread in the lean4 stream on the (what I consider a) bug demonstrated above.

David Loeffler (Apr 26 2024 at 11:01):

We do not want a global instance of metric space on R^n

@FilippoAENuccio I think we do actually have one, and it’s the L1 norm, which is a frequent cause of confusion for users expecting to get the Euclidean distance. But that’s orthogonal (pun not intended!) to the present discussion.

Michael Stoll (Apr 26 2024 at 11:06):

The new thread is here.

Michael Stoll (Apr 26 2024 at 11:07):

My hunch is that fixing this bug will go a loooooong way towards speeding up typeclass search in Mathlib.

Floris van Doorn (Apr 26 2024 at 13:00):

I profiled the following example:

import Mathlib

set_option trace.Meta.synthInstance true
set_option trace.profiler true
set_option profiler true
example (A : Type*) : Subsingleton (Fin 2 → ℝ) := by
  infer_instance

This takes ~0.6 on my laptop. The type-class trace can be found here.

Floris van Doorn (Apr 26 2024 at 13:00):

Some observations:

The longest individual steps are 10-30ms, for instances like docs#Field.isDomain, where Lean has to unify whether the implicit argument if IsDomain _.
I think the type-class inference algorithm is holding up very well despite Mathlib's large type-class hierarchy (even though Michael and I have pointed out some potential improvements). I think it's just unreasonable when you are asking Lean whether Fin 2 → ℝ has at most one element, Lean has to enumerate all algebraic, order-theoretic and categorical information it has about Fin 2, ℝ and even ℕ, just to answer this question. So I think for a large part we need to be more hygienic with instances.

Floris van Doorn (Apr 26 2024 at 13:00):

I think the instances that conclude a non-algebraic result (Finite, Subsingleton, Nontrivial) from an algebraic (or order-theoretic or category-theoretic) one are a bit suspicious, and the reverse is sometimes also suspicious. I would be in favor of scoping many of these instances.
Examples:

docs#littleWedderburn (conclude Field from Finite)
docs#CharP.CharOne.subsingleton
docs#CategoryTheory.IsConnected.is_nonempty
docs#CategoryTheory.FinCategory.fintypeObj
note: these last two might be fine on a right-to-left search.
maybe:
Quite some structures extend Nontrivial, e.g. StrictOrderedRing, IsSimpleGroup
docs#NonemptyFiniteLinearOrder.toFintype

Michael Stoll (Apr 26 2024 at 13:05):

Floris van Doorn said:

I think it's just unreasonable when you are asking Lean whether Fin 2 → ℝ has at most one element, Lean has to enumerate all algebraic, order-theoretic and categorical information it has about Fin 2, ℝ and even ℕ, just to answer this question.

This unreasonable behavior is caused by continuing the search for instances after later instances have not been found. I.e., in a situation like

instance foo [A] [B] : C

when looking for an instance C and no instance B is available, it traverses the complete search tree for instances A (say it finds one quickly initially, then does not find B, then it backtracks to find another instance of A etc.).

Sébastien Gouëzel (Apr 26 2024 at 13:15):

Just disabling docs#CharP.CharOne.subsingleton gives an overall 1% improvement on mathlib build time (which is a lot!), see #12445. It would probably be way less if the search were more reasonable as advocated rightly by @Michael Stoll , but still I find that completely crazy.

Michael Stoll (Apr 26 2024 at 13:19):

class A where

class B (n : Nat) where

class C where

instance test [B 10000] [C] : A where

instance Bsucc {n : Nat} [B n] : B n.succ where

instance instB0 : B 0 where

instance instB10000 : B 10000 where

set_option profiler true

set_option trace.Meta.synthInstance true

#synth A

and by increasing 10000 you can make it arbitrarily slow...

Floris van Doorn (Apr 26 2024 at 13:35):

I tried to modify the example to remove all instances that conclude nontrivial (or similar classes) from algebraic, order-theoretic or categorical information. This speeds up the instance search from ~500ms to ~20ms and it goes from ~2500 steps to 40 steps. However, the full list of these instances is very long (partially due to lean4#2905).

Floris van Doorn (Apr 26 2024 at 13:36):

Here is the full list

import Mathlib

set_option trace.Meta.synthInstance true
set_option trace.profiler true
set_option profiler true

attribute [-instance]
  -- miscellaneous suspicious instances
  CharP.CharOne.subsingleton
  littleWedderburn
  CategoryTheory.FinCategory.fintypeObj
  NonemptyFiniteLinearOrder.toFintype
  IsDomain.toCancelCommMonoidWithZero
  ZeroLEOneClass.neZero.two
  NeZero.of_gt'
  Invertible.ne_zero
  -- nontrivial instances
  StrictOrderedRing.toNontrivial
  IsSimpleGroup.toNontrivial
  LocalRing.toNontrivial
  IsSimpleAddGroup.toNontrivial
  IsSimpleOrder.toNontrivial
  EuclideanDomain.toNontrivial
  Field.toNontrivial
  StrictOrderedSemiring.toNontrivial
  instNontrivial
  CanonicallyLinearOrderedSemifield.toNontrivial
  Semifield.toNontrivial
  DivisionRing.toNontrivial
  DivisionSemiring.toNontrivial
  LinearOrderedCommGroupWithZero.toNontrivial
  LinearOrderedAddCommGroupWithTop.toNontrivial
  IsDomain.toNontrivial
  CommGroupWithZero.toNontrivial
  GroupWithZero.toNontrivial
  instNontrivial_1
  -- nonempty instances
  CategoryTheory.IsConnected.is_nonempty
  AddTorsor.nonempty
  bot_nonempty
  top_nonempty
  NonemptyFiniteLinearOrder.Nonempty
  ConnectedSpace.toNonempty
  IrreducibleSpace.toNonempty
  supSet_to_nonempty
  infSet_to_nonempty
  One.instNonempty
  Zero.instNonempty
  -- infinite instances
  IsAlgClosed.instInfinite
  CharZero.infinite
  NontriviallyNormedField.infinite

example (A : Type*) : Subsingleton (Fin 2 → ℝ) := by
  infer_instance

Michael Stoll (Apr 26 2024 at 13:37):

I have made an issue: lean4#3996

Floris van Doorn (Apr 26 2024 at 13:38):

Michael Stoll said:

This unreasonable behavior is caused by continuing the search for instances after later instances have not been found. I.e., in a situation like
instance foo [A] [B] : C
when looking for an instance C and no instance B is available, it traverses the complete search tree for instances A (say it finds one quickly initially, then does not find B, then it backtracks to find another instance of A etc.).

I think this is only a very small part. Instances like GroupWithZero.toNontrivial or One.instNonempty will require you to navigate the whole algebraic type class hierarchy to know that something is not nontrivial/nonempty.

Michael Stoll (Apr 26 2024 at 13:39):

Possibly. It would be interesting to see how much a modification of the typeclass search algorithm would improve things on its own.

Floris van Doorn (Apr 26 2024 at 13:40):

I'm not sure what to do with my giant list of instances. Is it possible to scope them all? (in namespaces like Algebra/Order/CategoryTheory?) Or is that too unintuitive?

Michael Stoll (Apr 26 2024 at 13:41):

Did you have to add them back in in many places?

Floris van Doorn (Apr 26 2024 at 13:41):

I didn't try anything outside my test file.

Floris van Doorn (Apr 26 2024 at 13:41):

I suspect so.

Floris van Doorn (Apr 26 2024 at 13:42):

Michael Stoll said:

Possibly. It would be interesting to see how much a modification of the typeclass search algorithm would improve things on its own.

I think it will be a nice improvement, but not huge. Note that many type-class searches are supposed to fail, and those have to navigate the whole type-class hierarchy anyway. This issue causes some succeeding searches to also navigate the whole type-class hierarchy.
But maybe I'm wrong.

Michael Stoll (Apr 26 2024 at 13:43):

I think with this appoach, one has to see which of these potentially problematic instances are needed where, so that one can decide whether it is a good idea to remove them from the list that is available by default.
I.e., if a specific instance is only needed very few times, then this would be a good candidate to get scoped.

Michael Stoll (Apr 26 2024 at 13:44):

My hunch is that without the "bad" backtracking, most of these failing searches would fail much more quickly. But we'll have to see.

Sébastien Gouëzel (Apr 26 2024 at 13:47):

In the Subsingleton (ℝ → ℝ) example, I have seen a lot of backtracking trying all the ways to synthesize again and again the same instances on the real line. Like @Michael Stoll, I expect that removing the bad backtracking would make things faster by one or two order of magnitude, because you would need to navigate the whole algebraic hierarchy just once instead of 100 times.

Sébastien Gouëzel (Apr 26 2024 at 13:48):

TLDR: please upvote lean4#3996 :-)

Sébastien Gouëzel (Apr 26 2024 at 13:55):

And in parallel I agree with @Floris van Doorn that all the instances he is flagging are bad -- but it's not clear to me if they are used a lot or not. (Although I guess that the toNontrivial parent instances can not be disabled currently if we keep the extends Nontrivial idiom)

Yuyang Zhao (Apr 26 2024 at 14:36):

I've found many similar examples (like this) before. They definitely shouldn't be this slow (I think they should be several times faster).

Michael Stoll (Apr 26 2024 at 14:41):

You could upvote lean4#3996 if you think this is important :smile:

Yuyang Zhao (Apr 26 2024 at 14:47):

Sure, I have upvoted. :blush:

Anne Baanen (Apr 26 2024 at 15:22):

Floris van Doorn said:

Also, in the Lean 3 community version, I believe we changed the order of search for such instances, from left-to-right to right-to-left

Indeed, Lean 3 works right-to-left:

class A : Type

class B (n : nat) : Type

class C : Type

instance test [B 10000] [C] : A := ⟨⟩

instance Bsucc {n : nat} [B n] : B n.succ := ⟨⟩

instance instB0 : B 0 := ⟨⟩

instance instB10000 : B 10000 := ⟨⟩

set_option trace.class_instances true

example : A := infer_instance
/-
[class_instances]  class-instance resolution trace
[class_instances] (0) ?x_0 : A := @test ?x_1 ?x_2
[class_instances] caching failure for A
[class_instances] caching failure for A
-/

instance foo : C := ⟨⟩
example : A := infer_instance
/-
[class_instances]  class-instance resolution trace
[class_instances] (0) ?x_0 : A := @test ?x_1 ?x_2
[class_instances] (1) ?x_2 : C := foo
[class_instances] caching instance for C
foo
[class_instances] (1) ?x_1 : B 10000 := instB10000
[class_instances] caching instance for B 10000
instB10000
[class_instances] caching instance for A
@test instB10000 foo
-/

Anne Baanen (Apr 26 2024 at 15:24):

I would be very happy if we could tell Lean if searching for a parameter should be delayed after the others have been found. So for example we could write

class A where

class B (n : Nat) where

class C where

instance test [delay (B 10000)] [C] : A where -- ← Delay the `B` parameter until a `C` instance is found.

instance Bsucc {n : Nat} [B n] : B n.succ where

instance instB0 : B 0 where

instance instB10000 : B 10000 where

set_option trace.Meta.synthInstance true

#synth A

In an even more beautiful solution, we'd might even write delay (n := 2) delay (n := 3) etc if we want very precise control over the amount of delaying.

Anne Baanen (Apr 26 2024 at 15:30):

Controlling argument synthesis order would have been very helpful for the FunLike unbundling refactor: https://leanprover.zulipchat.com/#narrow/stream/287929-mathlib4/topic/.60example.20.28p.20.3A.20P.29.20.3A.20Q.20.3A.3D.20p.60.20takes.200.2E25s.20to.20fail!/near/404440825

Mario Carneiro (Apr 26 2024 at 15:32):

The typeclass search order is not just left-to-right, it is customized per class and determined based on the usage of outParam and semiOutParam. You may have seen the error "cannot find synthesization order for instance" before which is the error message when this mechanism fails

Mario Carneiro (Apr 26 2024 at 15:33):

So it may be that we just need additional hints to tweak this mechanism

Anne Baanen (Apr 26 2024 at 15:34):

The most important difference is that I would like to place hints on the instances, while (semi)OutParam operates on classes.

Mario Carneiro (Apr 26 2024 at 15:36):

the synthOrder is per-instance, not per-class

Mario Carneiro (Apr 26 2024 at 15:37):

which is to say, it's possible to implement this hint without changing the main algorithm

Matthew Ballard (Apr 26 2024 at 15:38):

As is curating the state before calling the algorithm, I assume

Anne Baanen (Apr 26 2024 at 15:47):

Mario Carneiro said:

which is to say, it's possible to implement this hint without changing the main algorithm

Oh right, this is done in docs#Lean.Meta.computeSynthOrder. From a quick inspection, it looks like we can sort the array toSynth differently to tweak the order. Thanks!

Anne Baanen (Apr 26 2024 at 16:14):

Okay, doesn't look like I can implement this change and also compile Lean in time to catch my train. Maybe if I get bored the next days I'll come back to this, otherwise please feel free to take up the synth order.

Michael Stoll (Apr 29 2024 at 09:05):

I have tried to summarize my experience with trying to speed up a Mathlib file in #12412 here.

Should (something like) this go somewhere on the community website?

Michael Rothgang (Apr 29 2024 at 09:10):

I would love to have such a resource somewhere - I don't know where to place it, though.

Johan Commelin (Apr 29 2024 at 09:15):

Community website would be good. Or the github wiki of the mathlib repo

Yaël Dillies (Apr 29 2024 at 09:16):

As a blog post?

Johan Commelin (Apr 29 2024 at 09:17):

Yep, another option

Kim Morrison (Apr 29 2024 at 09:32):

Community website is most durable. We should get in the habit of putting more stuff there (and updating more of the out-of-date stuff there!)

Michael Stoll (Apr 29 2024 at 12:04):

I would also think of this as something that should be around (and easy to find) permanently.
What should I do to add it to the website?

Michael Rothgang (Apr 29 2024 at 18:05):

Send a PR to https://github.com/leanprover-community/leanprover-community.github.io, I think.

Michael Stoll (May 01 2024 at 17:56):

Made PR 472, which adds a file speedup.md under templates/contribute.

Michael Stoll (May 03 2024 at 08:48):

I have repeated the data-gathering experiment I described near the beginning of this thread with Mathlib on 3.8.0-rc1. For whatever reason, it took about three times as long than when I did it the first time. (The two runs are not entirely comparable: the first one missed about 370 Mathlib files, whereas the second one also built (at least part of) the dependencies.) I'm going to post the results here and add some comments.

Michael Stoll (May 03 2024 at 08:54):

Here is the data corresponding to what I posted here.

number of files taking between 1/8 and 1/4 seconds: 4
number of files taking between 1/4 and 1/2 seconds: 5
number of files taking between 1/2 and 1 seconds: 9
number of files taking between 1 and 2 seconds: 73
number of files taking between 2 and 4 seconds: 159
number of files taking between 4 and 8 seconds: 581
number of files taking between 8 and 16 seconds: 892
number of files taking between 16 and 32 seconds: 914
number of files taking between 32 and 64 seconds: 712
number of files taking between 64 and 128 seconds: 502
number of files taking between 128 and 256 seconds: 254
number of files taking between 256 and 512 seconds: 62
number of files taking between 512 and 1024 seconds: 8

Comment: The spread got quite a bit larger, but I assume this comes from a bunch of fast files from Std etc. being now included.

total time for typeclass inference: 68071.8 seconds  (36.94%)
total time for simp: 28814.2 seconds  (15.63%)
total time for interpretation: 19346.4 seconds  (10.50%)
total time for elaboration: 16364.4 seconds  (8.879%)
total time for tactic execution: 14899.6 seconds  (8.085%)
total time for import: 12648.0 seconds  (6.863%)
total time for type checking: 8193.12 seconds  (4.446%)
total time for compilation: 7957.97 seconds  (4.318%)
total time for aesop: 1862.84 seconds  (1.011%)
total time for linting: 1685.03 seconds  (0.9143%)
total time for dsimp: 1302.39 seconds  (0.7067%)
total time for .olean serialization: 1051.08 seconds  (0.5703%)
total time for initialization: 997.431 seconds  (0.5412%)
total time for parsing: 510.201 seconds  (0.2768%)
total time for attribute application: 238.467 seconds  (0.1294%)
total time for ring: 176.933 seconds  (0.09600%)
total time for norm_num: 114.681 seconds  (0.06223%)
total time for C code generation: 62.4177 seconds  (0.03387%)

Comment: These proportions have been remarkably stable (differences in the percentatges are less than a half percentage point, and the order is the same as before). This is somewhat disappointing, as I expected to see a drop in the fraction of time taken by typeclass inference.

Michael Stoll (May 03 2024 at 09:02):

Now the top 20 typeclass searches (in terms of precentages of the total time spent in typeclass searches taking at least 100ms. Note that because the overall build was slower, this threshold is relatively lower than before: the fraction of searches above the threshold accounts for 44.32% of the total typeclass search time now, as compared to 21.39% earlier), compare the previous list

proportion for typeclass inference of SMul (> 0.1s): 12.6338%
proportion for typeclass inference of AddCommMonoid (> 0.1s): 9.28545%
proportion for typeclass inference of Subsingleton (> 0.1s): 8.41692%
proportion for typeclass inference of HSMul (> 0.1s): 5.92720%
proportion for typeclass inference of Nonempty (> 0.1s): 5.00168%
proportion for typeclass inference of MulAction (> 0.1s): 4.33766%
proportion for typeclass inference of CoeT (> 0.1s): 3.83403%
proportion for typeclass inference of Module (> 0.1s): 3.71696%
proportion for typeclass inference of ContinuousConstSMul (> 0.1s): 2.28490%
proportion for typeclass inference of ZeroHomClass (> 0.1s): 1.93563%
proportion for typeclass inference of CoeFun (> 0.1s): 1.88482%
proportion for typeclass inference of Algebra (> 0.1s): 1.63880%
proportion for typeclass inference of OneHomClass (> 0.1s): 1.62345%
proportion for typeclass inference of OfNat (> 0.1s): 1.53243%
proportion for typeclass inference of MulHomClass (> 0.1s): 1.48795%
proportion for typeclass inference of Zero (> 0.1s): 1.44422%
proportion for typeclass inference of HAdd (> 0.1s): 1.39063%
proportion for typeclass inference of AddHomClass (> 0.1s): 1.33502%
proportion for typeclass inference of AddMonoidHomClass (> 0.1s): 1.17227%
proportion for typeclass inference of Fintype (> 0.1s): 1.04527%

Comment: Subsingleton has lost pole position and is now no. 3 with a bit more than half the previous percentage. I would have expected it to drop more. SMul, AddCommMonoid, HSMul, MulAction are taking more now, Module, ContinuousConstSMul and ZeroHomClass less than before. Recall that the overall relative time for typeclass search has stayed essentially the same, so this indicates that certain typeclass searches are now significantly slower.

Kevin Buzzard (May 03 2024 at 10:23):

How long is our favourite example SMul \C \R taking to fail nowadays? If this has gone up then it wouldn't surprise me at all if mathlib4 is slower.

Michael Stoll (May 03 2024 at 11:30):

This is reasonably fast now (639 heartbeats), but still takes a few detours.
(There should be shortcut instances for things like Zero ℝ ...)

Michael Stoll (May 10 2024 at 08:37):

Prompted by the observation that instance synthesis for algebraic classes like Semiring frequently goes down a rabbit hole chasing order structures, I tried to find out what causes this. My conclusion was that this comes from class definitions like

class OrderedSemiring (α : Type u) extends Semiring α, OrderedAddCommMonoid α where ...

(docs#OrderedSemiring) where an algebraic structure with order is defined extending an algebraic structure. Among many other things, this has the effect of generating an instance OrderedSemiring.toSemiring with default priority 1000. So all these instances tend to be early in the list of instances that are tried when looking for a Semiring structure (say) (note that the files under Mathlib.Algebra.Order, where these classes are defined, seem to be imported by a very large part of Mathlib), leading to the algorithm going on a hunt for order structures.

The idea now is to try to decouple the algebra+order hierarchy and the pure algebra hieraarchy by (preferably) making these instances scoped, so that they do not cause problems in the cases where they slow down instance synthesis, but can be made available when needed. However, this does not seem to work: while the instance attribute can be removed in the file defining the instance, this has no effect on downstream modules. So the next best solution is to reduce the priority to something small (like 50) instead; this persists in downstream modules. In addition, we add the high priority versions, scoped to AlgebraOrderInstances.

I have done this in #12778 to quite good effect: many files build siginifcantly faster; the overall build instructions count goes down by more than 2.5% (still less than what we could save by making #align... a no-op...), and time for typeclass inference goes down by 6.6%. This PR is now awaiting-review; comments are very much appreciated!

Kim Morrison (May 10 2024 at 08:47):

Oh, cool, 2.5% instructions reduction is a big deal!

Yaël Dillies (May 10 2024 at 09:22):

This seems like a really bad way to fix the issue, I must say

Michael Stoll (May 10 2024 at 09:23):

Why?

Yaël Dillies (May 10 2024 at 09:25):

Reading the start of your message, I thought you were going to say that you've turned the algebraic order typeclasses into mixins, as this too would stop wild typeclass goose chases down the algebraic order hierarchy when looking at purely algebraic goals. This is the solution I've come up with but I haven't said anything about it here on Zulip because I first want to try it out myself once my exams are over.

Yaël Dillies (May 10 2024 at 09:26):

But instead you're scoping the paths from the algebraic order hierarchy to the algebraic hierarchy, meaning that users would now need to know the magic open scoped AlgebraOrderInstances invocation in order to use the algebraic order hierarchy at all.

Kim Morrison (May 10 2024 at 09:27):

There's no harm in lowering the priority of these ordered -> algebraic instances in the meantime, is there? When/if someone gets around to trying mixins, the custom priorities will disappera.

Kim Morrison (May 10 2024 at 09:27):

Do we need the scoped instances at all? I'm hoping just lowering priorities suffices.

Kim Morrison (May 10 2024 at 09:27):

(In my reading of Michael's message he didn't actually turn off any instances.)

Michael Stoll (May 10 2024 at 09:27):

Yaël Dillies said:

users would now need to know the magic open scoped AlgebraOrderInstances invocation in order to use the algebraic order hierarchy at all.

As it is, the instances are still available, but with low priority.

Yaël Dillies (May 10 2024 at 09:28):

Wait, I thought we couldn't lower priorities after the fact? Has that changed?

Michael Stoll (May 10 2024 at 09:28):

Kim Morrison said:

Do we need the scoped instances at all? I'm hoping just lowering priorities suffices.

Yes, at least to not make a bunch of files noticeably slower.

Yaël Dillies (May 10 2024 at 09:28):

Then why do we even need the higher priority instances scoped? We should just go back to the way Lean 3 worked by making all always-applicable instances have low priority

Michael Stoll (May 10 2024 at 09:29):

Yaël Dillies said:

Wait, I thought we couldn't lower priorities after the fact? Has that changed?

I don't know if something has changed, but I found out by experiment that you can lower the priority in the file that defines the instance, with lasting effect.

Yaël Dillies (May 10 2024 at 09:29):

Hmm, that's annoying

Kim Morrison (May 10 2024 at 09:29):

What's annoying about that?

Michael Stoll (May 10 2024 at 09:29):

Yaël Dillies said:

Hmm, that's annoying

Not really, because otherwise there would be no way of changing the priority of the automatically generated instances.

Yaël Dillies (May 10 2024 at 09:30):

Yaël Dillies said:

users would now need to know the magic open scoped AlgebraOrderInstances invocation in order to use the algebraic order hierarchy at all.

So it seems :point_of_information: still holds, not because the global instances don't exist but because they have poor performance

Yaël Dillies (May 10 2024 at 09:31):

Kim Morrison said:

What's annoying about that?

This was a reply to
Michael Stoll said:

Yes, at least to not make a bunch of files noticeably slower.

Yaël Dillies (May 10 2024 at 09:32):

Hence I am strongly against merging #12778 before we have done the experiment of turning algebraic order typeclasses into mixins

Michael Stoll (May 10 2024 at 09:32):

Yaël Dillies said:

Yaël Dillies said:

users would now need to know the magic open scoped AlgebraOrderInstances invocation in order to use the algebraic order hierarchy at all.

So it seems :point_of_information: still holds, not because the global instances don't exist but because they have poor performance

I think the problem is that in some cases, you don't want to have the instances (but still need to import the files defining them) and in other cases you do want them. The question is how one can get the algorithm to make the correct selection automatically in all (or most) cases.

Yaël Dillies (May 10 2024 at 09:33):

Yes so mixins do exactly that

Michael Stoll (May 10 2024 at 09:33):

Yaël Dillies said:

Hence I am strongly against merging #12778 before we have done the experiment of turning algebraic order typeclasses into mixins

I'm fine with trying to first find out what works better.

Yaël Dillies (May 10 2024 at 09:34):

You have an algebraic layer, an order layer and an algebraic order layer. Typeclass synthesis will only search through one layer and never try moving to the next layer

Yaël Dillies (May 10 2024 at 09:35):

The cost is that the user needs to declare one instance per layer

Michael Stoll (May 10 2024 at 09:36):

I.e., an algebraic structure, an order structure, and some compatibility statements?

Michael Stoll (May 10 2024 at 09:38):

Can we have notation that would expand [LinearOrderedField F] to [Field F] [LinearOrder F] [??? F], where [??? F] denotes the compatibility class?

Yaël Dillies (May 10 2024 at 09:39):

Yes, this was discussed plenty already. Look for [[ ]] notation

Andrew Yang (May 10 2024 at 09:48):

Is there a link? I can’t figure out how to search for special symbols in Zulip.

Michael Stoll (May 10 2024 at 10:06):

In any case, my experiment gives an indication of what can be gained by decoupling the hierarchies, so I hope it provides additional motivation for looking into the mix-in refactor :smile:

Riccardo Brasca (May 10 2024 at 10:34):

Andrew Yang said:

Is there a link? I can’t figure out how to search for special symbols in Zulip.

It started here, but there is no link to the actual PR.

Riccardo Brasca (May 10 2024 at 10:36):

#3162

Damiano Testa (May 10 2024 at 11:57):

Note that there is also a parallel and related proposal of splitting typeclasses that carry data from typeclasses that express properties of the data.

The latest time that this came up, was at some monthly meeting, I think.

Edit: this is the summary.

Yuyang Zhao (May 10 2024 at 12:51):

I tried turning them into mixins, but failed in #6326. I've recently been looking into the reason why mixins become slower.

Matthew Ballard (May 10 2024 at 13:11):

It is pretty simple for me re-run the downgrade to parents priority, either or both preferred and non.

Should I with the changes since the last attempt? If so, what do people want to see for priorities?

Michael Stoll (May 10 2024 at 18:49):

Here is my summary on what I have learned from #12778.

Decoupling the algebra and order hierarchies has significant potential for speeding up Mathlib. Even the somewhat ham-handed approach used in #12778 leads to a larger than 2.5% reduction of the instruction count for building Mathlib and a gain of about 8% in typeclass search.
As Yaël has said earlier, the approach taken with scoping high-priority versions of the instances relating the hierarchies is a bit annoying to use: I added the line open scoped AlgebraOrderInstances to about 100 files (judging by the speed center output where it may be helpful).
Having a more targeted solution that always (or almost always) does the right thing should give even more of a speed-up. The version in #12778 decides on a per-file basis whether the connecting instances should have high or low priority, but it is well possible that the best choice varies within a file. (One could, of course, apply the open scoped in a more local way, but to find out where precisely this should be done is more work than I would be willing to do.)

The upshot is, I am very much looking forward to the refactor Yaël proposes! @Yaël Dillies How would it differ from the attempt in #6326 mentioned by @Yuyang Zhao above?

Yaël Dillies (May 10 2024 at 21:02):

#6326 only applies the change to one typeclass. It's not clear to me that you can rip out the benefits of the approach without swapping over all the classes at once

Jovan Gerbscheid (May 11 2024 at 18:34):

I wonder if it would help to cache all intermediate type-class search results? For example:

set_option trace.Meta.synthInstance true
run_meta do
  let i ← trySynthInstance q(AddGroup Nat)
  let i ← trySynthInstance q(AddCommGroup Nat)

After the first type class search that took 5ms, we store in the cache that AddGroup Nat has no instances. Along the way we figure out that there is also no instance for AddCommGroup Nat (and 100 other non-instances), but this is not cached. So when checking any of these instances, the goose chase needs to be played from the start, in this case taking another 4ms.

Michael Stoll (May 11 2024 at 18:41):

I had asked a similar question in a previous discussion (I think, but I can't find it easily). I think it would be interesting to try this. I imagine that it may be not so easy to implement, as instances may need the local context to make sense.

Jovan Gerbscheid (May 11 2024 at 18:44):

This is already dealt with in the current caching:

abbrev SynthInstanceCache := PersistentHashMap (LocalInstances × Expr) (Option Expr)

Jovan Gerbscheid (May 11 2024 at 18:59):

I think it should be quite straight forward to implement. The set of local instances is fixed at the start and doesn't change during type class resolution.

For each generator node in the tabled type class resolution, if the goal does not contain any assignable metavariables (as of two weeks ago this is tracked in the field typeHasMVars), then we can look it up in the cache. Along the way we keep an array of these nodes with their results and what their outcome is. At the end we simply add them all to the global cache.

Michael Stoll (May 11 2024 at 19:01):

Ah, OK; my question was asking a bit more: save the cached results for later typeclass searches even outside the current declaration, within a file ("module"). This may be more involved, perhaps...

Jovan Gerbscheid (May 11 2024 at 19:03):

Ah right, you weren't talking about local instances in the sense of instance arguments, but in the sense of instances defined in the current file

Michael Stoll (May 11 2024 at 19:05):

Not necessarily defined in the current file. Say, wie have variable {F} [Field F]; then we will likely often need something like Semiring F or Zero F, and it would be nice if the search needed to be done only once.

But I guess that this wouldn't work, since variables are looked at afresh each time they are used...

Jovan Gerbscheid (May 11 2024 at 19:09):

An obvious way to extend the cache would be to keep the cache between declarations IF no instances have been added in the meantime, which is orthogonal to whether we cache intermediate results

Jovan Gerbscheid (May 11 2024 at 19:12):

And if variable statements would yield the same free variables each time, then this would work even better, but apparently that is not the case.

Jovan Gerbscheid (May 11 2024 at 19:23):

I think the aggressive caching that I'm suggesting will be particularly useful for my library rewrite tactic rw??, because it has to try a large amount of type class problems with different type classes, but the same types. For example when running it on a + b + c for natural numbers, it will try (among many others) the lemmas add_comm and add_assoc. This means it will try to synthesize many different type classes of Nat, and so it does a lot of duplicate work.

The only downside could be that the extra caching may or may not cost too much time/memory.

Michael Stoll (May 11 2024 at 19:50):

I guess the only way to find out is to try it...

Michael Stoll (May 11 2024 at 20:10):

Here is a Mathlib-free example that shows that also intermediate instances that have been found (as opposed to non-instances) are not cached.

set_option profiler true
set_option profiler.threshold 1

class A (n : Nat) where

instance foo {n : Nat} [A n.succ] : A n where

def baz (k : Nat) [A k] : Nat := 37

set_option trace.Meta.synthInstance true
theorem xyz [A 20] : baz 0 = baz 1 := rfl

This first synthesizes A 0 (by backtracking up to the provided instance A 20); on the way it sees A 1, but then A 1 gets synthesized from scratch again for baz 1.

Replacing baz 1 by baz 0 shows that the end result of the first search does get cached and re-used.

Jovan Gerbscheid (May 13 2024 at 01:22):

It seems to me lines 726-728 do redundant work: https://github.com/leanprover/lean4/blob/f74980ccee82ca2abdae65dcbc5571d4640ed076/src/Lean/Meta/SynthInstance.lean#L726
If an instance has been retrieved from the cache, then there is no need to check again that it applies. (also assignOutParams is marked as let rec without being recursive)

Jovan Gerbscheid (May 15 2024 at 20:25):

How do I get to benchmark my Lean PR on Mathlib? lean4#4152

Matthew Ballard (May 15 2024 at 20:26):

Do you want to benchmark it against master or nightly-testing?

Matthew Ballard (May 15 2024 at 20:32):

For the former, cherry pick your commits on top of v4.8.0-rc1 (or whatever is the current mathlib toolchain). Release copies of your custom toolchain on your fork. Update the lean-toolchain on your mathlib branch to that toolchain. Fix errors and clear lints. Push and create a draft PR. Comment !bench.

For the latter, just rebase on nightly-with-mathlib and let the automation take over. Create a draft PR from the generated mathlib branch after fixing errors and clearing lints. Comment !bench.

Matthew Ballard (May 15 2024 at 20:34):

Though, right now, the benchmark server is taking a nap

Jovan Gerbscheid (May 15 2024 at 21:01):

Does it make a difference which one I choose? I believe I had rebased to nightly-with-mathlib at the start. Where do I find the generated mathlib branch?

Matthew Ballard (May 15 2024 at 21:27):

branch#lean-pr-testing-4152 is where it would be

Matthew Ballard (May 15 2024 at 21:31):

The bot still wants you to rebase on another commit to branch off nightly-with-mathlib

Jovan Gerbscheid (May 16 2024 at 13:00):

So I've now tried to get this mathlib branch on my laptop, but I get error: binary package was not provided for 'windows'. How do I go around this?

Jovan Gerbscheid (May 16 2024 at 13:13):

Ah, I seem to be able to do this on Gitpod

Matthew Ballard (May 16 2024 at 13:15):

The full CI tag should be applied to Lean PR if you want builds on all arch/OS’s.

Matthew Ballard (May 16 2024 at 13:28):

From the manual

It is also possible to generate releases that others can use, simply by pushing a tag to your fork of the Lean 4 github repository (and waiting about an hour; check the Actions tab for completion). If you push my-tag to a fork in your github account my_name, you can then put my_name/lean4:my-tag in your lean-toolchain file in a project using lake. (You must use a tag name that does not start with a numeral, or contain _).

If you don't want to keep bugging the core team for things like this

Matthew Ballard (May 16 2024 at 13:29):

Just make sure you have permissions set correctly for the actions or they will fail after an hour and you will be annoyed

Joachim Breitner (May 16 2024 at 14:42):

Added full-ci to that PR so that the next PR release contains windows binaries. If yuo use these pr releases locally,

elan toolchain uninstall $(cat lean-toolchain)

is a magic incarnation after CI built a new one :-)

Jovan Gerbscheid (May 16 2024 at 22:15):

At this application of the function fetchOrCreate, it synthesizes an instance of MonadStore1. This type class has an outParam, but there are multiple instances available that give different values for the outParam. The current algorithm is very lucky to find the right one, because there are two instances directly applicable (call them inst1 and inst2), such that inst1 is what is in the final solution, but if we set inst1 to a high priority, then the synthesis fails. So the existence of inst2 pushes the type class search in the "right" direction without actually being used.

My version failed here, causing the build to fail :melting_face:

Jovan Gerbscheid (May 16 2024 at 22:15):

I tried replacing the outParam by a semiOutParam, but this didn't work

Jovan Gerbscheid (May 16 2024 at 22:22):

The two instances in question are Lake.instMonadStore1OfMonadDStore and Lake.instMonadStore1OfMonadDStoreOfFamilyOut respectively

Matthew Ballard (May 16 2024 at 22:32):

Can one be deleted in master?

Jovan Gerbscheid (May 16 2024 at 23:06):

Well, deleting the (here) unused instance will make the type class synthesis fail

Jovan Gerbscheid (May 17 2024 at 00:06):

The reason that there are multiple possible instances is that there is a stack of multiple StateT monad transformers, and the type class search is expected to figure out which one to use.

Jovan Gerbscheid (May 17 2024 at 01:11):

Here's an interesting minimalization:

open Lean

class MonadStore1 {κ : Type u} (k : κ) (α : outParam $ Type v) (m : Type v → Type w) where

class MonadDStore (κ : Type u) (β : outParam $ κ → Type v) (m : Type v → Type w) where

instance inst [i : MonadDStore κ β m] : MonadStore1 k (β k) m where

variable [i : MonadDStore Name (fun _ => β) m]

set_option trace.Meta.isDefEq true

#synth MonadStore1 `hoihoi β m

#check inst (k := `hoihoi) (i := i) -- inst : MonadStore1 `hoihoi β m

this synthesis succeeds, but if you remove the outParam, or replace it by a semiOutParam, then it fails, with a very surprising unification failure. This must have been the reason why this outParam was put there in the first place, even though it doesn't really belong there.

Jovan Gerbscheid (May 17 2024 at 02:24):

I solved the problem by changing the outParams into semiOutParams, and adding the following two instances:

instance (priority := high) [MonadStore κ β m] : MonadStore1 (k : κ) β m where
  fetch? := MonadDStore.fetch? (β := fun _ => β) k
  store o := MonadDStore.store (β := fun _ => β) k o

instance [MonadLift m n] [MonadStore κ β m] : MonadStore κ β n where
  fetch? k := liftM (m := m) <| fetch? k
  store k a := liftM (m := m) <| store k a

These are non-dependent/constant copies of equivalent instances of the dependent MonadDStore. (Recall abbrev MonadStore κ α m := MonadDStore κ (fun _ => α) m). Unification is having trouble applying the general MonadDStore instances in the non-dependent case. It would be nice if unification in type class search could be configured in a way where the original instances would apply automatically, e.g. in the example above.

Matthew Ballard (May 17 2024 at 09:07):

Reminiscent of #7905.

Matthew Ballard (May 17 2024 at 09:08):

In particular, https://github.com/leanprover-community/mathlib4/pull/7905#issuecomment-1793562341

Jovan Gerbscheid (May 17 2024 at 10:44):

I'm not sure it it is the same. In that case it was just about speedup right? In this case the dependent instance will simply not unify (unless we illegally mark an outParam

Matthew Ballard (May 17 2024 at 10:49):

Sure but the underlying cause was ineffective beta reduction to go from the dependent version to the non-dependent one

Jovan Gerbscheid (May 17 2024 at 11:22):

Here the cause was a failed unification when applying the dependent version

[Meta.isDefEq] ❌ MonadStore1 `hoihoi β m =?= MonadStore1 ?m.206 (?m.204 ?m.206) ?m.205 ▼
  [] ✅ Lean.Name =?= ?m.203 ▶
  [] ✅ `hoihoi =?= ?m.206 ▶
  [] ❌ β =?= ?m.204 `hoihoi ▼
    [] β [nonassignable] =?= ?m.204 `hoihoi [assignable]
    [foApprox] ?m.204 [`hoihoi] := β
    [constApprox] ?m.204 [`hoihoi] := β
  [onFailure] ❌ MonadStore1 `hoihoi β m =?= MonadStore1 ?m.206 (?m.204 ?m.206) ?m.205
  [onFailure] ❌ MonadStore1 `hoihoi β m =?= MonadStore1 ?m.206 (?m.204 ?m.206) ?m.205

Jovan Gerbscheid (May 17 2024 at 13:07):

I encountered a very surprising slow failing type class search in mathlib. the function synthInstance? runs withNewMCtxDepth, which means that metavariables that are already in the type class expression will not be instantiated during the type class search. However, in this example, this rule is somehow broken. It tries to synthesize CommRing ?m.2404, and somehow manages to instantiate the metavariable. It then goes on a wild search for 1 second until failing.

I've minimized the example a bit:

import Mathlib.AlgebraicGeometry.Morphisms.RingHomProperties
universe u

open CategoryTheory Opposite TopologicalSpace CategoryTheory.Limits AlgebraicGeometry

namespace RingHom.PropertyIsLocal

variable {P} (hP : RingHom.PropertyIsLocal @P)

set_option trace.Meta.synthInstance true
set_option profiler true

theorem sourceAffineLocally_of_source_openCover' {X Y : Scheme.{u}} (f : X ⟶ Y) [IsAffine Y]
    (𝒰 : X.OpenCover) [∀ i, IsAffine (𝒰.obj i)] (H : ∀ i, P (Scheme.Γ.map (𝒰.map i ≫ f).op)) :
    sourceAffineLocally (@P) f := by
  let S i := (⟨⟨Set.range (𝒰.map i).1.base, (𝒰.IsOpen i).base_open.isOpen_range⟩,
    rangeIsAffineOpenOfOpenImmersion (𝒰.map i)⟩ : X.affineOpens)
  intro U
  -- Porting note: here is what we are eliminating into Lean
  apply of_affine_open_cover
    (P := fun V => P (Scheme.Γ.map (X.ofRestrict (Opens.openEmbedding V.val) ≫ f).op)) U
  · intro U r H
    -- Porting note: failing on instance synthesis for an (unspecified) meta variable
    -- made φ explicit and forced to use dsimp in the proof
    convert hP.StableUnderComposition
      (S := Scheme.Γ.obj (Opposite.op (X.restrict <| Opens.openEmbedding U.val)))
      (T := Scheme.Γ.obj (Opposite.op (X.restrict <| Opens.openEmbedding (X.basicOpen r))))
      ?_ ?_ H ?_ using 1
    sorry
    sorry
    -- Porting note: need to pass Algebra through explicitly
    convert @HoldsForLocalizationAway _ hP _
      (Scheme.Γ.obj (Opposite.op (X.restrict (X.basicOpen r).openEmbedding))) _ _ ?_
      (X.presheaf.map (eqToHom U.1.openEmbedding_obj_top).op r) ?_
    sorry
    sorry
    sorry
  sorry
  sorry
  sorry
  sorry

Jovan Gerbscheid (May 17 2024 at 13:23):

It is possible that this problem is caused by an oversight in this function:

/--
Return true if `mvarId.isReadOnly` return true or if `mvarId` is a synthetic opaque metavariable.

Recall `isDefEq` will not assign a value to `mvarId` if `mvarId.isReadOnlyOrSyntheticOpaque`.
-/
def _root_.Lean.MVarId.isReadOnlyOrSyntheticOpaque (mvarId : MVarId) : MetaM Bool := do
  let mvarDecl ← mvarId.getDecl
  match mvarDecl.kind with
  | MetavarKind.syntheticOpaque => return !(← getConfig).assignSyntheticOpaque
  | _ => return mvarDecl.depth != (← getMCtx).depth

When a variable is syntheticOpaque, and lives at a lower depth, and assignSyntheticOpaque := true, then the variable is read-only, but this function will return false.

Matthew Ballard (May 17 2024 at 13:26):

Is assignSyntheticOpaque := true at that point in the example?

Jovan Gerbscheid (May 17 2024 at 13:26):

I'm not sure how to check that

Matthew Ballard (May 17 2024 at 13:28):

Modify the toolchain to dump the config in the trace also?

Jovan Gerbscheid (May 17 2024 at 13:45):

Ok, I'll make a PR for it

Matthew Ballard (May 17 2024 at 13:46):

You can usually do it much faster on your local machine using elan override

Jovan Gerbscheid (May 17 2024 at 13:57):

I haven't yet figured out how to build Lean on my laptop

Matthew Ballard (May 17 2024 at 13:58):

OS?

Jovan Gerbscheid (May 17 2024 at 13:59):

Windows

Matthew Ballard (May 17 2024 at 13:59):

WSL?

Jovan Gerbscheid (May 17 2024 at 14:00):

I don't think I have it

Richard Osborn (May 17 2024 at 14:01):

Should be as simple as typing wsl --install in the terminal

Jovan Gerbscheid (May 17 2024 at 14:03):

Cool, I've installed it

Jovan Gerbscheid (May 17 2024 at 14:05):

How do I use elan override?

Matthew Ballard (May 17 2024 at 14:06):

Brb in 15

Richard Osborn (May 17 2024 at 14:06):

I should also mention there's a wsl vscode extension which allows you to run your project inside wsl. Afterwards, you can install the lean vscode extension in wsl and you should be able to build from the command line within vscode

Matthew Ballard (May 17 2024 at 14:32):

Fork leanprover/lean4 and pull a local copy.
You probably want to git switch --detach v4.8.0-rc1 and create a new branch so that the changes are built on top of something that mathlib runs on.
mkdir -p build/release from the root of the repo.
cd build/release
cmake ../..
make -jN where N is the number of cores for your processor. You can just make also but this speeds it up quite a bit.
(optional) make test ARGS=-jN for running the test suite
elan toolchain link some_name stage1
cd back into a local copy of mathlib
elan override set some_name
lake build

Matthew Ballard (May 17 2024 at 14:33):

After this you should be using the local toolchain to re-build mathlib

Matthew Ballard (May 17 2024 at 14:34):

Someone who develops on windows might be able to give as nitty-gritty a set of steps. But, I haven't tried windows proper (= non-WSL)

Jovan Gerbscheid (May 17 2024 at 19:07):

After a lot of perseverance, I managed to do it. But I still think making a PR is faster/more efficient, because building mathlib locally is very slow. Anyways, I can confirm that the metavariable in question has MetavarKind.syntheticOpaque and that the Config has assignSyntheticOpaque := true, confirming my hypothesis. The whole config is:

{ foApprox := true,
  ctxApprox := true,
  quasiPatternApprox := false,
  constApprox := false,
  isDefEqStuckEx := false,
  unificationHints := true,
  proofIrrelevance := true,
  assignSyntheticOpaque := true,
  offsetCnstrs := true,
  transparency := Lean.Meta.TransparencyMode.default,
  trackZetaDelta := false,
  etaStruct := Lean.Meta.EtaStructMode.all,
  univApprox := true }

Matthew Ballard (May 17 2024 at 19:10):

Yeah, that is certainly a bug

Matthew Ballard (May 17 2024 at 19:11):

You can also push a tag to your fork and use GitHub CI to build the toolchains. Then you can use that toolchain in Mathlib CI if your local machine takes too much time.

Jovan Gerbscheid (May 17 2024 at 19:15):

How do tags work?

Matthew Ballard (May 17 2024 at 19:18):

git tag your_tag $(git rev-parse --short HEAD) makes the tag with the commit hash at your head
git push your_tag origin pushes the tag to GitHub
Then GitHub CI (inherited from the leanprover/lean4) takes over and builds toolchains, assuming the permissions for actions are set
Once this is done, you can use github_username/lean4:your_tag in the lean-toolchain
Pushing this to your mathlib branch after fork's CI finishes and releases the toolchains will start mathlib CI

Matthew Ballard (May 17 2024 at 19:25):

But at this point, things are about as fast as making a core PR branching off nightly-with-mathlib assuming you don't need all the toolchains.

Jovan Gerbscheid (May 17 2024 at 20:32):

I've created pull requests for the two problems that I solved today: lean4#4205 and lean4#4206.

Patrick Massot (May 17 2024 at 20:35):

This is great but I fear the form of those PRs will not be very successful. Did you carefully read https://github.com/leanprover/lean4/blob/master/CONTRIBUTING.md?

Patrick Massot (May 17 2024 at 20:35):

Note that the process is very different from what is used by Mathlib.

Matthew Ballard (May 17 2024 at 20:37):

They are documented better than lean#4126 :)

Patrick Massot (May 17 2024 at 20:38):

You should probably at least add tests (that fail on current master) to improve them.

Patrick Massot (May 17 2024 at 20:39):

lean#4126 probably counts as being self-documenting.

Jovan Gerbscheid (May 17 2024 at 20:39):

Should I also make an issue for the pull requests?

Matthew Ballard (May 17 2024 at 20:52):

I would make issues because I can imagine other possible fixes, eg finding the underlying issue with dependent vs non-dependent signature unification or breaking up readOnlyOrSyntheticOpaque into separate functions.

Patrick Massot (May 17 2024 at 20:52):

Note that I am not part of the Lean core dev team. I am only worried that valuable contributions (and potentially lots of future valuable contributions) could be lost if you don’t follow the contributing guidelines.

Patrick Massot (May 17 2024 at 20:54):

And there are good reasons why the external contributions guidelines are so demanding. Writing tests in particular is a very standard requirement in any software project.

Jovan Gerbscheid (May 17 2024 at 21:44):

I have now added a small test for both PR's

Patrick Massot (May 17 2024 at 21:49):

FYI, in the MonadStore test, only the second one fails on 4.8.0-rc1 on my computer.

Jovan Gerbscheid (May 17 2024 at 21:49):

Yes, that is correct, maybe I should rephrase the comment. It is able to find exactly one instance due to the outParam

Patrick Massot (May 17 2024 at 21:50):

Ok, I think the comment is indeed misleading (or maybe my knowledge of the English language is misleading me).

Jovan Gerbscheid (May 17 2024 at 22:38):

It actually turns out that duplicating these specific instances isn't necessary, and the following instance can be added instead to help with the non-dependent case:

/-- The constant type family -/
instance : FamilyDef (fun _ => β) a β where
  family_key_eq_type := rfl

Jovan Gerbscheid (May 17 2024 at 22:52):

I've updated the PR description explaining this

Mac Malone (May 18 2024 at 02:16):

I was just ping by lean4#4205 and discovered this thread by searching for MonadStore on Zulip. :laughing: Could I get a run down of what the problem is? Unforunately, this thread is quite long and my skim through it did not enlighten me sufficiently how this issue with Monad(D)Store(1) is causing trouble. I am quite surprised any but me in the Lake internals is interacting with it! :shocked:

Jovan Gerbscheid (May 18 2024 at 02:19):

It originated from me editing the instance synthesis algorithm. My version failed on a synthesis of MonadStore1, and then I found out that this wasn't a problem with my implementation, but that this type class has an outParam that can have multiple different values.

Jovan Gerbscheid (May 18 2024 at 02:22):

It then turned out that simply turning it into a semiOutParam didn't work due to unification issues, but adding the above extra instance solves it.

Mac Malone (May 18 2024 at 02:23):

What I am confused about here is why that is troublesome. Having multiple resolution paths for an outParam is quite common. For instance, all the Monad* classes (MonadExcept, MonadState, etc.) have a [Monad*Of] : Monad* instance. Thus, if there are multiple valid Monad*Of instance there will be different values for the outParam.

For example, StateT S1 <| StateT S2 Id will have both a MonadState S1 and a MonadState S2 possible resolution and the synthesis algorithm always chooses the outermost, latest appearing one (which is relied upon quite heavily in monad code).

Jovan Gerbscheid (May 18 2024 at 02:25):

The funny thing is that in this particular case, the instance that was generated and used corresponds to the innermost StateT in a stack of StateTs

Jovan Gerbscheid (May 18 2024 at 02:27):

The Monad*Of classes all have semiOutParams, exactly because they can have multiple different instances.

Jovan Gerbscheid (May 18 2024 at 02:29):

At the very start of type class search, outParams are replaced by metavariables, and at the end it checks that these are instantiated correctly. Thus having multiple different possibilities is problematic.

Mac Malone (May 18 2024 at 02:29):

Jovan Gerbscheid said:

The Monad*Of classes all have semiOutParams, exactly because they can have multiple different instances.

Yes, but they are used to produce the outParam Monad* instances, thus there are multiple possibilites there as well..

Mac Malone (May 18 2024 at 02:30):

A mistake I have yet to have the time correct with the MonadStore classes is that they should also follow the pattern of the other Monad* classes and have both a semiOutParam``*Of version and outParam version. It had not been a pressing refactor, though.

Mac Malone (May 18 2024 at 02:33):

To demonstrate how this multiple possibilty problem also shows up for the other monad classes, consider these xamples:

example [Monad m] : StateT σ₁ (StateT σₐ m) σ₁ := MonadState.get -- works
example [Monad m] : StateT σ₁ (StateT σ₂ m) σ₂ := MonadState.get -- errors
example [Monad m] : StateT σ₁ (StateT σₐ m) σ₁ := MonadStateOf.get -- works

Jovan Gerbscheid (May 18 2024 at 02:36):

fetchOrCreate, is the example that I encountered, and it needs the Monad*Of variant, but it got lucky that the type class synthesis decided to choose the inner-most StateT.

Jovan Gerbscheid (May 18 2024 at 02:37):

(Like in your second example)

Jovan Gerbscheid (May 18 2024 at 02:39):

But I believe that when you get around to doing this refactor, you will run into the same issue I did, and then this extra instance for the constant type family will be necessary.

Mac Malone (May 18 2024 at 02:40):

@Jovan Gerbscheid I agree with the extra FamilyDef instance. The problem is that we do want to perserve an outParam variant for inferring the type of the value when using the fetch?/store methods (e.g., just like get does).

Jovan Gerbscheid (May 18 2024 at 02:40):

The reason why it works with the outParam, is because of the metavariables that are put in place for outParams, and this affects the unification during type class search.

Mac Malone (May 18 2024 at 02:53):

@Jovan Gerbscheid I think my confusion is, that I far as I understand, there is suppose to be a defined order for which instance gets synthesize first, even in the case of outParam. Thus, this should always reliable produce the same instance. As in the example above where the outer-most state is always choosen due to ordering of MonadState instances for lifting and transformers.

Jovan Gerbscheid (May 18 2024 at 02:57):

the language in deterministic, so in some sense it can be relied upon, however, in this particular case something weird happens: we have instances

instance [MonadDStore κ β m] : MonadStore1 k (β k) m
instance [MonadDStore κ β m] [FamilyOut β k α] : MonadStore1 k α m

and the second one is tried first, and the first one eventually succeeds. If we remove the second one or reduce its priority, then the order magically changes, and the wrong instance is synthesised.

Jovan Gerbscheid (May 18 2024 at 03:00):

And if we were to rely on an order I'd want it to be outer-first just like in StateT

Jovan Gerbscheid (May 18 2024 at 03:03):

By the way, the order in the case of MonadExcept seems to be a bit inconsistent. In a stack of ExceptTs, it picks the outer one, but if I wrap it around MetaM, it picks up the instance from MetaM:

example (e : ε₁) : ExceptT ε₁ MetaM PUnit := throw e -- error

Mac Malone (May 18 2024 at 03:05):

Yes, that was the point I was trying to make. :wink: It has the similar conceptual problems to MonadStore, they are just usually avoided in core because the instances have been ordered to work as desired. They can often pop up in downstreams use though.

Mac Malone (May 18 2024 at 03:06):

I agree, however, that these are real problems.

Mac Malone (May 18 2024 at 03:08):

For example, the current outermost ordering for transformer instance is just a by product of how instance syntehsi happens to decides between the lift and transformer instances. If we change this (via priorities) we can switch the ordering:

namespace MyLean

class MonadStateOf (σ : semiOutParam (Type u)) (m : Type u → Type v) where
  get : m σ
  set : σ → m PUnit
  modifyGet {α : Type u} : (σ → Prod α σ) → m α

export MonadStateOf (set)

/-- Similar to `MonadStateOf`, but `σ` is an `outParam` for convenience. -/
class MonadState (σ : outParam (Type u)) (m : Type u → Type v) where
  get : m σ
  set : σ → m PUnit
  modifyGet {α : Type u} : (σ → Prod α σ) → m α

export MonadState (get modifyGet)

instance [MonadStateOf σ m] : MonadState σ m where
  set         := MonadStateOf.set
  get         := MonadStateOf.get
  modifyGet f := MonadStateOf.modifyGet f

section Outermost

instance [MonadLift m n] [MonadStateOf σ m] : MonadStateOf σ n where
  get         := liftM (m := m) MonadStateOf.get
  set       s := liftM (m := m) (MonadStateOf.set s)
  modifyGet f := monadLift (m := m) (MonadState.modifyGet f)

instance [Monad m] : MonadStateOf σ (StateT σ m) where
  get       := StateT.get
  set       := StateT.set
  modifyGet := StateT.modifyGet

example [Monad m] : StateT σ₁ (StateT σₐ m) σ₁ := get -- works
example [Monad m] : StateT σ₁ (StateT σ₂ m) σ₂ := get -- errors

end

section Innermost

instance [Monad m] : MonadStateOf σ (StateT σ m) where
  get       := StateT.get
  set       := StateT.set
  modifyGet := StateT.modifyGet

instance (priority := high) [MonadLift m n] [MonadStateOf σ m] : MonadStateOf σ n where
  get         := liftM (m := m) MonadStateOf.get
  set       s := liftM (m := m) (MonadStateOf.set s)
  modifyGet f := monadLift (m := m) (MonadState.modifyGet f)

example [Monad m] : StateT σ₁ (StateT σₐ m) σ₁ := get -- errors
example [Monad m] : StateT σ₁ (StateT σ₂ m) σ₂ := get -- works

end

Jovan Gerbscheid (May 18 2024 at 03:09):

I have been frustrated myself about this before. This is caused by MonadError.toMonadExceptOf being tried first. I think it would be a fix to put a high instance priority on the relevant instance.

Edit: instead, there should be a low priority for the instance that allows you to pass through an ExceptT.

Jovan Gerbscheid (May 18 2024 at 03:15):

So why is it not the case that the lift instance has a low priority?

Mac Malone (May 18 2024 at 03:18):

Jovan Gerbscheid said:

So why is it not the case that the lift instance has a low priority?

I unfortunately do not know, the other Lean developers would have to chime in. If you think it should be, it might be worth making an issue about this. It seems like it would be a good idea to properly clarify the intended design around outParams, instance priorites, and synthesis order.

Mac Malone (May 18 2024 at 03:20):

My two main points about the current design, from my understanding of it:

It is expected that there will be multiple resolution paths for an outParam. However, it is also expected that only one will be consistently followed based ont the current ordering rules.
If you are proposing to change the synthesis order, this will break many things that can rely on this. Ideally, there should be a way to recover the existing order users relied on (e.g., via priority annotations) without losing existing features in the process (such as downgrading to an semiOutParam and losing the inference provided by outParam).

Mac Malone (May 18 2024 at 03:22):

Also, I am perfectly happy to change fetchOrCreate / MonadStore to follow a new order. I have no particular love for the current one, so I personally have no compliants with any of this.

Jovan Gerbscheid (May 18 2024 at 03:24):

It is only under very specific circumstances that going from outParam to semiOutParam is a downgrade in terms of synthesis, namely when isDefEq fails to unify two expressions that can be unified.

Jovan Gerbscheid (May 18 2024 at 03:26):

Hmm, that's not true, the other benefit of outParams is that they also succeed in synthesizing when the parameter is still a metavariable.

Mac Malone (May 18 2024 at 03:29):

Yes, they are used to assign metavariables via synthesis.

Mac Malone (May 18 2024 at 03:32):

For example:

example [Monad m] : StateT IO.Error m String := do
  return toString (← get) -- works due to `outParam` in `MonadState`

example [Monad m] : StateT IO.Error m String := do
  return toString (← MonadStateOf.get) -- errors due to metavariable

Jovan Gerbscheid (May 18 2024 at 04:00):

I've made an issue for it: 4212

Jovan Gerbscheid (May 18 2024 at 13:13):

lean4#4206 breaks a couple of proofs in Mathlib, and I've fixed those on the corresponding mathlib branch. The benchmarking shows a decrease of 0.3% in the amount of build instructions. I've extended the documentation on the Lean PR. Should I still make an issue where I just copy this comment?

Jovan Gerbscheid (May 18 2024 at 13:19):

one proof was fixed by removing 2 lines that were added in during the port :)

Richard Osborn (May 18 2024 at 13:22):

It does seem like convert and refine' are fairly fragile tactics. Do we know why they need extra unification hints after the change? Also, does refine do a better job figuring things out?

Jovan Gerbscheid (May 18 2024 at 13:30):

Both tactics call elabTermEnsuringType, which in turn calls withAssignableSyntheticOpaque <| isDefEq eType expectedType, and this bug appears when assignableSyntheticOpaque := true. In the congr tactic, called by convert, there is also try withAssignableSyntheticOpaque mvarId.refl. The bug is basically that it would instantiate metavariables that it wasn't supposed to, which sometimes can be helpful :sweat_smile:

Jovan Gerbscheid (May 18 2024 at 13:39):

I looked at the refine' proofs again, and I've now replaced them all with either exact or apply

Jovan Gerbscheid (May 18 2024 at 13:51):

For convert, in most cases the issue was that there were type class problems in which the metavariables weren't instantiated yet, so adding these explicitly allows the instances to be synthesized. I don't know if it is possible to modify the tactic to improve this. But there were also some more complicated cases where I had no clue how the original tactic managed to prove the goal, and I had to add a non-trivial amount to the proof

Jovan Gerbscheid (May 18 2024 at 22:37):

@Mac Malone I found out instead of the constant type family, it also suffices to have the general type family instance:

/-- The general type family -/
instance : FamilyDef Fam a (Fam a) where
  family_key_eq_type := rfl

The cool thing about this is that this means we can get rid of the non-type-family instance

instance [MonadDStore κ β m] : MonadStore1 k (β k) m

simplifying the situation.

This type class FamilyDef Fam a β strongly reminds me of Equality constraints in Haskell, which allow you to impose that two types are equal. That is exactly what we're doing here: enforcing that β k and α are equal. I wonder if such a design would make sense to use in Lean. I think it is much more intuitive to read this:

instance [MonadDStore κ β m] [β k ~ α] : MonadStore1 k α m

compared to this:

instance [MonadDStore κ β m] [FamilyOut β k α] : MonadStore1 k α m

Mac Malone (May 19 2024 at 02:18):

Jovan Gerbscheid said:

The cool thing about this is that this means we can get rid of the non-type-family instance
instance [MonadDStore κ β m] : MonadStore1 k (β k) m

It is my understanding that it is best to avoid cast where possible in Lean, so being able to use the instance without it when possible is desirable.

Jovan Gerbscheid (May 19 2024 at 02:27):

Ooh, interesting, so in that case it would actually be better to go back to not even having the constant instance for FamilyDef, and instead have the direct instance

instance [MonadDStore κ (fun _ => α) m] : MonadStore1 k α m

Jovan Gerbscheid (May 19 2024 at 02:30):

I suppose that is then also a reason why implementing equality constraint a la Haskell isn't ideal in Lean, because the cast function causes some overhead? Do you know what this overhead/drawback is?

Mario Carneiro (May 19 2024 at 10:21):

I don't think there is any such overhead, cast is compiled to a no-op. It makes proofs harder but I don't think this is relevant for Lake.

Jovan Gerbscheid (May 19 2024 at 12:49):

Also, is there a reason to use cast over ▸ (in the case that the equality needs to be substituted)? h ▸ x seems a lot simpler than cast (by rw [h]) x

Mario Carneiro (May 19 2024 at 12:54):

Not that I know of. Again, it makes a difference for proofs about the resulting object but lake doesn't do any such proofs so it doesn't matter

Ruben Van de Velde (May 19 2024 at 12:54):

cast works better with simp, as I understand it. But that might not be relevant in your case

Jovan Gerbscheid (May 19 2024 at 21:35):

@Mac Malone , I copied the definition of MonadStore1 into MonadStore1Of. MonadStore1 has α as an outParam, and it exports the fetch? function, MonadStore1Of has α as a semiOutParam and it exports store and is used in fetchOrCreate. This follows the design of get being the only thing exported from MonadState. This is because we only care about the outParam being synthesized when it can't be synthesized by normal type inference, such as in get. (So in the current situation, MonadStore1 isn't used anywhere because fetch? isn't used anywhere)

Mac Malone (May 19 2024 at 22:26):

@Jovan Gerbscheid fetchOrCreate uses fetch? (and store as you mentioned)

Jovan Gerbscheid (May 19 2024 at 22:28):

Sorry you're right, I was thinking of fetchOrCreate as being sort of part of the definition. fetch? is not used anywhere else.

Mac Malone (May 19 2024 at 22:30):

Correct, and fetchOrCreate uses the outParam type class to help infer α.

Mac Malone (May 19 2024 at 22:33):

Relatedly, fetchOrCreate is a good example of a place where you somtimes want help inferring the result and a place were you sometimes already know it. Depending on the use case you either want an outParam or semiOutParam, which is sadly not possible in the current inference setup (just like with get).

Jovan Gerbscheid (May 19 2024 at 22:33):

I thought the outParam is not so useful there because it can be inferred from either the output type or the argument

Mac Malone (May 19 2024 at 22:34):

It can be, assuming it is not a metavariable already in both of those cases.

Jovan Gerbscheid (May 19 2024 at 22:35):

In the case of get, your code can look like this:

def .. := do
   let s <- get
  ...

and you want to know the type of s before looking at the stuff below it.

Jovan Gerbscheid (May 19 2024 at 22:37):

But with fetchOrCreate, I don't see a scenario where it is really something you want. That is assuming that you write out the type of your function, which is always good to do.

Mac Malone (May 19 2024 at 22:38):

Similarly, with fetchOrCreate your code could look like

def .. := do
  let v <- fetchOrCreate k (fromJson? ...)
  ...

In which case you want fetchOrCreate to determine what the type fromJson? should return.

Jovan Gerbscheid (May 19 2024 at 22:39):

Hmm, the same problem appears in set, which also uses a semiOutParam

def .. := do
  set (fromJson? ...)
  ...

Mac Malone (May 19 2024 at 22:40):

Yeah, it does.

Jovan Gerbscheid (May 19 2024 at 22:42):

So in an ideal world, for set/fetchOrCreate, Lean would attempt type class synthesis with either an outParam or a semiOutParam and return the result of which ever one works?

Mac Malone (May 19 2024 at 22:46):

I think, ideally, outParam (or some variant) would work like outParam in the presence of metavariables but like semiOutParam when the type is know. I think that would fix problem. However, I am not the authority on type class synthesis within the team, so such a change/addition would likely need to go through an RFC process.

Mac Malone (May 19 2024 at 22:46):

In the meantime, I think you are right, that according to standards of the Lean core, fetchOrCreate should use the semiOutParam variant of the class.

Jovan Gerbscheid (May 19 2024 at 22:48):

In the current situation, type class synthesis isn't allowed to instantiate any metavariables that appear in the query, except for arguments that are outParams. But for outParams it goes quite far, namely ignoring completely what was there originally. We need a mix: allowing to instantiate metavariables in the outParam without ignoring what was already there.

Jovan Gerbscheid (May 19 2024 at 22:49):

So I agree with your point

Jovan Gerbscheid (May 19 2024 at 22:52):

It gets trickier when there are shared metavariables between the different arguments, like the goal MonadState ?s (StateT ?s ?m). In that case the ?s should "not count" as a metavariable because it appears in the other argument.

Jovan Gerbscheid (May 19 2024 at 22:56):

Although even the current system doesn't work well with that: I think that MonadState ?s (StateT Nat (StateT ?s ?m)) will instantiate ?s := Nat :rolling_on_the_floor_laughing:

Jovan Gerbscheid (May 19 2024 at 23:04):

Maybe the notions of outParam and semiOutParam could even be unified with this mix that has the best of both worlds.
If so, then that would also be helpful for new people confused about the difference.

Jovan Gerbscheid (May 20 2024 at 02:18):

I've made an issue on the topic of merging outParam and semiOutParam: lean4#4225

Jovan Gerbscheid (May 21 2024 at 12:51):

After fixing the metavariable assignment bug and the MonadStore instance, here (#13079) is the first result of the effect of caching intermediate type class results. Type class synthesis decreased by 8%, and Mathlib as a whole by 3%.

  Benchmark                                                          Metric                Change
  ===============================================================================================
+ build                                                              typeclass inference    -8.3%
+ ~Mathlib.Algebra.Algebra.NonUnitalSubalgebra                       instructions           -7.4%
+ ~Mathlib.Algebra.Algebra.Unitization                               instructions          -18.4%
+ ~Mathlib.Algebra.BigOperators.Finsupp                              instructions          -19.0%
+ ~Mathlib.Algebra.Lie.Weights.Killing                               instructions           -6.4%
+ ~Mathlib.Algebra.Module.LinearMap.Basic                            instructions           -9.6%
+ ~Mathlib.Algebra.Module.LocalizedModule                            instructions           -8.8%
+ ~Mathlib.Algebra.Ring.CentroidHom                                  instructions           -8.5%
+ ~Mathlib.Algebra.Star.NonUnitalSubalgebra                          instructions          -10.4%
+ ~Mathlib.Algebra.TrivSqZeroExt                                     instructions          -20.4%
+ ~Mathlib.Analysis.Analytic.Basic                                   instructions           -5.1%
+ ~Mathlib.Analysis.Calculus.ContDiff.Basic                          instructions           -6.0%
+ ~Mathlib.Analysis.Calculus.ContDiff.Defs                           instructions           -5.7%
+ ~Mathlib.Analysis.Calculus.Deriv.Mul                               instructions           -8.1%
+ ~Mathlib.Analysis.Calculus.FDeriv.Mul                              instructions           -5.6%
+ ~Mathlib.Analysis.Calculus.FDeriv.Symmetric                        instructions          -18.4%
+ ~Mathlib.Analysis.Calculus.Implicit                                instructions          -11.3%
+ ~Mathlib.Analysis.Convex.Function                                  instructions          -25.5%
+ ~Mathlib.Analysis.Convolution                                      instructions           -3.2%
+ ~Mathlib.Analysis.Fourier.FourierTransformDeriv                    instructions           -5.3%
+ ~Mathlib.Analysis.InnerProductSpace.Basic                          instructions           -6.2%
+ ~Mathlib.Analysis.InnerProductSpace.OfNorm                         instructions          -17.7%
+ ~Mathlib.Analysis.InnerProductSpace.Projection                     instructions           -6.3%
+ ~Mathlib.Analysis.NormedSpace.Banach                               instructions          -10.6%
+ ~Mathlib.Analysis.NormedSpace.Multilinear.Basic                    instructions           -8.5%
+ ~Mathlib.Analysis.NormedSpace.Multilinear.Curry                    instructions           -8.3%
+ ~Mathlib.Analysis.NormedSpace.OperatorNorm.Basic                   instructions          -11.2%
+ ~Mathlib.Analysis.NormedSpace.OperatorNorm.Bilinear                instructions          -13.4%
+ ~Mathlib.Analysis.NormedSpace.Star.Multiplier                      instructions           -8.0%
+ ~Mathlib.Analysis.NormedSpace.TrivSqZeroExt                        instructions          -13.4%
+ ~Mathlib.Analysis.NormedSpace.lpSpace                              instructions           -4.2%
+ ~Mathlib.Analysis.Seminorm                                         instructions           -9.1%
+ ~Mathlib.Analysis.SpecialFunctions.Pow.Deriv                       instructions           -8.5%
+ ~Mathlib.CategoryTheory.Bicategory.Adjunction                      instructions          -19.0%
+ ~Mathlib.CategoryTheory.Monoidal.Braided.Basic                     instructions          -20.4%
+ ~Mathlib.CategoryTheory.Monoidal.Rigid.Basic                       instructions          -10.7%
+ ~Mathlib.Data.Num.Lemmas                                           instructions          -13.9%
+ ~Mathlib.Geometry.Manifold.ContMDiff.NormedSpace                   instructions          -11.2%
+ ~Mathlib.GroupTheory.MonoidLocalization                            instructions           -9.6%
+ ~Mathlib.LinearAlgebra.BilinearMap                                 instructions          -14.5%
+ ~Mathlib.LinearAlgebra.Matrix.SchurComplement                      instructions          -22.0%
+ ~Mathlib.LinearAlgebra.Matrix.ToLin                                instructions          -18.1%
+ ~Mathlib.LinearAlgebra.TensorProduct.Basic                         instructions           -6.4%
+ ~Mathlib.LinearAlgebra.TensorProduct.Tower                         instructions          -14.7%
+ ~Mathlib.MeasureTheory.Group.FundamentalDomain                     instructions          -13.1%
+ ~Mathlib.MeasureTheory.Integral.SetToL1                            instructions           -3.8%
+ ~Mathlib.MeasureTheory.Measure.Lebesgue.EqHaar                     instructions          -12.0%
+ ~Mathlib.NumberTheory.RamificationInertia                          instructions           -5.4%
+ ~Mathlib.RingTheory.IsTensorProduct                                instructions          -10.6%
+ ~Mathlib.RingTheory.Kaehler                                        instructions           -5.6%
+ ~Mathlib.RingTheory.Localization.Basic                             instructions           -9.4%
+ ~Mathlib.RingTheory.TensorProduct.Basic                            instructions           -4.7%
+ ~Mathlib.Topology.Algebra.Module.Basic                             instructions           -5.5%
+ ~Mathlib.Topology.ContinuousFunction.NonUnitalFunctionalCalculus   instructions           -5.4%

Michael Stoll (May 21 2024 at 12:57):

There are a few files that got slower:

  Mathlib.CategoryTheory.Category.Cat.Limit:           +5.0626 * 10⁹ (+5.09%)
  Mathlib.CategoryTheory.Monoidal.Internal.FunctorCategory:
                                                       +3.5426 * 10⁹ (+1.48%)
  Mathlib.Algebra.Homology.HomotopyCategory.Pretriangulated:
                                                       +2.9497 * 10⁹ (+1.13%)
  Mathlib.CategoryTheory.Monad.Basic:                  +2.4588 * 10⁹ (+4.72%)
  Mathlib.Geometry.RingedSpace.PresheafedSpace.Gluing:
                                                       +2.0459 * 10⁹ (+0.492%)
  Mathlib.CategoryTheory.Limits.Final:                 +1.4472 * 10⁹ (+1.07%)
  Mathlib.Algebra.Homology.HomotopyCategory.Triangulated:
                                                       +1.3580 * 10⁹ (+0.692%)
  Mathlib.CategoryTheory.Preadditive.EilenbergMoore:   +1.3519 * 10⁹ (+0.570%)
  Mathlib.CategoryTheory.Monoidal.Discrete:            +1.2469 * 10⁹ (+1.11%)
  Mathlib.CategoryTheory.Bicategory.Kan.Adjunction:    +1.2169 * 10⁹ (+1.78%)
  Mathlib.CategoryTheory.Monoidal.Mon_:                +1.1204 * 10⁹ (+0.644%)

but many more that got faster.

The fikles above are in CategoryTheory (and in Algebra.Homology), however the two files that gained the most are also in CategoryTheory :shrug:

Matthew Ballard (May 21 2024 at 13:41):

Out of curiosity, what if you cache only those instances containing Expr meta-variables and disregard Level meta-variables?

Jovan Gerbscheid (May 21 2024 at 13:54):

Compared to regular type class search, instead of immediately returning when a solution is found, my implementation continues to construct instances to other goals by propagating solutions that are already available. This adds a small cost to the search, which I suspect is what causes the slow-down it those few files. A better design might be to store partial solutions, so storing which instances have been tried so far and what the results, if any, are so far.

Jovan Gerbscheid (May 21 2024 at 13:57):

Another improvement would be to keep cache between different theorems/definitions. This cache could be cleared whenever a new instance is added to the environment, but is kept otherwise. I suspect that this could also make a significant improvement in efficiency.

Jovan Gerbscheid (May 21 2024 at 13:59):

Matthew Ballard said:

What if you cache only those instances containing Expr meta-variables and disregard Level meta-variables?

What do you mean? Currently there is no global cache for goals with assignable metavariables (but this is cached within a single type class search)

Jovan Gerbscheid (May 21 2024 at 14:12):

But this could be improved upon by unifying the local and global cache into one cache.

Jovan Gerbscheid (May 21 2024 at 14:13):

@Michael Stoll how do you find the files that got slower?

Patrick Massot (May 21 2024 at 14:14):

Jovan Gerbscheid said:

Another improvement would be to keep cache between different theorems/definitions. This cache could be cleared whenever a new instance is added to the environment, but is kept otherwise. I suspect that this could also make a significant improvement in efficiency.

This sounds nice but one must keep in mind the upcoming work on parallelism inside files that makes everything like this more complicated.

Jovan Gerbscheid (May 21 2024 at 14:17):

How would this work with instances? you can't elaborate a theorem if you haven't elaborated the instance defined directly above it.

Michael Stoll (May 21 2024 at 14:17):

Jovan Gerbscheid said:

Michael Stoll how do you find the files that got slower?

I have written a shell script that downloads the json file with the relevant information from http://speed.lean-fro.org, filters it to extract files that have their instruction count changed by 10⁹ or more (separately for positive and negative changes), saves the information to a couple of files, and then calls Magma with a program that does the sorting and formatting. (There are certainly shell commands that could do that as well, but I'm conformtable using Magma...)

Michael Stoll (May 21 2024 at 14:18):

significant,bash:

#!/bin/bash

# usage: significant commit1 commit2
#
# gets the json file for the comparison from http://speed.lean-fro.org/mathlib4
# and prints the files / categories with an instruction change of at least 10^9,
# first the ones that got slower, then the ones that got faster

curl http://speed.lean-fro.org/mathlib4/api/compare/$1/to/$2?all_values=true > temp.json

jq '.differences | .[] | select(.dimension.metric == "instructions") |  select(.diff >= 1000000000) | [.dimension.benchmark, .diff, (.reldiff * 100)]' temp.json > temp1
jq '.differences | .[] | select(.dimension.metric == "instructions") |  select(.diff <= -1000000000) | [.dimension.benchmark, .diff, (.reldiff * 100)]' temp.json > temp2

magma -b ~/lean4/significant.magma

`significant.magma' :

// read in the two files with the data
// format is
// [
//   "~Mathlib.Analysis.NormedSpace.Star.Matrix",
//   1177909609,
//   0.86267198115728
// ]

lines1 := Split(Read("temp1"));
lines2 := Split(Read("temp2"));

// parse into triples <"file", diff, reldiff>
assert IsDivisibleBy(#lines1, 5) and IsDivisibleBy(#lines2, 5);

for i -> l in lines1 do
  if l[#l] eq "," then Prune(~lines1[i]); end if;
end for;
for i -> l in lines2 do
  if l[#l] eq "," then Prune(~lines2[i]); end if;
end for;

function parse(lseq, i)
  name := Split(lseq[5*i+2], " ~\"")[1]; //"
  d := StringToInteger(lseq[5*i+3]);
  rd := eval lseq[5*i+4];
  return <name, d, rd>;
end function;

seq1 := [parse(lines1, i) : i in [0..(#lines1 div 5)-1]];
seq2 := [parse(lines2, i) : i in [0..(#lines2 div 5)-1]];

seq := [e : e in seq1 cat seq2 | e[1,1] ne "M"];
seq1 := [e : e in seq1 | e notin seq];
seq2 := [e : e in seq2 | e notin seq];

// sort
Sort(~seq1, func<x,y | y[2]-x[2]>);
Sort(~seq2, func<x,y | x[2]-y[2]>);

// print
procedure print_entry(e)
  if #e[1] gt 50 then
    printf "  %o:\n"*" "^55*"%o%o * 10⁹ (%o%o%%)\n", e[1], e[2] gt 0 select "+" else "",
           ChangePrecision(e[2]*1e-9, 5), e[3] gt 0 select "+" else "", ChangePrecision(e[3], 3);
  else
    printf "  %-51o  %o%o * 10⁹ (%o%o%%)\n", e[1]*":", e[2] gt 0 select "+" else "",
           ChangePrecision(e[2]*1e-9, 5), e[3] gt 0 select "+" else "", ChangePrecision(e[3], 3);
  end if;
end procedure;

printf "\n"*"="^78*"\n";
printf "General information:\n";
for e in seq do
  print_entry(e);
end for;
printf "\nFiles that got slower:\n";
for e in seq1 do
  print_entry(e);
end for;
printf "\nFiles that got faster:\n";
for e in seq2 do
  print_entry(e);
end for;
printf "="^78*"\n";

quit;

Michael Stoll (May 21 2024 at 14:19):

The commits can be extracted from the link URL that gets posted by the bot (the parts left and right of /to/).

Jovan Gerbscheid (May 21 2024 at 14:20):

Can you run the same check on instances that you did at the start of this thread? Then we could see what instances are affected most by this change.

Michael Stoll (May 21 2024 at 14:21):

Maybe later today.

Patrick Massot (May 21 2024 at 14:21):

Jovan, this is a complicated story, but it worked in Lean 3. I am sure Leo and Sebastian want to make it even better in Lean 4 but this is complicated so it was postponed when developing Lean 4.

Matthew Ballard (May 21 2024 at 14:22):

Jovan Gerbscheid said:

Matthew Ballard said:

What if you cache only those instances containing Expr meta-variables and disregard Level meta-variables?

What do you mean? Currently there is no global cache for goals with assignable metavariables (but this is cached within a single type class search)

Hmm. Disregard. This is probably not a good idea for anything global.

Jovan Gerbscheid (May 21 2024 at 14:45):

I didn't actually mean global, I meant in-declaration, which is more global than the local cache within a single call to synthInstance.

Michael Stoll (May 21 2024 at 14:46):

Though a more global cache (if feasible) could have an even more pronounced effect, as usually there are many similar instance searches in a given file.

Jovan Gerbscheid (May 21 2024 at 14:55):

Yes, I think this could be done with something like an environment extension. A detail here is that the current "global" cache is stored in a persistent hash map inside the state of MetaM, meaning that it can be reverted. I don't know why you'd want to do this, but the reverting won't happen if you have an environment extension with a regular hashmap.

Jovan Gerbscheid (May 21 2024 at 15:02):

The risk here is that when using the MonadBacktrack instance of MetaM or TacticM to reset the state, the name generator gets reset, meaning that you'll get new meta/free variables with names that had already appeared, and this invalidates the cache.

So maybe in the more global cache, we'd want to not store anything with meta/free variables.

Matthew Ballard (May 21 2024 at 15:43):

Moving things from the temporary cache to permanent cache used in unification if they only have level metavariables (instead of the status quo where they either level or expr metavariables) suprisingly

didn't break anything
sped up mathlib mildly

This is just want I was thinking of, see https://leanprover.zulipchat.com/#narrow/stream/270676-lean4/topic/universe.20issues/near/438023843

Jovan Gerbscheid (May 21 2024 at 16:15):

Looking at the isDefEq cache, none of it is cached in between independent isDefEq calls, as isDefEq clears both the transient and permanent cache at the start. There is a TODO there to figure out why keeping the permanent cache doesn't work. So not very permanent...

Jannis Limperg (May 21 2024 at 16:22):

Jovan Gerbscheid said:

The risk here is that when using the MonadBacktrack instance of MetaM or TacticM to reset the state, the name generator gets reset, meaning that you'll get new meta/free variables with names that had already appeared, and this invalidates the cache.

The name generator does not get reset by backtracking. See this function, which is used in the MonadBacktrack instance for MetaM.

However, local instances might be a problem for any global caching strategy.

Jovan Gerbscheid (May 21 2024 at 16:25):

Local instances are already part of the lookup key for instance caching, so this is already being dealt with.

Jovan Gerbscheid (May 21 2024 at 16:29):

So in that case, why do we use persistent data structures for caching? And how big is the performance difference between HashMap and PersistentHashMap?

Jovan Gerbscheid (May 21 2024 at 16:32):

Another question I have is whether the speed from the benchmarking is the only metric we should try to improve, or whether maybe excessive memory usage during caching might be problematic somehow?

Jovan Gerbscheid (May 21 2024 at 16:40):

I ran a second benchmark, after removing a line of code. The total of build instructions decreased by 0,006 %, but the build time is 1% faster. Is it true that build time has a much higher standard deviation than build instructions? And is the build instructions count deterministic?

Matthew Ballard (May 21 2024 at 16:41):

The build time can vary +-3% in my experience

Jannis Limperg (May 21 2024 at 16:51):

Jovan Gerbscheid said:

So in that case, why do we use persistent data structures for caching? And how big is the performance difference between HashMap and PersistentHashMap?

The MonadBacktrack functionality is also used for snapshots, made by the server after every declaration(?). If the cache should be part of these snapshots, it needs to be a persistent data structure (but I don't know whether it should).

Jannis Limperg (May 21 2024 at 16:52):

I would expect the performance difference between persistent and non-persistent hash maps to be fairly big in a microbenchmark. But Sebastian once told me that operations on these hash maps basically don't show up during profiling, so apparently they're not relevant in the grand scheme of things.

Michael Stoll (May 22 2024 at 10:38):

Jovan Gerbscheid said:

Can you run the same check on instances that you did at the start of this thread? Then we could see what instances are affected most by this change.

Here are the results of building Mathlib with set_option profiler true on lean-pr-testing-4152.

total time for typeclass inference: 16428.3 seconds  (33.06%)
total time for simp: 7605.18 seconds  (15.30%)
total time for interpretation: 5244.79 seconds  (10.55%)
total time for elaboration: 4975.45 seconds  (10.01%)
total time for tactic execution: 4682.08 seconds  (9.422%)
total time for import: 4206.50 seconds  (8.465%)
total time for type checking: 2280.80 seconds  (4.590%)
total time for compilation: 1927.37 seconds  (3.879%)
total time for aesop: 533.690 seconds  (1.074%)
total time for dsimp: 436.107 seconds  (0.8776%)
total time for linting: 399.036 seconds  (0.8030%)
total time for .olean serialization: 353.838 seconds  (0.7120%)
total time for initialization: 273.796 seconds  (0.5510%)
total time for parsing: 166.096 seconds  (0.3343%)
total time for attribute application: 71.3646 seconds  (0.1436%)
total time for ring: 52.2042 seconds  (0.1051%)
total time for norm_num: 40.1689 seconds  (0.08083%)
total time for C code generation: 15.2958 seconds  (0.03078%)

Note that the percentage for typeclass inference is down by 3.5 points.

The top twenty typeclass searches:

proportion for typeclass inference of Module (> 0.1s): 7.75307%
proportion for typeclass inference of Nonempty (> 0.1s): 7.06358%
proportion for typeclass inference of Algebra (> 0.1s): 6.11472%
proportion for typeclass inference of AddMonoidHomClass (> 0.1s): 4.81564%
proportion for typeclass inference of HSMul (> 0.1s): 4.25272%
proportion for typeclass inference of CoeT (> 0.1s): 4.08880%
proportion for typeclass inference of CoeFun (> 0.1s): 3.97967%
proportion for typeclass inference of ZeroHomClass (> 0.1s): 3.54000%
proportion for typeclass inference of SMul (> 0.1s): 3.27658%
proportion for typeclass inference of ContinuousConstSMul (> 0.1s): 3.01271%
proportion for typeclass inference of AddHomClass (> 0.1s): 2.79991%
proportion for typeclass inference of AddCommMonoid (> 0.1s): 2.57972%
proportion for typeclass inference of CovariantClass (> 0.1s): 2.29647%
proportion for typeclass inference of MulHomClass (> 0.1s): 2.03958%
proportion for typeclass inference of FiniteDimensional (> 0.1s): 1.99274%
proportion for typeclass inference of HasOrthogonalProjection (> 0.1s): 1.63697%
proportion for typeclass inference of CategoryTheory.Limits.HasColimit (> 0.1s): 1.56189%
proportion for typeclass inference of NormedSpace (> 0.1s): 1.48793%
proportion for typeclass inference of Fintype (> 0.1s): 1.36851%
proportion for typeclass inference of IsScalarTower (> 0.1s): 1.30739%

Note that I have a faster machine now, so the 100ms threshold cuts off more (the proportion of typeclass searches over 100ms is 11.94% in this run).

Subsingleton is down to position 50. The top part is still dominated by algebra (and CoeT/CoeFun), with SMul related stuff still fairly prominent. (This should get better once Yaël gets to refactoring the algebra and order hierarchies). Nonempty is also very high up; this might be a target for closer inspection.

Yuyang Zhao (May 22 2024 at 21:06):

#9252 needs a hack to find instances, which might also be related to the metavariable assignment bug.

Jovan Gerbscheid (May 22 2024 at 22:04):

What's the hack?

Yuyang Zhao (May 23 2024 at 12:51):

See https://github.com/leanprover-community/mathlib4/pull/9252/commits/3d60bc9f316da42f6d6cbf27a8e5908541eef330.

Yuyang Zhao (May 23 2024 at 12:52):

Build failed without it.

Yuyang Zhao (May 23 2024 at 17:24):

Could we make Lean4 not check the instance parameter during TC search (do something like https://github.com/negiizhao/lean4/pull/7/files)? It doesn't seem to accelerate much overall, but it can accelerate some specific instances. See https://leanprover.zulipchat.com/#narrow/stream/287929-mathlib4/topic/TC.20slowness

Jovan Gerbscheid (May 23 2024 at 17:29):

I'm looking at the failure, and am a bit confused: without the hack, the instance search returns no instance. With the hack, the instance search errors. But I don't understand why this causes the thing to fail...

Jovan Gerbscheid (May 23 2024 at 17:33):

I don't understand what you mean

Yuyang Zhao (May 23 2024 at 17:50):

Look at this #mwe:

class A (t : Nat)
class B (n : Nat) [A 0] [A 1] [A 2] [A 3] [A 4] [A 5] [A 6] [A 7]

instance a1 : A t where

instance b [A 0] [A 1] [A 2] [A 3] [A 4] [A 5] [A 6] [A 7] [B n] : B n.succ where
instance b0 [A 1] [A 2] [A 3] [A 4] [A 5] [A 6] [A 7] [B n] [A 0] : B n.succ where
instance b1 [A 0] [A 2] [A 3] [A 4] [A 5] [A 6] [A 7] [B n] [A 1] : B n.succ where
instance b2 [A 0] [A 1] [A 3] [A 4] [A 5] [A 6] [A 7] [B n] [A 2] : B n.succ where
instance b3 [A 0] [A 1] [A 2] [A 4] [A 5] [A 6] [A 7] [B n] [A 3] : B n.succ where
instance b4 [A 0] [A 1] [A 2] [A 3] [A 5] [A 6] [A 7] [B n] [A 4] : B n.succ where
instance b5 [A 0] [A 1] [A 2] [A 3] [A 4] [A 6] [A 7] [B n] [A 5] : B n.succ where
instance b6 [A 0] [A 1] [A 2] [A 3] [A 4] [A 5] [A 7] [B n] [A 6] : B n.succ where
instance b7 [A 0] [A 1] [A 2] [A 3] [A 4] [A 5] [A 6] [B n] [A 7] : B n.succ where

instance a2 : A t where

#synth B 7 -- (deterministic) timeout at `typeclass`

Lean4 treats @B 0 a1 ... and @B 0 a2 ... as different instances, but we need to make sure that different instances parameters we found are defeq, so there are actually some redundant attempts here. Mathlib also builds successfully in #12810 (the diff is broken now).

Jovan Gerbscheid (May 23 2024 at 17:53):

Does this explain the failure that you solved with the hack?

Yuyang Zhao (May 23 2024 at 18:00):

No, I believe this should be an unrelated topic. (But they were both found while resolving the failure of #6326 IIRC.)

Jovan Gerbscheid (May 23 2024 at 18:38):

In https://github.com/negiizhao/lean4/pull/7/files, this solution is only safe under the assumption that different instances of the same type are always isDefEq, but that isn't necessarily a safe assumption. Although I think this assumption is supposed to hold in Mathlib

Jovan Gerbscheid (May 23 2024 at 20:04):

I've found out what causes the mysterious failure, and why the hack helps. This is the failing theorem:

theorem inv_le_one_of_one_le : 1 ≤ a → a⁻¹ ≤ 1 :=
  inv_le_one'.mpr

The code that throws the error does this:

  let result ← trySynthInstance type
  match result with
  | ...
  | .undef => return false -- we will try later
  | .none  =>
      throwError "failed to synthesize{indentExpr type}\n{useDiagnosticMsg}"

So the problem is that we want trySynthInstance to return .undef instead of .none. .undef is returned when a read-only metavariable (such as a type class variables) tries to be instantiated during unification. During the type class search, the only instance that is found is OrderedCommMonoid.toCovariantClassLeft, which has a multiplication operation derived from a OrderedCommMonoid instance. But the multiplication in inv_le_one' is derived from a Group instance. So unification isn't able to relate the Group and OrderedCommMonoid instance. Therefore, synthesis returns .none, implying that no instance will ever be applicable. But by hacking in another instance that causes type class search to return .undef, we try synthesizing again later when the metavariables are instantiated. And at that point we know that our type has a OrderedCommGroup instance.

Jovan Gerbscheid (May 23 2024 at 20:06):

While figuring this out, I noticed that there were many identical type class searches to Group ?m for a fixed metavariable ?m. It turns out that .undef results of type class search are not cached. I think that this is another point in which efficiency can be improved.

Jovan Gerbscheid (May 23 2024 at 20:22):

Can someone explain the reason for having OrderedCommMonoid be a class instead of CommMonoid and Ordered being separate type classes? Because that is what is causing the problem here.

Yaël Dillies (May 23 2024 at 20:27):

Simplicity. But I'll try making the algebra and order hierarchies more separate at some point this summer

Kevin Buzzard (May 23 2024 at 21:01):

Surely an OrderedCommMonoid is not just a CommMonoid with an order? There'll be some axiom relating the structures, right? So you'll need OrderedCommMonoid somehow right?

Kevin Buzzard (May 23 2024 at 21:02):

Oh you could have IsOrderedCommMonoid I guess

Michael Stoll (May 23 2024 at 21:02):

I think Yaël's idea is to have data-carrying typeclasses for algebra and for order and then mix-ins for the compatibilities.

Kevin Buzzard (May 23 2024 at 21:04):

Will that solve the problem @Jovan Gerbscheid is flagging?

Kevin Buzzard (May 23 2024 at 21:05):

By the way, my understanding is that Group ?m should fail instantly because it's obviously a bad idea.

Jovan Gerbscheid (May 23 2024 at 21:10):

Kevin Buzzard said:

Will that solve the problem Jovan Gerbscheid is flagging?

I wanted to say yes, but it turns out no: CommMonoid and Group are still not well-related in the hierarchy :oh_no:

Yuyang Zhao (May 23 2024 at 21:14):

So it's the lack of separation of algebra and order hierarchies itself prevents attempts to separate them...

Kevin Buzzard (May 23 2024 at 21:26):

There are two ways to get from CommGroup to Monoid and all four of the vertices of the square are useful mathematical things...

Yuyang Zhao (May 23 2024 at 21:40):

Jovan Gerbscheid said:

In https://github.com/negiizhao/lean4/pull/7/files, this solution is only safe under the assumption that different instances of the same type are always isDefEq, but that isn't necessarily a safe assumption. Although I think this assumption is supposed to hold in Mathlib

I'm still wondering if we can just believe they are defeq. Even if we do not use CovariantClass there are still some duplicate attempts. Although after a couple fixes to the TC search it probably will not stop us from separating algebra and order hierarchies anymore...

Yuyang Zhao (May 23 2024 at 21:44):

Not necessarily using the method in https://github.com/negiizhao/lean4/pull/7/files, that was just a test.

Jovan Gerbscheid (May 23 2024 at 22:24):

If you want to get type class search to not duplicate its search on isDefEq instances of the same type, then for the lookup table you'll have to do some trick like replacing all instance arguments with free variables, and after each lookup use isDefEq to check that these instances are indeed isDefEq. I'm not sure that doing this will improve overall efficiency because of the cost of modifying the lookup key.

Jovan Gerbscheid (May 23 2024 at 22:56):

I was trying to minimize the failure to a small example, and I discovered that replacing inv_le_one'.mpr by Iff.mpr inv_le_one' solves the problem. So this failure only occurs when the expected type is not propagated correctly. So maybe it is more of an elaboration problem instead?

Jovan Gerbscheid (May 23 2024 at 23:10):

#mwe:

class AA (α : Type) where
  asNum : α → Nat
  p : α → Prop

class AB (α : Type) extends AA α
class BA (α : Type) extends AA α
class BB (α : Type) extends AB α, BA α

class X (α : Type) (f : α → Nat)

instance [AB α] : X α AA.asNum where

theorem X_rfl [BA α] [X α AA.asNum] {a : α} : AA.p a ↔ AA.p a := Iff.rfl

/-
failed to synthesize instance
  X ?m.234 AA.asNum
-/
example {α : Type} [BB α] (a : α) : AA.p a → AA.p a := X_rfl.mpr

I don't know much about how expected types are passed around, but maybe this can be an easy fix?

Jovan Gerbscheid (May 23 2024 at 23:11):

On the topic of caching .undef results of TC, yes these are supposed to fail quickly, but seeing this trace, I still think a noticeable amount of time could be saved:

Edit: This turns out to not be valid to cache, because metavariable instantiations in the types of local instances can change in the meantime, which may make a previously .undef result now be successful or failing.

[0.030101] ❌ CovariantClass ?m.96723 ?m.96723 (fun x x_1 ↦ x * x_1) fun x x_1 ↦ x ≤ x_1 ▼
  [] [0.000033] new goal CovariantClass ?m.96723 ?m.96723 (fun x x_1 ↦ x * x_1) fun x x_1 ↦ x ≤ x_1 ▶
  [] [0.030044] ❌ apply @OrderedCommMonoid.toCovariantClassLeft to CovariantClass ?m.96723 ?m.96723 (fun x x_1 ↦ x * x_1)
        fun x x_1 ↦ x ≤ x_1 ▼
    [tryResolve] [0.030005] ❌ CovariantClass ?m.96723 ?m.96723 (fun x x_1 ↦ x * x_1) fun x x_1 ↦
          x ≤ x_1 ≟ CovariantClass ?m.96757 ?m.96757 (fun x x_1 ↦ x * x_1) fun x x_1 ↦ x ≤ x_1 ▼
      [] [0.000175] 💥 Group ?m.96723 ▶
      [] [0.000387] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000167] 💥 Group ?m.96723 ▶
      [] [0.000337] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000171] 💥 Group ?m.96723 ▶
      [] [0.000367] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000211] 💥 Group ?m.96723 ▶
      [] [0.000410] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000150] 💥 Group ?m.96723 ▶
      [] [0.000348] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000181] 💥 Group ?m.96723 ▶
      [] [0.000437] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000156] 💥 Group ?m.96723 ▶
      [] [0.000410] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000136] 💥 Group ?m.96723 ▶
      [] [0.000336] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000144] 💥 Group ?m.96723 ▶
      [] [0.000337] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000144] 💥 Group ?m.96723 ▶
      [] [0.000331] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000144] 💥 Group ?m.96723 ▶
      [] [0.000341] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000177] 💥 Group ?m.96723 ▶
      [] [0.000340] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000158] 💥 Group ?m.96723 ▶
      [] [0.000347] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000143] 💥 Group ?m.96723 ▶
      [] [0.000339] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000142] 💥 Group ?m.96723 ▶
      [] [0.000319] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000157] 💥 Group ?m.96723 ▶
      [] [0.000391] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000173] 💥 Group ?m.96723 ▶
      [] [0.000371] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000155] 💥 Group ?m.96723 ▶
      [] [0.000365] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000158] 💥 Group ?m.96723 ▶
      [] [0.000357] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000149] 💥 Group ?m.96723 ▶
      [] [0.000528] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000169] 💥 Group ?m.96723 ▶
      [] [0.000376] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000159] 💥 Group ?m.96723 ▶
      [] [0.000385] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000158] 💥 Group ?m.96723 ▶
      [] [0.000377] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000155] 💥 Group ?m.96723 ▶
      [] [0.000358] 💥 OrderedCommMonoid ?m.96723 ▶
      [] [0.000149] 💥 Group ?m.96723 ▶
      [] [0.000356] 💥 OrderedCommMonoid ?m.96723 ▶
      [] 44 more entries... ▶

Yuyang Zhao (May 23 2024 at 23:53):

Jovan Gerbscheid said:

In https://github.com/negiizhao/lean4/pull/7/files, this solution is only safe under the assumption that different instances of the same type are always isDefEq, but that isn't necessarily a safe assumption. Although I think this assumption is supposed to hold in Mathlib

Isn't this the same assumption as lean4#4003?

Jovan Gerbscheid (May 23 2024 at 23:57):

Hmm, that's true

Jovan Gerbscheid (May 24 2024 at 00:28):

Actually, isn't it the case for SMul Nat Nat and SMul Nat M for an additive monoid M that the instances are definitionally not equal if you substitute Nat for M?

For this example ~~I think lean4#4003 will not cause problems~~ could cause problems here, and with your change this could also result in a failed type class search in the wrong circumstances.

Johan Commelin (May 24 2024 at 03:02):

@Jovan Gerbscheid mathlib should have only one SMul Nat Nat up to defeq. That is why our definition of AddCommMonoid comes with an nsmul field.

Jovan Gerbscheid (May 24 2024 at 13:50):

Ah, I see, and the same for zsmul as a field of SubNegMonoid.

Kevin Buzzard (May 24 2024 at 14:07):

Don't forget nnqsmul and qsmul in Field :-)

Jovan Gerbscheid (May 25 2024 at 20:42):

I've made more progress on speeding up type-class search for Mathlib.

I made a big oversight when implementing the caching of intermediate type class results: I only stored positive results. Caching intermediate failures is also useful. This is a bit more complicated than it seems at first, because of the loop-detection: a search branch can "fail" due to a detected loop, but that doesn't mean that there will not eventually be an instance found for it. So I keep a noLoops : Bool in the state and only cache failures when noLoops := true. In the first version I just set noLoops to false whenever the search tree becomes a non-tree. This gave another speedup of around 8% to type class search in mathlib. I then replaced this with a more sophisticated algorithm that can tell apart a diamond from a loop, so that negative caching can happen even more often, but the speedup from this was relatively neglegible. See #13191 and see below.
The recent optimization of canonical instances is only applied when the subgoal doesn't have metavariables. But I believe that it should also apply when the subgoal has level metavariables. This gives a marginal overall speedup of 0.011% (I think that is statistically significant) #13194.
The implementation of the TC algorithm truns out to be a bit sloppy, so by cleaning it up, I managed to get another speedup: mathlib overall becomes 0.8% faster (so ~2.8% speedup for TC resolution) (#13207). This includes things like deduplicating a computation that is done twice, and making sure that computations meant for trace messages are only executed when the trace option is set to true. This performance improvement is orthogonal to the other one.

This is the benchmark for the improved intermediate TC result caching:

  Benchmark                                                                   Metric                Change
  ========================================================================================================
+ build                                                                       instructions           -5.1%
+ build                                                                       task-clock             -5.4%
+ build                                                                       typeclass inference   -16.2%
+ lint                                                                        wall-clock             -5.7%
+ ~Mathlib.Algebra.Algebra.NonUnitalSubalgebra                                instructions          -14.0%
+ ~Mathlib.Algebra.Algebra.Quasispectrum                                      instructions          -16.2%
+ ~Mathlib.Algebra.Algebra.Subalgebra.Unitization                             instructions          -17.0%
+ ~Mathlib.Algebra.Algebra.Unitization                                        instructions          -25.4%
+ ~Mathlib.Algebra.BigOperators.Finsupp                                       instructions          -29.5%
+ ~Mathlib.Algebra.Category.ModuleCat.ChangeOfRings                           instructions           -6.1%
+ ~Mathlib.Algebra.Lie.Weights.Killing                                        instructions          -12.2%
+ ~Mathlib.Algebra.Module.LinearMap.Basic                                     instructions          -14.4%
+ ~Mathlib.Algebra.Module.LocalizedModule                                     instructions          -12.1%
+ ~Mathlib.Algebra.MonoidAlgebra.Basic                                        instructions          -11.7%
+ ~Mathlib.Algebra.Order.Module.Defs                                          instructions          -17.0%
+ ~Mathlib.Algebra.Order.ToIntervalMod                                        instructions          -18.7%
+ ~Mathlib.Algebra.Polynomial.Module.Basic                                    instructions          -16.4%
+ ~Mathlib.Algebra.Ring.CentroidHom                                           instructions          -14.6%
+ ~Mathlib.Algebra.Star.NonUnitalSubalgebra                                   instructions          -20.7%
+ ~Mathlib.Algebra.TrivSqZeroExt                                              instructions          -26.7%
+ ~Mathlib.AlgebraicGeometry.Morphisms.QuasiSeparated                         instructions           -7.7%
+ ~Mathlib.Analysis.Analytic.Basic                                            instructions          -10.0%
+ ~Mathlib.Analysis.Analytic.CPolynomial                                      instructions          -11.9%
+ ~Mathlib.Analysis.Analytic.Inverse                                          instructions           -9.3%
+ ~Mathlib.Analysis.Calculus.ContDiff.Basic                                   instructions          -10.5%
+ ~Mathlib.Analysis.Calculus.ContDiff.Bounds                                  instructions           -7.2%
+ ~Mathlib.Analysis.Calculus.ContDiff.Defs                                    instructions          -10.4%
+ ~Mathlib.Analysis.Calculus.Deriv.Mul                                        instructions          -13.9%
+ ~Mathlib.Analysis.Calculus.FDeriv.Analytic                                  instructions           -9.6%
+ ~Mathlib.Analysis.Calculus.FDeriv.Basic                                     instructions          -10.6%
+ ~Mathlib.Analysis.Calculus.FDeriv.Measurable                                instructions          -12.9%
+ ~Mathlib.Analysis.Calculus.FDeriv.Mul                                       instructions          -10.7%
+ ~Mathlib.Analysis.Calculus.FDeriv.Prod                                      instructions          -11.7%
+ ~Mathlib.Analysis.Calculus.FDeriv.Symmetric                                 instructions          -24.2%
+ ~Mathlib.Analysis.Calculus.Implicit                                         instructions          -15.2%
+ ~Mathlib.Analysis.Calculus.LineDeriv.Basic                                  instructions          -16.0%
+ ~Mathlib.Analysis.Calculus.Rademacher                                       instructions          -16.8%
+ ~Mathlib.Analysis.Convex.Function                                           instructions          -31.6%
+ ~Mathlib.Analysis.Convex.Gauge                                              instructions          -15.3%
+ ~Mathlib.Analysis.Convex.Jensen                                             instructions          -19.6%
+ ~Mathlib.Analysis.Convex.Mul                                                instructions          -15.0%
+ ~Mathlib.Analysis.Convex.Segment                                            instructions          -19.9%
+ ~Mathlib.Analysis.Convolution                                               instructions          -10.7%
+ ~Mathlib.Analysis.Distribution.SchwartzSpace                                instructions           -7.9%
+ ~Mathlib.Analysis.Fourier.FourierTransformDeriv                             instructions          -10.2%
+ ~Mathlib.Analysis.InnerProductSpace.Basic                                   instructions           -8.3%
+ ~Mathlib.Analysis.InnerProductSpace.OfNorm                                  instructions          -21.6%
+ ~Mathlib.Analysis.InnerProductSpace.Projection                              instructions           -9.9%
+ ~Mathlib.Analysis.NormedSpace.BallAction                                    instructions          -21.9%
+ ~Mathlib.Analysis.NormedSpace.Banach                                        instructions          -15.4%
+ ~Mathlib.Analysis.NormedSpace.BoundedLinearMaps                             instructions          -13.7%
+ ~Mathlib.Analysis.NormedSpace.MStructure                                    instructions          -30.1%
+ ~Mathlib.Analysis.NormedSpace.Multilinear.Basic                             instructions          -13.3%
+ ~Mathlib.Analysis.NormedSpace.Multilinear.Curry                             instructions          -12.6%
+ ~Mathlib.Analysis.NormedSpace.OperatorNorm.Basic                            instructions          -17.8%
+ ~Mathlib.Analysis.NormedSpace.OperatorNorm.Bilinear                         instructions          -18.2%
+ ~Mathlib.Analysis.NormedSpace.OperatorNorm.Prod                             instructions          -14.7%
+ ~Mathlib.Analysis.NormedSpace.PiTensorProduct.InjectiveSeminorm             instructions           -6.5%
+ ~Mathlib.Analysis.NormedSpace.SphereNormEquiv                               instructions          -41.6%
+ ~Mathlib.Analysis.NormedSpace.Star.ContinuousFunctionalCalculus.Instances   instructions           -9.3%
+ ~Mathlib.Analysis.NormedSpace.Star.Multiplier                               instructions          -12.4%
+ ~Mathlib.Analysis.NormedSpace.TrivSqZeroExt                                 instructions          -20.5%
+ ~Mathlib.Analysis.NormedSpace.lpSpace                                       instructions           -7.0%
+ ~Mathlib.Analysis.Seminorm                                                  instructions          -15.3%
+ ~Mathlib.Analysis.SpecialFunctions.Pow.Deriv                                instructions           -9.1%
+ ~Mathlib.CategoryTheory.Bicategory.Adjunction                               instructions          -20.0%
+ ~Mathlib.CategoryTheory.Monoidal.Braided.Basic                              instructions          -20.4%
+ ~Mathlib.CategoryTheory.Monoidal.Rigid.Basic                                instructions          -10.9%
+ ~Mathlib.Data.Finset.Pointwise                                              instructions           -9.0%
+ ~Mathlib.Data.Matrix.Basic                                                  instructions          -14.4%
+ ~Mathlib.Data.Num.Lemmas                                                    instructions          -15.8%
+ ~Mathlib.Data.Set.Pointwise.Interval                                        instructions          -15.3%
+ ~Mathlib.FieldTheory.KummerExtension                                        instructions           -8.9%
+ ~Mathlib.Geometry.Euclidean.Angle.Oriented.Basic                            instructions          -10.3%
...

Yaël Dillies (May 25 2024 at 21:34):

Careful! You're gonna get hired by the FRO :wink:

Jovan Gerbscheid (May 26 2024 at 10:12):

I realize now why the more sophisticated loop detection doesn't give any improvement (I made a mistake in it at first but fixing it didn't give an improvement either). The reason is that instead of caching intermediate results globally at the end of a synthInstance? call, this is done immediately. So if there is a diamond, say with instances AA => AB, AA => BA, AB => BB, BA => BB, and if we get subgoal BB. Then first we find subgoal AB, and then AA. Then AA fails, and this is globally cached, and so AB fails and this is globally cached. Next we find subgoal BA and then AA again. But since we first lookup AA in the global cache, we see that AA has no solutions, and crucially we did not detect that there is a diamond, because of finding AA in the global cache instead of the local cache, keeping noLoops := true.

So now it only (locally) registers that an entry is done when the entry is not able to be cached globally.

Jovan Gerbscheid (May 26 2024 at 12:11):

By the way, is it possible to install the lean version from a specific commit (i.e. not the most recent one) to a lean4 PR?

Kim Morrison (May 26 2024 at 14:08):

Jovan Gerbscheid said:

By the way, is it possible to install the lean version from a specific commit (i.e. not the most recent one) to a lean4 PR?

No.

Kim Morrison (May 26 2024 at 14:08):

(I mean, you can use elan override set, but you can't do this just via lean-toolchain.)

Kim Morrison (May 26 2024 at 14:11):

Just a word of warning here: it's exciting that there are performance improvements here, but reviewing changes to core language features like TC search may not be quick, because it requires very careful review from people with a lot on their plate. The easier you can make PRs to review (thorough tests, good comments, clearly explained motivated in PR descriptions, etc), the better, but even then no promises!

Kim Morrison (May 26 2024 at 14:20):

Also, please review https://github.com/leanprover/lean4/blob/master/CONTRIBUTING.md. Realistically, we would prefer to see individual issues for each of your bullet points in the current PR description.

Jovan Gerbscheid (May 26 2024 at 18:54):

Ok, I have made an issue for each bullet point. lean4#4277

I'll add tests to some of these later, but many don't need tests.

Jovan Gerbscheid (May 26 2024 at 19:14):

And I made a bonus issue :wink: lean4#4288

Michael Stoll (May 26 2024 at 19:19):

lean4#4288 ?

Jovan Gerbscheid (May 26 2024 at 19:20):

Oh, whoops

Michael Rothgang (Jun 02 2024 at 15:36):

Reading and trying to apply leanprover-community/leanprover-community.io#472, two follow-up questions occurred to me. I'd be very happy to learn about positive answers to these! Specifically,

is there a way to match the output of profiler.true with individual lines? Say, when a proof has six simps, but I get output about three of them being slow, can I easily find out which ones? (I can work around this, e.g. by uncommenting them individually - but less workarounds mean I'm more productive.)
is there a common way to speed up linting took <a long time>?

Jovan Gerbscheid (Jun 02 2024 at 16:02):

I'm pretty sure there is no direct way to tell from which line/function a set_option profiler true message is coming from. There is instead set_option trace.profiler true, which does allow you to see which simp corresponds to what trace, but I'm not sure if that is more convenient to use.

Michael Rothgang (Jun 02 2024 at 16:11):

When I tried that, I got many lines with times close to zero - but I guess setting trace.profiler.threshold would help with that?

Jovan Gerbscheid (Jun 02 2024 at 16:11):

I think so

Michael Stoll (Jun 02 2024 at 16:22):

The default for profiler.threshold is 100ms, whereas for trace.profiler.threshold, it is 10ms.

Jovan Gerbscheid (Jun 02 2024 at 16:29):

I made a change to the caching in lean4#4272, so that it doesn't use or modify the cache for classes that contain outParam arguments. This is because outParam are treated differently when they are the main goal vs an intermediate goal: Every outParam argument in the main goal is replaced by a fresh metavariable, that is unified with the original argument after the type class search. This means that if someone came along and added a high priority instance for HAdd Nat Nat Int, then HAdd Nat Nat Nat will fail as a main goal, but HAdd Nat Nat Nat will succeed if it is an intermediate goal. I'm not convinced this is a beneficial state of affairs (see also lean4#4225), but I think my extra caching shouldn't change this behaviour.

After this change the total mathlib speedup is 4.6% instead of 5.1%.

Yuyang Zhao (Jun 08 2024 at 03:14):

I'm back to #7873 and got 10.2% improvement in typeclass inference.

Johan Commelin (Jun 08 2024 at 04:08):

@Yuyang Zhao Thanks so much! Can you please summarize what you still want to do with this PR? (It is labelled WIP.)

Yuyang Zhao (Jun 08 2024 at 04:34):

There are mysterious slowdowns in some files. But maybe we can merge it first if we can't find the reason now.

Johan Commelin (Jun 08 2024 at 04:36):

Well, I was mostly asking whether you wanted reviews already, or were still planning major work on the PR.

Johan Commelin (Jun 08 2024 at 04:36):

I agree that it would be good to get a better understanding of the slowdowns.

Kevin Buzzard (Jun 08 2024 at 07:06):

@Michael Stoll can you run your interpretation of the speedcenter data on #7873? Some files are 100% slower here, it would be good to understand better what that means

Damiano Testa (Jun 08 2024 at 07:26):

Should Michael's code be added to the PR summary after each !bench? In particular, should it be automated?

Michael Stoll (Jun 08 2024 at 07:53):

I've run my script; the output is some 1500 lines. Do you want me to add it as a comment to the PR? Or what would be a good way of making it available?

The largest changes are:

  lint:                                                +299.52 * 10⁹ (+2.50%)
  build:                                               -4486.2 * 10⁹ (-3.70%)

Files that got slower by at least 15*10⁹ instructions:
  Mathlib.CategoryTheory.Preadditive.EilenbergMoore:   +82.070 * 10⁹ (+34.6%)
  Mathlib.LinearAlgebra.TensorProduct.Graded.External: +57.056 * 10⁹ (+18.5%)
  Mathlib.CategoryTheory.Preadditive.EndoFunctor:      +43.177 * 10⁹ (+31.8%)
  Mathlib.Algebra.Order.Field.Defs:                    +27.138 * 10⁹ (+68.6%)
  Mathlib.CategoryTheory.Preadditive.SingleObj:        +26.084 * 10⁹ (+143.%)
  Mathlib.Algebra.Field.Basic:                         +21.982 * 10⁹ (+53.1%)
  Mathlib.Combinatorics.Schnirelmann:                  +17.875 * 10⁹ (+70.6%)
  Mathlib.Algebra.Polynomial.BigOperators:             +17.769 * 10⁹ (+33.3%)
  Mathlib.Geometry.Euclidean.Angle.Oriented.Basic:     +15.993 * 10⁹ (+13.9%)

Files that got slower by at least 30%:
  Mathlib.CategoryTheory.Preadditive.SingleObj:                       +143.%
  Mathlib.Algebra.EuclideanDomain.Instances:                          +100.%
  Mathlib.Algebra.Field.IsField:                                      +72.4%
  Mathlib.Combinatorics.Schnirelmann:                                 +70.6%
  Mathlib.Algebra.Order.Field.Defs:                                   +68.6%
  Mathlib.Algebra.Order.Sub.Canonical:                                +60.8%
  Mathlib.Algebra.Field.Basic:                                        +53.1%
  Mathlib.Algebra.Order.Sub.Defs:                                     +37.2%
  Mathlib.CategoryTheory.Preadditive.EilenbergMoore:                  +34.6%
  Mathlib.Algebra.Ring.Equiv:                                         +33.8%
  Mathlib.Algebra.Polynomial.BigOperators:                            +33.3%
  Mathlib.Data.ZMod.Defs:                                             +33.2%
  Mathlib.CategoryTheory.Preadditive.EndoFunctor:                     +31.8%
  Mathlib.LinearAlgebra.Matrix.Charpoly.Basic:                        +30.2%

Files that got faster by at least 25*10⁹ instructions:
  Mathlib.FieldTheory.Adjoin:                          -62.854 * 10⁹ (-22.7%)
  Mathlib.RingTheory.DedekindDomain.Different:         -59.765 * 10⁹ (-41.8%)
  Mathlib.RingTheory.IntegralRestrict:                 -55.813 * 10⁹ (-34.8%)
  Mathlib.NumberTheory.RamificationInertia:            -54.219 * 10⁹ (-17.9%)
  Mathlib.Algebra.Quaternion:                          -50.853 * 10⁹ (-23.8%)
  Mathlib.Topology.ContinuousFunction.StoneWeierstrass:
                                                       -46.435 * 10⁹ (-18.6%)
  Mathlib.LinearAlgebra.Dual:                          -46.255 * 10⁹ (-12.5%)
  Mathlib.Geometry.Manifold.ContMDiff.NormedSpace:     -46.023 * 10⁹ (-23.3%)
  Mathlib.LinearAlgebra.TensorProduct.Graded.Internal: -45.843 * 10⁹ (-17.7%)
  Mathlib.FieldTheory.SeparableDegree:                 -44.346 * 10⁹ (-22.0%)
  Mathlib.Analysis.Convolution:                        -44.057 * 10⁹ (-11.7%)
  Mathlib.RingTheory.Trace:                            -41.745 * 10⁹ (-35.0%)
  Mathlib.FieldTheory.PurelyInseparable:               -41.482 * 10⁹ (-18.9%)
  Mathlib.FieldTheory.RatFunc.Basic:                   -41.168 * 10⁹ (-18.5%)
  Mathlib.RingTheory.FractionalIdeal.Operations:       -39.789 * 10⁹ (-36.9%)
  Mathlib.NumberTheory.NumberField.Discriminant:       -37.172 * 10⁹ (-17.2%)
  Mathlib.Analysis.NormedSpace.Star.ContinuousFunctionalCalculus.Instances:
                                                       -35.498 * 10⁹ (-11.5%)
  Mathlib.Algebra.Lie.Weights.RootSystem:              -35.190 * 10⁹ (-18.1%)
  Mathlib.Algebra.Algebra.Subalgebra.Rank:             -34.750 * 10⁹ (-29.1%)
  Mathlib.NumberTheory.Cyclotomic.PrimitiveRoots:      -34.711 * 10⁹ (-37.5%)
  Mathlib.Analysis.Calculus.FDeriv.Mul:                -33.574 * 10⁹ (-7.69%)
  Mathlib.RingTheory.IntegralClosure:                  -32.421 * 10⁹ (-31.1%)
  Mathlib.Algebra.Category.AlgebraCat.Monoidal:        -32.371 * 10⁹ (-11.7%)
  Mathlib.LinearAlgebra.Semisimple:                    -32.216 * 10⁹ (-38.4%)
  Mathlib.AlgebraicGeometry.EllipticCurve.Affine:      -30.444 * 10⁹ (-17.7%)
  Mathlib.Analysis.NormedSpace.PiTensorProduct.InjectiveSeminorm:
                                                       -30.417 * 10⁹ (-10.1%)
  Mathlib.Algebra.Lie.Weights.Killing:                 -29.928 * 10⁹ (-13.0%)
  Mathlib.Analysis.InnerProductSpace.Adjoint:          -28.446 * 10⁹ (-13.3%)
  Mathlib.LinearAlgebra.Matrix.ToLin:                  -27.938 * 10⁹ (-18.3%)
  Mathlib.NumberTheory.NumberField.Embeddings:         -27.892 * 10⁹ (-21.1%)
  Mathlib.AlgebraicGeometry.EllipticCurve.Group:       -27.225 * 10⁹ (-20.1%)
  Mathlib.FieldTheory.Normal:                          -27.088 * 10⁹ (-28.5%)
  Mathlib.Analysis.NormedSpace.lpSpace:                -27.079 * 10⁹ (-10.4%)
  Mathlib.Topology.ContinuousFunction.UniqueCFC:       -26.821 * 10⁹ (-12.1%)
  Mathlib.Algebra.Algebra.NonUnitalSubalgebra:         -25.469 * 10⁹ (-17.2%)

Files that got faster by at leat 30%:
  Mathlib.RingTheory.DedekindDomain.Different:                        -41.8%
  Mathlib.RingTheory.Nullstellensatz:                                 -40.8%
  Mathlib.FieldTheory.MvPolynomial:                                   -40.7%
  Mathlib.RingTheory.DedekindDomain.IntegralClosure:                  -40.2%
  Mathlib.LinearAlgebra.Semisimple:                                   -38.4%
  Mathlib.Algebra.Order.Field.Subfield:                               -38.2%
  Mathlib.RingTheory.Localization.NormTrace:                          -38.2%
  Mathlib.NumberTheory.Cyclotomic.PrimitiveRoots:                     -37.5%
  Mathlib.RingTheory.FractionalIdeal.Operations:                      -36.9%
  Mathlib.RingTheory.Trace:                                           -35.0%
  Mathlib.RingTheory.Algebraic:                                       -34.9%
  Mathlib.RingTheory.IntegralRestrict:                                -34.8%
  Mathlib.RingTheory.Localization.Away.AdjoinRoot:                    -34.5%
  Mathlib.RingTheory.LaurentSeries:                                   -34.2%
  Mathlib.NumberTheory.NumberField.FractionalIdeal:                   -33.0%
  Mathlib.Algebra.Ring.Subring.Order:                                 -32.3%
  Mathlib.RingTheory.Unramified.Basic:                                -32.0%
  Mathlib.RingTheory.Jacobson:                                        -32.0%
  Mathlib.RingTheory.IntegralClosure:                                 -31.1%
  Mathlib.RingTheory.Perfection:                                      -31.0%
  Mathlib.LinearAlgebra.FreeModule.Norm:                              -30.8%
  Mathlib.RingTheory.Etale.Basic:                                     -30.7%
  Mathlib.FieldTheory.NormalClosure:                                  -30.4%
  Mathlib.Algebra.Ring.Subsemiring.Order:                             -30.2%
  Mathlib.RingTheory.IsAdjoinRoot:                                    -30.1%
  Mathlib.FieldTheory.PolynomialGaloisGroup:                          -30.0%

Michael Stoll (Jun 08 2024 at 09:44):

It may be necessary to (locally) re-adjust some instance priorities in certain files.

(I don't think that there is one single "best" assignment of priorities; depending on context some instances should have higher priority than others, but which ones these are will not always be the same.)

Michael Stoll (Jun 08 2024 at 09:56):

I have posted the detailed output of my script in a comment on #7873 (with four expandable subsections).

Yuyang Zhao (Jun 08 2024 at 22:53):

Some slowdowns in isDefEq may be caused by reordering extends. Eliminating defeq abuse of Quiver.Hom avoids slowdowns in preadditive categories.

Yuyang Zhao (Jun 09 2024 at 03:22):

Why is the second one much slower?

import Mathlib

open scoped Classical

count_heartbeats in -- Used 1591 heartbeats
noncomputable def IsField.toSemifield' {R : Type u} [Semiring R] (h : IsField R) : Semifield R where
  __ := ‹Semiring R›
  __ := h
  inv a := if ha : a = 0 then 0 else Classical.choose (h.mul_inv_cancel ha)
  inv_zero := dif_pos rfl
  mul_inv_cancel a ha := by convert Classical.choose_spec (h.mul_inv_cancel ha); exact dif_neg ha
  nnqsmul := _

count_heartbeats in -- Used 6564 heartbeats
noncomputable def IsField.toSemifield'' {R : Type u} [Semiring R] (h : IsField R) : Semifield R where
  toSemiring := ‹Semiring R›
  __ := h
  inv a := if ha : a = 0 then 0 else Classical.choose (h.mul_inv_cancel ha)
  inv_zero := dif_pos rfl
  mul_inv_cancel a ha := by convert Classical.choose_spec (h.mul_inv_cancel ha); exact dif_neg ha
  nnqsmul := _

Yuyang Zhao (Jun 09 2024 at 03:38):

import Mathlib

open scoped Classical

set_option backward.isDefEq.lazyProjDelta false in
count_heartbeats in -- Used 2997 heartbeats
noncomputable def IsField.toSemifield' {R : Type u} [Semiring R] (h : IsField R) : Semifield R where
  __ := ‹Semiring R›
  __ := h
  inv a := if ha : a = 0 then 0 else Classical.choose (h.mul_inv_cancel ha)
  inv_zero := dif_pos rfl
  mul_inv_cancel a ha := by convert Classical.choose_spec (h.mul_inv_cancel ha); exact dif_neg ha
  nnqsmul := _

set_option backward.isDefEq.lazyProjDelta false in
count_heartbeats in -- Used 2884 heartbeats
noncomputable def IsField.toSemifield'' {R : Type u} [Semiring R] (h : IsField R) : Semifield R where
  toSemiring := ‹Semiring R›
  __ := h
  inv a := if ha : a = 0 then 0 else Classical.choose (h.mul_inv_cancel ha)
  inv_zero := dif_pos rfl
  mul_inv_cancel a ha := by convert Classical.choose_spec (h.mul_inv_cancel ha); exact dif_neg ha
  nnqsmul := _

Yuyang Zhao (Jun 09 2024 at 05:50):

I used set_option backward.isDefEq.lazyProjDelta false to speed up some of the slower files.

Kevin Buzzard (Jun 09 2024 at 07:47):

Formally, the reason the second one is slower is because it has a ton of

[] [0.014299] ❌ 0 =?= 0

s which turn into

[] [0.014669] ❌ Zero.toOfNat0 =?= Zero.toOfNat0

s, where the LHS is @Zero.toOfNat0 R AddMonoid.toZero : OfNat R 0 and the RHS is the slightly different @Zero.toOfNat0 R MonoidWithZero.toZero : OfNat R 0, and all the 0.015's add up.

(NB I found this by comparing outputs using set_option trace.profiler true in; of course it doesn't answer the underlying question but it is some information about what's going on)

Yuyang Zhao (Jun 09 2024 at 11:31):

After further adjustments I got

+ build        instructions           -5.9%
+ build        typeclass inference   -13.8%
+ build        wall-clock             -6.0%

But there are also more things that need to be fixed.

Matthew Ballard (Jun 12 2024 at 16:26):

I think this is great but am concerned about a couple of things.

What is the explanation for this? It is hard to disentangle the reordering, the changes in priority for projections from preferred and non-preferred parents, and the new backwards compatibility flags being enabled.

My guess is that by moving the parent with the deepest/most data fields to the preferred slot you can avoid defeq checks on the data itself probably because they become syntactic equalities earlier in the process.

How much of the changes to the priorities can be achieved by just asking core to set new values for the projections from preferred and non-preferred parents? Everyone seems to think this is a problem but we haven't had good hard performance-based evidence for a change to bring to core. It would be great and much less ad-hoc if this was it.

Turning on new backward compatibility flags is probably not going to get merged though. So we will have to understand the regressions there better and either fix themselves ourselves or make a compelling case upstream.

Matthew Ballard (Jun 12 2024 at 16:29):

Also, some of the worst regressions can probably be helped with #11521

Mario Carneiro (Jun 12 2024 at 19:05):

One thing I just noticed is that docs#SetLike.instCoeSortType directly uses a lambda and membership:

instance (priority := 100) : CoeSort A (Type _) :=
  ⟨fun p => { x : B // x ∈ p }⟩

Since this is a coercion, it is unfolded and therefore loads of mathlib terms contain these explicit lambdas. (This also needs a dedicated delaborator, written just below it.) Has anyone tried replacing this with a call to a setToType function? The specific reason I have to believe that this will improve performance across the library is that I can see failed instance searches caused by the fact that the indexing is not able to handle the lambdas: if I have definitions foo, bar : Subgroup G then CommGroup ↑foo also attempts to apply an instance for CommGroup ↑bar. Normally this would not happen because the discrimination tree can tell them apart, but after the aforementioned elaboration the result is CommGroup {x // x ∈ foo} vs CommGroup {x // x ∈ bar} and lean gives up indexing at the lambda.

Matthew Ballard (Jun 12 2024 at 19:14):

Let's find out.

Jovan Gerbscheid (Jun 12 2024 at 19:31):

Wouldn't the explicit lambda be applied to an argument? In that case the discrimination tree gets rid of the lambda and there is no problem.

Mario Carneiro (Jun 12 2024 at 20:21):

I'm referring to the lambda in Subtype, i.e. the set abstraction in the code block

Mario Carneiro (Jun 12 2024 at 20:21):

the fun p => lambda is indeed applied

Yuyang Zhao (Jun 13 2024 at 06:02):

What is the status of #11521 now? I think my PRs really need it to eliminate some regressions.

Michael Rothgang (Jun 13 2024 at 08:19):

I looked at a different angle: are there any low-hanging fruit to speed up files. Turns out that yes, there are: sometimes, small changes to a file have a big effect. Three small examples, each a one-line change:

#13769 replaces a slow continuity call by fun_prop
#13772 avoid a really slow change
#13771 uses suppress_compilation in one file (no downstream files are affected)

Michael Rothgang (Jun 13 2024 at 10:18):

Wow, thanks for the speedy reviews and merges. Next in the list is #13770, which expands a few slow measurability calls: this one is more a matter of discussion. In my opinion, the effects make it worth talking about - and if the solution is to improve fun_prop to handle these instead :-)
CC @Rémy Degenne as one of the probability/measure theory maintainers

Jovan Gerbscheid (Jun 13 2024 at 11:54):

Mario Carneiro said:

I'm referring to the lambda in Subtype, i.e. the set abstraction in the code block

I see, so what is going on is that the currying of Membership.mem is the wrong way around, because fun x => x ∈ foo would be eta reduced in the discrimination tree if x was the last argument of Membership.mem. Is there a reason for having the arguments in this order? Apart from "that's the order in which the arguments appear in the notation".

We could also define a (not reducible) constant memRev with notation ∋ for this purpose, if we want to keep the currying in the order in which we write arguments.

Yuyang Zhao (Jun 13 2024 at 17:07):

I tried to do some other things with InjSurj in #13795.

+ build        compilation      -6.1%
+ build        linting          -5.3%
+ build        type checking    -5.4%

Yuyang Zhao (Jun 18 2024 at 22:20):

#7873 + #13852 + #13690

+ build        instructions          -10.6%
+ build        linting                -6.1%
+ build        tactic execution       -6.6%
+ build        task-clock             -8.8%
+ build        typeclass inference   -25.0%
+ build        wall-clock            -10.2%

Johan Commelin (Jun 21 2024 at 18:39):

Thanks so much for doing this! The speedup is impressive. Yet, looking at the PRs together with Rob Lewis we have two requests:

Could you please create a (draft) PR that combines all three PRs, so that we can see detailed benchmarking stats for the final effects.
Could you please write an issue that describes the method that applied for determining these priorities. In that issue, please address the questions by Matthew Ballard in https://leanprover.zulipchat.com/#narrow/stream/287929-mathlib4/topic/Ways.20to.20speed.20up.20Mathlib/near/444279419. Please refer to this issue in the PR commit message (aka top post) for all 3 PRs.

Matthew Ballard (Jul 01 2024 at 17:12):

Matthew Ballard said:

Let's find out.

I ran into some stage0 issues (I think) with this :cry:

Yuyang Zhao (Jul 03 2024 at 21:17):

I'll come back and write an issue soon when I have more time. #13690 actually contains all the changes and the final effects have been obtained by changing the base branch.

Matthew Ballard (Aug 02 2024 at 14:47):

Matthew Ballard said:

Matthew Ballard said:

Let's find out.

I ran into some stage0 issues (I think) with this :cry:

As is usual, I was being silly.

Compare the keys for

∀ {G : Type*} [Group G] {H : Subgroup G}, SMul (↥H.op) G

Before

#[SMul, Subtype, MulOpposite, *, ◾, *]

and after

#[SMul, Subtype, MulOpposite, *, Membership.mem, MulOpposite, *, Subgroup, MulOpposite, *, *, *, Subgroup.op, *, *, *, *]

Matthew Ballard (Aug 02 2024 at 16:20):

Compare

[] new goal Fintype ↥⊤ ▼
  [instances] #[@Set.fintypeInsert', @Unique.fintype, @Subtype.fintype, @Fintype.subtypeEq, @Finset.fintypeCoeSort, @List.Subtype.fintype, @Multiset.Subtype.fintype, @Finset.Subtype.fintype, @FinsetCoe.fintype, @Set.fintypeUniv, @Set.fintypeUnion, @Set.fintypeSep, @Set.fintypeInter, @Set.fintypeInterOfLeft, @Set.fintypeInterOfRight, @Set.fintypeDiff, @Set.fintypeDiffLeft, @Set.fintypeiUnion, @Set.fintypesUnion, @Set.fintypeBiUnion', @Set.fintypeBind', @Set.fintypeEmpty, @Set.fintypeSingleton, @Set.fintypePure, @Set.fintypeInsert, @Set.fintypeImage, @Set.fintypeRange, @Set.fintypeMap, @Set.fintypeImage2, @Set.fintypeSeq, @Set.fintypeSeq', @Set.fintypeMemFinset, @Submonoid.fintypePowers, @AddSubmonoid.fintypeMultiples, @RingHom.fintypeRangeS, inst✝]

[] new goal Fintype { x // x ∈ ⊤ } ▼
  [instances] #[@Unique.fintype, @Subtype.fintype, inst✝]

Matthew Ballard (Aug 03 2024 at 10:37):

Benchmark for a manual byAsSorry build of mathlib in #15457.

Kevin Buzzard (Aug 04 2024 at 19:12):

I love that some files got slower!

Edward van de Meent (Aug 04 2024 at 20:09):

how/why does that happen?

Edward van de Meent (Aug 04 2024 at 20:10):

does lean have a hard time inferring types or something like that?

Matthew Ballard (Aug 04 2024 at 20:53):

The changes to the keys primarily affect instance synthesis and the simplifier. It is possible that we are getting “worse” instances synthesized or simp theorems at some points locally.

It is also possible that my pattern of

Instance : DataDataData X := sorry

Caught up at points this way

Yuyang Zhao (Aug 05 2024 at 05:41):

Yuyang Zhao said:

I tried to do some other things with InjSurj in #13795.

#15476 (and its dependencies) is easier to review. After it gets merged #13795 may need some adjustments.

Matthew Ballard (Aug 06 2024 at 18:50):

I made the RFC: lean#4932 to swap the arguments of docs#Membership.mem.

Yuyang Zhao (Sep 06 2024 at 10:34):

In my previous attempts to refactor algebraic order structures, I realized that mixin typeclass may cause significant slowness that needs to be fixed by modifying the synthesis algorithm. In the current algorithm, IsSimpleOrder.toNontrivial (auto generated by class docs#IsSimpleOrder) causes the entire order hierarchy to be searched when synthesizing Nontrivial.

I can imagine a possible way to optimize synthesis by changing the synthesis order of subgoals to right-to-left (if possible in some sense) (or in a specified order), postponing subgoals on the left side. But I don't have enough knowledge about the details to modify it.

A specialized version of this is to set all instance parameters of a mixin typeclass as outParam, and then do the synthesis without synthesizing those parameters. The benchmarking of #16486 shows that it probably does work. If there was a way to remove some instances globally, we could easily do a bigger test in Mathlib without modifying core.

I'm not an expert on typeclasses, so I'm wondering if this sounds like a good idea?

Yuyang Zhao (Sep 06 2024 at 10:45):

@Filippo A. E. Nuccio advised me to ask @Anne Baanen, and then I just noticed that @Anne Baanen has already posted something similar in this topic. :open_mouth:

Yuyang Zhao (Sep 06 2024 at 10:52):

I also noticed docs#Lean.Meta.computeSynthOrder but modifying it directly causes some errors in Mathlib and doesn't change the order of IsSimpleOrder.toNontrivial.

Michael Stoll (Sep 06 2024 at 16:11):

@Yaël Dillies :up:

Yuyang Zhao (Sep 07 2024 at 03:11):

Yuyang Zhao said:

A specialized version of this is to set all instance parameters of a mixin typeclass as outParam, and then do the synthesis without synthesizing those parameters. The benchmarking of #16486 shows that it probably does work. If there was a way to remove some instances globally, we could easily do a bigger test in Mathlib without modifying core.

After figuring out how to globally remove instances, the further test showed that it was faster overall, but there was a significant slowdown in category theory and some other files.

Jovan Gerbscheid (Sep 07 2024 at 14:33):

Another way to speed up mathlib would be to reduce the import time. The following piece of code measures the initialization time of each PersistentEnvExtension:

def timePersistentExtensions (env : Environment) (opts : Options) : IO (Array (Nat × Name)) := do
  let mut results := #[]
  for ext in ← persistentEnvExtensionsRef.get do
    let t0 ← IO.monoMsNow
    discard <| ext.addImportedFn (ext.toEnvExtension.getState env).importedEntries { env := env, opts := opts }
    let t1 ← IO.monoMsNow
    if t1-t0 ≥ 10 then
      results := results.push (t1-t0, ext.name)
  return results.qsort (·.1 > ·.1)
run_meta do
  let r ← timePersistentExtensions (← getEnv) (← getOptions)
  logInfo m! "{r.map fun (t, name) => s!"{t}ms, {name}"}"

And these are the results, when importing Mathlib:

[362ms, Lean.Parser.parserExtension,
 217ms, Lean.Meta.simpExtension,
 159ms, Measurable,
 100ms, Mathlib.Meta.FunProp.functionTheoremsExt,
 87ms, Continuous,
 80ms, Lean.Meta.instanceExtension,
 56ms, Mathlib.Tactic.GCongr.gcongrExt,
 54ms, Lean.namespacesExt,
 47ms, Bound,
 46ms, Batteries.Tactic.Alias.aliasExt,
 40ms, Mathlib.Tactic.transExt,
 35ms, Lean.Elab.macroAttribute,
 35ms, Mathlib.Meta.FunProp.transitionTheoremsExt,
 34ms, Mathlib.Meta.FunProp.morTheoremsExt,
 30ms, Std.Tactic.Lint.batteriesLinterExt,
 27ms, Mathlib.Meta.NormNum.normNumExt,
 23ms, Lean.Compiler.specExtension,
 15ms, Mathlib.Meta.FunProp.lambdaTheoremsExt,
 14ms, finsetNonempty,
 13ms, SetLike,
 11ms, Lean.PrettyPrinter.Delaborator.appUnexpanderAttribute,
 11ms, builtin,
 10ms, default]

Hopefully the slowest of these can be made more efficient.

Jovan Gerbscheid (Sep 07 2024 at 15:20):

Would it make sense to make PersistentEnvExtensions lazy? So the PersistentEnvExtension would only be loaded if it was actually used. It seems to me like this could save a lot of importing time.

Yuyang Zhao (Sep 09 2024 at 21:40):

Yuyang Zhao said:

#7873 + #13852 + #13690

I split out some smaller PRs which may be easier to review. #16638 #16607

Yuyang Zhao (Sep 10 2024 at 02:31):

I recalled that I could specify the order of the subgoals and made #16646 as an alternative to #16607 and #16486. It doesn't need any additional fixes, ~~but I don't know if merging it into mathlib is appropriate~~. (upd: This PR looks much better now than before.)

Johan Commelin (Sep 10 2024 at 06:05):

#16646 gives massive speed boosts! Is there a systematic way in which you found these synth orders?
(Apologies if you already explained this somewhere, I didn't manage to follow the entire discussion.)

Kim Morrison (Sep 10 2024 at 06:26):

(for reference, "massive" is -2.2% on instruction count, which is indeed massive!)

Johan Commelin (Sep 10 2024 at 06:35):

yep, and -6% on TC

Yuyang Zhao (Sep 10 2024 at 06:54):

I have an explanation in another PR. I've copied it to #16646.

Johan Commelin (Sep 10 2024 at 07:03):

Did you use metaprogramming to find the places where you should add lines like

set_synth_order Algebra.toModule #[4, 2, 3]

or did you find these with experimentation by hand?

Yuyang Zhao (Sep 10 2024 at 07:13):

The idea here is to search for typeclasses of the form Foo A B before searching for typeclasses of the form Bar A. Then it would not always search the entire Bar A hierarchy.

Yuyang Zhao (Sep 10 2024 at 07:17):

Use

import Mathlib

variable {A B : Type*}

set_option trace.Meta.synthInstance true in
#synth SMul A B

to check if the settings are correct.

Yuyang Zhao (Sep 10 2024 at 07:39):

I used it to find the instance that needs to change the order. Then fix them by hand.

Matthew Ballard (Sep 10 2024 at 15:42):

I think #16646 is looks great.

It should be straightforward to modify computeSynthOrder to avoid these re-synthesizations but I think a tool to modify the synthesis order by hand is useful in itself if that change ever happens

Matthew Ballard (Sep 10 2024 at 16:08):

I also left a minor question on #16638. Otherwise, it looks good also. @Yuyang Zhao can we close the other two or is there still some benefit to extract?

Jireh Loreaux (Sep 10 2024 at 16:25):

Wow, #16646 is amazing, nice work!

Kim Morrison (Sep 10 2024 at 22:37):

Woah, this was not a good idea to merge #16646, and I would like to revert it asap.

This is playing with fire: it's using an internal Lean API to intentionally break the documented behaviour of typeclass search. This hack will break, leaving us in a worse position later. I appreciate that the FRO has said that modifying the TC search behaviour is not on the short term roadmap --- but this is precisely because it is very hard to get right.

Kim Morrison (Sep 10 2024 at 22:39):

(I'm sorry, I should have written something on the PR. It honestly didn't occur to me that anyone would consider merging it. :-)

Kim Morrison (Sep 10 2024 at 22:41):

#16689 proposes reverting this PR.

Kim Morrison (Sep 10 2024 at 22:42):

I'm not saying we should definitely never do this. But it should be a much more involved process, e.g. first thinking about whether we can do this in a supported fashion, thinking about stabilising the API first if we are going to use such tools, adding CI step that ensures Mathlib would still build if this hack suddenly stopped working, etc.

Kim Morrison (Sep 11 2024 at 03:01):

Even better would be to construct a Mathlib free #mwe, that demonstrates a significant performance change using this technique. If we have this available, the FRO is definitely game to work on achieving this speed up in a robust way.

Kim Morrison (Sep 11 2024 at 03:01):

@Yuyang Zhao and @Matthew Ballard, do you have a sense of what would be required to construct such an MWE?

Johan Commelin (Sep 11 2024 at 03:06):

I have just kicked #16689 on the queue. Explicitly not to stop this whole approach, but because we need more research and more discussion.

Johan Commelin (Sep 11 2024 at 03:06):

I would be happy to help with building an MWE.

Yuyang Zhao (Sep 11 2024 at 03:44):

import Mathlib.Util.SetSynthOrder

class A (n : Nat) where

instance [A n] : A n.succ where

class B [A 20050] where

class C [A 20000] extends B where

instance : A 20050 where

#synth B

set_synth_order C.toB #[1, 0]

#synth B

Kim Morrison (Sep 11 2024 at 04:47):

As that relies on code in the PR, I've made a standalone version.

Kim Morrison (Sep 11 2024 at 04:47):

import Lean.Elab.Command
import Lean.Elab.MatchExpr
import Lean.Meta.Instances

/-!
# Specify the synthesization order for instances
-/

namespace Lean.Meta

/-- Specify the synthesization order for instances. -/
def Instances.setSynthOrder (declName : Name) (synthOrder : Array Nat) : CoreM Unit := do
  let some entry := Meta.instanceExtension.getState (← getEnv) |>.instanceNames.find? declName |
    throwError "'{declName}' does not have [instance] attribute"
  instanceExtension.add { entry with synthOrder } entry.attrKind

end Lean.Meta

namespace Lean.Elab.Command

open Meta

/-- Specify the synthesization order for instances. -/
elab "set_synth_order " name:ident synthOrder:term : command => do
  let q := Term.MatchExpr.toDoubleQuotedName name
  elabCommand (← `(command| run_meta Instances.setSynthOrder $q $synthOrder))

set_synth_order instAddNat #[]

end Lean.Elab.Command

class A (n : Nat) where

instance [A n] : A n.succ where

class B [A 20050] where

class C [A 20000] extends B where

instance : A 20050 where

/--
error: (deterministic) timeout at `typeclass`, maximum number of heartbeats (20000) has been reached
Use `set_option synthInstance.maxHeartbeats <num>` to set the limit.
Additional diagnostic information may be available using the `set_option diagnostics true` command.
-/
#guard_msgs in
#synth B

set_synth_order C.toB #[1, 0]

/--
error: failed to synthesize
  B
Additional diagnostic information may be available using the `set_option diagnostics true` command.
-/
#guard_msgs in
#synth B

Kim Morrison (Sep 11 2024 at 04:48):

What we really want though is a realistic minimisation from Mathlib that shows the underlying performance problem.

Kim Morrison (Sep 11 2024 at 04:48):

i.e. something that is agnostic to the solution.

Kim Morrison (Sep 11 2024 at 04:50):

Also, I'm confused by this example. :-) Shouldn't #synth B succeed?

Yuyang Zhao (Sep 11 2024 at 06:05):

I think changing the order mostly speeds up the failed subgoals. Surely we can turn it into an example of successful synthesization.

import Lean.Elab.Command
import Lean.Elab.MatchExpr
import Lean.Meta.Instances

/-!
# Specify the synthesization order for instances
-/

namespace Lean.Meta

/-- Specify the synthesization order for instances. -/
def Instances.setSynthOrder (declName : Name) (synthOrder : Array Nat) : CoreM Unit := do
  let some entry := Meta.instanceExtension.getState (← getEnv) |>.instanceNames.find? declName |
    throwError "'{declName}' does not have [instance] attribute"
  instanceExtension.add { entry with synthOrder } entry.attrKind

end Lean.Meta

namespace Lean.Elab.Command

open Meta

/-- Specify the synthesization order for instances. -/
elab "set_synth_order " name:ident synthOrder:term : command => do
  let q := Term.MatchExpr.toDoubleQuotedName name
  elabCommand (← `(command| run_meta Instances.setSynthOrder $q $synthOrder))

set_synth_order instAddNat #[]

end Lean.Elab.Command

class A (n : Nat) where

instance [A n] : A n.succ where

class B [A 20050] where

class C [A 20000] extends B where

instance : A 20050 where

class D where

instance : D where
instance [B] : D where

/--
error: (deterministic) timeout at `typeclass`, maximum number of heartbeats (20000) has been reached
Use `set_option synthInstance.maxHeartbeats <num>` to set the limit.
Additional diagnostic information may be available using the `set_option diagnostics true` command.
-/
#guard_msgs in
#synth D

set_synth_order C.toB #[1, 0]

/-- info: instD -/
#guard_msgs in
#synth D

Kim Morrison (Sep 11 2024 at 06:25):

No -- I don't understand in your previous example why it fails! instance : A 20050 where is present, so why doesn't synthesisation succeed?

Daniel Weber (Sep 11 2024 at 07:44):

Why should it? There is no B instance

Yuyang Zhao (Sep 11 2024 at 09:07):

Only C.toB produces B, then it times out if A 20000 is searched first. I don't see the problem here.

Yuyang Zhao (Sep 11 2024 at 09:29):

Also, what happens here is similar to map_mul after #8386.

After this PR, the goals become [FunLike F ?M ?N] [Mul ?M] [Mul ?N] [MulHomClass F ?M ?N]. Now [FunLike F ?M ?N] is synthesized first, supplies values for ?M and ?N and then the Mul M and Mul N instances can be found, before trying MulHomClass F M N which fails. Since the Mul hierarchy is very big, this can be slow to fail, especially when there is no such Mul instance.

A long-term but harder to achieve solution would be to specify the order in which instance goals get solved. For example, we'd like to change the arguments to map_mul to look like [FunLike F M N] [Mul M] [Mul N] [highPriority <| MulHomClass F M N] because MulHomClass fails or succeeds much faster than the others.

Matthew Ballard (Sep 11 2024 at 10:01):

Sorry for being too hasty here. Once I looked at the PR things seemed obvious to me. As synth order sits next to (well almost) priority in docs#InstanceEntry, it seemed natural that the end user would want to manipulate it subject to some basic guardrails relating to (semi-)out-parameters (which admittedly are not present).

@Kim Morrison do you understand why the example given previously fails now?

The explanation above by @Yuyang Zhao is my understanding also. This is makes failure faster. Looking at the changes in the PR really informed me.

As I mentioned above, I don't think changing computeSynthOrder in core to avoid this pattern would be very hard. This would also naturally provide evidence and expose problems when run over Mathlib. But, I barely have time to type this message today.

Kim Morrison (Sep 12 2024 at 01:55):

@Yuyang Zhao, do you think you might be able to produce more realistic minimization from mathlib? I appreciate that it is a lot of work.

Kim Morrison (Sep 13 2024 at 09:11):

@Yuyang Zhao, @Matthew Ballard,

@Johan Commelin has written this up as an issue lean#5333, and we're hoping to make progress on this!

Kim Morrison (Sep 19 2024 at 09:22):

Just a heads up here that we have a fix in the pipeline, that doesn't require any manual adjustments, and also achieves a ~2% speedup across Mathlib (and an overall 8% reduction in typeclass search). See lean#5376.

The Mathlib diff required to adapt to it is at https://github.com/leanprover-community/mathlib4/compare/nightly-testing-2024-09-18...lean-pr-testing-5376.

I've added the minimum new outParams as possible (I think), but I hope that someone may be interested in doing a thorough review of this later.

Quite likely lean#5376 will land in the next few days, and then these adaptations will arrive Mathlib's nightly-testing.

If anyone sees good changes to make, please write about them here and push directly to branch#lean-pr-testing-5376.

Johan Commelin (Sep 19 2024 at 18:44):

@Anne Baanen If you around, I would love to hear what you think of this Lean PR and the Mathlib diff. Because you've battled with synth order, funlike-classes, and outParams in the past. So you might have some valuable insights to share.

Anne Baanen (Sep 19 2024 at 19:18):

First impressions are positive: this is something we used to do in Mathlib 3, I recall specifically in the module classes, and later abandoned as the changed synth order in Lean 3 community meant that dependencies would be filled in automatically. Now Lean 4 is back to the original order so the same kind of change makes sense.

One thing I'm wondering about: how does this interact with unification for Prop-valued classes that take non-Prop arguments? Will unification not simply skip the instances and so leave their arguments unassigned? (Is that why the hom classes needed extra outParams?)

Anne Baanen (Sep 19 2024 at 19:20):

Will check in detail later, probably by noon tomorrow (CEST, or certainly UTC!) I can say more interesting things!

Anne Baanen (Sep 20 2024 at 08:49):

One issue we had with making instance parameters implicit is the following: docs#Module depends on an docs#AddCommMonoid instance, but often we're interested in modules over rings, with a docs#AddCommGroup instance. In that case we can often get away with only considering unification: when the problem is shaped like AddCommGroup.toAddCommMonoid ?inst =?= AddCommGroup.toAddCommMonoid someInstance. But sometimes we get unification problems of the form AddCommGroup.toAddCommMonoid ?inst =?= Int.AddCommMonoid and then we'd need to do an expensive instance search anyway (or use unification hints!).

Since this only affects projections, the instances involved are always going to be literally the same as the ones we're looking for, so that issue won't come up here.

Anne Baanen (Sep 20 2024 at 08:52):

I suspect I have a counterexample showing that the change is useful but doesn't solve all existing issues in Mathlib with the current synth order being slow to fail. Namely, when we have a simp lemma that only applies if a certain slow-to-fail instance is found, and another simp lemma that isn't as slow. Let me see if I can get that to work on lean#5376.

Anne Baanen (Sep 20 2024 at 08:54):

To be clear, lean#5376 is great work and I'm happy it solves so much, I just claim we're not out of the woods yet with suboptimal synth orders.

Matthew Ballard (Sep 20 2024 at 08:55):

What was the rationale for returning to the original Lean 3 synth order after community experimentation and change in Mathlib 3?

Matthew Ballard (Sep 20 2024 at 08:56):

Also: it very much looks like your guess on Prop-valued instances taking non-Prop valued terms is the main pain point

Matthew Ballard (Sep 20 2024 at 08:58):

There are still a few examples of this where one needs outParams in the adaptation PR.

Matthew Ballard (Sep 20 2024 at 08:59):

We have a library note on "implicit instance arguments". Is this obsolete now?

Anne Baanen (Sep 20 2024 at 09:16):

Anne Baanen said:

I suspect I have a counterexample showing that the change is useful but doesn't solve all existing issues in Mathlib with the current synth order being slow to fail. Namely, when we have a simp lemma that only applies if a certain slow-to-fail instance is found, and another simp lemma that isn't as slow. Let me see if I can get that to work on lean#5376.

I think this is an example of what I'm talking about:

section SlowSynthGadget

class A (n : Nat) where

instance [A n] : A n.succ where

class B [A 20050] where
  fancyRefl : ∀ x : Nat, x = x

end SlowSynthGadget

/-! The issue is as follows: suppose we have a term of the form `foo (bar _)` and a `simp` lemma
about `bar` that requires a slow-to-fail instance. Then `simp` will try to simplify `bar` first,
time out and fail to notice there's also a `simp` lemma about `foo (bar _)`.
-/
def foo (_ : Nat) : Nat := 0
def bar (x : Nat) : Nat := x

theorem bar_apply
    {_ : A 20050} -- Just to be sure that the issue is with the classes, not the lemmas!
    [B] :
    bar x = x := B.fancyRefl _

theorem foo_bar (x : Nat) : foo (bar x) = 0 := rfl

theorem foo_bar_baz : foo (bar 0) = 0 := by
  -- Times out when trying to do `bar_apply`, when it should instead fail and go for `foo_bar`.
  simp only [bar_apply, foo_bar]

This happens both on a recent Mathlib4 master and on branch#lean-pr-testing-5376.

Anne Baanen (Sep 20 2024 at 09:29):

Oh right, this is because simp doesn't actually check which flavour of parameter it is passed, it just looks at whether something has a class as type or not:

-- src/Lean/Meta/Tactic/Simp/Rewrite.lean:87
    if (← instantiateMVars x).isMVar then
      -- A hypothesis can be both a type class instance as well as a proposition,
      -- in that case we try both TC synthesis and the discharger
      -- (because we don't know whether the argument was originally explicit or instance-implicit).
      if (← isClass? type).isSome then
        if (← synthesizeInstance x type) then
          continue
      if (← isProp type) then
        unless (← discharge?' thmId x type) do
          return false

Anne Baanen (Sep 20 2024 at 09:30):

That explains why I had so much trouble getting simp to listen to my annotations!

Anne Baanen (Sep 20 2024 at 09:32):

By the way, my counterexample comes from using simp [map_mul] on expressions like f (x * y) + g (a * b), where f is a docs#MulHomClass but g isn't. This would result in slowly searching through the entire Mul hierarchy, and then failing on MulHomClass.

Anne Baanen (Sep 20 2024 at 09:35):

Matthew Ballard said:

We have a library note on "implicit instance arguments". Is this obsolete now?

It was obsolete (in most cases) by the end of Mathlib 3, but is relevant again. In fact, it is now extra relevant since the thing it talks about happens automatically now in certain cases!

Matthew Ballard (Sep 20 2024 at 09:59):

branch#mrb/5376 has some more outParam annotations and cleanup

Anne Baanen (Sep 20 2024 at 10:05):

Matthew Ballard said:

branch#mrb/5376 has some more outParam annotations and cleanup

Ah, you were faster than me getting the ZeroMemClass outParam fix to work!

Anne Baanen (Sep 20 2024 at 10:06):

I had to change docs#SubsemiringClass.addSubmonoidWithOneClass to also use implicit params:

instance (priority := 100) SubsemiringClass.addSubmonoidWithOneClass (S : Type*)
    (R : Type u) {_ : NonAssocSemiring R} [SetLike S R] [h : SubsemiringClass S R] :
    AddSubmonoidWithOneClass S R :=
  { h with }

Anne Baanen (Sep 20 2024 at 10:10):

see commit#6c3f6b8d029

Kim Morrison (Sep 20 2024 at 11:37):

Note that I've just merged lean-pr-testing-5376 into nightly-testing. (Also, I'll be offline for the next 48 hours.)

Jireh Loreaux (Sep 20 2024 at 11:49):

@Anne Baanen I just recently marked a bunch of these very generic simp lemmas (e.g. map_zero) with low priority for exactly this reason (i.e., failing slowly because they have to search the entire FunLike hierarchy). I'm not saying it's a fix for this issue, but I thought it worthwhile to mention.

Yuyang Zhao (Oct 04 2024 at 17:39):

The part of #16646 that lean#5376 didn't resolve (#17410 or #17411) is ready. Lean complains about synthesization order of docs#DistribMulAction.toDistribSMul. This can be resolved with outParam (#17410) or set_option synthInstance.checkSynthOrder false in (#17411). Considering that #16486 does something similar to lean#5376, but uses outParam and causes much more slowdowns in mathlib, I think these slowdowns may be caused by outParam and #17411 might be better.

Johan Commelin (Oct 09 2024 at 06:22):

@Yuyang Zhao your message has been on my todo list for 5 days. I see that you closed #17410 yesterday. Could you please comment why? (Possibly on the PR page, so that it is clear also for future visitors.)

Johan Commelin (Oct 09 2024 at 06:28):

I really appreciate that you are digging in to these issues, and finding ways to speed up mathlib. But I also think it is important that we solve these issues in a principled way, where we understand why we are fixing things in a particular way.

At the moment I don't have that understanding in the case of your proposed PR.
Why should we adapt the priorities, and why by this amount?
Why does Lean complain about the synthesization order? And why is it ok to just switch of the check?
Why do these problems show up in this part of the hierarchy? Or are there other parts of the hierarchy that should also be changed?

Yuyang Zhao (Oct 09 2024 at 07:39):

For this change, #17411 uses the same solution as lean#5376, and using outParam could lead to unexpected problems. So I think such a change should be less controversial.
However, I did fail to understand everything that happened here, and I'm willing to reopen #17410 if maintainers think it's more appropriate.

Yuyang Zhao (Oct 09 2024 at 07:43):

For PRs that change priorities, I'll break them down into more understandable parts when I have time.

Johan Commelin (Oct 09 2024 at 08:03):

@Yuyang Zhao Does it work to make a version that makes all the same changes except that it doesn't touch DistribMulActions? Or does that not compile?

Yuyang Zhao (Oct 09 2024 at 18:23):

#17582

Yuyang Zhao (Oct 12 2024 at 01:10):

While tracing the synthesization in FunLike hierarchy, I found that lean#5376 did not modify all instances. I think fixing this could further improve performance.

class A

class B where
  a : Nat

class C extends B

class D [A] extends B, C

#check D.toB -- D.toB : {inst : A} → [self : D] → B
#check D.toC -- D.toC [A] [self : D] : C

Yuyang Zhao (Oct 12 2024 at 01:36):

However, there are some instances like docs#MulEquivClass.toMonoidWithZeroHomClass in FunLike hierarchy. So even if this is fixed, Lean will still search the entire algebraic hierarchy when searching FunLike hierarchy. :pensive:

Yuyang Zhao (Oct 12 2024 at 01:43):

If this is fixed, and the synthesization order can be specified, then we can not search algebraic hierarchy in a failed FunLike hierarchy search.

Yury G. Kudryashov (Oct 12 2024 at 03:55):

About #17582: can the new Algebra.toModule cause issues because it gets CommSemiring by unification while the goal Module R A only has a Semiring R argument?

Yury G. Kudryashov (Oct 12 2024 at 03:55):

Similarly for other instances.

Yuyang Zhao (Oct 12 2024 at 04:18):

Algebra R A has a CommSemiring R argument.

Yury G. Kudryashov (Oct 12 2024 at 04:21):

Did you test in the case when the Semiring in the goal isn't syntactically equal to CommSemiring.toSemiring?

Yury G. Kudryashov (Oct 12 2024 at 04:22):

It would be nice to have a test for that in test/.

Yury G. Kudryashov (Oct 12 2024 at 04:22):

So that if Lean algorithm for TC search in case of {_ : MyClass _} arguments changes, we catch it.

Yury G. Kudryashov (Oct 12 2024 at 04:24):

Note: I don't know enough about the current algorithm and what parts of it are considered to be "public API" to judge, I just saw something that looks suspicious to me and asked a question.

Yuyang Zhao (Oct 12 2024 at 04:33):

The parameter changes in #17582 are similar to lean#5376, so I think that's fine.

Yury G. Kudryashov (Oct 12 2024 at 04:37):

Thanks for the explanation!

Yuyang Zhao (Oct 12 2024 at 04:43):

The experiment in #16486 marks typeclass parameters (such as the CommSemiring R parameter of Algebra R A) as outParam, which looks safer. But this caused some slowdowns in category theory.

Violeta Hernández (Oct 25 2024 at 17:26):

Not sure if this thread is for this, but one of my PRs (#18217) shows this:Screenshot_2024-10-25-11-22-11-085_com.github.android-edit.jpg

Violeta Hernández (Oct 25 2024 at 17:26):

What does this mean? Is this bad?

Violeta Hernández (Oct 25 2024 at 17:27):

None of the actual .lean files seem to have been slowed down so I'm not sure what's going on here

Michael Rothgang (Oct 25 2024 at 19:04):

It means a slowdown - I don't know how important these metrics are.

Floris van Doorn (Oct 25 2024 at 20:30):

It could be spurious because of the recent second runner (@Sebastian Ullrich )?
I'd be surprised that a PR has such a big impact on these metrics.

Floris van Doorn (Oct 25 2024 at 20:30):

Maybe we should try a second bench!?

Violeta Hernández (Oct 25 2024 at 22:08):

Yep, benching again and that's gone!

Sebastian Ullrich (Oct 26 2024 at 07:35):

Yes, there have been a few diffs like that, but confusingly not only from the new runner, nor consistently so. I don't yet know what to make of that.

Sebastian Ullrich (Oct 26 2024 at 10:03):

Actually both machines now seem to have settled on the higher value for e.g. "build: import"
image.png

Sebastian Ullrich (Oct 26 2024 at 10:05):

Then the most likely explanation is not a hardware difference but that the new machine came with a newer Linux version, which apparently affects I/O overhead. Now that I updated both to the same version, the difference seems to have vanished. That would explain why we saw the jump even between runs on the same machine.

Yuyang Zhao (Oct 31 2024 at 01:48):

Yuyang Zhao said:

For PRs that change priorities, I'll break them down into more understandable parts when I have time.

#18468 #18470 #18474 are ready.

Kyle Miller (Oct 31 2024 at 21:15):

Can you explain the theory behind these changes @Yuyang Zhao?

Is this something that the class command can do itself? There are currently some Lean issues about instances created by the class command, like lean4#2325 and lean4#2905. I'd like to understand what is going on here very carefully before we consider merging these PRs.

I am wondering how much of the issue is that we are not very careful about parent orderings. In the next Lean release there will be an option (set_option structure.strictResolutionOrder true) that will make it a warning if parents are not consistently ordered.

Michael Stoll (Oct 31 2024 at 21:21):

I think the idea is similar to what I did in #12778 (with the difference that I make the original instance priorities available in a scope): reduce the priorities of instances that are generated automatically via inheritance from the default 1000 to something small, in cases where we usually don't want these instances to be considered in TC resolution because they (say) connect the order and algebra hierarchies.

Michael Stoll (Oct 31 2024 at 21:22):

And maybe it is really time to consider doing something like this, as it does not look like @Yaël Dillies's original other plans for a decoupling of order and algebra will be implemented.

Yaël Dillies (Oct 31 2024 at 21:59):

It's still on my todo list, but it's understandably a huge task

Michael Stoll (Oct 31 2024 at 23:47):

@Yaël Dillies Can you give an estimate when you are going to do (and finish) it?

The question is whether it may make sense to do the more hacky fixes for now to get some of the speed-up. They are easy to revert, so should not really stand in the way of a more principled refactor.

If you say you will do it within the next (say) six months or so, then it makes sense to wait for it. If it takes longer, then maybe not.

Yuyang Zhao (Nov 01 2024 at 04:22):

I know about the plan to make Ordered* mixin typeclasses, and I've made efforts to do so. If anyone wants to move it forward, #17444 is one step.

Yuyang Zhao (Nov 01 2024 at 04:29):

Kyle Miller said:

Can you explain the theory behind these changes Yuyang Zhao?

Is this something that the class command can do itself? There are currently some Lean issues about instances created by the class command, like lean4#2325 and lean4#2905. I'd like to understand what is going on here very carefully before we consider merging these PRs.

I am wondering how much of the issue is that we are not very careful about parent orderings. In the next Lean release there will be an option (set_option structure.strictResolutionOrder true) that will make it a warning if parents are not consistently ordered.

We want Distrib.toMul to have a lower priority than Semigroup.toMul or MulOneClass.toMul. I think changing the class command doesn't make it do this automatically.

Yuyang Zhao (Nov 01 2024 at 04:54):

A goal of #7873 is to search for Zero/One/Add/Mul instances first along Zero/One/Add/Mul -> ... -> Ring -> CommRing -> Field.

Yaël Dillies (Nov 01 2024 at 06:59):

Michael Stoll said:

Yaël Dillies Can you give an estimate when you are going to do (and finish) it?

Honestly no idea. I have no time theoretically dedicated to it for the next five years, that's for sure

Yuyang Zhao (Nov 01 2024 at 08:10):

#18468 does not make users have to use commands. I expect it would be less controversial than #12778.

Eric Wieser (Nov 01 2024 at 08:55):

@[instance 0] feels like a hack to make up for no "remove this instance" existing, which on turn would be a workaround for lean4#2905; is that correct?

Michael Stoll (Nov 01 2024 at 09:40):

Yaël Dillies said:

Michael Stoll said:

Yaël Dillies Can you give an estimate when you are going to do (and finish) it?

Honestly no idea. I have no time theoretically dedicated to it for the next five years, that's for sure

If you have no time explicitly planned for it for the next five years, then I tink it is a safe bet to expect it will not happen in the next five years.

Yaël Dillies (Nov 01 2024 at 09:41):

Well, I have holidays :grinning: but yes

Michael Stoll (Nov 01 2024 at 09:46):

The reasoning behind #12778 is the following. The basic assumption is that we keep the situation that classes in the algebra&order hierarchy extend algebraic classes.

In many cases (when we are dealing with purely algebraic arguments, but ordered algebaric structures are in the environment) we do not want TC inference to go chasing for ordered algebraic structures when looking for purely algebraic ones. So we want the automatic instances created by extends to have low priority.
In other cases (when really dealing with order and algebra jointly), we do want TC inference to deduce the algebraic structure from the (usually given) ordered algebraic structure. So here we want the automatic instances to have high priority.

Under the overall assumption, it is clear (in my opinion) that there is no "one size (= priority) fits all" solution, and that it must be possibly to switch between the low and high priority instances according to the context.

Michael Stoll (Nov 01 2024 at 09:48):

Yaël Dillies said:

Well, I have holidays :grinning: but yes

Let me stress that I don't blame you for not doing it: it is a lot of work and not really helpful for your further career. The community would be very thankful, however, I assume :smile: (and I can think of more fun things to do in your holidays...)

Kevin Buzzard (Nov 01 2024 at 13:03):

Note that the current set-up really does cause continual problems in practice, see for example this from last week.

Kevin Buzzard (Nov 01 2024 at 13:04):

Yael if you want to work on it with some London people during your visit then we could give it a go. It's probably a task we can distribute at Xena, once you've started us off.

Yuyang Zhao (Nov 01 2024 at 13:26):

Yuyang Zhao said:

In my previous attempts to refactor algebraic order structures, I realized that mixin typeclass may cause significant slowness that needs to be fixed by modifying the synthesis algorithm. In the current algorithm, IsSimpleOrder.toNontrivial (auto generated by class docs#IsSimpleOrder) causes the entire order hierarchy to be searched when synthesizing Nontrivial.

Notice that docs#StrictOrderedSemiring implies docs#Nontrivial. In previous attempts, this caused searching algebraic hierarchy and order hierarchy in every search for docs#Nontrivial. Hopefully lean#5376 has solved this problem.

Yaël Dillies (Nov 01 2024 at 13:28):

Kevin Buzzard said:

Yael if you want to work on it with some London people during your visit then we could give it a go. It's probably a task we can distribute at Xena, once you've started us off.

Hmm, that's a good shout. We've got many hands over here

Kyle Miller (Nov 01 2024 at 14:49):

Eric Wieser said:

@[instance 0] feels like a hack to make up for no "remove this instance" existing, which on turn would be a workaround for lean4#2905; is that correct?

In the last couple weeks, in core we finally have things in place to be able to think about implementing and testing the impact of 2905. If we add hacks to mathlib, it will make it much more difficult to measure whether improvements to the class command have solved the underlying issues — I'd prefer we wait to see if any of these mitigations work before setting an army of helpers loose on the problem!

Matthew Ballard (Nov 01 2024 at 20:40):

lean#2905 is heavily dependent on the ordering of the parents. Maybe we are getting closer to an optimal(ish) choice for these @Kyle Miller ?

Kyle Miller (Nov 01 2024 at 20:42):

I've been experimenting with this today in lean4#5920 @Matthew Ballard. There are surprisingly few issues in mathlib, but I'm guessing these last ten modules are going to be the worst

Matthew Ballard (Nov 01 2024 at 20:58):

Re: decoupling order from algebra, I did get a benchmark for that a few months back (I can’t find it now). Typeclass synthesis dropped 7% but there was only a global 2% instruction drop. The penalty we paid was that some simp lemmas gained more generic keys so we tried synthesizing ordered classes more for these than before.

From a pure performance perspective, it’s not as good as gating off the projections in namespaces.
Ideally, with proper hygiene, we would only see bundled ordered algebra classes when we actually needed them. But this level of discipline is probably unrealistic.

Junyan Xu (Nov 10 2024 at 20:36):

#18824 is a slight speedup (0.09% build, 0.09% lint), with a larger effect when applied on top of #17930. It came out of my experiments here.

Notification Bot (Nov 10 2024 at 21:06):

2 messages were moved from this topic to #mathlib4 > lean4#2905 non-direct class parents shouldn't be instances by Kyle Miller.

Damiano Testa (Nov 10 2024 at 22:21):

I'm glad to see the bench summary script posting 0.0X%, as it weirdly makes me confident that previous bug with printing decimals is fixed! :smile:

Yuyang Zhao (Dec 13 2024 at 13:59):

Now lean#2905 has been resolved. I wonder if we could merge #18468 (or something like #12778 that further provides scoped instances).

If we want to decouple order and algebraic hierarchies, I hope someone can take a look at #17444. The diff of this PR has become larger and larger over time.

Eric Wieser (Dec 13 2024 at 20:57):

I'd prefer we come up with a better solution than the [instance 0] stuff, but the [instance 10] stuff could be fine

Yuyang Zhao (Dec 14 2024 at 00:02):

If there is no better solution at the moment, we can consider it as [instance 10] that will never be used.

Artie Khovanov (Jan 06 2025 at 03:36):

I like the idea of #17444
I might try do the same thing with Strict* (this will weaken the definition of LinearOrderedSemiring and up, but that's fine because the current definition is not standard)

Johan Commelin (Jan 06 2025 at 07:33):

I think that whatever is done, it should come with documentation about how/why it was done, in such a way that it can be extended to other/future parts of the library. Both for future contributors and for library maintenance and for scalability.

Yuyang Zhao (Jan 12 2025 at 09:10):

Yuyang Zhao said:

I know about the plan to make Ordered* mixin typeclasses, and I've made efforts to do so. If anyone wants to move it forward, #17444 is one step.

The full experiment has been done in #20676.

+ build           instructions           -5.1%
+ build           typeclass inference   -16.8%
+ lint            wall-clock             -8.5%

Sébastien Gouëzel (Jan 12 2025 at 09:39):

Impressive! Can you explain in the PR text which classes have been nuked, and what is the replacement for each of them? Also, did you adjust some instances by hand, or is the above just the result of removing some classes and replacing them by mixins?

Kim Morrison (Jan 12 2025 at 10:29):

+ build                  wall-clock           -3.892 %

is also very impressive.

Kim Morrison (Jan 12 2025 at 10:42):

refactor: make CanonicallyOrdered... mixin #17444 by itself benchmarks very impressively.

Kim Morrison (Jan 12 2025 at 10:43):

I've just read through the diff of that, and I think I would be happy. It is a big change, so I would like to have some consensus we want to proceed with it, but I would really encourage everyone to have a look soon so we can get the ball rolling!

Kim Morrison (Jan 12 2025 at 10:44):

There are a couple of places where instances are being refactored where I'm tempted to suggest we fix names (i.e. make sure instances start with inst), but this is really orthogonal to the PR so I wouldn't insist.

Kim Morrison (Jan 12 2025 at 10:45):

As far as I saw, #17444 is not doing any priority tweaking, so I don't think there are any worries/objections there. It's a straightforward conversion to mixins.

Michael Rothgang (Jan 12 2025 at 10:53):

I would like to request more details in the PR description for #17444: I'm not fully sure what is done (is that just some grammar issue?), but the message doesn't mention any of the why. Cobbling it together from zulip is not nice for others, can you add it?
That said: these speed-ups are fantastic, I would really like to see them land.

Johan Commelin (Jan 14 2025 at 13:35):

I also read through this diff, and it looks really good.
But atm it doesn't build, and the latest bench results actually have a couple of regressions, I think.

Yuyang Zhao (Jan 16 2025 at 16:50):

#20593 introduces new mixin typeclasses. In the future we may make them assume weaker algebraic typeclasses. What should they be called?

Yuyang Zhao (Jan 16 2025 at 16:51):

/poll Monoid/Semigroup
IsOrdered(Cancel)(Add)Monoid
IsOrdered(Cancel)(Add)Semigroup

Yuyang Zhao (Jan 16 2025 at 16:52):

/poll Ring/Semiring
Is(Strict)OrderedRing
Is(Strict)OrderedSemiring

Artie Khovanov (Jan 16 2025 at 19:05):

Do we not want to split out IsStrict?

Floris van Doorn (Feb 18 2025 at 19:21):

#9467 gives a performance regression (large in some files, but overall small). It is otherwise a nice mathematical change. I'd say we merge it despite the performance regression, but I'd like the quick opinion of the people here optimizing type class inference.

Michael Stoll (Feb 18 2025 at 19:24):

Any idea what causes the slow-down here?

Floris van Doorn (Feb 18 2025 at 19:25):

The only thing this PR is doing is weakening type-class assumptions. So my guess is that type-class inference takes longer because it has to traverse significantly more of the type-class hierarchy.

Floris van Doorn (Feb 18 2025 at 19:26):

Or perhaps in checking defeq of generated instances? I guess the real answer is: "no idea"

Michael Stoll (Feb 18 2025 at 19:28):

Weakening typeclass assumptions is certainly a good thing. Hopefully this will improve again when there is more decoupling between the various hierarchies...

Eric Wieser (Feb 18 2025 at 19:58):

Often slowdowns of this kind are due to a generalization impedence mismatch

Eric Wieser (Feb 18 2025 at 19:59):

That is, the files with a big slowdown will become faster when they themselves are generalized, and then the slowdown will move one step further downstream

Eric Wieser (Feb 18 2025 at 19:59):

It's hard to get a feel for the cost of the global generalization of course.

Patrick Massot (Feb 18 2025 at 20:34):

Is the generalization useful for anything?

Eric Wieser (Feb 18 2025 at 20:52):

Exponentiating NNReal would be the main motivation, which is probably convenient when working with nndist, nnnorm etc

Last updated: May 02 2025 at 03:31 UTC