Zulip Chat Archive

Stream: lean4

Topic: Lake parallel builds


Tobias Grosser (Oct 18 2023 at 08:02):

I just run lake build on math lib and realized that my system is 50% idle using only 8 out of 16 threads. I would be surprised if mathlib has only 5-8 parallel build opportunities. Hence, I wonder if there is a way to ask lake to use more threads ?

Scott Morrison (Oct 18 2023 at 08:39):

Hmm, that is strange. I peak at much higher than 800%, and average about 1400% over the whole build.

Scott Morrison (Oct 18 2023 at 08:39):

Early and late in mathlib it is constrained by available jobs.

Scott Morrison (Oct 18 2023 at 08:40):

(Reducing unnecessary imports presumably has some marginal impact on this.)

Tobias Grosser (Oct 18 2023 at 09:23):

I just run a full 'lake clean; time lake build' run and get the following output:
lake build 22191.10s user 738.37s system 1660% cpu 23:00.82 total

However, when I look at top, the system is consistently 10-50% idle. I am considering buying a high threadcount machine and am really not sure if that would speedup the builds.

Tobias Grosser (Oct 18 2023 at 09:24):

I would like to run lake single-threaded to see how much speedup going from 1 to 16 cores gives. Is there such an option for lake?

Jireh Loreaux (Oct 18 2023 at 15:55):

I don't recall how @Sebastian Ullrich made this chart but it apparently has a way to also simulate a build with maximum parallelism.

Jireh Loreaux (Oct 18 2023 at 16:02):

This answer is here, but he doesn't recommend nixprof for outside use, at least at the time of that message, and it doesn't look like the repo has been updated much since then.

Tobias Grosser (Oct 18 2023 at 16:05):

Thank you, that is interesting.

Eric Wieser (Oct 18 2023 at 16:12):

I've often seen it be IO-limited rather than CPU limited.

Tobias Grosser (Oct 18 2023 at 16:16):

DRAM bandwidth?

Jireh Loreaux (Oct 18 2023 at 16:24):

Eric, the IO should be small right? Or do you mean writing oleans? I guess that could be potentially significant.

Eric Wieser (Oct 18 2023 at 16:29):

Some combination of writing oleans and paging between all the memory-mapped oleans

Eric Wieser (Oct 18 2023 at 16:30):

and realized that my system is 50% idle using only 8 out of 16 threads.

Is it possible that there's a setting somewhere likemake -j8 that is restricting parallelism by default to avoid surprising the user?

Tobias Grosser (Oct 18 2023 at 16:37):

I wanted to play with the -j option, but I cannot find how to control the thread level parallelism with lake.

Mac Malone (Oct 18 2023 at 17:29):

@Tobias Grosser There isn't a way to do this from Lake, because Lean does not provide a API-based thread control mechanism. However, there is a program wide limiting mechanism available through LEAN_NUM_THREADS (as mentioned here)

Tobias Grosser (Oct 18 2023 at 17:50):

This is what I was looking for. I give this a shot.

Tobias Grosser (Nov 10 2023 at 23:37):

I just redid the experiment on a 56 core 12 thread machine with today's mathlib:

real 16m23.110s
user 253m2.379s
sys 14m39.441s

So we get about 16 x speedup on a 56 core system. With perfect scaling we should be at 5 m roughly. I guess the issue is that lean/lake cannot yet extract sufficient parallelism from mathlib. Interesting.

Scott Morrison (Nov 11 2023 at 00:38):

In-file parallelism (coming!) will help here.

Disentangling the mathlib import tree can help as well. (i.e. locate moments when relatively few files are compiling simultaneously, and then try and reduce import dependencies on those).

It might also be possible, but only with a substantial change to Lake, to prioritize compiling those files with the most downstream dependents. This would mean that during the phases when we are CPU bound rather than import bound, we work on the files which will open up more tasks.

Scott Morrison (Nov 11 2023 at 00:39):

I think an automatic tool that showed where the import graph bottlenecks were would be great.

Scott Morrison (Nov 11 2023 at 00:40):

I'm not really sure what the spec of such a tool would be.

Scott Morrison (Nov 11 2023 at 00:41):

Or whether it should try to run "live", or if we should record the build times of each file and then do an offline analysis assuming infinite parallelism.

Scott Morrison (Nov 11 2023 at 00:42):

Even just producing a flame graph, and looking for the files that "finish at the bottom of a trough" would be a great start.

Scott Morrison (Nov 11 2023 at 00:43):

(i.e. files which, when finished, allow more cores to start working)

Scott Morrison (Nov 11 2023 at 00:43):

Then you would try to move those files earlier / split them / reduce dependencies on them.

Scott Morrison (Nov 11 2023 at 04:53):

One place that anecdata suggests is a bottleneck is Mathlib.Analysis.Calculus.ContDiff. I often see a build get stuck there, and immediately afterwards there are lots of things that can be compiled.

Scott Morrison (Nov 11 2023 at 04:54):

It is almost 3000 lines long.

Scott Morrison (Nov 11 2023 at 04:54):

And the import Mathlib.Analysis.NormedSpace.FiniteDimension is not needed until almost 2000 lines in, so I think there is an obvious split there.

Tobias Grosser (Nov 11 2023 at 23:37):

Scott Morrison said:

In-file parallelism (coming!) will help here.

I guess that's a good piece to the answer. Happy to beta-test.

Henrik Böving (Nov 12 2023 at 00:24):

Scott Morrison said:

Even just producing a flame graph, and looking for the files that "finish at the bottom of a trough" would be a great start.

A flame graph isn't really the right thing here, I believe we want a gantt chart. Although I don't think we can just do this nicely outside of lake I believe? It would probably have to be part of lake itself.

Eric Wieser (Nov 12 2023 at 01:01):

If we just want to do longest pole analysis, the data we already have on the speed center, combined with a dependency graph, should do the trick

Scott Morrison (Nov 12 2023 at 02:11):

I've previously (mathlib3 days) done longest pole analysis, but I did it in Mathematica.

Scott Morrison (Nov 12 2023 at 02:16):

Is there a way to get all of the file build times for a commit from speedcentre without scraping pages?

Scott Morrison (Nov 12 2023 at 02:20):

Also, the speed center only reports per-file instruction counts, rather than wall-clock. But perhaps instruction count is a good enough proxy?

Scott Morrison (Nov 12 2023 at 02:22):

http://speed.lean-fro.org/mathlib4/api/run/137c79fb-447e-4e1f-b773-df3a1d85c5af?all_values=true seems to do it, although you still need to do the commit sha --> velcom id lookup.

Scott Morrison (Nov 12 2023 at 03:51):

I couldn't resist:

% lake exe pole
Mathlib 8581
Mathlib.AlgebraicGeometry.Morphisms.FiniteType 8567
Mathlib.AlgebraicGeometry.Morphisms.RingHomProperties 8557
Mathlib.AlgebraicGeometry.Morphisms.Basic 6582
Mathlib.AlgebraicGeometry.Pullbacks 6497
Mathlib.AlgebraicGeometry.AffineScheme 6350
Mathlib.AlgebraicGeometry.Restrict 5438
Mathlib.AlgebraicGeometry.OpenImmersion 5182
Mathlib.AlgebraicGeometry.Scheme 5106
Mathlib.AlgebraicGeometry.Spec 5067
Mathlib.AlgebraicGeometry.StructureSheaf 4936
Mathlib.AlgebraicGeometry.PrimeSpectrum.Basic 4689
Mathlib.RingTheory.Ideal.Over 4645
Mathlib.RingTheory.Localization.Integral 4578
Mathlib.RingTheory.Algebraic 4517
Mathlib.RingTheory.IntegralClosure 4445
Mathlib.LinearAlgebra.Matrix.Charpoly.LinearMap 4269
Mathlib.LinearAlgebra.Matrix.Charpoly.Coeff 4228
Mathlib.LinearAlgebra.Matrix.Charpoly.Basic 4133
Mathlib.RingTheory.PolynomialAlgebra 4112
Mathlib.RingTheory.MatrixAlgebra 4041
Mathlib.RingTheory.TensorProduct 3990
Mathlib.LinearAlgebra.FiniteDimensional 3680
Mathlib.LinearAlgebra.FreeModule.Finite.Rank 3535
Mathlib.LinearAlgebra.Finrank 3509
Mathlib.LinearAlgebra.Dimension 3458
Mathlib.LinearAlgebra.InvariantBasisNumber 3320
Mathlib.RingTheory.PrincipalIdealDomain 3287
Mathlib.RingTheory.UniqueFactorizationDomain 3258
Mathlib.RingTheory.Noetherian 3033
Mathlib.RingTheory.Nilpotent 2902
Mathlib.LinearAlgebra.Matrix.ToLin 2868
Mathlib.Algebra.Algebra.Subalgebra.Tower 2533
Mathlib.Algebra.Algebra.Subalgebra.Basic 2512
Mathlib.RingTheory.Ideal.Operations 2331
Mathlib.LinearAlgebra.Basis.Bilinear 2144
Mathlib.LinearAlgebra.Basis 2128
Mathlib.LinearAlgebra.LinearIndependent 1980
Mathlib.LinearAlgebra.Prod 1853
Mathlib.Algebra.Algebra.Prod 1762
Mathlib.Algebra.Algebra.Hom 1751
Mathlib.Algebra.Algebra.Basic 1685
Mathlib.LinearAlgebra.Basic 1612
Mathlib.Data.Finsupp.Basic 1425
Mathlib.Algebra.BigOperators.Finsupp 1347
Mathlib.Algebra.BigOperators.Fin 1277
Mathlib.Data.Fintype.Fin 1243
Mathlib.Data.Fin.Interval 1238
Mathlib.Data.Nat.Interval 1228
Mathlib.Data.Finset.LocallyFinite 1218
Mathlib.Order.LocallyFinite 1182
Mathlib.Data.Finset.Preimage 1148
Mathlib.Algebra.BigOperators.Basic 1142
Mathlib.Data.Finset.Powerset 1039
Mathlib.Data.Finset.Lattice 1028
Mathlib.Data.Finset.Prod 963
Mathlib.Data.Finset.Card 944
Mathlib.Data.Finset.Image 928
Mathlib.Data.Finset.Basic 904
Mathlib.Data.Multiset.FinsetOps 824
Mathlib.Data.Multiset.Dedup 817
Mathlib.Data.Multiset.Nodup 812
Mathlib.Data.Multiset.Bind 806
Mathlib.Algebra.BigOperators.Multiset.Basic 792
Mathlib.Data.Multiset.Basic 769
Mathlib.Data.List.Perm 699
Mathlib.Data.List.Dedup 660
Mathlib.Data.List.Nodup 654
Mathlib.Data.List.Lattice 642
Mathlib.Data.List.Count 635
Mathlib.Data.List.BigOperators.Basic 630
Mathlib.Data.List.Forall2 602
Mathlib.Data.List.Infix 587
Mathlib.Data.List.Basic 578
Mathlib.Data.Nat.Order.Basic 506
Mathlib.Algebra.Order.Ring.Canonical 491
Mathlib.Algebra.Order.Ring.Defs 484
Mathlib.Algebra.Order.Group.Defs 441
Mathlib.Order.Hom.Basic 408
Mathlib.Order.WithBot 384
Mathlib.Order.BoundedOrder 362
Mathlib.Data.Option.Basic 352

Scott Morrison (Nov 12 2023 at 03:55):

This displays the cumulative instructions (in billions) assuming infinite parallelism.

Scott Morrison (Nov 12 2023 at 03:56):

You can also use --to, for the longest pole to your favourite target:

Scott Morrison (Nov 12 2023 at 03:56):

% lake exe pole --to Mathlib.Analysis.Calculus.ContDiff.Basic
Mathlib.Analysis.Calculus.ContDiff.Basic 5850
Mathlib.Analysis.Calculus.ContDiff.Defs 5418
Mathlib.Analysis.Calculus.FDeriv.Mul 5029
Mathlib.Analysis.Calculus.FDeriv.Bilinear 4715
Mathlib.Analysis.Calculus.FDeriv.Prod 4678
Mathlib.Analysis.Calculus.FDeriv.Linear 4594
Mathlib.Analysis.NormedSpace.BoundedLinearMaps 4579
Mathlib.Analysis.NormedSpace.Multilinear 4491
Mathlib.Analysis.NormedSpace.OperatorNorm 3793
Mathlib.Analysis.LocallyConvex.WithSeminorms 3105
Mathlib.Topology.Algebra.Equicontinuity 2953
Mathlib.Topology.Algebra.UniformConvergence 2941
Mathlib.Analysis.LocallyConvex.Bounded 2909
Mathlib.Analysis.Seminorm 2877
Mathlib.Analysis.LocallyConvex.Basic 2682
Mathlib.Analysis.NormedSpace.Basic 2646
Mathlib.Analysis.Normed.MulAction 2600
Mathlib.Analysis.Normed.Field.Basic 2584
Mathlib.Algebra.Algebra.Subalgebra.Basic 2512
Mathlib.RingTheory.Ideal.Operations 2331
Mathlib.LinearAlgebra.Basis.Bilinear 2144
Mathlib.LinearAlgebra.Basis 2128
Mathlib.LinearAlgebra.LinearIndependent 1980
Mathlib.LinearAlgebra.Prod 1853
Mathlib.Algebra.Algebra.Prod 1762
Mathlib.Algebra.Algebra.Hom 1751
Mathlib.Algebra.Algebra.Basic 1685
Mathlib.LinearAlgebra.Basic 1612
Mathlib.Data.Finsupp.Basic 1425
Mathlib.Algebra.BigOperators.Finsupp 1347
Mathlib.Algebra.BigOperators.Fin 1277
Mathlib.Data.Fintype.Fin 1243
Mathlib.Data.Fin.Interval 1238
Mathlib.Data.Nat.Interval 1228
Mathlib.Data.Finset.LocallyFinite 1218
Mathlib.Order.LocallyFinite 1182
Mathlib.Data.Finset.Preimage 1148
Mathlib.Algebra.BigOperators.Basic 1142
Mathlib.Data.Finset.Powerset 1039
Mathlib.Data.Finset.Lattice 1028
Mathlib.Data.Finset.Prod 963
Mathlib.Data.Finset.Card 944
Mathlib.Data.Finset.Image 928
Mathlib.Data.Finset.Basic 904
Mathlib.Data.Multiset.FinsetOps 824
Mathlib.Data.Multiset.Dedup 817
Mathlib.Data.Multiset.Nodup 812
Mathlib.Data.Multiset.Bind 806
Mathlib.Algebra.BigOperators.Multiset.Basic 792
Mathlib.Data.Multiset.Basic 769
Mathlib.Data.List.Perm 699
Mathlib.Data.List.Dedup 660
Mathlib.Data.List.Nodup 654
Mathlib.Data.List.Lattice 642
Mathlib.Data.List.Count 635
Mathlib.Data.List.BigOperators.Basic 630
Mathlib.Data.List.Forall2 602
Mathlib.Data.List.Infix 587
Mathlib.Data.List.Basic 578
Mathlib.Data.Nat.Order.Basic 506
Mathlib.Algebra.Order.Ring.Canonical 491
Mathlib.Algebra.Order.Ring.Defs 484
Mathlib.Algebra.Order.Group.Defs 441
Mathlib.Order.Hom.Basic 408
Mathlib.Order.WithBot 384
Mathlib.Order.BoundedOrder 362
Mathlib.Data.Option.Basic 352

Scott Morrison (Nov 12 2023 at 04:20):

#8361

Mauricio Collares (Nov 12 2023 at 09:16):

The second file is Mathlib.AlgebraicGeometry.Morphisms.RingHomProperties, which is my cue to ask for a review (or feedback) on #8299

Tobias Grosser (Nov 12 2023 at 11:50):

Lovely. I wonder if we could also print the 'sequential' performance, just adding up all numbers, and - for convenience -- the maximal potential speedup.

Scott Morrison (Nov 13 2023 at 00:27):

@Tobias Grosser, I've updated the output format. Rather than printing the sequential performance separately, I thought it was more useful to print the speedup factor. The current output looks like:

file                                                    | instructions | parallelism
------------------------------------------------------- | ------------ | -----------
Mathlib                                                 | 7377447      | x18.21
Mathlib.NumberTheory.NumberField.Units                  | 7363973      | x7.83
Mathlib.NumberTheory.NumberField.CanonicalEmbedding     | 7020425      | x7.40
Mathlib.NumberTheory.NumberField.Embeddings             | 6840733      | x7.50
Mathlib.Analysis.Complex.Polynomial                     | 6772681      | x7.11
Mathlib.Analysis.Complex.Liouville                      | 6763172      | x6.68
Mathlib.Analysis.Complex.CauchyIntegral                 | 6724325      | x6.71
Mathlib.MeasureTheory.Measure.Lebesgue.Complex          | 6673273      | x6.60
Mathlib.MeasureTheory.Measure.Haar.InnerProductSpace    | 6648245      | x6.62
Mathlib.Analysis.SpecialFunctions.Integrals             | 6620833      | x6.57
Mathlib.MeasureTheory.Integral.FundThmCalculus          | 6460684      | x6.36

Tobias Grosser (Nov 13 2023 at 04:45):

That's super interesting. So no need to run on more than 16 cores.

Tobias Grosser (Nov 13 2023 at 05:02):

I left a couple of comments.

Tobias Grosser (Nov 13 2023 at 05:03):

I am also a little surprised that about the following output:

~/Projects/mathlib4$ lake exe pole | wc
    121     605    9572
~/Projects/mathlib4$ find Mathlib | wc
   4166    4166  165631

Tobias Grosser (Nov 13 2023 at 05:04):

Why do I get so many less modules than there are files in Mathlib. Are there really just 121 modules (after filtering leaf modules)?

Tobias Grosser (Nov 13 2023 at 05:20):

Very interesting. Overall.

Tobias Grosser (Nov 13 2023 at 05:23):

I guess the next question would be if mathlib could be structured in a way that exposes more file-level parallelism or if it is inherently coupled such that higher levels of file-level parallelism are infeasible.

Scott Morrison (Nov 13 2023 at 05:57):

Tobias Grosser said:

Why do I get so many less modules than there are files in Mathlib. Are there really just 121 modules (after filtering leaf modules)?

lake exe pole is only reporting the files that occur on the longest pole in Mathlib.

Scott Morrison (Nov 13 2023 at 05:58):

Tobias Grosser said:

That's super interesting. So no need to run on more than 16 cores.

Also, I think this is false. The 18x parallelism reported for Mathlib is the ratio of total compile time for all files (i.e. 1 core compile time) to the infinite parallelism compile time. That is: with infinitely many cores, during a build on average you will be using 18.

Scott Morrison (Nov 13 2023 at 05:59):

We know that at the beginning and end of the build you use much less: hence during the middle you will be able to use more than 18 cores!

Scott Morrison (Nov 13 2023 at 05:59):

To actually find the maximum number of useful cores we would need to construct the full Gantt diagram. That is definitely doable using the code in pole. In fact it's probably only another 10-20 lines of code to do this! (hint, hint :-)

Scott Morrison (Nov 13 2023 at 06:01):

Tobias Grosser said:

I guess the next question would be if mathlib could be structured in a way that exposes more file-level parallelism or if it is inherently coupled such that higher levels of file-level parallelism are infeasible.

I did look at the new longest pole after Andrew Yang's recent fix to algebraic geometry. It actually looks very plausible to my mathematical eye: you really do need (or at least want) to use those results to prove those things! Once you get down into the weeds of finiteness there are probably some more efficiencies to eek out, but not a huge amount.

Tobias Grosser (Nov 13 2023 at 06:23):

Scott Morrison said:

Tobias Grosser said:

That's super interesting. So no need to run on more than 16 cores.

Also, I think this is false. The 18x parallelism reported for Mathlib is the ratio of total compile time for all files (i.e. 1 core compile time) to the infinite parallelism compile time. That is: with infinitely many cores, during a build on average you will be using 18.

I guess approximated here a little, but given Amdahl's law this means that the low-parallelism parts dominate so even if we can be sometimes faster this won't really change the overall trend too much.

Tobias Grosser (Nov 13 2023 at 06:24):

I am still a little flabbergasted, that there is so little parallelism on the file level. Unfortunately, I have paper deadlines upcoming so cannot really dig deep, but I feel there is enough on the table that I can explore this after the deadlines.

Tobias Grosser (Nov 13 2023 at 06:24):

I think 'lake build pole' is worth merging.

Tobias Grosser (Nov 13 2023 at 06:25):

Scott Morrison said:

In-file parallelism (coming!) will help here.

Is there an ETA for this?

Tobias Grosser (Nov 13 2023 at 06:30):

And are there already people committed to it or is this up-for-takers?

Scott Morrison (Nov 13 2023 at 06:58):

This is something that Sebastian is working on, and it doesn't really make sense for anyone but him to be doing it. (There will be lots of details about this in a coming-soon roadmap.)

Tobias Grosser (Nov 13 2023 at 06:59):

Perfect. Then waiting is likely the right option.

Tobias Grosser (Nov 13 2023 at 06:59):

@Sebastian Ullrich, happy to help measuring things. Depending on how parallel things are, we can also try to run these things multi-node.


Last updated: Dec 20 2023 at 11:08 UTC