Zulip Chat Archive
Stream: mathlib4
Topic: Quadratic slowdown in building mathlib
Jeremy Tan (Jan 13 2024 at 17:58):
The latest staging commit https://github.com/leanprover-community/mathlib4/actions/runs/7513702216/job/20455754088 took 48 minutes to build mathlib. At the rate we're adding new declarations to mathlib we could be exceeding 1 hour for build time alone before June this year.
Can we use leaff to cut this time down or is it time to split topic-specific packages from mathlib?
Anatole Dedecker (Jan 13 2024 at 18:15):
This is not a serious comment, but 48 minutes is awesome compared to what we had in Lean 3 :big_smile:
Jeremy Tan (Jan 13 2024 at 18:17):
Only two months ago we were building in less than 30 minutes on the Hoskinson machines. I'm referring to the sheer number of declarations in mathlib right now, not any individual file performance
Kevin Buzzard (Jan 13 2024 at 18:49):
Right but Anatole is pointing out that mathlib3 would take a couple of hours to build and the world didn't end.
Jireh Loreaux (Jan 13 2024 at 18:50):
Well, sort of. A couple hours of build time was pretty unsustainable.
Yaël Dillies (Jan 13 2024 at 18:51):
It was six hours towards the end, FYI
Jireh Loreaux (Jan 13 2024 at 18:51):
It made it very hard to do refactors low in the hierarchy that would cause lots of breakage.
Jireh Loreaux (Jan 13 2024 at 18:53):
But, as I think others are saying, we're not yet in the scenario that we need to consider splitting Mathlib.
Jireh Loreaux (Jan 13 2024 at 18:54):
There may yet be other ways of dealing with these issues. And also, note that wall clock time is not really a reliable measure.
Jireh Loreaux (Jan 13 2024 at 18:55):
@Jeremy Tan did you see Sebastian Ullrich's talk?
Jeremy Tan (Jan 13 2024 at 19:07):
I haven't (and it's 3am right now)
Mauricio Collares (Jan 13 2024 at 19:57):
build wall-clock is currently 17.5 minutes on speedcenter, so a ~3x speedup is possible with hardware upgrades (if it becomes necessary in the future)
Mauricio Collares (Jan 13 2024 at 20:01):
Honestly, it's incredible how well the 7950X3D performs. I was expecting a threadripper.
Yury G. Kudryashov (Jan 13 2024 at 21:39):
Note that you need to rebuild the whole mathlib only in the following cases:
- you upgrade a dependency;
- you change a file deep inside the import tree.
So, PRs that add new files can be compiled very quickly. I don't see how leaff
can help here.
We may want to restructure import
s to allow Lean to compile more files in parallel.
Matthew Ballard (Jan 13 2024 at 23:53):
There is also a 20% speed up waiting #mathlib4 > example (p : P) : Q := p
takes 0.25s to fail!
Kim Morrison (Jan 14 2024 at 00:16):
Regarding the topic line for this thread, is there actually evidence that quadratic terms in the build time (which certainly exist, e.g. time spent in simp
will is presumably at least quadratic in the "depth" of Mathlib) are actually starting to dominate?
One of my key takeaways from Sebastian's talk was that it mostly looks like we are still in the linear domain.
Mario Carneiro (Jan 14 2024 at 00:18):
My initial read was that the other linear factor was the fact that we run builds regularly
Mario Carneiro (Jan 14 2024 at 00:20):
Scott Morrison said:
e.g. time spent in
simp
will is presumably at least quadratic in the "depth" of Mathlib
this is one of the major differences between lean 3 and lean 4 and why it doesn't take 3-6 hours anymore: simp over the library is now O(n log n) not O(n^2)
Mario Carneiro (Jan 14 2024 at 00:21):
building environments is O(n^2) though (or rather O(n * files)), but it is very much not a dominant factor in build times
Kim Morrison (Jan 14 2024 at 00:52):
This said, there are huge opportunities to improve the import structure of Mathlib, with benefits:
- faster build times for individual targets (e.g. you want to change X to see how it effects Y, but can leave everything else to CI)
- improving parallelism of the build, so CI and benchmarking machines can continue taking advantage of more cores
- better pedagogy: currently there are way too many instances of "surely you don't need to know about A to understand B!" in Mathlib
- easier de-Mathlibify of bugs
- allowing users who can't cope with all of Mathlib (we know these exist!) to more easily extract parts they need, either by hand or future automatic tools
- making possible future decisions to split the repository (either "from the top" or "from the bottom")
- it's also just kind of fun, like a form of gardening...?
As a concrete example, @Michael Rothgang's PR #7920 two months ago split MetricSpace.Basic
. Up until that point the mere definition of a MetricSpace
happened on line number 2895(!!) of that file. Now it's the first declaration.
But there's lots more that could be done here. For example, in PseudoMetric
we have imports
import Mathlib.Topology.Algebra.Order.Compact
import Mathlib.Topology.Bornology.Constructions
neither of which are needed until after line 1000 of that file.
Lower down in the Mathlib hierarchy we often have a split between X.Defs
and X.Basic
files, where the .Defs
files attempt to do the bare minimum of theory required to set up the definitions, postponing everything else to .Basic
. Here, for example, is the import hierarchy up to Algebra.Field.Defs
:
Std does this much more systematically, although there the split is called .Basic
and .Lemmas
. As this gets done more consistently, the .Defs
files only need to depend on other .Defs
files, and we considerably reduce the import path to the actual definition --- and I think this is a really good thing!
In PseudoMetric
we could certainly do exactly this: those imports I mentioned don't need to be in the .Defs
file. But this happens essentially everywhere in Mathlib still.
Yury G. Kudryashov (Jan 14 2024 at 01:03):
I plan to split Topology.Basic
so that the definitions of TopologicalSpace
, IsClosed
, and Continuous
don't depend on, e.g., Set.Countable
(and Filter
s).
Yury G. Kudryashov (Jan 14 2024 at 01:04):
BTW, how can I generate the list of all dependencies of a given file (recursive)?
Yury G. Kudryashov (Jan 14 2024 at 01:05):
#xy: I want to see how many dependencies I add to the closure of a file if I add one import
line.
Yury G. Kudryashov (Jan 14 2024 at 01:05):
For this, I need to see the diff
of 2 closures.
Mario Carneiro (Jan 14 2024 at 01:05):
lake exe shake
does a lot of computations like that
Mario Carneiro (Jan 14 2024 at 01:06):
Yury G. Kudryashov (Jan 14 2024 at 01:07):
Even more precisely, I want to know how much I'll add to the closure of Mathlib.Topology.Algebra.MulAction
if I make it import Mathlib.GroupTheory.GroupAction.SubMulAction
.
Yury G. Kudryashov (Jan 14 2024 at 01:08):
Mario Carneiro said:
How is it related? "implement group operations on nonsingular rational points for Jacobian coordinates"
Mario Carneiro (Jan 14 2024 at 01:08):
fixed
Yury G. Kudryashov (Jan 14 2024 at 01:09):
I understand that both lake exe graph
and lake exe shake
have access to this information. How exactly do I extract it?
Mario Carneiro (Jan 14 2024 at 01:10):
you edit the code?
Yury G. Kudryashov (Jan 14 2024 at 01:11):
Do you mean the code of shake
? If yes, then could you please point me to a more specific location? Lean meta programming is not my strong side at the moment.
Mario Carneiro (Jan 14 2024 at 01:14):
well the possible space of things you might want to do is pretty open, so I don't see anything except a metaprogramming API being helpful to you
Yury G. Kudryashov (Jan 14 2024 at 01:16):
I made lake exe graph
output graphs in dot
format, then used sed
to list vertices and compared txt files with vimdiff
...
Yury G. Kudryashov (Jan 14 2024 at 01:20):
I know that this is not the lean way.
Mario Carneiro (Jan 14 2024 at 01:22):
here's a sliced and diced version of lake exe shake
which works interactively:
import Lean
open Lean
/-- We use `Nat` as a bitset for doing efficient set operations.
The bit indexes will usually be a module index. -/
abbrev Bitset := Nat
/-- The main state of the checker, containing information on all loaded modules. -/
structure State where
/-- Maps a module name to its index in the module list. -/
toIdx : HashMap Name USize := {}
/-- Maps a module index to the module name. -/
modNames : Array Name := #[]
/-- Maps a module index to the module data. -/
mods : Array ModuleData := #[]
/-- `j ∈ deps[i]` if module `j` is a direct dependency of module `i` -/
deps : Array Bitset := #[]
/-- `j ∈ transDeps[i]` is the reflexive transitive closure of `deps` -/
transDeps : Array Bitset := #[]
partial def loadModules (imports : Array Import) : StateT State IO (Bitset × Bitset) := do
let mut imps := 0
let mut transImps := 0
for imp in imports do
let s ← get
if let some i := s.toIdx.find? imp.module then
imps := imps ||| (1 <<< i.toNat)
transImps := transImps ||| s.transDeps[i]!
else
let mFile ← findOLean imp.module
unless (← mFile.pathExists) do
throw <| IO.userError s!"object file '{mFile}' of module {imp.module} does not exist"
let (mod, _) ← readModuleData mFile
let (deps, transDeps) ← loadModules mod.imports
let s ← get
let n := s.mods.size.toUSize
let transDeps := transDeps ||| (1 <<< n.toNat)
imps := imps ||| (1 <<< n.toNat)
transImps := transImps ||| transDeps
set (σ := State) {
toIdx := s.toIdx.insert imp.module n
modNames := s.modNames.push imp.module
mods := s.mods.push mod
deps := s.deps.push deps
transDeps := s.transDeps.push transDeps
}
return (imps, transImps)
#eval show IO Unit from do
initSearchPath (← findSysroot)
let root := `Mathlib
let mut (_, s) ← (loadModules #[{module := root}]).run {}
let m1 := s.toIdx.find! `Mathlib.Topology.Algebra.MulAction
let m2 := s.toIdx.find! `Mathlib.GroupTheory.GroupAction.SubMulAction
let newDeps := (s.transDeps[m1]! ||| s.transDeps[m2]!) ^^^ s.transDeps[m1]!
-- count bits in newDeps
let mut count := 0
for i in [0:s.mods.size] do
if newDeps &&& (1 <<< i) != 0 then
count := count + 1
println! count -- 5
Mario Carneiro (Jan 14 2024 at 01:23):
so the answer to your question is that there are 5 new files
Yury G. Kudryashov (Jan 14 2024 at 01:28):
Thank you!
Mario Carneiro (Jan 14 2024 at 01:54):
Here's a better version for interactive use, which reuses the module datas from the environment instead of actually rereading mathlib every time the #eval
is reevaluated:
import Mathlib -- this is important, you have to import all modules you want to ask questions about
open Lean
#eval show MetaM Unit from do
let env ← getEnv
let mut toIdx : HashMap Name USize := {}
for i in [0:env.header.moduleData.size] do
toIdx := toIdx.insert env.header.moduleNames[i]! i.toUSize
/- `j ∈ deps[i]` if module `j` is a direct dependency of module `i` -/
let mut deps : Array Nat := #[]
/- `j ∈ transDeps[i]` is the reflexive transitive closure of `deps` -/
let mut transDeps : Array Nat := #[]
for i in [0:env.header.moduleData.size] do
let mod := env.header.moduleData[i]!
let mut imps := 0
let mut transImps := 1 <<< i
for imp in mod.imports do
let i := toIdx.find! imp.module
imps := imps ||| (1 <<< i.toNat)
transImps := transImps ||| transDeps[i]!
deps := deps.push imps
transDeps := transDeps.push transImps
let m1 := toIdx.find! `Mathlib.Topology.Algebra.MulAction
let m2 := toIdx.find! `Mathlib.GroupTheory.GroupAction.SubMulAction
let newDeps := (transDeps[m1]! ||| transDeps[m2]!) ^^^ transDeps[m1]!
-- count bits in newDeps
for i in [0:env.header.moduleNames.size] do
if newDeps &&& (1 <<< i) != 0 then
println! env.header.moduleNames[i]!
-- Mathlib.Algebra.Field.IsField
-- Mathlib.Algebra.Ring.Equiv
-- Mathlib.Algebra.GroupRingAction.Basic
-- Mathlib.GroupTheory.GroupAction.Hom
-- Mathlib.GroupTheory.GroupAction.SubMulAction
Jeremy Tan (Jan 14 2024 at 04:42):
Mario Carneiro said:
My initial read was that the other linear factor was the fact that we run builds regularly
That's what I was referring to. The phrase "quadratic slowdown" was taken from a phenomenon in universal constructor cellular automata, where you've got a static tape that's read by a moving arm
Mario Carneiro (Jan 14 2024 at 04:43):
this of course assumes that we increase the size of mathlib linearly as a function of time
Mario Carneiro (Jan 14 2024 at 04:44):
which seems unlikely in the long term
Mario Carneiro (Jan 14 2024 at 04:44):
and also that we keep the rate of builds constant
Mario Carneiro (Jan 14 2024 at 04:45):
It is a natural consequence of rising build times that the build rate slows down
Yury G. Kudryashov (Jan 14 2024 at 07:03):
I expect that the growth rate will be more than linear (at least while we have lots of math out there to formalize) but we may want to semi-freeze low-level files some day (or not).
Michael Rothgang (Jan 17 2024 at 14:56):
Scott Morrison said:
This said, there are huge opportunities to improve the import structure of Mathlib, with benefits:
- faster build times for individual targets (e.g. you want to change X to see how it effects Y, but can leave everything else to CI)
- improving parallelism of the build, so CI and benchmarking machines can continue taking advantage of more cores
- better pedagogy: currently there are way too many instances of "surely you don't need to know about A to understand B!" in Mathlib
- easier de-Mathlibify of bugs
- allowing users who can't cope with all of Mathlib (we know these exist!) to more easily extract parts they need, either by hand or future automatic tools
- making possible future decisions to split the repository (either "from the top" or "from the bottom")
- it's also just kind of fun, like a form of gardening...?
As a concrete example, Michael Rothgang's PR #7920 two months ago split
MetricSpace.Basic
. Up until that point the mere definition of aMetricSpace
happened on line number 2895(!!) of that file. Now it's the first declaration.But there's lots more that could be done here. For example, in
PseudoMetric
we have importsimport Mathlib.Topology.Algebra.Order.Compact import Mathlib.Topology.Bornology.Constructions
neither of which are needed until after line 1000 of that file.
Lower down in the Mathlib hierarchy we often have a split between
X.Defs
andX.Basic
files, where the.Defs
files attempt to do the bare minimum of theory required to set up the definitions, postponing everything else to.Basic
. Here, for example, is the import hierarchy up toAlgebra.Field.Defs
:Std does this much more systematically, although there the split is called
.Basic
and.Lemmas
. As this gets done more consistently, the.Defs
files only need to depend on other.Defs
files, and we considerably reduce the import path to the actual definition --- and I think this is a really good thing!In
PseudoMetric
we could certainly do exactly this: those imports I mentioned don't need to be in the.Defs
file. But this happens essentially everywhere in Mathlib still.
FWIW, I just prototyped splitting PseudoMetric
as described, it's at branch#MR-pseudometric-split-compactness. Eyes on the split are very welcome.
Michael Rothgang (Jan 17 2024 at 15:00):
Filed as #9815
Geoffrey Irving (Jan 17 2024 at 16:40):
Is it possible to instrument building mathlib4 in such a way that it could spit out speedups if each file is split up? Roughly,
- Record how long each theorem takes, and what it depends on.
- Build the full dependency graph of individual lemmas (large, I suppose).
- Simulate building all of mathlib in parallel on a given number of threads, assuming one particular file is split entirely into one file per theorem (this is a rough upper bound on the speedup available if a given file is split).
Geoffrey Irving (Jan 17 2024 at 16:40):
Presumably (3) is fast enough (once the timing graph is cached) that you can do it for every file in mathlib.
Mario Carneiro (Jan 18 2024 at 02:37):
shake
could do that if you relax the requirement for timing data and just settle for graph properties (e.g. number of files or constants on the critical path, graph diameter, etc)
Yury G. Kudryashov (Jan 19 2024 at 06:49):
I'm splitting file#Topology/Separation into files, 1 per separation axiom.
Last updated: May 02 2025 at 03:31 UTC