Zulip Chat Archive

Stream: mathlib4

Topic: Quadratic slowdown in building mathlib


Jeremy Tan (Jan 13 2024 at 17:58):

The latest staging commit https://github.com/leanprover-community/mathlib4/actions/runs/7513702216/job/20455754088 took 48 minutes to build mathlib. At the rate we're adding new declarations to mathlib we could be exceeding 1 hour for build time alone before June this year.

Can we use leaff to cut this time down or is it time to split topic-specific packages from mathlib?

Anatole Dedecker (Jan 13 2024 at 18:15):

This is not a serious comment, but 48 minutes is awesome compared to what we had in Lean 3 :big_smile:

Jeremy Tan (Jan 13 2024 at 18:17):

Only two months ago we were building in less than 30 minutes on the Hoskinson machines. I'm referring to the sheer number of declarations in mathlib right now, not any individual file performance

Kevin Buzzard (Jan 13 2024 at 18:49):

Right but Anatole is pointing out that mathlib3 would take a couple of hours to build and the world didn't end.

Jireh Loreaux (Jan 13 2024 at 18:50):

Well, sort of. A couple hours of build time was pretty unsustainable.

Yaël Dillies (Jan 13 2024 at 18:51):

It was six hours towards the end, FYI

Jireh Loreaux (Jan 13 2024 at 18:51):

It made it very hard to do refactors low in the hierarchy that would cause lots of breakage.

Jireh Loreaux (Jan 13 2024 at 18:53):

But, as I think others are saying, we're not yet in the scenario that we need to consider splitting Mathlib.

Jireh Loreaux (Jan 13 2024 at 18:54):

There may yet be other ways of dealing with these issues. And also, note that wall clock time is not really a reliable measure.

Jireh Loreaux (Jan 13 2024 at 18:55):

@Jeremy Tan did you see Sebastian Ullrich's talk?

Jeremy Tan (Jan 13 2024 at 19:07):

I haven't (and it's 3am right now)

Mauricio Collares (Jan 13 2024 at 19:57):

build wall-clock is currently 17.5 minutes on speedcenter, so a ~3x speedup is possible with hardware upgrades (if it becomes necessary in the future)

Mauricio Collares (Jan 13 2024 at 20:01):

Honestly, it's incredible how well the 7950X3D performs. I was expecting a threadripper.

Yury G. Kudryashov (Jan 13 2024 at 21:39):

Note that you need to rebuild the whole mathlib only in the following cases:

  • you upgrade a dependency;
  • you change a file deep inside the import tree.

So, PRs that add new files can be compiled very quickly. I don't see how leaff can help here.
We may want to restructure imports to allow Lean to compile more files in parallel.

Matthew Ballard (Jan 13 2024 at 23:53):

There is also a 20% speed up waiting #mathlib4 > example (p : P) : Q := p takes 0.25s to fail!

Kim Morrison (Jan 14 2024 at 00:16):

Regarding the topic line for this thread, is there actually evidence that quadratic terms in the build time (which certainly exist, e.g. time spent in simp will is presumably at least quadratic in the "depth" of Mathlib) are actually starting to dominate?

One of my key takeaways from Sebastian's talk was that it mostly looks like we are still in the linear domain.

Mario Carneiro (Jan 14 2024 at 00:18):

My initial read was that the other linear factor was the fact that we run builds regularly

Mario Carneiro (Jan 14 2024 at 00:20):

Scott Morrison said:

e.g. time spent in simp will is presumably at least quadratic in the "depth" of Mathlib

this is one of the major differences between lean 3 and lean 4 and why it doesn't take 3-6 hours anymore: simp over the library is now O(n log n) not O(n^2)

Mario Carneiro (Jan 14 2024 at 00:21):

building environments is O(n^2) though (or rather O(n * files)), but it is very much not a dominant factor in build times

Kim Morrison (Jan 14 2024 at 00:52):

This said, there are huge opportunities to improve the import structure of Mathlib, with benefits:

  • faster build times for individual targets (e.g. you want to change X to see how it effects Y, but can leave everything else to CI)
  • improving parallelism of the build, so CI and benchmarking machines can continue taking advantage of more cores
  • better pedagogy: currently there are way too many instances of "surely you don't need to know about A to understand B!" in Mathlib
  • easier de-Mathlibify of bugs
  • allowing users who can't cope with all of Mathlib (we know these exist!) to more easily extract parts they need, either by hand or future automatic tools
  • making possible future decisions to split the repository (either "from the top" or "from the bottom")
  • it's also just kind of fun, like a form of gardening...?

As a concrete example, @Michael Rothgang's PR #7920 two months ago split MetricSpace.Basic. Up until that point the mere definition of a MetricSpace happened on line number 2895(!!) of that file. Now it's the first declaration.

But there's lots more that could be done here. For example, in PseudoMetric we have imports

import Mathlib.Topology.Algebra.Order.Compact
import Mathlib.Topology.Bornology.Constructions

neither of which are needed until after line 1000 of that file.

Lower down in the Mathlib hierarchy we often have a split between X.Defs and X.Basic files, where the .Defs files attempt to do the bare minimum of theory required to set up the definitions, postponing everything else to .Basic. Here, for example, is the import hierarchy up to Algebra.Field.Defs:

o.png

Std does this much more systematically, although there the split is called .Basic and .Lemmas. As this gets done more consistently, the .Defs files only need to depend on other .Defs files, and we considerably reduce the import path to the actual definition --- and I think this is a really good thing!

In PseudoMetric we could certainly do exactly this: those imports I mentioned don't need to be in the .Defs file. But this happens essentially everywhere in Mathlib still.

Yury G. Kudryashov (Jan 14 2024 at 01:03):

I plan to split Topology.Basic so that the definitions of TopologicalSpace, IsClosed, and Continuous don't depend on, e.g., Set.Countable (and Filters).

Yury G. Kudryashov (Jan 14 2024 at 01:04):

BTW, how can I generate the list of all dependencies of a given file (recursive)?

Yury G. Kudryashov (Jan 14 2024 at 01:05):

#xy: I want to see how many dependencies I add to the closure of a file if I add one import line.

Yury G. Kudryashov (Jan 14 2024 at 01:05):

For this, I need to see the diff of 2 closures.

Mario Carneiro (Jan 14 2024 at 01:05):

lake exe shake does a lot of computations like that

Mario Carneiro (Jan 14 2024 at 01:06):

#9346

Yury G. Kudryashov (Jan 14 2024 at 01:07):

Even more precisely, I want to know how much I'll add to the closure of Mathlib.Topology.Algebra.MulAction if I make it import Mathlib.GroupTheory.GroupAction.SubMulAction.

Yury G. Kudryashov (Jan 14 2024 at 01:08):

Mario Carneiro said:

#9436

How is it related? "implement group operations on nonsingular rational points for Jacobian coordinates"

Mario Carneiro (Jan 14 2024 at 01:08):

fixed

Yury G. Kudryashov (Jan 14 2024 at 01:09):

I understand that both lake exe graph and lake exe shake have access to this information. How exactly do I extract it?

Mario Carneiro (Jan 14 2024 at 01:10):

you edit the code?

Yury G. Kudryashov (Jan 14 2024 at 01:11):

Do you mean the code of shake? If yes, then could you please point me to a more specific location? Lean meta programming is not my strong side at the moment.

Mario Carneiro (Jan 14 2024 at 01:14):

well the possible space of things you might want to do is pretty open, so I don't see anything except a metaprogramming API being helpful to you

Yury G. Kudryashov (Jan 14 2024 at 01:16):

I made lake exe graph output graphs in dot format, then used sed to list vertices and compared txt files with vimdiff...

Yury G. Kudryashov (Jan 14 2024 at 01:20):

I know that this is not the lean way.

Mario Carneiro (Jan 14 2024 at 01:22):

here's a sliced and diced version of lake exe shake which works interactively:

import Lean
open Lean

/-- We use `Nat` as a bitset for doing efficient set operations.
The bit indexes will usually be a module index. -/
abbrev Bitset := Nat

/-- The main state of the checker, containing information on all loaded modules. -/
structure State where
  /-- Maps a module name to its index in the module list. -/
  toIdx : HashMap Name USize := {}
  /-- Maps a module index to the module name. -/
  modNames : Array Name := #[]
  /-- Maps a module index to the module data. -/
  mods : Array ModuleData := #[]
  /-- `j ∈ deps[i]` if module `j` is a direct dependency of module `i` -/
  deps : Array Bitset := #[]
  /-- `j ∈ transDeps[i]` is the reflexive transitive closure of `deps` -/
  transDeps : Array Bitset := #[]

partial def loadModules (imports : Array Import) : StateT State IO (Bitset × Bitset) := do
  let mut imps := 0
  let mut transImps := 0
  for imp in imports do
    let s  get
    if let some i := s.toIdx.find? imp.module then
      imps := imps ||| (1 <<< i.toNat)
      transImps := transImps ||| s.transDeps[i]!
    else
      let mFile  findOLean imp.module
      unless ( mFile.pathExists) do
        throw <| IO.userError s!"object file '{mFile}' of module {imp.module} does not exist"
      let (mod, _)  readModuleData mFile
      let (deps, transDeps)  loadModules mod.imports
      let s  get
      let n := s.mods.size.toUSize
      let transDeps := transDeps ||| (1 <<< n.toNat)
      imps := imps ||| (1 <<< n.toNat)
      transImps := transImps ||| transDeps
      set (σ := State) {
        toIdx := s.toIdx.insert imp.module n
        modNames := s.modNames.push imp.module
        mods := s.mods.push mod
        deps := s.deps.push deps
        transDeps := s.transDeps.push transDeps
      }
  return (imps, transImps)

#eval show IO Unit from do
  initSearchPath ( findSysroot)
  let root := `Mathlib
  let mut (_, s)  (loadModules #[{module := root}]).run {}

  let m1 := s.toIdx.find! `Mathlib.Topology.Algebra.MulAction
  let m2 := s.toIdx.find! `Mathlib.GroupTheory.GroupAction.SubMulAction
  let newDeps := (s.transDeps[m1]! ||| s.transDeps[m2]!) ^^^ s.transDeps[m1]!
  -- count bits in newDeps
  let mut count := 0
  for i in [0:s.mods.size] do
    if newDeps &&& (1 <<< i) != 0 then
      count := count + 1
  println! count -- 5

Mario Carneiro (Jan 14 2024 at 01:23):

so the answer to your question is that there are 5 new files

Yury G. Kudryashov (Jan 14 2024 at 01:28):

Thank you!

Mario Carneiro (Jan 14 2024 at 01:54):

Here's a better version for interactive use, which reuses the module datas from the environment instead of actually rereading mathlib every time the #eval is reevaluated:

import Mathlib -- this is important, you have to import all modules you want to ask questions about
open Lean

#eval show MetaM Unit from do
  let env  getEnv
  let mut toIdx : HashMap Name USize := {}
  for i in [0:env.header.moduleData.size] do
    toIdx := toIdx.insert env.header.moduleNames[i]! i.toUSize
  /- `j ∈ deps[i]` if module `j` is a direct dependency of module `i` -/
  let mut deps : Array Nat := #[]
  /- `j ∈ transDeps[i]` is the reflexive transitive closure of `deps` -/
  let mut transDeps : Array Nat := #[]
  for i in [0:env.header.moduleData.size] do
    let mod := env.header.moduleData[i]!
    let mut imps := 0
    let mut transImps := 1 <<< i
    for imp in mod.imports do
      let i := toIdx.find! imp.module
      imps := imps ||| (1 <<< i.toNat)
      transImps := transImps ||| transDeps[i]!
    deps := deps.push imps
    transDeps := transDeps.push transImps

  let m1 := toIdx.find! `Mathlib.Topology.Algebra.MulAction
  let m2 := toIdx.find! `Mathlib.GroupTheory.GroupAction.SubMulAction
  let newDeps := (transDeps[m1]! ||| transDeps[m2]!) ^^^ transDeps[m1]!
  -- count bits in newDeps
  for i in [0:env.header.moduleNames.size] do
    if newDeps &&& (1 <<< i) != 0 then
      println! env.header.moduleNames[i]!

-- Mathlib.Algebra.Field.IsField
-- Mathlib.Algebra.Ring.Equiv
-- Mathlib.Algebra.GroupRingAction.Basic
-- Mathlib.GroupTheory.GroupAction.Hom
-- Mathlib.GroupTheory.GroupAction.SubMulAction

Jeremy Tan (Jan 14 2024 at 04:42):

Mario Carneiro said:

My initial read was that the other linear factor was the fact that we run builds regularly

That's what I was referring to. The phrase "quadratic slowdown" was taken from a phenomenon in universal constructor cellular automata, where you've got a static tape that's read by a moving arm

Mario Carneiro (Jan 14 2024 at 04:43):

this of course assumes that we increase the size of mathlib linearly as a function of time

Mario Carneiro (Jan 14 2024 at 04:44):

which seems unlikely in the long term

Mario Carneiro (Jan 14 2024 at 04:44):

and also that we keep the rate of builds constant

Mario Carneiro (Jan 14 2024 at 04:45):

It is a natural consequence of rising build times that the build rate slows down

Yury G. Kudryashov (Jan 14 2024 at 07:03):

I expect that the growth rate will be more than linear (at least while we have lots of math out there to formalize) but we may want to semi-freeze low-level files some day (or not).

Michael Rothgang (Jan 17 2024 at 14:56):

Scott Morrison said:

This said, there are huge opportunities to improve the import structure of Mathlib, with benefits:

  • faster build times for individual targets (e.g. you want to change X to see how it effects Y, but can leave everything else to CI)
  • improving parallelism of the build, so CI and benchmarking machines can continue taking advantage of more cores
  • better pedagogy: currently there are way too many instances of "surely you don't need to know about A to understand B!" in Mathlib
  • easier de-Mathlibify of bugs
  • allowing users who can't cope with all of Mathlib (we know these exist!) to more easily extract parts they need, either by hand or future automatic tools
  • making possible future decisions to split the repository (either "from the top" or "from the bottom")
  • it's also just kind of fun, like a form of gardening...?

As a concrete example, Michael Rothgang's PR #7920 two months ago split MetricSpace.Basic. Up until that point the mere definition of a MetricSpace happened on line number 2895(!!) of that file. Now it's the first declaration.

But there's lots more that could be done here. For example, in PseudoMetric we have imports

import Mathlib.Topology.Algebra.Order.Compact
import Mathlib.Topology.Bornology.Constructions

neither of which are needed until after line 1000 of that file.

Lower down in the Mathlib hierarchy we often have a split between X.Defs and X.Basic files, where the .Defs files attempt to do the bare minimum of theory required to set up the definitions, postponing everything else to .Basic. Here, for example, is the import hierarchy up to Algebra.Field.Defs:

o.png

Std does this much more systematically, although there the split is called .Basic and .Lemmas. As this gets done more consistently, the .Defs files only need to depend on other .Defs files, and we considerably reduce the import path to the actual definition --- and I think this is a really good thing!

In PseudoMetric we could certainly do exactly this: those imports I mentioned don't need to be in the .Defs file. But this happens essentially everywhere in Mathlib still.

FWIW, I just prototyped splitting PseudoMetric as described, it's at branch#MR-pseudometric-split-compactness. Eyes on the split are very welcome.

Michael Rothgang (Jan 17 2024 at 15:00):

Filed as #9815

Geoffrey Irving (Jan 17 2024 at 16:40):

Is it possible to instrument building mathlib4 in such a way that it could spit out speedups if each file is split up? Roughly,

  1. Record how long each theorem takes, and what it depends on.
  2. Build the full dependency graph of individual lemmas (large, I suppose).
  3. Simulate building all of mathlib in parallel on a given number of threads, assuming one particular file is split entirely into one file per theorem (this is a rough upper bound on the speedup available if a given file is split).

Geoffrey Irving (Jan 17 2024 at 16:40):

Presumably (3) is fast enough (once the timing graph is cached) that you can do it for every file in mathlib.

Mario Carneiro (Jan 18 2024 at 02:37):

shake could do that if you relax the requirement for timing data and just settle for graph properties (e.g. number of files or constants on the critical path, graph diameter, etc)

Yury G. Kudryashov (Jan 19 2024 at 06:49):

I'm splitting file#Topology/Separation into files, 1 per separation axiom.


Last updated: May 02 2025 at 03:31 UTC