Zulip Chat Archive

Stream: mathlib4

Topic: Counting prerequisites of a theorem


David Loeffler (Mar 06 2024 at 09:04):

Is there some way of finding out "how much of mathlib is needed" to prove a given theorem?

This could mean either

  • (a) number of files in a minimal set of mathlib imports required to get the proof to work;
  • (b) total lines of code in (a);
  • (c) (probably much harder) total number of definitions / theorems needed, divided by the number in the stats page.

(Number theorists love to boast that their area uses just about every other area of pure mathematics as input at some stage, and I'm wondering if the statistics really bear this out.)

Johan Commelin (Mar 06 2024 at 09:09):

I think (c) is very doable these days. Tools like leaff and shake do much more complicated calculations.

I remember that we computed (c) for R\R and for perfectoid_space, and we discovered that perfectoid spaces are only 4x more complicated than real numbers.

Kevin Buzzard (Mar 06 2024 at 09:09):

I did notice that my PhD thesis used results from 11 of the 12 courses I went to as a final year UG (logic was the only missing one) and one reason I was pushing in 2017 for mathlib to become what it's becoming was because I knew that FLT would need "everything". But I wonder whether we're high enough up to see this phenomenon yet. Given all this import pruning it's probably not too hard to figure out if all this L-function stuff we have now depends on topology by "eyeballing", but no doubt it's depending on a bunch of analysis and some algebra already.

Ruben Van de Velde (Mar 06 2024 at 09:13):

I know there's logic and logic, but I still was taken aback a bit by your claim that you didn't need logic for your PhD thesis :)

Kim Morrison (Mar 06 2024 at 10:48):

A creative work of pure aesthetic genius.

Damiano Testa (Mar 07 2024 at 02:06):

I had some code that did exactly this, since I had been curious about this question already. Unfortunately, the code is a little too slow to be really usable for "complicated" declarations, but it does provide some data:

import Mathlib

open Lean

/-- `getExprsName e` takes as input an `Expr`ession `e`.
It returns a pair consisting of
* the list of all the maximal proper sub-expressions of `e`,
* the "head" `Name` associated to `e`, if it exists.
-/
def getExprsName : Expr  List Expr × Option Name
  | .const declName _            => ([], some declName)
  | .app fn arg                  => ([fn, arg], none)
  | .lam _ binderType body _     => ([binderType, body], none)
  | .forallE _ binderType body _ => ([binderType, body], none)
  | .letE _ type value body _    => ([type, value, body], none)
  | .mdata _ expr                => ([expr], none)
  | .proj typeName _ struct      => ([struct], some typeName)
  | _ => ([], none)

/-- `getForallEs e` takes as input an expression `e`.
If `e` is of the form `∀ a₁, a₂,..., aₙ, <something>`, then it returns the names
of all the variables `a₁, a₂,..., aₙ`.
This is useful to extract (some of) the constructors of an inductive type. -/
def getForallEs : Expr  List Name
  | .forallE name _ body _ => name :: getForallEs body
  | _ => []

/-- `getConstructorsAux nm` takes as input a `Name` `nm`.
It returns the pair of lists of
* `Name`s of the constructors of `nm` obtained via `Lean.InductiveVal.ctors`,
  if `nm` is the `Name` of an inductive type -- typically `nm.mk`;
* `Name`s of the constructors that appear in the `nm.mk` function for `nm`, extracted
  via `getExprsName`.

The two projections are called `getConstructors'` and `getConstructors`. -/
def getConstructorsAux {m : Type  Type} [Monad m] [MonadResolveName m] [MonadEnv m] (nm : Name) :
    m (List Name × List Name) := do
  let env  getEnv
  if ! env.contains nm then return ([], []) else
    let decl := (env.find? nm).get!
    if ! decl.isInductive then return ([], []) else
      let ctrs := decl.inductiveVal!.ctors
      let decls := (ctrs.map (env.find? · )).reduceOption
      let types := decls.map ConstantInfo.type
      let ret := (types.map getForallEs).join
      let withNamespace  ret.mapM (resolveGlobalName <| nm.append ·)
      return (ctrs, withNamespace.map fun x => (x.getD 0 default).1)

run_cmd do
  let n := `Mul
  if ( getConstructorsAux n) != ([`Mul.mk], [`Mul, `Mul.mul]) then
    throwError "oh no! {← getConstructorsAux n}"

/-- `getConstructors' nm` returns the list of constructors of `nm`, assuming that
it is the `Name` of an inductive type.
Often, this is `nm.mk`. -/
def getConstructors' {m : Type  Type} [Monad m] [MonadResolveName m] [MonadEnv m] (nm : Name) :
    m (List Name) := do
  let (ctrs, _)  getConstructorsAux nm
  return ctrs

/-- `getConstructors nm`  returns the list of `Name`s of the constructors of `nm`,
that appear in the `nm.mk` function for `nm`. -/
def getConstructors {m : Type  Type} [Monad m] [MonadResolveName m] [MonadEnv m] (nm : Name) :
    m (List Name) := do
  let (ctrs, ns)  getConstructorsAux nm
  return ctrs ++ ns

/-- `getConstsIn e` takes as input an `Expr`ession `e` and it returns the `Name`s of all the
constants recursively appearing in `e`.
The optional argument `hs` allows to increment `hs` by such constants, instead of starting
from scratch every time. -/
partial
def getConstsIn (e : Expr) (hs : HashSet Name := .empty) : HashSet Name :=
  Id.run do
  let mut hs := hs
  let (se, na) := getExprsName e
  if let some na := na then hs := hs.insert na
  for s in se do
    let nas := getConstsIn s
    for n in nas do
      if ! hs.contains n then hs := hs.insert n
  return hs

/-- `getAll nm` takes as input a name `nm` and returns the `HashSet Name` of all the
non-blacklisted `Name`s that appear in `nm` and in all the constants that are implied
by `nm`. -/
def getAll {m : Type  Type} [Monad m] [MonadResolveName m] [MonadEnv m] (nm : Name) :
    m (HashSet Name) := do
  let env  getEnv
  let mut boundary := HashSet.empty.insert nm
  let mut tot := HashSet.empty.insert nm
  let mut seenDecls : HashSet Name := .empty
  while ! boundary.isEmpty do
    for new in boundary do
      boundary := boundary.erase new
      let decl := (env.find? new).getD default
      if ! seenDecls.contains decl.name then
        seenDecls := seenDecls.insert decl.name
        let first := getConstsIn decl.type
        for d in first do
          if (! ( d.isBlackListed)) && (! tot.contains d) then
            tot := tot.insert d
            boundary := boundary.insert d
        if decl.value?.isSome then
          for d in getConstsIn decl.value! do
            if (! ( d.isBlackListed)) && (! tot.contains d) then
              tot := tot.insert d
              boundary := boundary.insert d
        for d in ( getConstructors new) ++ ( getConstructors' new) do
          if (! ( d.isBlackListed)) && (! tot.contains d) then
            tot := tot.insert d
            boundary := boundary.insert d
  return tot

/-- `moduleIdxPrefix env str` takes as input an environment `env` and a string `str`.
It returns the array of pairs `(name, moduleIdx)` consisting of a `Name` of a module
that begins with `str` and the `ModuleIdx` corresponding `name`.
This is essentially a dictionary between module names beginning with `str` and
their module indices. -/
def moduleIdxPrefix (env : Environment) (str : String) : Array (Name × ModuleIdx) :=
  let filt := env.allImportedModuleNames.filter (str.isPrefixOf ·.toString)
  filt.map fun d => (d, (env.getModuleIdx? d).getD default)

/-- `countDeclCmd id pathPrefix` is the main "command": it takes as input an identifier `id` and
a path string `pathPrefix` and returns the `HashSet Name` of all names of constants recursively
implied by the declaration determined by `id` and whose  -/
def countDeclCmd {m : Type  Type} [Monad m] [MonadEnv m][MonadResolveName m] [MonadError m]
    (id : TSyntax `ident) (pathPrefix : String) :
    m (HashSet Name) := do
  let nm := id.getId
  let env  getEnv
  let idxPrefix := moduleIdxPrefix env pathPrefix
  let (_, idx) := idxPrefix.unzip
  if (env.find? nm).isNone then throwError "unknown constant '{nm}'"
  let mut rest  getAll nm
  for d in rest do
    let c := env.getModuleIdxFor? d
    if ! (c.getD default)  idx then rest := rest.erase d
  return rest

/-- `#tally id` takes as input an identifier `id`.  It returns a tally of the number of
declarations recursively implied by `id`, where the tally is split by the first two folders
in the path to the file containing each declaration.

For example,
```lean
#tally Nat
/-
3  Init.Prelude

3  total
-/
#tally Int
/-
3  Init.Data
3  Init.Prelude
1  [anonymous]

7  total
-/
```
 -/
elab "#tally " tk:(noWs "!")? id:ident : command => do
  let env  getEnv
  let filt := env.allImportedModuleNames
  let dict := filt.map fun m => ((env.getModuleIdx? m).getD default, m)
  let dict := dict.map fun (idx, nm) =>
    (idx, (nm.components.take 2).foldl (· ++ ·) default)
  let dict := dict.toList
  let tot  countDeclCmd id ""
  let mut tal : HashMap Name Nat := .empty
  for d in tot do
    let didx := env.getModuleIdxFor? d
    let pair := (dict.lookup didx).getD default
    match tal.find? pair with
      | none => tal := tal.insert pair 1
      | some x => tal := tal.insert pair (x + 1)
  let ct := tal.fold (fun n _ t => (n + t)) 0
  let lth := (Nat.toDigits 10 ct).length
  for (m, t) in tal.toArray.qsort (fun n m => n.1.toString < m.1.toString) do
    let pad := List.replicate (lth - (Nat.toDigits 10 t).length) ' '
    dbg_trace (⟨pad ++ s!"{t}  {m}")
  dbg_trace "\n{ct}  total"
  if tk.isSome then
    dbg_trace "\nDeclarations in '{id.getId}':\n"
    for t in tot.toArray.qsort (fun n m => n.toString < m.toString) do dbg_trace t

Damiano Testa (Mar 07 2024 at 02:06):

For instance, you get this, if you wait for a few seconds:

#tally Semiring
/-
  1  Init.Core
  4  Init.Data
  1  Init.Meta
104  Init.Prelude
 41  Mathlib.Algebra
  8  Mathlib.Init
  1  Std.Data
  1  [anonymous]

161  total
-/

#tally Real.sin
/-
   9  Init.Classical
  25  Init.Control
 132  Init.Core
 489  Init.Data
   1  Init.Meta
 317  Init.Prelude
  16  Init.PropLemmas
  51  Init.SimpLemmas
  13  Init.WF
1689  Mathlib.Algebra
 471  Mathlib.Data
  28  Mathlib.GroupTheory
 147  Mathlib.Init
  52  Mathlib.Logic
 374  Mathlib.Order
   9  Mathlib.Tactic
   1  Mathlib.Topology
   5  Std.Classes
 179  Std.Data
   2  Std.Logic
   1  [anonymous]

4011  total
-/

David Loeffler (Mar 07 2024 at 06:37):

Hi Damiano! That looks cool, but given the time it takes for Real.sin, I suspect #tally hasSum_zeta_two would probably be totally infeasible.

David Loeffler (Mar 07 2024 at 06:39):

So maybe I was right to be pessimistic about (c), at least for the time being.

Johan Commelin (Mar 07 2024 at 06:42):

I think it is worth a shot. I would be surprised if it takes more than a few minutes

David Loeffler (Mar 07 2024 at 07:04):

It got there in the end (10 minutes or so). The final tally for hasSum_zeta_two is 19410. That's actually only about 10% of mathlib, I'm disappointed :smile:

I'd be curious to compare this with other heavyweight theorems in the library – can anyone find a single declaration with more than this many prerequisites?

Kevin Buzzard (Mar 07 2024 at 07:11):

I remember that we were initially quite disappointed to discover that perfectoid spaces were only four times as complex as real numbers! I would be tempted to try a theorem about group cohomology for this game. It wouldn't surprise me if it's pulling in a bunch of the algebra hierarchy and a bunch of category theory but right now I have no feeling for numbers.

Actually perhaps the thing to do, now Mario has shaken mathlib, is to redo Scott's calculation of the longest pole and look in the leaf file. I found it very interesting that during the port the longest pole was changing between group cohomology and measure theory, it seemed like evidence that algebra and analysis were growing in complexity at similar rates.

Damiano Testa (Mar 07 2024 at 07:24):

With respect to speed: it is very likely that small tweaks could improve dramatically the running time. I don't think that the code that I wrote is particularly efficient.

David Loeffler (Mar 07 2024 at 07:27):

I'll try running it on groupCohomology.H1ofAutOnUnitsUnique (Hilbert 90) and see how that compares.

David Loeffler (Mar 07 2024 at 08:01):

David Loeffler said:

I'll try running it on groupCohomology.H1ofAutOnUnitsUnique (Hilbert 90) and see how that compares.

Only 1611, so Hilbert 90 is "easier" than Real.sin. Just goes to show how little correlation there is between the order things are built from the axioms, and the order we teach them to students!

Damiano Testa (Mar 07 2024 at 08:10):

Ok, besides speed improvements, this answer on Hilbert 90 makes me wonder whether my code is missing anything.

Damiano Testa (Mar 07 2024 at 08:16):

These are the declarations that it finds for H90.

H90 -- first 400 lines

Damiano Testa (Mar 07 2024 at 08:18):

second batch

Damiano Testa (Mar 07 2024 at 08:19):

third batch

Damiano Testa (Mar 07 2024 at 08:20):

final batch

David Loeffler (Mar 07 2024 at 08:56):

Kevin: you know this part of the code, and I don't – is there anything here that's missing? It definitely seems strange that Hilbert 90 only seems to touch five other declarations under groupCohomology.

Damiano Testa (Mar 07 2024 at 09:05):

If it helps with the debugging, I added a ! flag to #tally (I edited the code above):

#tally! Int
/-
3  Init.Data
3  Init.Prelude
1  [anonymous]

7  total

Declarations in 'Int':

Int
Int.negSucc
Int.ofNat
Nat
Nat.succ
Nat.zero
[anonymous]
-/

This prints the (sorted) list of declarations that it found.

Xavier Roblot (Mar 07 2024 at 09:25):

I tried it on docs#RingOfIntegers.isPrincipalIdealRing_of_abs_discr_lt that uses most of the algebraic number theory in Mathlib (except units) and got a score of 29220.

    1  Aesop.BuiltinRules
    7  Init.ByCases
   18  Init.Classical
    3  Init.Coe
   31  Init.Control
  244  Init.Core
  906  Init.Data
    2  Init.Ext
    1  Init.Meta
    4  Init.MetaTypes
  121  Init.Omega
  370  Init.Prelude
   65  Init.PropLemmas
   71  Init.SimpLemmas
    4  Init.SizeOf
   30  Init.WF
 4665  Mathlib.Algebra
    6  Mathlib.AlgebraicGeometry
 2467  Mathlib.Analysis
 5171  Mathlib.Data
   33  Mathlib.Dynamics
  251  Mathlib.FieldTheory
  838  Mathlib.GroupTheory
  281  Mathlib.Init
 1443  Mathlib.LinearAlgebra
  644  Mathlib.Logic
 3105  Mathlib.MeasureTheory
  153  Mathlib.NumberTheory
 2727  Mathlib.Order
 1182  Mathlib.RingTheory
  377  Mathlib.SetTheory
  180  Mathlib.Tactic
 3355  Mathlib.Topology
    5  Std.Classes
  447  Std.Data
   11  Std.Logic
    1  [anonymous]

29220  total

And docs#IsCyclotomicExtension.Rat.three_pid should even get more but I don't have time to test it now...

Damiano Testa (Mar 07 2024 at 09:26):

By manual inspection, I think that there are definitely more declarations in the groupCohomology namespace that it should have found. I'm investigating why it is not recursing properly.

Damiano Testa (Mar 07 2024 at 09:29):

To go a bit into the internals, the code omits declarations that are blackListed: these should essentially be the declarations that lean uses internally to fill in some gaps, but I do not think that pruning them they should be responsible for the inconsistencies that seem to be present.

David Loeffler (Mar 07 2024 at 09:55):

Xavier Roblot said:

I tried it on docs#RingOfIntegers.isPrincipalIdealRing_of_abs_discr_lt that uses most of the algebraic number theory in Mathlib (except units) and got a score of 29220.

Ah, the Minkowski bound! I wondered why that needed quite so much analysis / topology, but it looks like the dependence goes via MeasureTheory.Measure.Lebesgue.VolumeOfBalls. This may well be the "commanding height" of mathlib – I wonder if anyone can beat that score?

Xavier Roblot (Mar 07 2024 at 10:01):

docs#IsCyclotomicExtension.Rat.three_pid should beat it since it actually uses that function. Note that it tooks more than 30mn to compute this tally on my M1 laptop. Anyway, I am not in my office right now so somebody else will have to try :smile:

Riccardo Brasca (Mar 07 2024 at 10:23):

I guess this is because the definition of the cyclotomic polynomials uses complex roots of unity (this is my fault, I am to lazy to do an algebraic refactor)

Xavier Roblot (Mar 07 2024 at 10:34):

There is also a bunch of topology / analysis involved in the proof that two complex embeddings define the same infinite place iff they are complex conjugate.

Kevin Buzzard (Mar 07 2024 at 10:58):

David Loeffler said:

Kevin: you know this part of the code, and I don't – is there anything here that's missing? It definitely seems strange that Hilbert 90 only seems to touch five other declarations under groupCohomology.

I was expecting a lot more category theory and topology -- but looking at the code I see I'd misguessed what was going on. H1 is defined concretely as 1-cocycles over 1-coboundaries, rather than H n for n = 1, so all the general group cohomology machinery (which uses a bunch of resolutions, which uses a bunch of topology...) is not used. In short docs#groupCohomology.isoH1 is not rfl, so it was my guess which was lousy, not Damiano's code.

Damiano Testa (Mar 07 2024 at 11:02):

Kevin Buzzard said:

In short docs#groupCohomology.isoH1 is not rfl, so it was my guess which was lousy, not Damiano's code.

Oh, I assure you, Damiano's code has a very high chance of being lousy!

Xavier Roblot (Mar 07 2024 at 17:59):

I finally had the time to test docs#IsCyclotomicExtension.Rat.three_pid, the tally is 31083

    1  Aesop.BuiltinRules
    8  Init.ByCases
   18  Init.Classical
    3  Init.Coe
   31  Init.Control
  247  Init.Core
  953  Init.Data
    2  Init.Ext
    1  Init.Meta
    4  Init.MetaTypes
  147  Init.Omega
  371  Init.Prelude
   69  Init.PropLemmas
   72  Init.SimpLemmas
    4  Init.SizeOf
   30  Init.WF
 5094  Mathlib.Algebra
    6  Mathlib.AlgebraicGeometry
 2483  Mathlib.Analysis
 5774  Mathlib.Data
   39  Mathlib.Dynamics
  368  Mathlib.FieldTheory
  916  Mathlib.GroupTheory
  283  Mathlib.Init
 1483  Mathlib.LinearAlgebra
  649  Mathlib.Logic
 3105  Mathlib.MeasureTheory
  239  Mathlib.NumberTheory
 2735  Mathlib.Order
 1517  Mathlib.RingTheory
  384  Mathlib.SetTheory
  193  Mathlib.Tactic
 3355  Mathlib.Topology
    5  Std.Classes
  482  Std.Data
   11  Std.Logic
    1  [anonymous]

31083  total

Mario Carneiro (Mar 07 2024 at 18:36):

@Damiano Testa As you suspected, the code can be written more simply and a lot faster by reusing the code of #print axioms, which is already doing basically all of this work. (You can also copy and paste it, it's not that complicated, but it so happens that we can actually use the core function unmodified in this case.)

import Mathlib

open Lean

elab "#tally " tk:(noWs "!")? id:ident : command => do
  let env  getEnv
  let decl  Elab.resolveGlobalConstNoOverloadWithInfo id
  let (_, { visited, .. }) := Elab.Command.CollectAxioms.collect decl |>.run ( getEnv) |>.run {}
  let mut truncate := if tk.isSome then
    env.header.moduleNames
  else
    env.header.moduleNames.map fun n => Name.fromComponents <| n.components.take 2
  let mut out := mkRBMap Name Nat Name.cmp
  for d in visited do
    if let some idx := env.getModuleIdxFor? d then
      let n := truncate[idx.toNat]!
      out := out.insert n (out.findD n 0 + 1)
  let mut msg := m!""
  let mut total := 0
  for (name, n) in out do
    total := total + n
    msg := msg ++ m!"{(toString n).leftpad 8 ' '}  {name}\n"
  logInfo m!"{msg}\n{(toString total).leftpad 8 ' '}  total"

Mario Carneiro (Mar 07 2024 at 18:36):

#tally IsCyclotomicExtension.Rat.three_pid
       1  Aesop.BuiltinRules
      16  Init.ByCases
      32  Init.Classical
       3  Init.Coe
      20  Init.Control
     331  Init.Core
    1511  Init.Data
       3  Init.Ext
       1  Init.Meta
     172  Init.Omega
     497  Init.Prelude
     156  Init.PropLemmas
      99  Init.SimpLemmas
       9  Init.SizeOf
      45  Init.WF
    8646  Mathlib.Algebra
       9  Mathlib.AlgebraicGeometry
    4591  Mathlib.Analysis
       5  Mathlib.Combinatorics
    9588  Mathlib.Data
      64  Mathlib.Dynamics
     637  Mathlib.FieldTheory
    2084  Mathlib.GroupTheory
     408  Mathlib.Init
    3142  Mathlib.LinearAlgebra
    1209  Mathlib.Logic
    4749  Mathlib.MeasureTheory
     444  Mathlib.NumberTheory
    4425  Mathlib.Order
    2980  Mathlib.RingTheory
     666  Mathlib.SetTheory
     254  Mathlib.Tactic
    5234  Mathlib.Topology
       5  Std.Classes
     925  Std.Data
      17  Std.Logic

   52978  total

David Loeffler (Mar 07 2024 at 18:58):

As you suspected, the code can be written more simply and a lot faster by reusing the code of #print axioms, which is already doing basically all of this work

It gets – simply and a lot faster – to a very different answer! For Rat.three_PID, Mario's code gives an answer of 52978 while Damiano's gives 31083. I wonder how the difference arises?

Damiano Testa (Mar 07 2024 at 19:04):

Mario, that is fantastic! I did not really know what print axioms did, so it hadn't even occurred to me to recycle it!

Damiano Testa (Mar 08 2024 at 09:50):

I've done a little bit of investigating and I think that the discrepancy between the numbers may be due to the fact that my code was removing blackListed declarations (which I was doing in the hope of speeding up the computation).

Besides that, in terms of programming abilities, there are probably very few wider chasms than the one separating Mario and myself, so that should explain why his code works, while mine is, at best, shaky.

Finally, I see that Mario reinterpreted the value of the ! flag to mean "tally by whole path instead of 2-step truncated paths". What I had intended the ! flag to be was to actually print all the declarations found. If you want that behaviour, below I adapt minimally Mario's code from above.

So, now you can do

#tally CovariantClass
/-
       3  Mathlib.Algebra

       3  total
-/

#tally! CovariantClass
/-
       3  Mathlib.Algebra.CovariantAndContravariant

       3  total
-/

#tally CovariantClass all
/-
       3  Mathlib.Algebra

       3  total

[Covariant, CovariantClass, CovariantClass.mk]
-/

#tally! CovariantClass all
/-
       3  Mathlib.Algebra.CovariantAndContravariant

       3  total

[Covariant, CovariantClass, CovariantClass.mk]
-/
import Mathlib

open Lean

elab "#tally " tk:(noWs "!")? id:ident all?:(" all")? : command => do
  let env  getEnv
  let decl  Elab.resolveGlobalConstNoOverloadWithInfo id
  let (_, { visited, .. }) := Elab.Command.CollectAxioms.collect decl |>.run ( getEnv) |>.run {}
  let mut truncate := if tk.isSome then
    env.header.moduleNames
  else
    env.header.moduleNames.map fun n => Name.fromComponents <| n.components.take 2
  let mut out := mkRBMap Name Nat Name.cmp
  for d in visited do
    if let some idx := env.getModuleIdxFor? d then
      let n := truncate[idx.toNat]!
      out := out.insert n (out.findD n 0 + 1)
  let mut msg := m!""
  let mut total := 0
  for (name, n) in out do
    total := total + n
    msg := msg ++ m!"{(toString n).leftpad 8 ' '}  {name}\n"
  let pmsg := m!"{msg}\n{(toString total).leftpad 8 ' '}  total"
  logInfo <| pmsg ++ match all? with
      | none => ""
      | some _ => m!"\n\n{visited.toArray.qsort (·.toString < ·.toString)}"

Riccardo Brasca (May 03 2024 at 19:19):

Can part of this code be used to find unused declarations to get to a specific theorem? For example, in flt-regular we have a clear target (FLT for regular primes!), but I am sure we have a lot of useless code. It would be nice to get a list of useless things, to check it they just deserve to be removed.

Mario Carneiro (May 03 2024 at 19:26):

Std/Batteries has #show_unused, although it only works within a single file

Mario Carneiro (May 03 2024 at 19:26):

it could work across files but it would need a different interface

David Loeffler (May 04 2024 at 06:04):

I recall a discussion not long back about a script intended to automatically produce MWE's for some failing declaration. That's the same problem as here, in some sense. I can't remember who was working on the auto-MWE script though – @Johan Commelin was it you?

Johan Commelin (May 04 2024 at 06:53):

No, it wasn't me.

Yaël Dillies (May 04 2024 at 06:55):

It was @Anne Baanen

Yaël Dillies (May 04 2024 at 06:56):

See https://leanprover.zulipchat.com/#narrow/stream/270676-lean4/topic/Bug.20causing.20stack.20overflow

Kim Morrison (May 04 2024 at 07:15):

I think the bottom line was that the automation wasn't yet good enough. I would love progress on this front.

There are a lot of tricks to effectively minimizing from deep inside Mathlib.

To begin, you need a good set of #guard_msgs statements (or sometimes hand-rolled checks of the errors messages) that ensure that you have red iff you've over-minimized.

After that, you iteratively:

  • move the contents of an import into the current file (I find it helpful to wrap it in a section named after the file)
  • delete all unneeded declarations (#show_unused in X is super helpful)
  • further minimizing the remaining declarations (replace all proofs with sorry, remove structure fields -- hard to automate as you often need downstream fixes, try changing definitions to sorry (sometimes this doesn't work but changing the definition to be something "simpler" does), weakening extends statements or typeclass arguments...)
  • minimize imports (try removing imports, and other replacing each by its own imports: shake can help here)

If you go through the @[to_additive] parts of the library removing that dependency can be painfully manual.

Anne Baanen (May 04 2024 at 08:58):

Here's the main thread for the script: https://leanprover.zulipchat.com/#narrow/stream/113488-general/topic/Minimizer.20script

Anne Baanen (May 04 2024 at 09:00):

It essentially tries doing all the steps Kim suggests but by brute force, simply doing a search cutting out whole lines and paragraphs without any guidance from Lean. So it's not particularly effective at actually finding what can be dropped, nor does it do any downstream fixing.

Anne Baanen (May 04 2024 at 09:00):

In practice it means it is way too slow and way too ineffective for any use on Mathlib. :(

Damiano Testa (Jun 04 2024 at 20:13):

Out of curiosity, I tallied all lemmas: these are the top 10 using the most declarations

Declaration Depends on
NumberField.Units.unitLattice_rank 51024
NumberField.hermiteTheorem.finite_of_discr_bdd_of_isReal 51088
NumberField.hermiteTheorem.finite_of_discr_bdd_of_isComplex 51103
NumberField.finite_of_discr_bdd 51121
NumberField.Units.rank_modTorsion 51201
NumberField.Units.basisModTorsion 51211
NumberField.Units.fun_eq_repr 51214
NumberField.Units.exist_unique_eq_mul_prod 51219
IsCyclotomicExtension.Rat.three_pid 53083
IsCyclotomicExtension.Rat.five_pid 53084

Kim Morrison (Jun 05 2024 at 01:30):

I feel like you should only list declarations that are not themselves used in other declarations!

Damiano Testa (Jun 05 2024 at 05:13):

I agree, but I need to think about how to do that with finite resources! :smile:

Damiano Testa (Sep 10 2024 at 15:34):

Warning: take what is below more as plausibility test, rather than a bug-free report!

I went back to the question of finding which declarations are maximal, in the sense that they are not used by any other declaration in mathlib. For those declarations, I computed the number of transitively involved constants, as a measure of "complexity". Here is a summary of the findings.

There are 72619 maximal declarations in Mathlib. The 10 longest ones are

I find it a little ironic that docs#WeierstrassCurve.toProjective involves 4 constants and is at position 42.

Source: the Find tips tab.

David Ang (Sep 10 2024 at 15:56):

To be fair docs#WeierstrassCurve.toProjective doesn't actually do anything...

Damiano Testa (Sep 10 2024 at 15:59):

I know: as soon as I saw it, I went and checked, since I thought "there is no way that this can be right", but it is!

In case you are curious, these are the 4 involved constants:

#[WeierstrassCurve,
 WeierstrassCurve.Projective,
 WeierstrassCurve.toProjective,
 WeierstrassCurve.mk]

Michael Stoll (Sep 10 2024 at 16:27):

Damiano Testa said:

Warning: take what is below more as plausibility test, rather than a bug-free report!

I went back to the question of finding which declarations are maximal, in the sense that they are not used by any other declaration in mathlib. For those declarations, I computed the number of transitively involved constants, as a measure of "complexity". Here is a summary of the findings.

There are 72619 maximal declarations in Mathlib. The 10 longest ones are

I find it a little ironic that docs#WeierstrassCurve.toProjective involves 4 constants and is at position 42.

Source: the Find tips tab.

Number Theory rules!

David Loeffler (Sep 10 2024 at 17:45):

Michael Stoll said:

Number Theory rules!

In mountaineering, people talk about the "prominence" of a mountain peak, defined as the height of that peak above the saddle connecting it to the next highest peak. By analogy with this, I wonder what the "most prominent" mathlib theorems are? I.e. what are the first few elements of the list characterised by

L[N + 1] := the theorem "x" which maximizes # { prerequisites of x which are not prerequisites of any of L[1], ..., L[N] }.

Presumably L[2] would be something outside of number theory – I'm guessing it would be one of the results about C-star algebras that are the highest-ranked non-number-theory items on the current list.

Damiano Testa (Sep 10 2024 at 21:10):

I agree that this information is still not ideal, but at least these results are not "in a chain", like the previous attempt that I made!

Damiano Testa (Sep 10 2024 at 21:11):

However, this code took almost 4 hours to run, so any more sophisticated analysis will require a different idea for how to implement it.

Daniel Weber (Sep 11 2024 at 06:50):

Would it be possible to do this count ignoring proofs, i.e. the most complex definitions?

Damiano Testa (Sep 11 2024 at 06:53):

Probably in byAsSorry with roughly the same code? I never used byAsSorry, though.

Damiano Testa (Sep 11 2024 at 07:00):

If that works, then also I expect that the whole process would be much faster and a lot of extra analysis could be done without rethinking about optimizing the current script.

Kim Morrison (Sep 11 2024 at 07:03):

No, byAsSorry is irrelevant here. Just pull the constants from the type, not from the value.

Daniel Weber (Sep 11 2024 at 07:21):

How many constants are there? I don't know much about meta-programming, but if you're able to export a graph I could do some analysis

Damiano Testa (Sep 11 2024 at 07:22):

230827 is the number of declarations.

Damiano Testa (Sep 11 2024 at 07:23):

(Possibly after filtering out the ones that are "implementation details" for some meaning of "implementation details".)

Damiano Testa (Sep 11 2024 at 07:23):

In any case, those are the declarations that the linter uses, although it counts the implementation details as dependencies as well. I am not sure how much of a difference that would make.

Daniel Weber (Sep 11 2024 at 07:23):

How many edges are there?

Damiano Testa (Sep 11 2024 at 07:24):

I do not know, but a lot vertices have valence way over 10k.

Damiano Testa (Sep 11 2024 at 07:25):

The maximum valence is 55k (as per above).

Daniel Weber (Sep 11 2024 at 07:27):

I mean direct edges, not in the transitive closure

Damiano Testa (Sep 11 2024 at 07:27):

That information I have not looked at...

Damiano Testa (Sep 11 2024 at 07:27):

However, maybe the length of the proof is a good indication?

Damiano Testa (Sep 11 2024 at 07:45):

In any case, even though I have not looked explicitly at the number of direct dependencies, I do think that it is much smaller than the number of transitive dependencies and it might even be quite small (maybe even ~100 on average? Pure speculation).

Damiano Testa (Sep 11 2024 at 08:39):

This build tallies what I think are the direct imports of each declaration. The maximum appears to be 579.

Damiano Testa (Sep 11 2024 at 08:43):

Notice how this step took 3mins, as opposed to almost 4 hours for the "full" check: definitely there is room for improvement!

Daniel Weber (Sep 11 2024 at 11:01):

How would I get it to output all of tips to a file?

Damiano Testa (Sep 11 2024 at 11:02):

You would like the "lighter" one, right? Just the direct dependencies and then you take care of taking the transitive closure?

Damiano Testa (Sep 11 2024 at 11:04):

(I am saying this, since tips is what I have been using for the expr-maximal declarations, whereas I imagine that you want a list of pairs (declaration, array of declarations that are directly used in the first entry).)

Damiano Testa (Sep 11 2024 at 11:04):

(And this is not directly available from the script, but could probably be generated.)

Daniel Weber (Sep 11 2024 at 11:06):

Damiano Testa said:

You would like the "lighter" one, right? Just the direct dependencies and then you take care of taking the transitive closure?

Yes

Damiano Testa (Sep 11 2024 at 11:11):

This list would contain, roughly, 20k entries, each with a corresponding array of roughly 300 entries. So, roughly, 10^8 characters (I'm trying to overestimate, using that each declaration name is 100 characters, since there no longer are insane instance names :smile: ).

Damiano Testa (Sep 11 2024 at 11:12):

It does not sound insane, but I am not sure that I should make GitHub produce that in CI...

Daniel Weber (Sep 11 2024 at 11:13):

Would it be possible to number them? i.e. Say that 2 uses 4, 6, and 9, and have the map to the constant name separate

Damiano Testa (Sep 11 2024 at 11:14):

It is certainly possible and since everything goes via stuff that is called HashSomething I imagine that Lean is already doing that internally... But maybe someone who is more knowledgeable of actual coding should help here...

Daniel Weber (Sep 11 2024 at 12:06):

I managed to write

elab "#direct_deps" : command => do
  let env  getEnv
  let mut name_ids : Std.HashMap Name Nat :=  liftCoreM do
    env.constants.map₁.foldM (init := {}) fun map n ci => do
      if ( n.isBlackListed) || n.getRoot == `Lean then return map
      else return map.insert n map.size
  let mut tips : Std.HashMap Name NameSet :=  liftCoreM do
    env.constants.map₁.foldM (init := {}) fun map n ci => do
      if ( n.isBlackListed) || n.getRoot == `Lean then return map
      else return map.insert n (ci.getUsedConstantsAsSet)
  IO.FS.withFile (( IO.currentDir) / "test.txt") (.write) fun hfile  do
    hfile.putStr s!"{name_ids.size}\n"
    for (name, id) in name_ids do
      hfile.putStr s!"{name} {id}\n"
    for (name, maps_to) in tips do
      hfile.putStr (s!"{name_ids.get! name} {maps_to.size} " ++
        " ".intercalate (maps_to.toList.map (toString  name_ids.get!)) ++ "\n")

it seems to be working. Do you know if there's a way to get a nice progress bar, like Python's tqdm?

Damiano Testa (Sep 11 2024 at 12:15):

I don't think that there is a way to get a progress bar. I also don't think that it is possible to print something "mid-computation". I think it is all-or-nothing, unfortunately.

Damiano Testa (Sep 11 2024 at 12:16):

What I thought might have been an option was to try to split the computation in two parts and get a "halfway report", but did not think hard about it.

Daniel Weber (Sep 11 2024 at 12:28):

Does VS code compile and run this, or is this getting interpreted? It still hadn't finished, and it's only with what Tips.lean imports

Damiano Testa (Sep 11 2024 at 12:30):

I do not really know the difference between interpret and compile, but I think that it is the slower of the two.

Damiano Testa (Sep 11 2024 at 12:33):

Also, in CI, the commented import Mathlib gets uncommented so that the code runs on all of mathlib, but I was using the lighter imports for testing.

Daniel Weber (Sep 11 2024 at 13:25):

I did this, I'm not sure how well it's working

/-
Copyright (c) 2024 Damiano Testa. All rights reserved.
Released under Apache 2.0 license as described in the file LICENSE.
Authors: Damiano Testa
-/
import ImportGraph
import Cli
--import Mathlib

/-!
# Find declarations that are not used by other declarations
-/

open Lean Meta Elab Command

def isMatcher' (env : Environment) (declName : Name) : Bool :=
  isMatcherCore env declName

def isRec' (env : Environment) (declName : Name) : Bool :=
  isRecCore env declName

def Lean.Name.isBlackListed' (env : Environment) (declName : Name) : Id Bool := do
  if declName == ``sorryAx then return true
  if declName matches .str _ "inj" then return true
  if declName matches .str _ "noConfusionType" then return true
  pure <| declName.isInternalDetail
   || isAuxRecursor env declName
   || isNoConfusion env declName
   || isRec' env declName || isMatcher' env declName

/--
`lake exe deps`
-/
def mainCli (args : Cli.Parsed) : IO UInt32 := do
  let to  match args.flag? "to" with
  | some to => pure <| to.as! Cli.ModuleName
  | none => ImportGraph.getCurrentModule -- autodetect the main module from the `lakefile.lean`
  IO.eprintln "test"
  searchPathRef.set compile_time_search_path%
  IO.FS.withFile (( IO.currentDir) / "test.txt") .write fun hfile  do
  unsafe withImportModules #[{module := to}] {} (trustLevel := 1024) fun env => do
  let mut name_ids : NameMap Int := {}
  for (n, _) in env.constants do
    if ( n.isBlackListed' env) || n.getRoot == `Lean then pure () else
    if name_ids.size % 10 == 0 then IO.eprintln s!"finished {name_ids.size}"
    name_ids := name_ids.insert n name_ids.size
  hfile.putStrLn s!"{name_ids.size}"
  for (n, ci) in env.constants do
    if ( n.isBlackListed' env) || n.getRoot == `Lean then pure () else
    let names := (ci.getUsedConstantsAsSet.toList.filterMap name_ids.find?).map toString
    hfile.putStrLn (s!"{n} {name_ids.find! n} {names.length} " ++ " ".intercalate names)
  return 0

/-- Setting up command line options and help text for `lake exe pole`. -/
def deps : Cli.Cmd := `[Cli|
  deps VIA mainCli; ["0.0.1"]
  "Calculat a thing."

  FLAGS:
    to : Cli.ModuleName;      "Calculate the longest thing with the thing."
]

/-- `lake exe deps` -/
def main (args : List String) : IO UInt32 :=
  deps.validate args

Daniel Weber (Sep 11 2024 at 15:02):

It finished running and the file is about 50MB. I'll do some analysis, is there anything in particular which might be interesting to look at?

Daniel Weber (Sep 11 2024 at 15:14):

I wanted to check that it's a DAG as a sanity check, but surprisingly it's not - there are SCCs of size 28, for instance:

  • CommRingCat.Colimits.Relation
  • CommRingCat.Colimits.Relation.mul
  • CommRingCat.Colimits.Relation.neg_add_cancel
  • CommRingCat.Colimits.Relation.zero_mul
  • CommRingCat.Colimits.Relation.one_mul
  • CommRingCat.Colimits.Relation.zero
  • CommRingCat.Colimits.Relation.neg
  • CommRingCat.Colimits.Relation.left_distrib
  • CommRingCat.Colimits.Relation.mul_comm
  • CommRingCat.Colimits.Relation.add_1
  • CommRingCat.Colimits.Relation.add_zero
  • CommRingCat.Colimits.Relation.map
  • CommRingCat.Colimits.Relation.refl
  • CommRingCat.Colimits.Relation.add_2
  • CommRingCat.Colimits.Relation.mul_zero
  • CommRingCat.Colimits.Relation.zero_add
  • CommRingCat.Colimits.Relation.one
  • CommRingCat.Colimits.Relation.mul_1
  • CommRingCat.Colimits.Relation.mul_assoc
  • CommRingCat.Colimits.Relation.add_comm
  • CommRingCat.Colimits.Relation.symm
  • CommRingCat.Colimits.Relation.right_distrib
  • CommRingCat.Colimits.Relation.trans
  • CommRingCat.Colimits.Relation.add_assoc
  • CommRingCat.Colimits.Relation.add
  • CommRingCat.Colimits.Relation.mul_2
  • CommRingCat.Colimits.Relation.mul_one
  • CommRingCat.Colimits.Relation.neg_1

Daniel Weber (Sep 11 2024 at 15:15):

What is CommRingCat.Colimits.Relation.below? I don't see it on the docs - I think I might not be filtering implementation details correctly

Markus Himmel (Sep 11 2024 at 15:17):

below is explained in Theorem Proving in Lean: https://lean-lang.org/theorem_proving_in_lean4/induction_and_recursion.html?highlight=below#structural-recursion-and-induction

Daniel Weber (Sep 11 2024 at 15:31):

The height (longest chain) of Mathlib is 261 (note - edited):

Daniel Weber (Sep 11 2024 at 15:37):

And in Init it's 43:

Daniel Weber (Sep 12 2024 at 02:03):

Here are the most commonly used constants (directly):

Daniel Weber (Sep 12 2024 at 03:28):

And transitively:

Damiano Testa (Sep 12 2024 at 09:04):

If you are taking requests, I think that this might be in spirit similar to David's prominence.

The top "tips" probably all share a large number of common ancestors and therefore should probably all be considered "about the same". What is the next "group" of declarations? Maybe considering two "tips" close if they share more than 99% of ancestors, can you find a tip that is not close to the top one? Maybe varying 99% to some other more appropriate value.

Daniel Weber (Sep 12 2024 at 09:35):

I currently have a bug in my data, because of you only look at direct dependencies you can't just ignore private decelerations

Daniel Weber (Sep 12 2024 at 11:33):

David Loeffler said:

Michael Stoll said:

Number Theory rules!

In mountaineering, people talk about the "prominence" of a mountain peak, defined as the height of that peak above the saddle connecting it to the next highest peak. By analogy with this, I wonder what the "most prominent" mathlib theorems are? I.e. what are the first few elements of the list characterised by

L[N + 1] := the theorem "x" which maximizes # { prerequisites of x which are not prerequisites of any of L[1], ..., L[N] }.

Presumably L[2] would be something outside of number theory – I'm guessing it would be one of the results about C-star algebras that are the highest-ranked non-number-theory items on the current list.

Here are the first 30 values of L, and the number of new prerequisites they add:

Johan Commelin (Sep 12 2024 at 11:55):

Pretty cool! I'm proud that WittVector.equiv is in the list :smiley:

Violeta Hernández (Sep 16 2024 at 13:12):

No idea what Surreal.dyadicMap_apply_pow is doing there... Is it just that it's unused and uses some heavy number theory stuff?

Violeta Hernández (Sep 16 2024 at 13:13):

I guess it does help how isolated game theory is from everything else

Felix Weilacher (Nov 04 2024 at 23:29):

Question from a noob: How does the #tally! from this thread differ from Lean.Name.transitivelyUsedConstants?

Damiano Testa (Nov 04 2024 at 23:35):

Probably just that it was written before the function.


Last updated: May 02 2025 at 03:31 UTC