Zulip Chat Archive

Stream: metaprogramming / tactics

Topic: Counting declarations


Jon Eugster (Sep 01 2024 at 12:09):

What is the best way to get the number of declarations defined in a particular file? Probably from CoreM...

I found docs#Std.Tactic.Lint.getDeclsInPackage which seems to do something like that, but for example for `Lean it returns all declarations in any file Lean/XY/_.lean and I thought having it return 0 would be better in my use case, as Lean.lean is an empty file (with some imports).

Damiano Testa (Sep 01 2024 at 12:44):

The Environment contains const2ModIdx, though you then have to find the correct modIdx. I think that there is a function that tells you directly the declarations in the current module, but maybe that's not what you want (and I'm on mobile and cannot find it now).

Damiano Testa (Sep 01 2024 at 13:09):

E.g., something like this?

import Mathlib

open Lean Elab Command

elab "#decls_in " id:ident : command => do
  let env  getEnv
  match env.moduleIdxForModule? id.getId with
    | none => logWarning m!"Sorry, module '{id}' does not exist"
    | some modIdx =>
      let csts := env.const2ModIdx
      let cInM  csts.foldM (fun a b c => do
        if c == modIdx && (! ( b.isBlackListed)) then return a.insert b else return a) ({} : NameSet)
      logInfo m!"Found {cInM.size} declarations in '{id}':\n\n{cInM.toArray.qsort (·.toString < ·.toString)}"

#decls_in Mathlib.Data.Nat.Defs

Jon Eugster (Sep 01 2024 at 13:09):

That's docs#Std.Tactic.Lint.getDeclsInCurrModule.

I see, that's roughly what I'm trying now with const2ModIdx, thanks!

Dumb sanity question, if I have a file with two defs in there and nothing else, like this here: Batteries/Lean/IO/Process.lean, should the number of decls be 2?

Damiano Testa (Sep 01 2024 at 13:09):

I removed the "blacklisted" decls, otherwise you get a lot of _aux stuff.

Damiano Testa (Sep 01 2024 at 13:10):

Try filtering out more or less what you want, to see what suits you.

Damiano Testa (Sep 01 2024 at 13:14):

In general, I find it always a little tricky to find out exactly what declarations I care about: normally, everything that is "visible" is something that I would want, to_additive stuff tends also to be what I want, but anything after that... it depends!

Jon Eugster (Sep 01 2024 at 13:15):

Ah thanks, that's awesome! (I see, I didn't use env.moduleIdxForModule? so probably got a random index :sweat_smile:

I'm not in Mathlib so I have to see what to replace isBlacklisted with (I vaguely remember that Lean had some isInternal or so.

Damiano Testa (Sep 01 2024 at 13:16):

For the particular file that you mentioned, with isBlacklisted you do get only two declarations, but there 16 more that are blacklisted.

Damiano Testa (Sep 01 2024 at 13:16):

def isBlackListed {m} [Monad m] [MonadEnv m] (declName : Name) : m Bool := do
  if declName == ``sorryAx then return true
  if declName matches .str _ "inj" then return true
  if declName matches .str _ "noConfusionType" then return true
  let env  getEnv
  pure <| declName.isInternalDetail
   || isAuxRecursor env declName
   || isNoConfusion env declName
  <||> isRec declName <||> isMatcher declName

Damiano Testa (Sep 01 2024 at 13:21):

This may be a "minimized" version:

import Lean.Elab.Command

open Lean Meta

/- from `Mathlib.Lean.Expr.Basic` -/
def Lean.Name.isBlackListed {m} [Monad m] [MonadEnv m] (declName : Name) : m Bool := do
  if declName == ``sorryAx then return true
  if declName matches .str _ "inj" then return true
  if declName matches .str _ "noConfusionType" then return true
  let env  getEnv
  pure <| declName.isInternalDetail
   || isAuxRecursor env declName
   || isNoConfusion env declName
  <||> isRec declName <||> isMatcher declName

/- from `Batteries.Tactic.OpenPrivate` -/
def Lean.Environment.moduleIdxForModule? (env : Environment) (mod : Name) : Option ModuleIdx :=
  (env.allImportedModuleNames.indexOf? mod).map fun idx => idx.val


elab "#decls_in " id:ident : command => do
  let env  getEnv
  match env.moduleIdxForModule? id.getId with
    | none => logWarning m!"Sorry, module '{id}' does not exist"
    | some modIdx =>
      let csts := env.const2ModIdx
      let cInM  csts.foldM (fun a b c => do
        if c == modIdx && (! ( b.isBlackListed)) then
          return a.insert b
        else
          return a) ({} : NameSet)
      logInfo m!"Found {cInM.size} declarations in '{id}':\n\n\
                {cInM.toArray.qsort (·.toString < ·.toString)}"

Jon Eugster (Sep 02 2024 at 10:09):

so, yesterday I ended up with this

/-- Get all declarations in the specified file. -/
def getDeclsInFile (module : Name) : CoreM NameSet := do
  let env  getEnv
  match env.moduleIdxForModule? module with
    | none => return {}
    | some modIdx =>
      let decls := env.const2ModIdx
      let declsIn  decls.foldM (fun acc n idx => do
        if idx == modIdx && (! ( isBlackListed n)) then
          return acc.insert n
        else
          return acc) ({} : NameSet)
      return declsIn

(source)

However, I have a hunch that this is quite slow and all I need later is (← getDeclsInFile n).size to create some node-sizes in this mathlib graph (which some might remember from Lean3):

Screenshot_20240901_231516.png

Do you have a suggestion for a faster method to get some meaningful node sizes? Maybe just retrieving the file size?

Jon Eugster (Sep 02 2024 at 10:21):

Hmm apparently de-monadifying everything and only calling let env ← getEnv once outside is already an improvement. Is ← getEnvsomething that's usually slow?

Damiano Testa (Sep 02 2024 at 10:26):

I am not sure about getEnv being slow: could it be that re-getting the environment means that a lot of cached information is thrown away?

Damiano Testa (Sep 02 2024 at 10:28):

If you just want to count the definitions that are "visible", you might also look at the .ileans that contain information about what is defined in a file, though only the declarations that have some syntax, I believe.

Damiano Testa (Sep 02 2024 at 10:28):

So, to_additive declarations would not be there, probably.

Damiano Testa (Sep 02 2024 at 10:29):

As a general rule, I do not have a very good feeling for what is slow and what is fast, so it is very likely that the code above could be made immensely more efficient by small changes that I ignore...

Jon Eugster (Sep 02 2024 at 10:48):

I don't understand this too well, so maybe your guess about the cache is something.

In any case, only getting the env once and passing it into the functions reduced the time to 6 minutes for mathlib, which I guess is acceptable. (but it takes 5 seconds in total if I just set the size to 1)

I'll try the .ilean. Do you know if there's an intended way to read them? I guess finding them in .lake and parsing as JSON is somewhat hacky...

Damiano Testa (Sep 02 2024 at 11:04):

If you want the count for all modules, maybe you can setup a HashMap Name Nat that loops through the declarations once and adds them where they should go?

Damiano Testa (Sep 02 2024 at 11:30):

For instance, this takes a few seconds on the server, although I have not tested it very much:

import Mathlib

open Lean Meta

elab "#tally_decls" : command => do
  let mut invert : HashMap Nat Name := {}
  let env  getEnv
  for e in env.allImportedModuleNames do
    invert := invert.insert ((env.getModuleIdx? e).getD default) e
  let mut tots : HashMap Nat Nat := {}
  for (c, mod) in env.const2ModIdx.toArray do
    if ! ( c.isBlackListed) then
      let curr := (tots.find? mod).getD 0
      tots := tots.insert mod (curr + 1)
  logInfo m!"{tots.toArray.map fun (idx, tot) => ((invert.find? idx).getD default, tot)}"

#tally_decls
/-
[(Init.Prelude, 864),
 (Init.Coe, 63),
 (Init.Notation, 126),
 (Init.Tactics, 167),
 (Init.SizeOf, 75),
 (Init.Core, 494),
 (Init.SimpLemmas, 136),
 (Init.Data.Nat.Basic, 332),
 (Init.WF, 67),
 (Init.MetaTypes, 97),
 (Init.WFTactics, 5),
 (Init.Data.Nat.Div, 59),
 (Init.Data.List.Notation, 2),
 (Init.Data.List.Basic, 290),
 (Init.Data.Nat.Bitwise.Basic, 18),
 (Init.Data.Fin.Basic, 51),
 (Init.Data.UInt.Basic, 197),
 (Init.Data.Char.Basic, 23),
...
(Mathlib.Topology.Sheaves.SheafCondition.EqualizerProducts, 51),
 (Mathlib.Topology.Sheaves.Sheafify, 8),
 (Mathlib.Topology.Sheaves.Skyscraper, 39),
 (Mathlib.Topology.UniformSpace.AbsoluteValue, 2),
 (Mathlib.Topology.UniformSpace.CompareReals, 13),
 (Mathlib.Util.AssertNoSorry, 1),
 (Mathlib.Util.Export, 54),
 (Mathlib.Util.FormatTable, 12),
 (Mathlib.Util.GetAllModules, 2),
 (Mathlib.Util.IncludeStr, 1),
 (Mathlib.Util.LongNames, 3),
 (Mathlib.Util.SleepHeartbeats, 2),
 (Mathlib.Util.TermBeta, 2),
 (Mathlib.Util.Time, 2)]
-/

Jon Eugster (Sep 02 2024 at 11:32):

Damiano Testa said:

If you want the count for all modules, maybe you can setup a HashMap Name Nat that loops through the declarations once and adds them where they should go?

actually true, I'm looping way to often! Thanks!

Damiano Testa (Sep 02 2024 at 11:35):

Btw, I find the dance around ModuleIdx and Name very annoying: I never remember which way the functions go and do not really know why it is so hard to get a human-readable name out of a ModuleIdx!

Jon Eugster (Sep 02 2024 at 11:56):

Ah wow, down to 7 seconds :star_struck:

I did this now:

def getNumberOfDeclsPerFile (env: Environment) : NameMap Nat :=
  env.const2ModIdx.fold (fun acc n idx =>
    let mod := env.allImportedModuleNames.get! idx
    if isBlackListed env n then acc else acc.insert mod ((acc.findD mod 0) + 1)
    ) {}

from your code I assume env.allImportedModuleNames.get! idx should be how one could retrieve the module name, is that right?

Jon Eugster (Sep 02 2024 at 11:58):

Because this is how it's defined under the hood:

def getModuleIdx? (env : Environment) (moduleName : Name) : Option ModuleIdx :=
  env.header.moduleNames.findIdx? (· == moduleName)

Damiano Testa (Sep 02 2024 at 11:58):

I do not know whether the order of the ModuleIdx is the same as the one in which they are stored in allImportedModuleNames. It would make sense, but I do not know.

Jon Eugster (Sep 02 2024 at 11:59):

ah yes, let me double-check these definitions

Jon Eugster (Sep 02 2024 at 11:59):

def allImportedModuleNames (env : Environment) : Array Name :=
  env.header.moduleNames

Damiano Testa (Sep 02 2024 at 12:00):

I typed my comment, while you sent your message with the definition of getModuleIdx?, so do not take my comment as a reply to your message just above it, but to the one before that!

Damiano Testa (Sep 02 2024 at 12:00):

I.e., I still do not know what is the order, since I have not yet fully parsed what the functions do! :smile:

Damiano Testa (Sep 02 2024 at 12:04):

Ok, I have not tested it, but you are probably right: it seems that moduleNames is an array and the ModuleIdx is the index within that array of the module name.

Damiano Testa (Sep 02 2024 at 12:04):

So, then also my command above can be shortened, since invert was just a way of doing what you suggested with get!ing the Array entry.

Jon Eugster (Sep 02 2024 at 12:05):

Yep, the messages are a bit out of order. So to restate this, it looks like env.header.moduleNames is just an Array Name, env.allImportedModuleNames is just a wrapper for the same array and env.getModuleIdx? `MyModule seems to just use Array.findIdx? to find the index in this array, which is then the ModuleIdx (as def ModuleIdx := Nat)

Damiano Testa (Sep 02 2024 at 12:06):

I agree! Maybe test it on a couple of modules, just to make sure, but I am convinced that this is what it is.

Damiano Testa (Sep 02 2024 at 12:06):

I was just worried that these arrays were obtained after storing their entries in a HashSet or something, forgetting any ordering, but it seems like this is not the case.

Jon Eugster (Sep 02 2024 at 12:15):

Btw, here is the result: https://github.com/leanprover-community/import-graph/pull/30

Damiano Testa (Sep 02 2024 at 12:18):

Very cool!

Damiano Testa (Sep 02 2024 at 12:18):

I wonder whether we should add the graph to the summary of every PR. :lol:

Jon Eugster (Sep 02 2024 at 12:57):

... and at a visual-based linter to test the yarn ball isn't messier than previously :wink:


Last updated: May 02 2025 at 03:31 UTC