Zulip Chat Archive

Stream: lean4

Topic: lake show dependencies


Eric Wieser (May 01 2023 at 10:36):

Is there a way to ask lake to show the dependencies of a file without building that file and everything below it from scratch?

Alex J. Best (May 01 2023 at 11:36):

There is the builtin

lake env lean --deps Mathlib/Data/Nat/Basic.lean

but this only shows immediate dependencies (I'm also not sure if it builds as everything is already built for me)

Eric Wieser (May 01 2023 at 12:18):

Starting from everything already being built is fine for me

Eric Wieser (May 01 2023 at 12:19):

I only mentioned a clean build because from that you get all the filenames from the progress messages

Eric Wieser (May 01 2023 at 15:35):

I tried calling lean --deps recursively, but it's prohibitively slow:

Henrik Böving (May 01 2023 at 15:37):

As I understand it lake as a build system is inherently monadic. That is the dependency graph is dynamic to the point where you cannot know it without executing it. So I would say generally speaking it is impossible to figure this out without executing some part of the build system. But it might be possible in the case where the full power of the monadic part is not used?

Mario Carneiro (May 01 2023 at 15:38):

I'm pretty sure it stages that at least somewhat, i.e. you can get the deps without actually building everything

Henrik Böving (May 01 2023 at 15:40):

Hmmm, I mean I can for example define a facet that branches on the execution of some IO operation and based on that decides to add something as a dependency or not. How are you going to figure that out?

Mario Carneiro (May 01 2023 at 15:40):

You run the IO operation

Eric Wieser (May 01 2023 at 15:41):

I'm very happy to extract the dependency graph from build artifacts

Mario Carneiro (May 01 2023 at 15:41):

I'm not saying that it's a pure function (I mean obviously it's not pure since you have to at the very least read some .lean files), but it doesn't require calling lean on all those files

Eric Wieser (May 01 2023 at 15:42):

I just don't want to have to build all of mathlib from scratch. If lake exe cache get downloads what I need to get the data through Lake, then that's fine for me

Henrik Böving (May 01 2023 at 15:42):

Eric Wieser said:

I'm very happy to extract the dependency graph from build artifacts

If you figure out how to parse the .trace files that sounds doable

Eric Wieser (May 01 2023 at 15:43):

Surely Lake already has all the information I need internally?

Henrik Böving (May 01 2023 at 15:43):

Yes the deps of a file are stored in the .trace file accompanying it as I understand

Henrik Böving (May 01 2023 at 15:44):

I just dont know what the trace file means...but in e.g. the doc gen facet there is a lot of "mixing of traces" to basically delacre a dependency

Henrik Böving (May 01 2023 at 15:45):

A trace is a piece of data (generally a hash) which is used to verify whether a given target is up-to-date. If the trace stored with a built target matches the trace computed during build, then a target is considered up-to-date. A target's trace is derived from its various inputs (e.g., source file, Lean toolchain, imports, etc.).

ah well, almost /o\

Eric Wieser (May 01 2023 at 15:48):

It sounds like you're telling me that lake never actually constructs the transitive build graph

Mario Carneiro (May 01 2023 at 15:49):

I wouldn't be surprised if it didn't, IIRC it is somewhat expensive

Mario Carneiro (May 01 2023 at 15:49):

Although honestly it feels like I could do pretty well on this problem with a regex parser and no lean calls at all

Eric Wieser (May 01 2023 at 15:50):

I think that's what leanproject does, but it's pretty silly that lean --deps is so slow that it's easier to use regex

Mario Carneiro (May 01 2023 at 15:50):

it doesn't feel like it should be a computationally expensive task

Eric Wieser (May 01 2023 at 15:50):

I imagine lake env adds some overhead?

Mario Carneiro (May 01 2023 at 15:50):

I don't think it is that much, and it is a one time cost

Eric Wieser (May 01 2023 at 15:51):

Not in my script above

Eric Wieser (May 01 2023 at 15:51):

I guess I should run my entire script through lake env?

Eric Wieser (May 01 2023 at 15:51):

Either way, it only takes 4 minutes on the file I care about, and I only need to run it twice

Eric Wieser (May 01 2023 at 15:59):

Update: the overhead is almost entirely lake env

Eric Wieser (May 01 2023 at 15:59):

The script goes from 4 minutes to 20s.

Mario Carneiro (May 01 2023 at 16:03):

this is why we can't have nice things

Mario Carneiro (May 01 2023 at 16:07):

Looking at the code, it is moderately surprising to me (until I give it a bit of thought) that the lean entry point (and what parses the --deps argument) is part of the C++ part. Has anyone tried to write a lean frontend in lean itself?

Mario Carneiro (May 01 2023 at 16:08):

it's doing some stuff like loading shared libraries that make sense for leanc but I don't think would be necessary if you just want to run the initial part of the lean module parser

Mario Carneiro (May 01 2023 at 16:10):

oh fun, there is a thing that parses #lang lean4 at the start of files in here. I don't think that works...

Henrik Böving (May 01 2023 at 16:11):

Works anymore! iirc it was used for the switch between the old and the new language frontend back in the bootstrapping days

Mario Carneiro (May 01 2023 at 16:24):

Here's a demo: Make a new lake project deps with this:

import Lean

open Lean Parser
def main (args : List String) : IO Unit := do
  let some (path : String) := args[0]? | throw (.userError "usage: deps <FILE>")
  let input  IO.FS.readFile path
  let inputCtx := mkInputContext input path
  let (stx, _, _)  parseHeader inputCtx
  for x in stx[1].getArgs do
    IO.println s!"{x.getIdAt 2}"

result:

$ time build/bin/deps ../mathlib4/Mathlib/AlgebraicTopology/MooreComplex.lean
Mathlib.Algebra.Homology.HomologicalComplex
Mathlib.AlgebraicTopology.SimplicialObject
Mathlib.CategoryTheory.Abelian.Basic

________________________________________________________
Executed in   29.27 millis    fish           external
   usr time   25.79 millis  271.00 micros   25.52 millis
   sys time    0.04 millis   42.00 micros    0.00 millis

Mario Carneiro (May 01 2023 at 16:26):

compare:

$ time lake env lean --deps Mathlib/AlgebraicTopology/MooreComplex.lean
/home/mario/.elan/toolchains/leanprover--lean4---nightly-2023-04-11/lib/lean/Init.olean
./build/lib/Mathlib/Algebra/Homology/HomologicalComplex.olean
./build/lib/Mathlib/AlgebraicTopology/SimplicialObject.olean
./build/lib/Mathlib/CategoryTheory/Abelian/Basic.olean

________________________________________________________
Executed in  349.30 millis    fish           external
   usr time  275.22 millis  320.00 micros  274.90 millis
   sys time   76.69 millis   44.00 micros   76.65 millis

Mario Carneiro (May 01 2023 at 16:27):

This of course isn't doing anything about transitive dependencies, but it shouldn't be too hard to bolt that on similar to the python implementation

Mario Carneiro (May 01 2023 at 16:46):

Okay here's a version that does transitive dependencies (within mathlib):

import Lean

open Lean Parser

def main (args : List String) : IO Unit := do
  let root :: mod :: _ := args | throw (.userError "usage: deps <ROOT> <MODULE>")
  let mut visited : NameSet := {}
  let mut to_visit : Array Name := #[mod.toName]
  repeat do
    let mut next_to_visit : Array Name := #[]
    for p in to_visit do
      if p.getRoot == `Mathlib then
        let path := modToFilePath root p "lean"
        let input  IO.FS.readFile path
        let inputCtx := mkInputContext input path.toString
        let (stx, _, _)  parseHeader inputCtx
        for c in stx[1].getArgs do
          let c := c.getIdAt 2
          if !visited.contains c then
            next_to_visit := next_to_visit.push c
        visited := visited.insert p
    to_visit := next_to_visit
    if to_visit.isEmpty then
      break
  for x in visited do
    IO.println s!"{x}"

Mario Carneiro (May 01 2023 at 16:48):

$ time build/bin/deps ../mathlib4 Mathlib.AlgebraicTopology.MooreComplex

Mathlib.Algebra.GroupPower.Ring
Mathlib.Data.Sigma.Basic
Mathlib.CategoryTheory.Limits.Preserves.Limits
...
Mathlib.Data.Rat.Defs
Mathlib.Data.Fintype.Powerset
Mathlib.CategoryTheory.Preadditive.Basic
Mathlib.Init.Propext
Mathlib.CategoryTheory.Limits.Shapes.RegularMono
Mathlib.Data.List.NodupEquivFin
Mathlib.Tactic.SimpRw
Mathlib.Data.Set.NAry
Mathlib.Init.Algebra.Classes
Mathlib.Tactic.SplitIfs
________________________________________________________
Executed in  423.00 millis    fish           external
   usr time  391.22 millis  640.00 micros  390.58 millis
   sys time   31.88 millis    0.00 micros   31.88 millis

Mario Carneiro (May 01 2023 at 16:49):

it's 1.48s to do the transitive closure of Mathlib

Eric Wieser (May 01 2023 at 17:12):

Does this print them in the order they populate the environment?

Mario Carneiro (May 01 2023 at 17:13):

no, they are I think sorted by Name.quickLt

Mario Carneiro (May 01 2023 at 17:13):

which is basically random

Mario Carneiro (May 01 2023 at 17:14):

if you use RBTree Name Name.cmp instead of NameSet for the type of visited they come out in alphabetical order

Mario Carneiro (May 01 2023 at 17:15):

uh... not alphabetical order, looks like length then alphabetical order by component

Mario Carneiro (May 01 2023 at 17:15):

meh, just pipe it through sort

Mario Carneiro (May 01 2023 at 17:31):

Hm, it looks like there were some bugs in the DFS code I borrowed from the python script which were causing it to revisit files multiple times. This version prints the files out in preorder (postorder is a bit more difficult with this work list implementation), if you care about maintaining a topological sort, and it takes only half the time, 574ms for Mathlib:

def main (args : List String) : IO Unit := do
  let root :: mod :: _ := args | throw (.userError "usage: deps <ROOT> <MODULE>")
  let mut out : String := ""
  let mut visited : NameSet := {}
  let mut toVisit : Array Name := #[mod.toName]
  while !toVisit.isEmpty do
    let mut nextToVisit : Array Name := #[]
    for p in toVisit do
      if !visited.contains p then
        let path := modToFilePath root p "lean"
        let input  IO.FS.readFile path
        let inputCtx := mkInputContext input path.toString
        let (stx, _, _)  parseHeader inputCtx
        for c in stx[1].getArgs do
          let c := c.getIdAt 2
          if c.getRoot == `Mathlib then
            nextToVisit := nextToVisit.push c
        visited := visited.insert p
        out := out ++ s!"{p}\n"
    toVisit := nextToVisit
  IO.print out

Eric Wieser (May 01 2023 at 17:38):

Does this belong in lake itself as a helper? Or would it go in Std like cache did?

Mario Carneiro (May 01 2023 at 17:42):

I think it could just go in whatever script is using it. A lot of the speed boosts it is getting come from it hardcoding various properties of the use case and not being as robust as lake

Mario Carneiro (May 01 2023 at 17:42):

I mean, it's basically the same length as your original python script

Mario Carneiro (May 01 2023 at 17:44):

We could add some parts of this to std and/or lake to make the script even shorter, but it would have to get quite a bit more complex and probably slower if it was upstreamed wholesale

Mario Carneiro (May 01 2023 at 17:47):

There is also another version of this program that could be written, which reads the .olean files instead of the .lean files. This is likely even faster, since the oleans actually contain transitive dependencies (they don't, but they do contain the parsed imports)

Mario Carneiro (May 01 2023 at 17:48):

this is a bit closer to what lean itself does

Mario Carneiro (May 01 2023 at 17:49):

Actually I'm not sure if it would be faster, 500ms is hard to beat and I think even just loading the oleans for mathlib already takes that long (maybe it is also paying for initialization though)

Mario Carneiro (May 01 2023 at 17:57):

wowza, 88ms for Mathlib:

import Lean

open Lean Parser

def main (args : List String) : IO Unit := do
  let root :: mod :: _ := args | throw (.userError "usage: deps <ROOT> <MODULE>")
  let mut out : Array String := #[]
  let mut visited : NameSet := {}
  let mut toVisit : Array Name := #[mod.toName]
  while !toVisit.isEmpty do
    let mut nextToVisit : Array Name := #[]
    for p in toVisit do
      if !visited.contains p then
        let path := modToFilePath root p "olean"
        let (mod, _)  readModuleData path
        nextToVisit := mod.imports.foldr (init := nextToVisit) fun i nextToVisit =>
          let c := i.module
          if c.getRoot == `Mathlib then nextToVisit.push c
          else nextToVisit
        visited := visited.insert p
        out := out.push s!"{p}"
    toVisit := nextToVisit
  out.forRevM IO.println
$ time build/bin/deps ../mathlib4/build/lib/ Mathlib
...
Mathlib.Util.AssertNoSorry
Mathlib.Util.AtomM
Mathlib.Util.Export
Mathlib.Util.IncludeStr
Mathlib.Util.LongNames
Mathlib.Util.MemoFix
Mathlib.Util.Syntax
Mathlib.Util.SynthesizeUsing
Mathlib.Util.Tactic
Mathlib.Util.Time
Mathlib.Util.WhatsNew
Mathlib.Util.WithWeakNamespace
Mathlib

________________________________________________________
Executed in   88.77 millis    fish           external
   usr time   40.51 millis  781.00 micros   39.73 millis
   sys time   47.85 millis  175.00 micros   47.67 millis

Mac Malone (May 07 2023 at 16:30):

Mario Carneiro said:

I wouldn't be surprised if it didn't, IIRC it is somewhat expensive

Lake does compute dependencies and it does it in a few different ways. The direct dependencies are available in the deps facet and the transitive dependencies in transDeps. However, using them requires a script; they are not available from the CLI (yet).


Last updated: Dec 20 2023 at 11:08 UTC