Zulip Chat Archive

Stream: lean4

Topic: list of files that build


Johan Commelin (Mar 15 2024 at 21:53):

How can I ask lean (or lake) for a list of files that build without errors?
I'm refactoring mathlib, and would like to benchmark the part that builds against the corresponding part on master

Kim Morrison (Mar 15 2024 at 21:53):

@Mac Malone? As far as I'm aware this isn't currently possible.

Mac Malone (Mar 15 2024 at 21:57):

Unfortunately, this is not information available from the Lake CLI. :sad: Lake can get this information, so if you are comfortable wriritng a Lean program that uses the API you could get it. But that may be more hassle than it is worth.

Alex J. Best (Mar 16 2024 at 00:56):

Could you just do some shell scripting to run lake build on each file in turn and print the list of those with 0 exit code (maybe chatgpt could write this in an instant?)

Julian Berman (Mar 16 2024 at 01:19):

That will re-build every broken file with n dependents n times right?

Mac Malone (Mar 16 2024 at 02:00):

Also, you could potentially deduce which files failed from the output of lake build and then subtract those from a list of all files in mathlib to get the set of files which build.

Johan Commelin (Mar 16 2024 at 06:35):

Together with copilot I came up with the following script

import os
import sys
import subprocess
import networkx as nx
from collections import defaultdict
import time

# Traverse the repository to find all .lean files and parse their import statements
dependencies = defaultdict(list)
for root, dirs, files in os.walk('./Mathlib'):
    for file in files:
        if file.endswith('.lean'):
            filepath = os.path.join(root, file)
            index = filepath[2:].replace('/', '.').replace('.lean', '')
            with open(filepath, 'r') as f:
                for line in f:
                    if line.startswith('import '):
                        dependencies[index].append(line.split(' ')[1].strip())
                    elif line.startswith('/-!'):
                        break

# Create a graph from the dependencies
G = nx.DiGraph(dependencies)

# Perform a topological sort on the graph
try:
    sorted_files = [file for file in nx.topological_sort(G) if file.startswith('Mathlib')]
    # sorted_files.reverse()
except nx.NetworkXUnfeasible:
    print("Graph contains a cycle:", file=sys.stderr)
    cycle = nx.simple_cycles(G)
    print(next(cycle))
    exit()

# Attempt to build each file in the sorted order
successful_files = []
for file in sorted_files:
  # Skip building the file if any of its dependencies weren't successfully built
  if any(dependency not in successful_files for dependency in G.predecessors(file)):
    continue
  try:
    print("Attempting to build", file, file=sys.stderr)
    subprocess.check_call(['lake', 'build', file], stdout=sys.stderr, stderr=sys.stderr)
    successful_files.append(file)
    print(file)
  except subprocess.CalledProcessError:
    pass

Johan Commelin (Mar 16 2024 at 06:36):

It doesn't seem to work. It still starts building all of Mathlib, after iterating through the first ~10 items of sorted_files.

Johan Commelin (Mar 16 2024 at 06:36):

It's almost as if that topological sort didn't really work

Damiano Testa (Mar 16 2024 at 06:37):

Is the idea that you want to build Mathlib and then find this out, or you would like to do this with the data of the oleans?

Damiano Testa (Mar 16 2024 at 06:38):

You could see what lake exe cache get downloads and the missing oleans (or the leaves that you download that are not actual leaves) would be the broken files, right?

Johan Commelin (Mar 16 2024 at 06:46):

@Damiano Testa I don't really care about the method. I have a copy of mathlib in which ~500 files build without error. I would like to get the explicit list of those filenames.

Johan Commelin (Mar 16 2024 at 06:46):

If this can be done in < 1 minute, that would be really cool.

Mario Carneiro (Mar 16 2024 at 06:47):

if you are willing to rebuild from scratch you can just clear the build files and then see what files exist when lake is done

Johan Commelin (Mar 16 2024 at 06:47):

But I can of course resort to "throw away all .oleans, attempt a build, check which oleans have been succesfully written"

Johan Commelin (Mar 16 2024 at 06:47):

Yes, I'm starting to consider that. But it feels wasteful. And it means I can't iterate often.

Mario Carneiro (Mar 16 2024 at 06:47):

well as long as it only grows monotonically you can reuse the results

Mario Carneiro (Mar 16 2024 at 06:51):

what order do you get when running the above script?

Johan Commelin (Mar 16 2024 at 06:59):

This is the first 20 files:

['Mathlib.Util.Superscript', 'Mathlib.Util.Time', 'Mathlib.Util.AssertNoSorry', 'Mathlib.Util.SleepHeartbeats', 'Mathlib.Util.LongNames', 'Mathlib.Util.AddRelatedDecl', 'Mathlib.Util.IncludeStr', 'Mathlib.Util.Export', 'Mathlib.Util.TermBeta', 'Mathlib.Data.List.Func', 'Mathlib.Data.List.ProdSigma', 'Mathlib.Data.List.ToFinsupp', 'Mathlib.Data.List.Intervals', 'Mathlib.Data.List.AList', 'Mathlib.Data.List.Sections', 'Mathlib.Data.List.Lemmas', 'Mathlib.Data.List.DropRight', 'Mathlib.Data.List.Card', 'Mathlib.Data.List.Sym', 'Mathlib.Data.List.Destutter']

Johan Commelin (Mar 16 2024 at 06:59):

Mario Carneiro said:

well as long as it only grows monotonically you can reuse the results

Right, but as is often the case with refactors, I regularly jump back to fix things lower in the hierarchy.

Mario Carneiro (Mar 16 2024 at 07:12):

it was looking good until Mathlib.Data.List.Func

Mario Carneiro (Mar 16 2024 at 07:12):

I suspect that all of the ones that came before it have no mathlib imports at all

Mario Carneiro (Mar 16 2024 at 07:12):

so it could also be an issue in the import parsing

Johan Commelin (Mar 16 2024 at 07:26):

Yeah, but I don't see the bug

Johan Commelin (Mar 16 2024 at 07:27):

Also:

>>> print(dependencies['Mathlib.Data.List.Func'])
['Mathlib.Data.Nat.Order.Basic']

Mario Carneiro (Mar 16 2024 at 07:35):

and print(dependencies['Mathlib.Data.Nat.Order.Basic']) is also something?

Johan Commelin (Mar 16 2024 at 07:41):

Yes

['Mathlib.Algebra.Order.Ring.Canonical', 'Mathlib.Data.Nat.Basic', 'Mathlib.Init.Data.Nat.Bitwise']

Mario Carneiro (Mar 16 2024 at 07:50):

Oh I think I see the issue

if any(dependency not in successful_files for dependency in G.predecessors(file)):

this will return true for any file which has a non-mathlib dependency (recall that non-mathlib files are in the graph G)

Johan Commelin (Mar 16 2024 at 07:51):

Ooh, good catch!

Johan Commelin (Mar 16 2024 at 08:34):

This is the version of the script that I'm currently using:

import os
import sys
import subprocess
import networkx as nx
from collections import defaultdict
import time

# Traverse the repository to find all .lean files and parse their import statements
dependencies = defaultdict(list)
for root, dirs, files in os.walk('./Mathlib'):
    for file in files:
        if file.endswith('.lean'):
            filepath = os.path.join(root, file)
            index = filepath[2:].replace('/', '.').replace('.lean', '')
            with open(filepath, 'r') as f:
                for line in f:
                    if line.startswith('import '):
                        dep = line.split(' ')[1].strip()
                        if dep.startswith('Mathlib'):
                            dependencies[index].append(dep)
                    elif line.startswith('/-!'):
                        break

# Create a graph from the dependencies
G = nx.DiGraph(dependencies)

# Perform a topological sort on the graph
try:
    sorted_files = [file for file in nx.topological_sort(G) if file.startswith('Mathlib')]
    sorted_files.reverse()
except nx.NetworkXUnfeasible:
    print("Graph contains a cycle:", file=sys.stderr)
    cycle = nx.simple_cycles(G)
    print(next(cycle))
    exit()

# Attempt to build each file in the sorted order
successful_files = []
for i, file in enumerate(sorted_files):
  # Skip building the file if any of its dependencies weren't successfully built
  if any(dependency not in successful_files for dependency in dependencies[file]):
    continue
  try:
    # print("Attempting to build", file, file=sys.stderr)
    subprocess.check_call(['lake', 'build', file], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    successful_files.append(file)
    print(file)
  except subprocess.CalledProcessError:
    pass

Damiano Testa (Mar 16 2024 at 09:22):

More generally, I have often wanted some tooling for finding out future dependencies of current declarations.

Even just something as simple as scanning ileans, looking for usages of a specific declaration would be beneficial for refactors and also for giving better automation for breaking changes.

Marc Huisinga (Mar 18 2024 at 08:28):

Damiano Testa said:

More generally, I have often wanted some tooling for finding out future dependencies of current declarations.

Even just something as simple as scanning ileans, looking for usages of a specific declaration would be beneficial for refactors and also for giving better automation for breaking changes.

Does the call hierarchy do some of what you want? (Right click on an identifier and select "Show call hierarchy")

Mario Carneiro (Mar 18 2024 at 16:18):

you can also use the "go to definition" action on a definition, where it means "find references" instead

Damiano Testa (Mar 19 2024 at 08:21):

@Marc Huisinga and @Mario Carneiro I knew of neither of these functionalities and they are both excellent, thanks!

Marc Huisinga (Mar 19 2024 at 08:40):

That's what the .ilean files were added for originally :-)

Damiano Testa (Mar 19 2024 at 08:44):

Thanks for the further information! Btw, is there a description of what the ileans encode now? A few weeks ago, they simply had info like (roughly) "c:decl_name:usages{[coordinates of begin end usages],definition[coordinates of definition, if available]]}".

What do they contain now?

Marc Huisinga (Mar 19 2024 at 08:49):

The corresponding Lean type for .ilean files is at Lean.Server.Ilean in Lean/Server/References.lean.
It's still a map from an identifier (on master this is a combination of a module name and an identifier name to disambiguate conflicting identifier names from separate modules) to its definition / usages, where both definition and usages now also store information about the parent declaration of each reference.

The JSON format has changed a bit, though.

Damiano Testa (Mar 19 2024 at 08:51):

Ok, when I'll try to use it, I will try to avoid looking at the ilean explicitly!

Btw, I often would like to have information about where each declaration begins/ends. For instance, something that would also take into account doc-strings and attributes for lemmas. Is this available as well?

Damiano Testa (Mar 19 2024 at 08:59):

E.g. for

/-- I am a docstring.
Spread over several lines -/
@[simp]
lemma name {n : Nat} : n + 0 = n := by
  have := 0
  simp

it would begin at the /-- and end at simp.


Last updated: May 02 2025 at 03:31 UTC