Zulip Chat Archive
Stream: new members
Topic: Exploring Mathlib
Michael Gehlke (Oct 09 2025 at 07:49):
Hi -- I'm new here and exploring how Mathlib is structured, out of personal interest in how mathematics (or at least, this definition of it) is composed/layered. To get started, I exported the import graph from the Mathlib/ directory so I could study it in Python -- in this case, just looking at the depth of imports within that directory. Quite a lot! And perhaps unsurprisingly, clustered around the middle-depths.
I'm curious if anyone has a dependency analysis they'd like to see, any thoughts of what I could explore besides the dependency graph (eg, maybe typeclasses?), or suggestions on how to improve my (highly suspect) Lean code.
I don't plan to open a PR, since I doubt anyone else would find this useful -- but code viewable here:
https://github.com/zmgehlke/mathlib4/tree/master/scripts/visualizer
Kevin Buzzard (Oct 09 2025 at 08:03):
Whilst I can't give you any concrete links, I know that there are ways out there to visualise the import graph graphically, find the longest chain of imports etc etc; the community has already written such software (but I don't know how to use it or where to find it). Hopefully someone else will chime in.
Julian Berman (Oct 09 2025 at 08:44):
https://github.com/leanprover-community/mathlib4/actions/runs/18369898030 every build of the Mathlib repo builds an "import-graph" artifact which is a .dot file, so you can essentially download that and visualize it
Julian Berman (Oct 09 2025 at 08:44):
https://github.com/leanprover-community/import-graph is the underlying Lean project
Michael Gehlke (Oct 09 2025 at 09:08):
Neat! -- curiously, import-graph seems to require a build but my version didn't. Presumably because it's looking at something like the .olean files rather than .lean imports statements.
Michael Gehlke (Oct 09 2025 at 11:49):
I suspect I'm doing something wrong, because I'm getting order of magnitude differences between parsing files to export and using import-graph. Is there a way to compile import-graph or something I'm missing?
After having run lake build for mathlib4 (untimed), then using import-graph:
lake exe graph 9.92s user 5.38s system 29% cpu 51.273 total
Running a Python script that only exports with a compiled Lean helper and reimports to Python:
python3 scripts/visualizer/demo_mathlib_graph.py 2.28s user 0.29s system 62% cpu 4.143 total
Joseph Tooby-Smith (Oct 09 2025 at 12:04):
For PhysLean I (well, really, github copilot) recently made a java-script version of part of the implementation of import-graph. In particular it takes the '.dot' file output from import-graph which only has to be run once, and then lets you specify the source nodes and target nodes online, to make different graphs: e.g.
Michael Gehlke (Oct 10 2025 at 00:20):
That's cool!
My motivation is that I'm looking for ways to address the graph complexity by analyzing further structures and extract something meaningful -- eg, the everything graph for PhysLean highlights the problem:
https://physlean.com/Dependencies.html?sources=&targets=PhysLean
One way is by looking at patterns (such as the above histogram) which tell us how the graph clusters (eg, 55% of Mathlib modules cluster in the rank 40-80 range); another is using deeper graph analysis to look at topic clustering or highly utilized modules.
Michael Gehlke (Oct 10 2025 at 06:28):
Today I added a radial graph drawing where the nodes are colored by rank. Please excuse the faintness of the edges -- they completely wash out in the center if you turn it up higher.
What is interesting to me is two things:
- the spike module that has a high rank, but is deeply connected to the center; and,
- the banding where you have rings that are connected by towers.
My next questions are about what causes that in the repo (eg, identifying the spike module and looking at examples of bands vs towers).
Ruben Van de Velde (Oct 10 2025 at 06:40):
Is it Mathlib.lean?
Michael Gehlke (Oct 10 2025 at 06:48):
I believe that Mathlib.lean is excluded because I start parsing in the Mathlib/directory (rather than project root). I'm also missing stuff like Archive/, Cache/, Counterexamples/, etc.
So it's something that's within that directory (and subdirectories) that depends on at least 1 high ranked file (there's a faint edge going up and to the right to another yellow node), yet has strong links to lower rank content.
Michael Gehlke (Oct 12 2025 at 13:52):
After investigating, it seems like the spike is Mathlib/Tactics.lean; which makes sense for both why it has such a high number of imports and a link to many low rank modules.
Next up, figuring out why the density rings form!
Daniel Morrison (Oct 14 2025 at 02:10):
I was also looking at something similar for PhysLean, you can see the code I wrote here, and some discussion of tests I ran here.
It reads the import graph in .xdot_json format from import-graphinto a dataframe and finds some basic information:
- what I called "subject" is the folder the file is in which is directly under the main folder
- List of direct imports and dependents
- List of transitive imports and dependents (files that are imported/dependents indirectly by some intermediate files)
My reason for doing this is I was/am interested in seeing if there's a way to see which files are a good place to start learning PhysLean/Mathlib/other projects. The issue being if you're just starting you maybe don't want to all the way back to the beginning but you also can't skip all the background files. My crude measure of what I called "importance" which is the fraction of files transitively imported multiplied by the fraction of files which are transitively dependents. The files in PhysLean with the largest "importance" were almost all .Basic files of the more filled out folders in PhysLean so it seemed to be doing something right.
I had hoped to run this on Mathlib but was having trouble getting the import-graph to generate on my computer and then never got around to it. I would be interested in seeing how it does on a significantly larger project like mathlib.
Michael Gehlke (Oct 16 2025 at 00:18):
I'll have to give your example a look, sounds like we're working on similar ideas.
Daniel Morrison said:
The issue being if you're just starting you maybe don't want to all the way back to the beginning but you also can't skip all the background files. My crude measure of what I called "importance" which is the fraction of files transitively imported multiplied by the fraction of files which are transitively dependents. The files in PhysLean with the largest "importance" were almost all .Basic files of the more filled out folders in PhysLean so it seemed to be doing something right.
Huh -- I expect something similar to happen in Mathlib. I think the biggest outcome of my preliminary work is that it really looks like an onion: there's some really core things (basic types, tactics, etc); there's a heavy shell (rank 40 to 80) with over half the modules; and there's a sprawl (including an outer ring of activity).
Naively, this confirms your framing: we don't really need to chase all the way down (or up) for most mathematics -- and accordingly, people would probably be happier ignoring those files at first. Next up I'll be exploring more structure than just the import graph, but one last image highlighted by histogram.
rank_histogram_by_count_v2.png
Last updated: Dec 20 2025 at 21:32 UTC