Zulip Chat Archive
Stream: general
Topic: Why is doc-gen4 so slow?
Hanson Char (Jun 06 2025 at 20:46):
Every time I try to regenerate doc (using doc-gen4) for my project via e.g. lake build Myproject:docs it takes a long time apparently related to the dependency on mathlib4. Is there a way to make this faster?
Kevin Buzzard (Jun 06 2025 at 21:56):
I just try and import as little of mathlib as possible (indeed I've even recently started linting against import Mathlib and I've got shake working on my project).
Eric Wieser (Jun 06 2025 at 21:57):
Once you've built it once you can just reuse the same cache
Kevin Buzzard (Jun 06 2025 at 21:57):
...except when you bump mathlib (which every good project should be doing at least once a week, right?)
Eric Wieser (Jun 06 2025 at 21:58):
Well, as long as you are also sleeping, you can bump, kick off a doc build, go to bed, and then wake up to a hot build cache
Eric Wieser (Jun 06 2025 at 21:59):
If you're bumping multiple times a day then probably you have more important things going on than local doc builds
Eric Wieser (Jun 06 2025 at 22:00):
If you really want live docs for your project, then a 2hr delay in CI every time you bump is probably acceptable; that's what we have in https://github.com/google-deepmind/formal-conjectures
Eric Wieser (Jun 06 2025 at 22:01):
(our bumps were doc-gen versions because of bugs around weird filenames!)
Hanson Char (Jun 06 2025 at 22:16):
I've got the project built and doc generated, and and I also have lake exe cache get. However, even without make any changes, if I re-run lake build Myproject:docs, I still need to wait like a minute with lots of warning messages about mathlib4. Is this expected behavior? For example,
$ lake build Nng4:docs
ℹ [125/542] Replayed «doc-gen4»/bibPrepass
info: stdout:
INFO: reference page disabled
ℹ [2945/2949] Replayed Mathlib.Tactic.Linter.FlexibleLinter:docs
info: stdout:
WARNING: Failed to calculate equational lemmata for Mathlib.Linter.Flexible.usesGoal?: (deterministic) timeout at `whnf`, maximum number of heartbeats (200000) has been reached
Use `set_option maxHeartbeats <num>` to set the limit.(invalid MessageData.lazy, missing context)
...
I'd think if no change is made to the project, the build process should be able to quickly detect that nothing needs to be done and complete in seconds, if not sub-seconds. But clearly not the case.
Eric Wieser (Jun 06 2025 at 22:26):
I wonder if it's faster if you tell it to be quiet; by default it has to retrieve and print all the old logs.
Eric Wieser (Jun 06 2025 at 22:27):
@Mac Malone may be able to advise.
Hanson Char (Jun 06 2025 at 22:58):
Interesting idea. I just tried
time lake -q build Nng4:docs
No warnings shown, but it takes about as long.
Notification Bot (Jun 06 2025 at 23:51):
7 messages were moved from this topic to #general > Syntax highlighting in doc-gen by Eric Wieser.
Mac Malone (Jun 07 2025 at 07:48):
Is anything being built? You can test this with lake build Nng4:docs --no-build (an error indicates something does want build). If nothing is being built, it is possible that simply verifying everything is update-to-date is sufficiently time-consuming. Mathlib and doc-gen4 touch a lot of files and Lake's analysis scales linearly in the number of files. If file I/O is not particular fast on system (e.g., Windows vs Linux), this can hurt Lake's performance.
Hanson Char (Jun 07 2025 at 14:32):
Mac Malone said:
Is anything being built? You can test this with
lake build Nng4:docs --no-build(an error indicates something does want build). If nothing is being built, it is possible that simply verifying everything is update-to-date is sufficiently time-consuming. Mathlib and doc-gen4 touch a lot of files and Lake's analysis scales linearly in the number of files. If file I/O is not particular fast on system (e.g., Windows vs Linux), this can hurt Lake's performance.
Whether --no-build is specified or not doesn't seem to make any difference — both take about a minute. It seems to me this is a problem specific to doc-gen4. lake build returns in like one second. I wonder why doc-gen4 touches files unnecessarily. This indicates a problematic design choice (or at least a performance bottleneck) in doc-gen4. It defeats the purpose of incremental builds, and is not sustainable as mathlib and its ecosystem continue to grow. We should probably open a Github issue for doc-gen4.
Hanson Char (Jun 07 2025 at 15:51):
Found out the reason why the doc gen was so slow. This fixes it:
export DOCGEN_SRC="file"
time lake build Nng4:docs
...
Build completed successfully.
real 0m1.099s
user 0m1.367s
sys 0m1.052s
Hanson Char (Jun 07 2025 at 15:54):
By default doc-gen4 assumes the URL's are in github, so I suppose it remote accesses the links and time outs multiple times. If so, then of course doc gen is slow. By changing the default to local file reference, it stopped the bleed. (I was wondering what those "heart beats" were about — apparently it's doing remote access!)
See here for more info.
In any case, interesting that doc-gen4 appears to make many default assumptions, like the version control system is git, the repository is github, etc. Make sense in a way, but surprising also to beginners like me :)
Hanson Char (Jun 07 2025 at 16:01):
Thinking out loud, perhaps it would be smarter for doc-gen4 to somehow detect that it's running in github workflow, or not. If it's running in github workflow, then the default should be remote URL to github pages. If not, then it should default to local file reference.
This would make it fast by default everywhere.
For instance,
if [ "$GITHUB_ACTIONS" == "true" ]; then
echo "Running inside GitHub Actions"
# default to remote URL's
else
echo "Running outside GitHub Actions"
# default to local file references
fi
Hanson Char (Jun 07 2025 at 16:06):
For now, I'll just create my own Makefile :)
Notification Bot (Jun 07 2025 at 16:13):
some.knowit has marked this topic as resolved.
Eric Wieser (Jun 07 2025 at 16:16):
so I suppose it remote accesses the links and time outs multiple times.
Do you have anything belong speculation to back this up?
Eric Wieser (Jun 07 2025 at 16:16):
I'm pretty sure this is just used to inform what style the urls output by doc-gen use
Eric Wieser (Jun 07 2025 at 16:17):
I think more likely is that it is invoking git remote -v or similar a large number of times (which still makes no network access)
Hanson Char (Jun 07 2025 at 16:36):
Good question. Indeed, I cannot detect any network connections by running the doc gen using the default settings. Yet it's so much faster once I changed it to local file reference. ~60x performance improvement.
Hanson Char (Jun 07 2025 at 16:56):
Indeed, the getGitRemoteUrl appears to fork a new subprocess per call which is expensive, thus slowing things down.
Hanson Char (Jun 07 2025 at 17:04):
This begs the question: why doesn't doc-gen4 query git remote once, and then reuse the result, instead of repeatedly querying the same thing over and over?
Eric Wieser (Jun 07 2025 at 17:15):
That, is the right question :)
Eric Wieser (Jun 07 2025 at 17:16):
The answer is that doc-gen was decomposed into tiny pieces so that lake can manage the build, and so has no way to "do things once" without lake managing this
Eric Wieser (Jun 07 2025 at 17:16):
So the fix will involve adding an extra step to the docgen4 lakefile that computes this and stashes it in a json / text file somewhere that the next step reads.
Hanson Char (Jun 07 2025 at 17:17):
Aha. Sounds like a good cause to open an issue for doc-gen4.
Hanson Char (Jun 07 2025 at 17:18):
Will benefit so many users instantly!
Eric Wieser (Jun 07 2025 at 17:19):
https://github.com/leanprover/doc-gen4/pull/296 is an easy partial fix
Eric Wieser (Jun 07 2025 at 17:19):
This still does it for every file, but does so as part of the cached step.
Hanson Char (Jun 07 2025 at 20:52):
Eric Wieser said:
This still does it for every file, but does so as part of the cached step.
From a minute to less than 3 seconds — good enough for me!
Eric Wieser (Jun 07 2025 at 21:11):
Well, the part that is still slow is when you're spending the ~1hr doing a fresh build from scratch; there my change makes no difference at all
Hanson Char (Jun 07 2025 at 21:13):
True. It took 23 minutes in my case. Still a torture. Would using the json intermediate file approach help with fresh builds? It doesn't seem it would iiuc.
Eric Wieser (Jun 07 2025 at 21:15):
If you want to find out, I recommend timing the line I modified vs the proc call below it, to see how much of that 23 minutes is from git and how much is doc-gen itself
Eric Wieser (Jun 07 2025 at 21:15):
It's probably also worth checking if doc-gen makes any git calls outside of the lakefile
Hanson Char (Jun 07 2025 at 21:32):
I see there are two places in doc-gen4 that fork child process via IO.Process.output, both in lakefile.lean. One to get the git remote url, and the other to get the git commit hash of the project. Can anything similar be done to the existing getProjectCommit to minimize the number of creating child processes?
In principle, I see git has a C API called Libgit2, and I presume Lean4 can call C via FFI; so it seems all possible to avoid the cost of process forking all together in doc-gen4.
Eric Wieser (Jun 07 2025 at 22:31):
I think this is the wrong direction entirely, let me try my fix...
Hanson Char (Jun 07 2025 at 22:58):
Meanwhile, I am also trying to measure the total time spent on the git commands during doc gen.
Eric Wieser (Jun 07 2025 at 23:03):
https://github.com/leanprover/doc-gen4/pull/298/ is my attempt, but I don't think it currently helps
Hanson Char (Jun 08 2025 at 01:35):
ok, with your PR-296 locally applied to v4.20.0 of doc-gen4, I instrumented doc-gen4 to measure the elapsed time of running getGitRemoteUrl and getProjectCommit. I ran a fresh build lake build Nng4:docs which took ~18min. A total of 234.199 seconds (or roughly 3.9min) were spent on the git commands, each taking an average of ~42ms. (A total of 5,616 such git commands.) So roughly about 21% of a fresh build doc is spent on these git commands forking child processes. (A sample size of only one. Quite time consuming to run.)
Eric Wieser (Jun 08 2025 at 01:44):
by ms did you mean minutes or milliseconds?
Hanson Char (Jun 08 2025 at 01:44):
min=minutes, ms=milliseconds
Hanson Char (Jun 08 2025 at 01:47):
sorry, many typos :). Should be fixed, most if not all.
Hanson Char (Jun 08 2025 at 02:52):
I also tried measuring the time spent on git commands on doc-gen4 v4.20.0 without PR-296, the result is similar. The whole build finished in 16.67min with 2.88min spent on git commands so about 17.28%. Same number of 5,616 git commands executed.
Another interesting observation is that without PR-296, the rebuild takes 58.115s with the same number of 5,616 git commands executed, but each taking much less time to complete. A total of 55.714s spent on these git commands. I wonder why the git commands execute so much faster upon a rebuild than a fresh build.
Hanson Char (Jun 25 2025 at 21:42):
I further tested the case of caching all the Git URIs for all the packages so the construction of each github URL to each source file is done via reading from the file system instead of forking a git command. The performance doesn't seem to have any major improvement. For instance, currently (without disk caching) it took 22m48s to do a cold build of docgen on the analysis project (that entails the following 9 packages), whereas with disk caching it took 21m52s. The profit of (~10%) seems perhaps too small to be worth the effort. The bottleneck seems somewhere else. I guess I could even hardcode the 9 URI's as a hack to see the performance (when there is absolutely no disk access for the URL's). That would absolutely confirm the bottleneck hypothesis.
- aesop
- Analysis
- batteries
- importGraph
- LeanSearchClient
- mathlib
- plausible
- proofwidgets
Henrik Böving (Jun 25 2025 at 22:10):
Have you tried to just run the warm build with a profiler and see what it's doing?
Hanson Char (Jun 25 2025 at 22:12):
No, I haven't since the warm build seems fast enough. But this sounds interesting. Will try it.
Henrik Böving (Jun 25 2025 at 22:14):
Oh I was under the impression that by "with disk caching" you mean a warm build?
Hanson Char (Jun 25 2025 at 22:15):
Warm build takes like 1min (or sometimes just 10s or so), so not paying much attention to it.
Hanson Char (Jun 25 2025 at 22:19):
Henrik Böving said:
Oh I was under the impression that by "with disk caching" you mean a warm build?
Just to further clarify I hacked the doc-gen4 lakefile.lean to cache each unique module github URI to the disk (via running a git command once), and then read back from the files afterwards to see if it makes a difference to the cold build performance.
Henrik Böving (Jun 25 2025 at 22:21):
I see, a cold build performance of approximately 17 minutes on a beefy server machine is totally expected right now, last time I looked it was just that doing all of the pretty printing necessary to render equations takes some time really, that should be easily visible in a profiler.
Hanson Char (Jun 26 2025 at 01:29):
Hanson Char said:
I guess I could even hardcode the 9 URIs as a hack to see the performance (when there is absolutely no disk access for the URL's). That would absolutely confirm the bottleneck hypothesis.
The result of that experiment turned out to be a little surprising. A cold build dropped from 22min48s to 15min47s — about 30% reduction. In other words, the git commands accounted for roughly 30% of the total overhead during a cold build. That seems like low-hanging fruit and a strong incentive to optimize.
Eric Wieser (Jun 26 2025 at 09:09):
Is that with or without my previous patch?
Hanson Char (Jun 26 2025 at 16:24):
with. Specifically, it's based off the latest main as you can see here.
However, I don't think your previous patch makes much perf difference on cold build.
Eric Wieser (Jun 26 2025 at 18:43):
https://github.com/leanprover/doc-gen4/pull/298/ is not yet in main
Hanson Char (Jun 26 2025 at 19:08):
ok, then without :) However, your patch doesn't seem to have much impact on cold build so I believe the optimization opportunity stands.
Pietro Monticone (Jul 02 2025 at 14:11):
I haven’t had time to investigate thoroughly, but it looks like doc generation has become even slower after the bump to the latest toolchain version. See #Brownian motion > CI is very slow.
Kim Morrison (Jul 02 2025 at 16:15):
Ouch, yeah, I just had a 10 minute doc-gen build on a no-op commit...
Eric Wieser (Jul 03 2025 at 00:06):
I'll try to rebase my PR in the next week, which should probably fix this at the same time
Jz Pan (Jul 03 2025 at 06:23):
Same problem here, seems that it from 40 minutes to 5 hours...
Kevin Buzzard (Jul 03 2025 at 07:37):
CI is failing in FLT because I bumped mathlib in main triggering a build of doc-gen which timed out after 5h50.
Henrik Böving (Jul 03 2025 at 08:25):
Disabling equations seems to fix some of this slowdown but something is still very slow in mathlib files deep in the dependency chain
Henrik Böving (Jul 03 2025 at 08:28):
well this can't be good
Henrik Böving (Jul 03 2025 at 08:30):
Ok I know why this happens
Henrik Böving (Jul 03 2025 at 08:34):
CC @Cameron Zwarich, I don't know if this is a regressio nfrom the new compiler or that some definition was changed but:
for (name, cinfo) in env.constants do
let some modidx := env.getModuleIdxFor? name | unreachable!
let moduleName := env.allImportedModuleNames[modidx]!
if !relevantModules.contains moduleName then
continue
res ← tryCatchRuntimeEx
(do
let config := {
maxHeartbeats := 5000000,
options := ← getOptions,
fileName := ← getFileName,
fileMap := ← getFileMap,
}
let analysis ← Prod.fst <$> ((DocInfo.ofConstant (name, cinfo)).run options).toIO config { env := env } {} {}
if let some dinfo := analysis then
let moduleName := env.allImportedModuleNames[modidx]!
let module := res[moduleName]!
return res.insert moduleName {module with members := module.members.push (ModuleMember.docInfo dinfo)}
else
return res
)
(fun e => do
if let some pos := e.getRef.getPos? then
let pos := (← getFileMap).toPosition pos
IO.println s!"WARNING: Failed to obtain information in file: {pos}, for: {name}, {← e.toMessageData.toString}"
else
IO.println s!"WARNING: Failed to obtain information for: {name}: {← e.toMessageData.toString}"
return res
)
In this loop in doc-gen defined here https://github.com/leanprover/doc-gen4/blob/38a4f87cadd82f16d4c3e89dcc7140b8512b0bf2/DocGen4/Process/Analyze.lean#L130-L159 we have env.allImportedModuleNames which is not O(1) but computed by mapping the module name from the import array so for environments with a lot of constants we keep repeating this computation over and over again. Retrospectively it is clearly my mistake for not pulling this env.allImportedModuleNames out of the loop into a let but the compiler should be able to do that maybe?
Henrik Böving (Jul 03 2025 at 08:38):
I updated the tag v4.22.0-rc2 with the performance fix, you should get the top level commit 928dded in your lock file which fixes this issue CC @Pietro Monticone @Kim Morrison @Jz Pan
Henrik Böving (Jul 03 2025 at 08:39):
And @Kevin Buzzard of course
Kevin Buzzard (Jul 03 2025 at 08:48):
Thanks a lot for the prompt fix! I'll give it a go on FLT now.
Kevin Buzzard (Jul 03 2025 at 08:52):
Actually I realise now that doc-gen is not listed in my lake-manifest so I don't really know where to start. It's not as simple as lake update? That's the only trick I know :-(
Henrik Böving (Jul 03 2025 at 08:54):
There is a recommended setup described in the doc-gen4 repo README but FLT doesn't seem to follow that so I don't know I'm afraid.
Pietro Monticone (Jul 03 2025 at 08:57):
We use the new docgen-action so, since @Henrik Böving tagged a new commit, @Kevin Buzzard should just rerun the failed job and everything should work fine.
Pietro Monticone (Jul 03 2025 at 09:00):
I have just done it. Let’s see if it works as expected https://github.com/ImperialCollegeLondon/FLT/actions/runs/16038445978/job/45277793540
Kevin Buzzard (Jul 03 2025 at 09:26):
This is all way out of my comfort zone, but my understanding was that we were originally using doc-gen in a way which Kim didn't like (I had to type lake -R -Kenv=dev update which made them unhappy) and then Yael changed it to a new way (doc-gen as a new private project?), and now I find out that we're still not using the recommended set-up :-) I would be happy to use any set-up which makes everyone happy (does such a set-up exist?) but would be completely unable to set that up myself. Having said that, it seems that if the current fix at the FLT end is "do nothing", this sounds perfect!
Michael Rothgang (Jul 03 2025 at 09:33):
My understanding was that the doc-gen set-up changes was made specifically to align to the recommended set-up. Are you sure FLT does not follow it?
Pietro Monticone (Jul 03 2025 at 09:33):
We are basically using the recommended setup. It’s just been incorporated in the new docgen-action. See here https://github.com/leanprover-community/docgen-action/blob/main/scripts/build_docs.sh
Michael Rothgang (Jul 03 2025 at 09:34):
There is a small wrinkle, that the CI step generates a project with the lakefile and runs that - so the lakefile is never committed to the repository. But that's inessential.
Pietro Monticone (Jul 03 2025 at 09:48):
Pietro Monticone said:
I have just done it. Let’s see if it works as expected https://github.com/ImperialCollegeLondon/FLT/actions/runs/16038445978/job/45277793540
Yes, it worked as expected (37 min).
Pietro Monticone (Jul 03 2025 at 09:49):
Same good news from
Cameron Zwarich (Jul 03 2025 at 15:38):
Henrik Böving said:
Retrospectively it is clearly my mistake for not pulling this
env.allImportedModuleNamesout of the loop into aletbut the compiler should be able to do that maybe?
Sadly, until we get recursive join points this won't happen. Perhaps specialization worked differently in the old compiler in a way that would enable this, but I don't see how it could have (due to it also using a cache of specializations that does not include extra decls to be inserted in the caller of the specialized loop).
Sebastian Ullrich (Jul 03 2025 at 16:41):
It's more likely that it's from me changing the definition :)
Jz Pan (Jul 03 2025 at 19:15):
Henrik Böving said:
I updated the tag
v4.22.0-rc2with the performance fix, you should get the top level commit928ddedin your lock file which fixes this issue
Thank you, it goes back to 35 minutes.
Kevin Buzzard (Jul 03 2025 at 21:33):
FLT CI is green again! Many thanks all!
Eric Wieser (Jul 04 2025 at 09:35):
Eric Wieser said:
I'll try to rebase my PR in the next week
This is now merged!
Kim Morrison (Aug 25 2025 at 02:18):
What's the status here? As far as I can see the docgen-action is attempting to cache the Mathlib parts of the docs. However looking at recent Carleson logs it seems like Mathlib docs are still being rebuilt.
Kim Morrison (Aug 25 2025 at 02:25):
Separately, I've just made 3 PRs to https://github.com/leanprover-community/docgen-action/pulls.
Perhaps @Anne Baanen and/or @Pietro Monticone could sanity check these for me?
Eric Wieser (Aug 25 2025 at 06:55):
The next step is to have the caching action use lake query to list the files to cache
Eric Wieser (Aug 25 2025 at 06:56):
(the PR above was about making that list accurate)
Eric Wieser (Aug 25 2025 at 06:58):
I think there is still some missing work on the doc-gen side to prevent it being affected by stale files from a previous run (eg not listing files and instead taking arguments)
Floris van Doorn (Aug 26 2025 at 10:34):
Kim Morrison said:
What's the status here? As far as I can see the
docgen-actionis attempting to cache the Mathlib parts of the docs. However looking at recent Carleson logs it seems like Mathlib docs are still being rebuilt.
I also noticed this, and opened docgen-action#3 last month
Alfredo Moreira-Rosa (Aug 31 2025 at 20:54):
It's so slow on my builds i'm considering puting them on a side ci that only trigger when a commit is tagged. the default setup we get that runs it on every build is kind of too much
Last updated: Dec 20 2025 at 21:32 UTC