Zulip Chat Archive

Stream: mathlib4

Topic: slow linting step CI?


David Loeffler (Dec 09 2025 at 18:14):

For my recent PR #32646, the bors staging CI run took a really remarkably long time to finish, getting hung up on the "lint style" step. Since the PR adds a new linter, I was worried I'd done something wrong, but it looks like the slowness is something else: here's the relevant part of the CI log:

Get:29 http://azure.archive.ubuntu.com/ubuntu noble/universe amd64 lmodern all 2.005-1 [9542 kB]
Get:30 http://azure.archive.ubuntu.com/ubuntu noble/universe amd64 mupdf-tools amd64 1.23.10+ds1-1build3 [49.3 MB]
Get:31 http://azure.archive.ubuntu.com/ubuntu noble/main amd64 t1utils amd64 1.41-4build3 [61.3 kB]
Get:32 http://azure.archive.ubuntu.com/ubuntu noble/universe amd64 texlive-binaries amd64 2023.20230311.66589-9build3 [8529 kB]
Get:33 http://azure.archive.ubuntu.com/ubuntu noble/universe amd64 texlive-base all 2023.20240207-1 [21.7 MB]
Preconfiguring packages ...
Fetched 118 MB in 46min 34s (42.1 kB/s)

This is some Ubuntu auto-update step, downloading new texlive packages, which presumably shouldn't be happening afresh on every CI run! I hope this was just a one-off glitch, but I'm reporting it here in case some action is needed.

Bryan Gin-ge Chen (Dec 09 2025 at 18:21):

Very strange. My best guess is some kind of network hiccup. It doesn't look like something we have control over, unfortunately.

Michael Rothgang (Dec 10 2025 at 14:20):

Another instance: https://github.com/leanprover-community/mathlib4/actions/runs/20100617503/job/57670221999?pr=32690

Snir Broshi (Dec 20 2025 at 03:20):

I have one that beats both of these combined:
https://github.com/leanprover-community/mathlib4/actions/runs/20384108818/job/58581312606?pr=33111
4 hours and counting

Snir Broshi (Dec 20 2025 at 04:38):

Now it says "cancelled after 6h 5m"

Snir Broshi (Dec 20 2025 at 04:41):

Oh there's a time limit
image.png

Bryan Gin-ge Chen (Dec 21 2025 at 02:50):

Looks like a similar issue caused #32559 to fail as well: https://github.com/leanprover-community/mathlib4/actions/runs/20396174623/job/58611998166

Violeta Hernández (Dec 21 2025 at 03:26):

#33140 is going through this as well.

Snir Broshi (Dec 21 2025 at 03:34):

Could this be GitHub's safe sleep debacle?

Bryan Gin-ge Chen (Dec 21 2025 at 03:37):

It seems like the issue reported there was fixed recently. Are you suggesting that something is wrong with their fix?

I just remembered that we have been seeing similar issues with linters never terminating in downstream projects for a while, cf. #general > runLinter in github workflow not terminating and #general > CI stuck at Lint project step.

Snir Broshi (Dec 21 2025 at 05:46):

Bryan Gin-ge Chen said:

It seems like the issue reported there was fixed recently. Are you suggesting that something is wrong with their fix?

Are the runners v2.329.0 or newer? (idk if these are Mathlib's runners or if we're using GitHub's own runners which are probably always up to date)

I don't think the fix for the != vs <= problem is wrong, but IIUC that script historically caused multiple problems related to sleeping.

Bryan Gin-ge Chen (Dec 21 2025 at 06:31):

Mathlib's CI runs on self-hosted runners, but the runners keep themselves automatically up to date (see the docs). You can also see the version of the runner in the "Set up job" step, it looks like it's 2.330.0 for all the recent failures (since Dec. 19).

Bryan Gin-ge Chen (Dec 21 2025 at 15:39):

This seems to have caused #32701 to fail as well https://github.com/leanprover-community/mathlib4/actions/runs/20407376484/job/58639104912

David Loeffler (Dec 22 2025 at 07:13):

Several more CI runs seem to have failed overnight with this error, of the linting getting stuck forever and being automatically killed after 6h. Here's an example on #33187, https://github.com/leanprover-community/mathlib4/actions/runs/20418281106, and here on #33155, https://github.com/leanprover-community/mathlib4/actions/runs/20414682705

David Loeffler (Dec 22 2025 at 07:14):

Here also on #30089 https://github.com/leanprover-community/mathlib4/actions/runs/20413164116/attempts/1?pr=30089

Bryan Gin-ge Chen (Dec 22 2025 at 13:27):

I checked the logs above and it seems like the issue isn't just restricted to one or two of the machines.

I wonder if we should add more detailed logs to our linters temporarily to try and identify the culprit? Not entirely sure how to do this -- maybe one of our linting experts can weigh in?

Damiano Testa (Dec 22 2025 at 14:49):

This is somewhat mysterious: on my computer, the step that times out in CI takes under 3 minutes to complete.

Bryan Gin-ge Chen (Dec 22 2025 at 14:50):

It seems to happen randomly with a low probability even in CI -- so far re-running the job has fixed it.

Damiano Testa (Dec 22 2025 at 15:01):

Could it be that there is some lock in place and, if there are two pushes within a small period of time, then the linting step on the second one gets caught in a lock?

Damiano Testa (Dec 22 2025 at 15:02):

In at least two of the time outs above, the push before the one that times out happened less than 10 minutes earlier.

Snir Broshi (Dec 23 2025 at 06:07):

Another one https://github.com/leanprover-community/mathlib4/actions/runs/20450551667/job/58762462708?pr=32870

Bryan Gin-ge Chen (Dec 23 2025 at 06:43):

Damiano Testa said:

In at least two of the time outs above, the push before the one that times out happened less than 10 minutes earlier.

This is interesting, but I'm still not sure how this could lead to what we're seeing. Were the CI jobs for the earlier pushes assigned to the same runners as the CI for the later pushes? (I'd be surprised if they were because as far as I know, jobs are assigned fairly randomly to available runners.)

Damiano Testa (Dec 23 2025 at 09:48):

Bryan, I do not have any evidence. In particular, I do not even know how to check which runner was used!

Damiano Testa (Dec 23 2025 at 09:49):

Is there a way to list all actions that were re-ran?

Ruben Van de Velde (Dec 23 2025 at 10:07):

Under "set up job", there's a line like "Runner name: 'hoskinson6'"

Damiano Testa (Dec 23 2025 at 10:15):

Ah, thanks!

Damiano Testa (Dec 23 2025 at 10:18):

The first example that I looked of consecutive commits, they were both on hoskinson9. In fact, all 3:

  1. the first push (https://github.com/leanprover-community/mathlib4/actions/runs/20418257319/job/58665173117),
  2. the one that cancelled the previous one and timed out (https://github.com/leanprover-community/mathlib4/actions/runs/20418281106/attempts/1), and
  3. the re-run (https://github.com/leanprover-community/mathlib4/actions/runs/20418281106).

Damiano Testa (Dec 23 2025 at 10:20):

I am also not convinced that this "lock" is the reason, but if it were, then it would explain why it is not consistent behaviour: the second run needs to happen on the same runner as the previous one, introducing a possible randomness.

Bryan Gin-ge Chen (Dec 23 2025 at 14:05):

Unfortunately whatever's going on it doesn't seem that this is the only criterion. Looking at the example from #33155, it looks like:

  1. CI was triggered on the previous push at 18:41 UTC and ran on hoskinson2. This failed due to a long line: https://github.com/leanprover-community/mathlib4/actions/runs/20414169316
  2. CI was triggered on a push at 19:24 UTC and ran on hoskinson5. This ended up timing out in the linting step: https://github.com/leanprover-community/mathlib4/actions/runs/20414169316

It's a good idea to see if there are some patterns like this though.

Bryan Gin-ge Chen (Dec 24 2025 at 01:07):

Another example: https://github.com/leanprover-community/mathlib4/actions/runs/20470049395/job/58823025523

Bryan Gin-ge Chen (Dec 26 2025 at 22:10):

Still happening: https://github.com/leanprover-community/mathlib4/actions/runs/20525480899/job/58967951672?pr=33267

Thomas Murrills (Dec 27 2025 at 08:05):

Things that jump out at me (but with no guarantee of relevance!):

  • There are a couple spawned tasks in runLinterOnModule to build the necessary targets if they don't exist. I don't think this is where things go awry, but it would be nice to have a --no-build flag to runLinter to throw instead of doing this just in case.
  • There are some warnings in the docstrings of enableInitializerExecutions and withImporting about not accessing global references on multiple execution threads and ensuring that everything matches exactly if loaded twice. It makes me wonder if we need to be careful in a way we're not.
  • We use BaseIO.asTask to run linters in lintCore, together with a blocking Task.get. This seems suspicious? Note also that the docstring for BaseIO.asTask mentions that it continues to run even if the last reference to it is dropped. Not sure if it's possible for that to be an issue here.

It would be nice to add tracing to runLinter in choice spots to see how far we get—would that come through in the logs, or does the timeout somehow obliterate things? I'd be happy to PR that to batteries if it would help. :)

Bryan Gin-ge Chen (Dec 27 2025 at 14:00):

Adding more tracing certainly sounds worth trying! I think chances are good we'll learn something new about what's going on: the runners don't seem to be losing their connection with GitHub during the timeout (otherwise we'd get different error messages), so we should be able to read them in the logs. (Worst case, we can also ssh into the runners and check the logs there.)

Andrew Yang (Dec 27 2025 at 19:23):

This seems to be another example:
https://github.com/leanprover-community/mathlib4/actions/runs/20542058792/job/59007727248

Andrew Yang (Dec 28 2025 at 17:11):

It is also happening to bors batches now
https://github.com/leanprover-community/mathlib4/actions/runs/20553310113/job/59033959466

Maybe bors batches should have tighter timeout limits?

Bryan Gin-ge Chen (Dec 28 2025 at 18:23):

There is a setting for it: https://github.com/leanprover-community/mathlib4/blob/88c480406bd423b91b691ee43655057aa1f8db63/bors.toml#L3

I think we've always just set it to the max since back in the mathlib3 days our builds really were taking that long. Looking at the first few pages of builds here, the longest build (without the linting issue) took about 1h18m, so 2 hours might be a reasonable limit now (leaving some room for mathlib to grow a bit).

Thomas Murrills (Dec 29 2025 at 00:21):

Some basic tracing: batteries#1583

Thomas Murrills (Dec 29 2025 at 00:22):

(This means we should call runLinter --trace in mathlib CI once that lands, and possibly runLinter --no-build --trace!)

Thomas Murrills (Dec 29 2025 at 00:23):

I also have decl-level tracing in the wings, but figured the PR was already large enough! :) We may get enough information from this very simple tracing alone, after all.

Snir Broshi (Dec 30 2025 at 08:56):

in bors https://github.com/leanprover-community/mathlib4/actions/runs/20586447955/job/59123654008

Bryan Gin-ge Chen (Dec 31 2025 at 09:17):

Thomas Murrills said:

Some basic tracing: batteries#1583

Looks like this has been merged (thanks!) and will make it into mathlib soon (after #33426 is merged). Do we just need to add --trace to our lake exe runLinter steps?

Bryan Gin-ge Chen (Dec 31 2025 at 09:19):

Bryan Gin-ge Chen said:

There is a setting for it: https://github.com/leanprover-community/mathlib4/blob/88c480406bd423b91b691ee43655057aa1f8db63/bors.toml#L3

I think we've always just set it to the max since back in the mathlib3 days our builds really were taking that long. Looking at the first few pages of builds here, the longest build (without the linting issue) took about 1h18m, so 2 hours might be a reasonable limit now (leaving some room for mathlib to grow a bit).

I've opened #33427 for this.

Bryan Gin-ge Chen (Jan 02 2026 at 00:26):

#33474 adds tracing to the linting steps and makes an adjustment to the build wrapper script to make output get printed without buffering.

Snir Broshi (Jan 05 2026 at 00:24):

I think --trace broke PRs with a cryptic error message (e.g. https://github.com/leanprover-community/mathlib4/actions/runs/20701110864/job/59423426873), which will probably be fixed by them merging master. IIUC the problem is the CI scripts are always fetched from master, which pass --trace, but the PR might not have an updated Batteries so the linter doesn't know what --trace is. Is there a way of synchronizing these? (--trace iff updated Batteries)

Bryan Gin-ge Chen (Jan 05 2026 at 00:28):

Ah, unfortunately not... this would have been another good place to apply the hypothetical please-merge-master label (#mathlib4 > CI failure @ 💬) to all of our PRs.

Kim Morrison (Jan 05 2026 at 07:37):

Here's another one that looks like it is timeing out:

https://github.com/leanprover-community/mathlib4/actions/runs/20703941509/job/59430805978?pr=33577

Kim Morrison (Jan 05 2026 at 07:37):

Note that it says -- Linting passed for Mathlib., but the step doesn't terminate.

Bryan Gin-ge Chen (Jan 05 2026 at 12:37):

cc: @Thomas Murrills maybe this rings a bell?

Thomas Murrills (Jan 05 2026 at 18:07):

That is really weird. Is there any way of logging all running Tasks? :thinking: I'll think about this...

Thomas Murrills (Jan 05 2026 at 18:08):

(Aside: I've also made miscellaneous improvements to the tracing at batteries#1599. However, these will not help us diagnose the issue.)

Thomas Murrills (Jan 05 2026 at 18:22):

Oh, and to clarify: after checking manually, it does indeed look like all listed linters finished via tracing. I was hoping one of them might be a task that hadn't resolved, but...

Notification Bot (Jan 05 2026 at 19:37):

4 messages were moved from this topic to #mathlib4 > runLinter --update by Bryan Gin-ge Chen.

Notification Bot (Jan 05 2026 at 19:37):

A message was moved from this topic to #mathlib4 > runLinter --update by Bryan Gin-ge Chen.

Bryan Gin-ge Chen (Jan 05 2026 at 23:43):

One thing comes to mind is that there could be a strange interaction between the python wrapper script that collapses lake output and the linter process somehow. However, there are downstream projects that also ran into this issue (linked at #mathlib4 > slow linting step CI? @ 💬) which don't use this script.

In any case, we could also try removing the wrapper script from the linting step in CI and see if that helps.

Bryan Gin-ge Chen (Jan 06 2026 at 02:03):

I happened to see a few of these in progress and decided to log into the runners to see if anything was going on.

Here's the first log. I logged into the runner and got the following in the motd (the output of landscape-sysinfo, apparently):

 System information as of Tue Jan  6 01:39:49 UTC 2026

  System load:  0.0                Processes:               309
  Usage of /:   84.7% of 97.26GB
  Memory usage: 91%
  Swap usage:   100%

Here's the same for another one stuck in the linting step (log at):

  System load:  0.07               Processes:               282
  Usage of /:   79.1% of 97.26GB   Users logged in:         1
  Memory usage: 89%
  Swap usage:   100%

For comparison, here's what we see on another runner that's currently idle:

 System information as of Tue Jan  6 01:52:18 UTC 2026

  System load:  2.01               Processes:               272
  Usage of /:   51.7% of 97.26GB
  Memory usage: 15%
  Swap usage:   100%

And one which is currently in the middle of the build step (log):

  System load:  1.3                Processes:               283
  Usage of /:   88.2% of 97.87GB
  Memory usage: 67%
  Swap usage:   100%

The only thing I can see is that the system load is low and the memory usage is around 90% for the ones that are stuck?

Bryan Gin-ge Chen (Jan 06 2026 at 02:17):

Bryan Gin-ge Chen said:

In any case, we could also try removing the wrapper script from the linting step in CI and see if that helps.

This is done in #33646. Worth trying IMO!

Bryan Gin-ge Chen (Jan 06 2026 at 14:36):

Unfortunately removing the wrapper doesn't seem to have helped: https://github.com/leanprover-community/mathlib4/actions/runs/20749857182/job/59576122164

Now the log output seems to stop after the build step.

Nailin Guan (Jan 06 2026 at 14:41):

Am I meeting same problem here? https://github.com/leanprover-community/mathlib4/actions/runs/20742219875/job/59551216119?pr=26212

Bryan Gin-ge Chen (Jan 06 2026 at 17:02):

Yeah, looks like it. I maybe should have said this earlier, but when you see the linting step run for longer than say 30 minutes, then it's likely you're hitting this and it should be fine to cancel the run and then restart it.

Artie Khovanov (Jan 08 2026 at 23:18):

Is this a certified GitHub moment, or is it something wrong with the Mathlib tooling?

Bryan Gin-ge Chen (Jan 08 2026 at 23:25):

I think there's something wrong with the linters, but we haven't been able to figure out what yet.

Bryan Gin-ge Chen (Jan 09 2026 at 00:41):

Bryan Gin-ge Chen said:

Unfortunately removing the wrapper doesn't seem to have helped: https://github.com/leanprover-community/mathlib4/actions/runs/20749857182/job/59576122164

Now the log output seems to stop after the build step.

Following a suggestion from @Thomas Murrills in DMs, #33783 uses stdbuf to change the buffering to line buffering so that we can hopefully see the linting log output in real-time.

Michael Rothgang (Jan 11 2026 at 08:50):

https://github.com/leanprover-community/mathlib4/actions/runs/20885756973/job/60008669811?pr=30658 seems to be another instance (in #30658): the linting itself passed (according to the logs), but the CI step never finished and apparently got cancelled by the time-out.

Junyan Xu (Jan 11 2026 at 17:17):

Basically the same situation: https://github.com/leanprover-community/mathlib4/actions/runs/20895707567/job/60033883160?pr=32027 It's been 3h 52m and I'm gonna cancel/restart it.

Junyan Xu (Jan 13 2026 at 00:07):

Here's Copilot's diagnosis just in case it offers some debugging inspirations.

David Loeffler (Jan 13 2026 at 16:25):

Michael Rothgang said:

https://github.com/leanprover-community/mathlib4/actions/runs/20885756973/job/60008669811?pr=30658 seems to be another instance (in #30658): the linting itself passed (according to the logs), but the CI step never finished and apparently got cancelled by the time-out.

This seems to be happening increasingly often for the "staging" runs too (that CI hangs indefinitely at the end of the linting stage). I just cancelled two such CI runs (and someone else did the same yesterday).

Bryan Gin-ge Chen (Jan 13 2026 at 16:41):

Apologies for the inconvenience, everyone.

@Thomas Murrills has made some progress diagnosing what's going on but we still haven't pinned it down. I'm working on a workflow to automatically cancel jobs that appear stuck, which should hopefully reduce the runner congestion.

Marcelo Lynch (Jan 13 2026 at 16:47):

Bryan Gin-ge Chen said:

I'm working on a workflow to automatically cancel jobs that appear stuck, which should hopefully reduce the runner congestion.

adding a reasonable timeout-minutes to the step should achieve this, no?

Bryan Gin-ge Chen (Jan 13 2026 at 16:52):

Oh, good call! Not sure how I missed that.

Bryan Gin-ge Chen (Jan 13 2026 at 16:57):

I've created #33922 for this.

Wang Jingting (Jan 14 2026 at 13:38):

Can anyone take a look at https://github.com/leanprover-community/mathlib4/actions/runs/20949859110/job/60200416389 ? The runner reported "The job has exceeded the maximum execution time of 6h0m0s" (which is on linting stage), so I suspect it might be relevant to this thread.
By the way, is there a way to manually restart CI without pushing another commit? It seems that I can't find the "retry task" button now.

Michael Rothgang (Jan 14 2026 at 13:42):

I'd be happy to restart CI for you. In general, just making noise on zulip is fine; if you tell us your PR number, it's even faster. That said, in this case it probably makes more sense if you merge master --- recent changes to CI will cancel the job much faster (after 30 minutes for the linting step, instead of 6 hours) next time.

Snir Broshi (Jan 14 2026 at 13:42):

Since the job didn't have the new 30min timeout, surely you have an "update branch" button in your PR (at the bottom)

Bryan Gin-ge Chen (Jan 14 2026 at 13:45):

I think even just restarting the job will re-run it with the 30 min timeout (the version of the workflow that runs is the one on master, and from the log link it looks like that job was started before #33922 was merged).

Bryan Gin-ge Chen (Jan 14 2026 at 13:46):

I've gone ahead and restarted the job.

Wang Jingting (Jan 14 2026 at 13:56):

Thanks! It's PR #31613, and I've merged master now.

Yaël Dillies (Jan 15 2026 at 07:43):

Is this thread at all related to what's happening in #33873? Linting succeeds but times out: https://github.com/leanprover-community/mathlib4/actions/runs/21011593498/job/60407581598

Johan Commelin (Jan 15 2026 at 09:08):

FYI: behind the scenes @Bryan Gin-ge Chen, @Thomas Murrills and @Marcelo Lynch are working hard to fix this bug. But it's a very elusive one.

Michael Stoll (Jan 15 2026 at 16:48):

The current CI run on #33893 seems to be stuck in the "lint Mathlib" step at

[Mathlib] - simpComm: (1/2) Getting...
[Mathlib] - simpComm: (2/2) Passed!
[Mathlib] - simpNF: (1/2) Getting...

Marcelo Lynch (Jan 15 2026 at 16:51):

I'll try to capture a memory dump of this before it times out, please don't cancel the job

Bryan Gin-ge Chen (Jan 15 2026 at 17:06):

If it turns out 30 mins is too short of a timeout to catch these “live”, we can increase the timeout to, say, an hour per #33922

Marcelo Lynch (Jan 15 2026 at 17:09):

Yeah, I was thinking about that too. Interestingly enough it seems like there was a bunch of stdout when the timeout hits finishing with "Linting passed for Mathlib". It'd be interesting to see if this holds in other instances, it might narrow down where we suspect the hang is ocurring.

Thomas Murrills (Jan 15 2026 at 17:11):

When I tried to view that PR initially, I got a github webpage error saying the page was taking too long to load. Maybe the hang was our fault, but stdout not coming through immediately was github's fault?

Marcelo Lynch (Jan 15 2026 at 17:12):

Github is having issues today https://www.githubstatus.com/

Marcelo Lynch (Jan 15 2026 at 17:14):

Thomas Murrills said:

stdout not coming through immediately was github's fault?

maybe... although I would still make the same conclusion if that were the case

Michael Stoll (Jan 15 2026 at 21:33):

This run looks similar (7 minutes into linting right now).

Michael Stoll (Jan 15 2026 at 22:02):

The full text output of the linting step is

Run cd pr-branch
+ cd pr-branch
+ env LEAN_ABORT_ON_PANIC=1 stdbuf -oL lake exe runLinter --trace Mathlib
+ tee .lake/lint_output.txt
✔ [19/22] Built Batteries.Data.Array.Basic:c.o (264ms)
✔ [20/22] Built runLinter (1.4s)
✔ [21/22] Built runLinter:c.o (686ms)
✔ [22/22] Built runLinter:exe (480ms)
Running linter on specified module: Mathlib
[Mathlib] Starting lint...
[Mathlib] Running linters:
  checkType
  checkUnivs
  defLemma
  deprecatedNoSince
  docBlame
  dupNamespace
  impossibleInstance
  nonClassInstance
  simpComm
  simpNF
  simpVarHead
  structureInType
  synTaut
  tacticDocs
  unusedArguments
  unusedHavesSuffices
[Mathlib] - checkType: (0/2) Starting...
[Mathlib] - checkUnivs: (0/2) Starting...
[Mathlib] - defLemma: (0/2) Starting...
[Mathlib] - deprecatedNoSince: (0/2) Starting...
[Mathlib] - docBlame: (0/2) Starting...
[Mathlib] - dupNamespace: (0/2) Starting...
[Mathlib] - impossibleInstance: (0/2) Starting...
[Mathlib] - nonClassInstance: (0/2) Starting...
[Mathlib] - simpComm: (0/2) Starting...
[Mathlib] - simpNF: (0/2) Starting...
[Mathlib] - simpVarHead: (0/2) Starting...
[Mathlib] - structureInType: (0/2) Starting...
[Mathlib] - synTaut: (0/2) Starting...
[Mathlib] - tacticDocs: (0/2) Starting...
[Mathlib] - unusedArguments: (0/2) Starting...
[Mathlib] - unusedHavesSuffices: (0/2) Starting...
[Mathlib] - checkType: (1/2) Getting...
[Mathlib] - checkType: (2/2) Passed!
[Mathlib] - checkUnivs: (1/2) Getting...
[Mathlib] - checkUnivs: (2/2) Passed!
[Mathlib] - defLemma: (1/2) Getting...
[Mathlib] - defLemma: (2/2) Passed!
[Mathlib] - deprecatedNoSince: (1/2) Getting...
[Mathlib] - deprecatedNoSince: (2/2) Passed!
[Mathlib] - docBlame: (1/2) Getting...
[Mathlib] - docBlame: (2/2) Failed with 256 messages, but these may include declarations in `nolints.json`.
[Mathlib] - dupNamespace: (1/2) Getting...
[Mathlib] - dupNamespace: (2/2) Passed!
[Mathlib] - impossibleInstance: (1/2) Getting...
[Mathlib] - impossibleInstance: (2/2) Passed!
[Mathlib] - nonClassInstance: (1/2) Getting...
[Mathlib] - nonClassInstance: (2/2) Passed!
[Mathlib] - simpComm: (1/2) Getting...
[Mathlib] - simpComm: (2/2) Passed!
[Mathlib] - simpNF: (1/2) Getting...

where this was also the output visible after a few minutes.

Michael Stoll (Jan 15 2026 at 22:18):

The next run (after re-starting the workflow) only shows

Run cd pr-branch
+ cd pr-branch
+ env LEAN_ABORT_ON_PANIC=1 stdbuf -oL lake exe runLinter --trace Mathlib
+ tee .lake/lint_output.txt
✔ [19/22] Built Batteries.Data.Array.Basic:c.o (641ms)
✔ [20/22] Built runLinter (1.6s)
✔ [21/22] Built runLinter:c.o (714ms)
✔ [22/22] Built runLinter:exe (1.7s)
Running linter on specified module: Mathlib
[Mathlib] Starting lint...
[Mathlib] Running linters:
  checkType
  checkUnivs
  defLemma
  deprecatedNoSince
  docBlame
  dupNamespace
  impossibleInstance
  nonClassInstance
  simpComm
  simpNF
  simpVarHead
  structureInType
  synTaut
  tacticDocs
  unusedArguments
  unusedHavesSuffices
[Mathlib] - checkType: (0/2) Starting...
[Mathlib] - checkUnivs: (0/2) Starting...
[Mathlib] - defLemma: (0/2) Starting...
[Mathlib] - deprecatedNoSince: (0/2) Starting...
[Mathlib] - docBlame: (0/2) Starting...
[Mathlib] - dupNamespace: (0/2) Starting...
[Mathlib] - impossibleInstance: (0/2) Starting...
[Mathlib] - nonClassInstance: (0/2) Starting...
[Mathlib] - simpComm: (0/2) Starting...
[Mathlib] - simpNF: (0/2) Starting...
[Mathlib] - simpVarHead: (0/2) Starting...
[Mathlib] - structureInType: (0/2) Starting...

but is successful.

Thomas Murrills (Jan 15 2026 at 22:21):

Now they all end with -- Linting passed for Mathlib., so I suspect this is a github-based delay in forwarding stdout :grinning_face_with_smiling_eyes:

Michael Stoll (Jan 15 2026 at 22:30):

...which likely makes it hard to figure out from the output where it hangs.

Thomas Murrills (Jan 15 2026 at 22:32):

It's especially mysterious because these tracing steps happen strictly after all of the "obvious" top-level runLinter Task.get"s have been called, meaning that those runLinter-started tasks are actually finished! But we do have a hypothesis, which we're trying to catch...

Marcelo Lynch (Jan 16 2026 at 16:43):

Marcelo Lynch said:

Yeah, I was thinking about that too. Interestingly enough it seems like there was a bunch of stdout when the timeout hits finishing with "Linting passed for Mathlib". It'd be interesting to see if this holds in other instances, it might narrow down where we suspect the hang is ocurring.

Given this, plus a backtrace that Thomas shared showing that a worker was stuck in lean_finalize_task_manager, maybe a quick workaround to this could be to force an exit in runLinter? I wonder if just sticking IO.Process.exit 0 at the end of main would avoid this teardown or not...

Not sure if this is the best idea as these hacks tends to stick and the concurrency bug would still be lurking :) But it would narrow it down

Bryan Gin-ge Chen (Jan 16 2026 at 16:58):

I haven’t contributed to that part of batteries before but at this point it’s worth a shot I think!

Junyan Xu (Jan 17 2026 at 09:53):

backtrace that Thomas shared

can't access this ChatGPT link ...

I was just impacted by a bors run with additional outputs after -- Linting passed for Mathlib.:

+ status=124
+ grep -q 'cannot parse arguments' .lake/lint_output_attempt_2.txt
+ '[' 124 -eq 124 ']'
runLinter timed out (attempt 2).
+ echo 'runLinter timed out (attempt 2).'
+ '[' 2 -lt 2 ']'
+ exit 124
Error: Process completed with exit code 124.

Bryan Gin-ge Chen (Jan 17 2026 at 13:06):

That output is expected after @Marcelo Lynch's #34059 which retries the linting step twice now. Unfortunately it seems that some builds are still getting stuck twice...

Marcelo Lynch (Jan 19 2026 at 15:42):

I queried the CI runs from the last couple of days, and it looks like 80% of the instances of the timeout (48 out of 60 (!)) have benefited from the retry, but the other 20% hit a new timeout even upon retry (like Junyan's example above). It's good news, even though we're still not quite healthy.

Marcelo Lynch (Jan 20 2026 at 16:29):

I have opened this PR in batteries which will presumably bypass the deadlock if I understood Sebastian correctly - as in IO.Process.exit will just exit(0) without any cleanup
https://github.com/leanprover-community/batteries/pull/1621

Marcelo Lynch (Jan 20 2026 at 23:10):

The PR went in, though I believe people need to rebase mathlib to get the latest batteries in the CI.

In any case, data from the last couple of days show that retries have been mostly helpful except for some unlucky runs.
image.png

Michael Stoll (Jan 21 2026 at 16:42):

It seems that there are only a few different times for a successful linting step. Is there an obvious explanation for this?

Marcelo Lynch (Jan 21 2026 at 16:44):

Some updates on this issue:

  • looking at the bors CIs, which should be using the latest version of mathlib/master (hence the updated batteries), there have been no timeouts since the change went in. If this holds for the next couple of days I would say we're good in that respect, and as forks start getting this we should be in a good state
  • I was able to reproduce the concurrency issue that lean4#12052 is trying to fix by running runLinter in a loop. It took quite a number of iterations (the issue seems a lot more likely in our CIs for whatever reason), but still I think it's a good explanation for the backtraces we were seeing in gdb, so it might just be it

Marcelo Lynch (Jan 21 2026 at 16:45):

Michael Stoll said:

It seems that there are only a few different times for a successful linting step. Is there an obvious explanation for this?

I wonder what makes it "fast" (namely the ones at the bottom). The ~5min runtime is the same as the ~20min one because 20 = 15 (timeout of the first attempt) + 5

Bryan Gin-ge Chen (Jan 21 2026 at 16:46):

The fast ones might be where the mathlib4 build already failed, so the linting step just prints those build errors and exits.

Marcelo Lynch (Jan 21 2026 at 16:48):

Ah, yes, that makes total sense


Last updated: Feb 28 2026 at 14:05 UTC