Zulip Chat Archive

Stream: general

Topic: CI stuck at Lint project step

Rémy Degenne (Nov 06 2025 at 16:58):

In the Brownian motion project, half of our CI runs get stuck on the Lint project step, where we get those 3 lines printed, and then nothing for a very long time (I don't know how long, I stopped one after 50 minutes):

Run env LEAN_ABORT_ON_PANIC=1 ~/.elan/bin/lake exe runLinter BrownianMotion
  env LEAN_ABORT_ON_PANIC=1 ~/.elan/bin/lake exe runLinter BrownianMotion
  shell: /usr/bin/bash -e {0}

Example: https://github.com/RemyDegenne/brownian-motion/actions/runs/19140278725/job/54703144152?pr=199

When I cancel and then restart the job, I usually get a successful Lint project step which is less than a minute. But sometimes it gets stuck again.
There was no problem with CI yesterday, but today I had to restart many runs. Does someone have any idea of what could be the cause?

Bryan Gin-ge Chen (Nov 06 2025 at 17:33):

As I mentioned in the original thread at #Brownian motion > CI gets stuck @ 💬, I looked a little bit but didn't get very far. Here are some more details that might save some clicking:

the logs for the stuck jobs are here; some of the ones that are green only became green after being re-run, so the runs with the issues don't all show up when you filter for canceled jobs.
the first run that had to get canceled seems to be this one, corresponding to this commit in RemyDegenne/brownian-motion#194. Nothing looked off to me but maybe I just don't know the linters well enough.
there's currently a run that's been stuck for over an hour and a half. It might be interesting to just keep letting it run to see if it eventually finishes or if (as I suspect) GitHub will terminate it after hitting the 6 hour limit.
Does anyone know what linters might be nondeterministic? That might be a good place to start digging.

edit: looks like the job I mentioned above was indeed automatically canceled after 6 hours: https://github.com/RemyDegenne/brownian-motion/actions/runs/19141550379/job/54707721006

Rémy Degenne (Nov 07 2025 at 06:55):

Several stuck jobs have now hit the 6 hours point, so I guess stuck means stuck forever.
It might be only a feeling but it looks like it's now much more than 50% of runs that have problems.

Damiano Testa (Nov 07 2025 at 09:41):

I do not know what this could be, but I am reminded of #general > runLinter in github workflow not terminating which also has some issues with linters on projects depending on Mathlib.

Joseph Tooby-Smith (Nov 10 2025 at 13:32):

What I ended up doing to fix this on PhysLean is (see the link Damiano posted):

Putting a time limit on the step in the workflow so it doesn't go on forever.
Rewriting runLinter, leaving only what is actually needed (hard to say if this actually did anything).
Just waited a couple of days.

For PhysLean, I think it may have been down to lots of simultaneous, or close together runs of the workflow. Would something like this make sense here?

Bryan Gin-ge Chen (Nov 10 2025 at 13:37):

I didn't notice any simultaneous runs of the workflow when I checked the logs; I'd also be very surprised if something like that could affect what happened on GitHub-hosted runners.

Joseph Tooby-Smith (Nov 10 2025 at 13:40):

Yeah, I agree. The only evidence I have for this is that when I was having issues with it, I noticed that usually the "first run" of the day would work, and then it would stop working, or fluctuating working after that.

Let me also add, PhysLean is no longer having these issues. I have not bumped Lean or Mathlib since then.

Joseph Tooby-Smith (Nov 10 2025 at 13:43):

The only real change I made was replacing runLinter with a stripped down version, which can be found in PhysLean#786. I think after making this change, I did have the issue again, but only one or two times - so I can't say anything conclusive about making said change.

Last updated: Dec 20 2025 at 21:32 UTC