Zulip Chat Archive
Stream: general
Topic: ddos attack on the PR queue
Johan Commelin (May 04 2021 at 11:47):
Help! Someone seems to be generating a new PR about every 3 minutes... what's going on!
Bryan Gin-ge Chen (May 04 2021 at 13:04):
Bryan Gin-ge Chen (May 04 2021 at 13:04):
By the way, due to the massive GitHub Actions queue (distinct from the bors queue), the staging
job for #7268 timed out, so I put it back on the queue.
I suspect the same thing may happen a few more times, so if anyone wants to be a good citizen, please cancel any "redundant" workflows and possibly consider waiting to push your commits until the number of queued builds is 0.
Bryan Gin-ge Chen (May 04 2021 at 13:08):
Keep an eye on this staging
job; if it remains queued for longer than about 1.5 hours there's a high chance that it will time out, even if it would otherwise build successfully.
Bryan Gin-ge Chen (May 04 2021 at 13:10):
(I've just gone through and canceled a bunch of workflows on outdated commits.)
Johan Commelin (May 04 2021 at 13:11):
Can we switch to a more aggressive canceling by default?
Kevin Buzzard (May 04 2021 at 13:11):
I don't know what "cancelling redundant workflows" even means. Does this just mean "don't push to any mathlib branches for a while if you can help it"?
Bryan Gin-ge Chen (May 04 2021 at 13:12):
Johan Commelin said:
Can we switch to a more aggressive canceling by default?
There might be some GitHub actions which do it; I don't know of any built-in options that let us control this.
Johan Commelin (May 04 2021 at 13:13):
Something like: if CI has run < 30 minutes, and a new commit is pushed, just cancel it.
Johan Commelin (May 04 2021 at 13:13):
In particular, if someone pushed 5 times in a row, the first 4 pushes will not really use up CI.
Bryan Gin-ge Chen (May 04 2021 at 13:14):
Kevin Buzzard said:
I don't know what "cancelling redundant workflows" even means. Does this just mean "don't push to any mathlib branches for a while if you can help it"?
Basically I went through the list of queued and running CI jobs and if I saw multiple "continuous integration" jobs from the same branch, I canceled all the older ones.
Not pushing to mathlib branches for a while will help; like I said, keep an eye on this page.
Bryan Gin-ge Chen (May 04 2021 at 13:16):
@Johan Commelin I think @Scott Morrison experimented with something like this before; in particular, there are a few jobs under "cancel CI for obsolete pushes". I don't remember the details though.
Bryan Gin-ge Chen (May 04 2021 at 13:16):
Looks like the PR was #2235. Maybe time to revive it?
Scott Morrison (May 04 2021 at 13:18):
I thought it was a good idea at the time: I'm definitely a culprit when it comes to making pushes on a single branch in quick succession.
Johan Commelin (May 04 2021 at 13:19):
Right, but doing multiple pushes in a row should be supported.
Johan Commelin (May 04 2021 at 13:20):
At the moment we are relying on the elves to silently and smoothly keep the queue clean and CI running. (Thanks soooo much Bryan! You're doing an amazing invisible job!) If we can make this job a bit easier using more automation that would be great.
Anne Baanen (May 04 2021 at 13:27):
I have been creating a lot of PRs touching files low in the import hierarchy, sorry for any delays caused by the long recompilation times! I'll wait for a bit before continuing.
Johan Commelin (May 04 2021 at 13:27):
Kan academy hosts an action that runs in javascript and hence should be very lightweight: https://github.com/marketplace/actions/pull-request-workflow-cancel
Johan Commelin (May 04 2021 at 13:27):
I don't know if it is flexible enough for what we want.
Johan Commelin (May 04 2021 at 13:35):
This one is very popular: https://github.com/marketplace/actions/cancel-workflow-action
Bryan Gin-ge Chen (May 04 2021 at 13:40):
Increasing the bors timeout somehow (or getting it to count only actual running time) would also keep a long queue from disrupting bors. I'm not sure if I can make a good case for this in the bors repo issues though, as the waiting time needed could be very, very long (especially if the queue keeps growing), and allowing for super-long build times is probably not a good idea for the public bors instance that we're using.
Johan Commelin (May 04 2021 at 13:42):
@Bryan Gin-ge Chen do you think that the new workflows that add delegated
and such are eating up precious time?
Johan Commelin (May 04 2021 at 13:42):
Or are those not a problem at all?
Bryan Gin-ge Chen (May 04 2021 at 13:46):
Johan Commelin said:
Bryan Gin-ge Chen do you think that the new workflows that add
delegated
and such are eating up precious time?
No, those only run for a few seconds, so I don't think they're a big deal. Our limit is 20 concurrent jobs (though it sometimes looks like we're allowed more than that?), so the queue starts to grow once we have more than about 20 long-running jobs (e.g. build, lint, or test).
Johan Commelin (May 04 2021 at 13:47):
And bors is using the same budget of 20 jobs, right?
Bryan Gin-ge Chen (May 04 2021 at 13:49):
Well, the staging
job does. (The processes that actually control bors and pushes commits, posts comments, closes PRs, etc. runs outside of our budget)
Johan Commelin (May 04 2021 at 13:50):
If we ran staging
on a self-hosted runner (or in a different repo, but that seems awkward) that would let some steam off, right?
Bryan Gin-ge Chen (May 04 2021 at 13:51):
Yes, that would also solve the problem of bors timing out.
Johan Commelin (May 04 2021 at 13:52):
But is it easy to isolate staging
from the rest of CI?
Johan Commelin (May 04 2021 at 13:54):
Alternatively, if github would allow us to schedule certain workflow / branches to have higher prio, that would also be helpful
Bryan Gin-ge Chen (May 04 2021 at 14:01):
Johan Commelin said:
But is it easy to isolate
staging
from the rest of CI?
From a casual glance at the docs it looks like we might have to have 2 copies of the workflow file, one that only runs on staging
and which calls the self-hosted runner and one for everything else.
Johan Commelin (May 04 2021 at 14:09):
@Bryan Gin-ge Chen can we save some time by somehow caching the elan and python installation steps? Especially because this python install is done several jobs per workflow
Johan Commelin (May 04 2021 at 14:11):
I have no idea how much time this "overhead" is taking
Bryan Gin-ge Chen (May 04 2021 at 14:11):
Those don't take much time at all (probably less than 10 seconds total per workflow).
Johan Commelin (May 04 2021 at 14:13):
Hmm yes, I see. That isn't worth looking into.
Johan Commelin (May 04 2021 at 14:25):
It's really annoying that we can't give some hints to the Github Actions scheduler.
Jannis Limperg (May 04 2021 at 14:34):
I have a branch for which I don't actually need or want CI to run. Can I disable CI on that branch?
Eric Wieser (May 04 2021 at 14:35):
If you don't need CI on the branch, you may as well develop it in a fork rather than in mathlib itself
Eric Wieser (May 04 2021 at 14:35):
The only reason that everyone does their development in mathlib rather than forks (which is quite unusual for github) is because of how the CI caching is set up
Jannis Limperg (May 04 2021 at 14:36):
Yeah I thought so too. Let me do that.
Eric Wieser (May 05 2021 at 10:38):
bors seems stuck in a timeout loop. It's able to build and lint just fine, but then leanchecker
pushes it over the edge
Johan Commelin (May 05 2021 at 10:51):
What happened with out plans to switch to another checker?
Bryan Gin-ge Chen (May 07 2021 at 15:04):
Update: as of #7532, master
now uses trepplein
instead of leanchecker
. So if you have any older PRs, merging master
into them will make the test
job run about 40 minutes faster!
Last updated: Dec 20 2023 at 11:08 UTC