Zulip Chat Archive

Stream: general

Topic: CI takes several hours and then cancels


Damiano Testa (Aug 23 2022 at 03:00):

Dear All,

I have been experiencing the following issue with #16192.

CI begins building mathlib, and after 6+ hours it cancels. I tried re-running a couple of times, I also tried merging master, but nothing seems to make progress. I do not know if it is an issue with the PR or with something else. Is there something that I am missing?

Thanks!

Damiano Testa (Aug 23 2022 at 03:03):

In case it helps, CI was successful sometimes in #15984 with those same lemmas (plus more in other files). Thus, I do not think that the issue is with the lemmas themselves.

Damiano Testa (Aug 23 2022 at 05:28):

Nevermind: CI was successful today! :octopus:

Stuart Presnell (Aug 23 2022 at 13:37):

I had the same issues with a couple of PRs at the weekend. I also had a situation where a PR was waiting a couple hours in the queue to begin the lint mathlib step, which seemed unusual.

Jireh Loreaux (Aug 23 2022 at 15:43):

Yeah, this has been happening

Oliver Nash (Aug 25 2022 at 15:35):

I have a PR #16158 that I cannot merge because of apparent CI issues.

Oliver Nash (Aug 25 2022 at 15:37):

It was sent to bors yesterday but it was cancelled due to a merge conflict. I fixed this conflict (renaming round to round_eq) and merged master but after building for nearly six hours over night, it was cancelled. I merged master again this morning and it has just been cancelled at the six hour mark while still building.

Oliver Nash (Aug 25 2022 at 15:38):

Am I stuck? How can I get this merged?

Yaël Dillies (Aug 25 2022 at 15:39):

Try again. I've sometimes had to rerun CI thrice before it working.

Oliver Nash (Aug 25 2022 at 15:40):

But is this CI being weird or is the build genuinely a six-hour run now?

Yaël Dillies (Aug 25 2022 at 15:40):

CI being weird.

Oliver Nash (Aug 25 2022 at 15:40):

OK I'll hit rerun and cross my fingers.

Yaël Dillies (Aug 25 2022 at 15:40):

This also happens to linting, which usually takes 40min.

Oliver Nash (Aug 25 2022 at 15:41):

I have seen the reports about the linting taking far too long.

Yaël Dillies (Aug 25 2022 at 15:41):

Oh, do you think you're not hitting the same problem here?

Oliver Nash (Aug 25 2022 at 15:42):

I have no idea.

Oliver Nash (Aug 25 2022 at 15:42):

My guess is that at 22:41 this evening I'll be back here.

Jireh Loreaux (Aug 25 2022 at 15:45):

Oliver, I've had this 6 hour issue with both lintiing and build steps now on various PRs. It's CI being weird, but it's getting really annoying.

Oliver Nash (Aug 25 2022 at 15:45):

Ah OK thanks, that's good data.

Mauricio Collares (Aug 25 2022 at 16:07):

Is it swapping/thrashing? How much memory do the builders have?

Ruben Van de Velde (Aug 25 2022 at 16:09):

I don't think anyone has diagnosed it

Sebastien Gouezel (Aug 25 2022 at 16:15):

Is it only on lena/nael and friends, or also on the hoskinson ones?

Oliver Nash (Aug 25 2022 at 16:59):

My two ~6 hour runs took place on nael and lane respectively.

Oliver Nash (Aug 25 2022 at 16:59):

The new run is taking place on hoskinson3 :fingers_crossed:

Yaël Dillies (Aug 25 2022 at 17:02):

Can someone check the percentage of failures on nael and others? Maybe it's so low that CI would be faster without them?

Damiano Testa (Aug 25 2022 at 18:26):

If you want another data point, #16127 has been linting for 5h30 and is not done yet!

Oliver Nash (Aug 25 2022 at 19:05):

Damiano Testa said:

If you want another data point, #16127 has been linting for 5h30 and is not done yet!

I note that this timed out also ran on nael.

Damiano Testa (Aug 25 2022 at 19:30):

Yes, this has been a pretty common pattern recently. I'll rerun the failed jobs and eventually it will build.

Damiano Testa (Aug 25 2022 at 20:29):

The reattempted linting finished in 40min on hoskinson2.

Stuart Presnell (Aug 26 2022 at 00:35):

I've just restarted linting on #16231 after it timed out on nela after 6 hours.

Jireh Loreaux (Aug 26 2022 at 04:16):

I would be tempted to cancel any linting jobs that take over 80 minutes. That's almost certainly more than is required for linting.

Johan Commelin (Aug 26 2022 at 08:32):

Maybe the Freiburg runners should be end-of-lifed. Or at least reserved for much more trivial tasks.

Kalle Kytölä (Aug 27 2022 at 10:11):

What is the recommended way to restart CI in these cases? Should one just push some tiny change? #15321 seems to have the same issue.

Yaël Dillies (Aug 27 2022 at 10:12):

Go here and click Re-run jobs, Re-run failed jobs.

Moritz Doll (Aug 28 2022 at 19:07):

another datapoint: #16269 failed now 3 times in the linting stage. this is PR only adds comments...

Junyan Xu (Aug 28 2022 at 19:44):

Data point: mathlib build succeeded after 5 attempts
In reverse chronological order:
nela (lost communication)
nela (exceeded 360 min)
lena (exceeded 360 min)
nale (exceeded 360 min)

Stuart Presnell (Aug 28 2022 at 20:00):

#16231: build mathlib cancelled on nael after 364m

Stuart Presnell (Aug 28 2022 at 20:02):

(Is it at all useful to be reporting these datapoints here, or is this all monitored in a more systematic way somewhere?)

Damiano Testa (Aug 28 2022 at 20:05):

I would say that it is clear: if a job is not handled by hoskinsoni, it will almost certainly time out.

Yaël Dillies (Aug 28 2022 at 20:06):

There was some talk about how runners had their disk full. Can we somehow clear their hardrive from all the past logs?

Bryan Gin-ge Chen (Aug 28 2022 at 22:37):

Stuart Presnell said:

(Is it at all useful to be reporting these datapoints here, or is this all monitored in a more systematic way somewhere?)

Yes, thank you for reporting these! Unfortunately we're still investigating what's going on with the Freiberg runners. Would anyone be opposed if we just disabled them temporarily? This could mean that everyone's builds will have to wait in queue longer (note that the staging build which gets merged into master has a dedicated runner so it won't be affected).

Yaël Dillies (Aug 28 2022 at 22:39):

Given that most builds take over a day of restarting the jobs, I'm very fine with this!

Bryan Gin-ge Chen (Aug 29 2022 at 02:27):

OK, I've disabled the Freiberg runners (lane, lena, nale, nael, nela). After they finish (or fail) their currently running jobs, no more jobs should be sent to them.

Eric Rodriguez (Sep 01 2022 at 20:19):

The queue seems to be coping alright! This seems like a completely positive change all around :)

Damiano Testa (Sep 01 2022 at 20:21):

I agree! I rarely had to wait for a job to start and they were all successful, when they were supposed to be! :tada:


Last updated: Dec 20 2023 at 11:08 UTC