Zulip Chat Archive
Stream: general
Topic: No space left on device
Xavier Roblot (Sep 05 2023 at 12:02):
This is from CI for #6832:
Current runner version: '2.308.0'
Runner name: 'hoskinson1'
Runner group name: 'Default'
Machine name: 'b715c4209e97'
GITHUB_TOKEN Permissions
Secret source: Actions
Prepare workflow directory
Prepare all required actions
Getting action download info
Error: No space left on device : '/home/lean/actions-runner/_work/_actions/dcarbone/install-jq-action/v1.0.1'
Is this something to be worried about?
Johan Commelin (Sep 05 2023 at 12:06):
cc @Wojciech Nawrocki
Johan Commelin (Sep 05 2023 at 12:07):
It probably means we need to delete the cache directory
Anatole Dedecker (Sep 05 2023 at 12:37):
Can reproduce in all of my PRs :scream:
Arthur Paulino (Sep 05 2023 at 12:49):
I thought the CI script included lake exe cache clean
Jireh Loreaux (Sep 05 2023 at 12:51):
I think it's failing on a different step.
Scott Morrison (Sep 05 2023 at 12:52):
The install-jq-action
is new (added by me ....), and I'm worried that it is installing multiple copies and filling up the drive.
Scott Morrison (Sep 05 2023 at 12:52):
I'm not sure if we can diagnose that until someone with shell access to the machines is available.
Anatole Dedecker (Sep 05 2023 at 13:59):
The faulty runner has been temporarily disabled, re-running jobs should now work :fingers_crossed:
Sebastien Gouezel (Sep 05 2023 at 14:01):
I've resent accepted PRs to bors.
Wojciech Nawrocki (Sep 05 2023 at 15:36):
It's actually .elan/toolchains
:sweat_smile:
Wojciech Nawrocki (Sep 05 2023 at 15:43):
I cleaned up the toolchains and reenabled hoskinson1
. Toolchain size will certainly come up again, but hopefully less so assuming that rc
and stable release bumps are less frequent than nightly bumps.
Jireh Loreaux (Sep 05 2023 at 16:04):
Doesn't this just mean we need garbage collection for toolchains within elan? E.g., a date- or storage-based system where elan automatically removes toolchains after the specified limit? Even if the default is "never delete", we probably want this option.
Scott Morrison (Sep 06 2023 at 00:16):
Note we are still using every toolchain (possibly even more so than previously), because the nightly-testing
branch now has automation to try out every nightly as soon as it arrives.
Scott Morrison (Sep 06 2023 at 00:36):
Fixing this in elan requires more care, because we wouldn't want to delete a toolchain that was still in use.
In CI it is easy, because there is no danger to occasionally have to redownload.
Yuyang Zhao (Sep 14 2023 at 19:09):
It's back. : (
Patrick Massot (Sep 14 2023 at 19:12):
@Wojciech Nawrocki
Anatole Dedecker (Sep 14 2023 at 19:35):
I've disabled the runner hoskinson3
while we wait for a fix so that it doesn't cause more issues.
Wojciech Nawrocki (Sep 14 2023 at 20:36):
Fixed. It was the same issue: hoskinson3
had 15GiB of Lean toolchains stored on it. Even though #6969 was merged, this does not appear to suffice. My conjecture is that there are jobs that execute on all these runners that pull in some old toolchains. For example, on hoskinson2
2023-08-05
was pulled in today. So checking for toolchain folders older than two weeks does not correspond to checking for old Lean versions. The bruteforce solution would be to shorten the period to two days or even just one day.
Patrick Massot (Sep 14 2023 at 20:39):
Thanks!
Ruben Van de Velde (Sep 14 2023 at 21:36):
Wojciech Nawrocki said:
Fixed. It was the same issue:
hoskinson3
had 15GiB of Lean toolchains stored on it. Even though #6969 was merged, this does not appear to suffice. My conjecture is that there are jobs that execute on all these runners that pull in some old toolchains. For example, onhoskinson2
2023-08-05
was pulled in today. So checking for toolchain folders older than two weeks does not correspond to checking for old Lean versions. The bruteforce solution would be to shorten the period to two days or even just one day.
Looks like hoskinson3 is now failing at
Run lake exe cache clean
error: toolchain 'leanprover/lean4:v4.0.0' is not installed
Kevin Buzzard (Sep 14 2023 at 21:42):
Maybe it just pulls a random lean and adds it to the pile, then hopes it got the right one at some point in the past?
Patrick Massot (Sep 14 2023 at 21:43):
Scott Morrison (Sep 14 2023 at 22:37):
I suspect the problem is that @Wojciech Nawrocki deleted the toolchains in some way that left elan in a broken state, knowing the toolchain wasn't there but unwilling to install it.
Scott Morrison (Sep 14 2023 at 22:39):
Indeed, I can reproduce locally.
Deleting ~/.elan/toolchains
but not ~/.elan/update-hashes
leaves elan in exactly this broken state.
Scott Morrison (Sep 14 2023 at 22:40):
So --- to fix, we need:
- Someone (@Wojciech Nawrocki?) with Hoskinson access to delete also the
update-hashes
directories - @Sebastian Ullrich to patch elan so it can cope with this sort of abuse from users :-)
Scott Morrison (Sep 14 2023 at 22:41):
@Mario Carneiro, do you still have shell access to these machines?
Mario Carneiro (Sep 14 2023 at 22:41):
in theory
Scott Morrison (Sep 14 2023 at 22:42):
It looks like it may just be hoskinson3 affected, so we can just disable it again.
Scott Morrison (Sep 14 2023 at 22:43):
I've disabled it, so hopefully we're good to go now.
Scott Morrison (Sep 14 2023 at 22:47):
Ugh, hoskinson9 is having the same problems, I'm not sure how many this affected.
Scott Morrison (Sep 14 2023 at 22:49):
hoskinson
too, looks like most of them have been broken. :-)
Mario Carneiro (Sep 14 2023 at 22:50):
maybe we need a script before downloading a toolchain which just ensures that there are fewer than N toolchains installed by removing old ones
Anatole Dedecker (Sep 14 2023 at 22:51):
Which problem, no space available or toolchain not installed ?
Mario Carneiro (Sep 14 2023 at 22:51):
no space available
Mario Carneiro (Sep 14 2023 at 22:51):
the toolchain not installed problem is the side effect of not having the space to install it IIUC
Mario Carneiro (Sep 14 2023 at 22:52):
if we just have a hard limit on the number of toolchains installed instead of a time-based limit then we can avoid this
Scott Morrison (Sep 14 2023 at 22:52):
Also, the current scripts is only deleting from the .elan/toolchains/
directory, leaving .lean/update-hashes/
alone, so is borked anyway.
Anatole Dedecker (Sep 14 2023 at 22:52):
Scott Morrison said:
I suspect the problem is that Wojciech Nawrocki deleted the toolchains in some way that left elan in a broken state, knowing the toolchain wasn't there but unwilling to install it.
So this isn’t the right diagnosis?
Scott Morrison (Sep 14 2023 at 22:53):
I think it is?
Anatole Dedecker (Sep 14 2023 at 22:53):
I thought Mario said it isn’t
Mario Carneiro (Sep 14 2023 at 22:53):
no
Mario Carneiro (Sep 14 2023 at 22:54):
fixing the messed up installation is good, but the root issue is that we are running out of space for toolchains
Scott Morrison (Sep 14 2023 at 22:54):
Here's the chain of events:
- we don't delete enough old toolchains
- so the runners still run out of space
- yesterday Wojciech cleaned up by deleting from the
.elan/toolchains/
directory - but because
.elan/update-hashes
was still there, elan refuses to redownload the toolchains
Anatole Dedecker (Sep 14 2023 at 22:54):
Right, sorry of the misunderstanding
Adam Topaz (Sep 14 2023 at 22:54):
Why do we need a script at all? Shouldn’t the runs running on hoskinson
et al all be running in a container with an ephemeral volume?
Mario Carneiro (Sep 14 2023 at 22:54):
we want to cache toolchains
Mario Carneiro (Sep 14 2023 at 22:55):
just not forever
Adam Topaz (Sep 14 2023 at 22:55):
Oh right ok
Scott Morrison (Sep 14 2023 at 22:55):
Life would be simpler if we didn't!
Scott Morrison (Sep 14 2023 at 22:56):
It seems the savings from caching the toolchains are:
- saving on network traffic (are we actually paying incrementally?)
- saving a little bit of time on each CI run
Wojciech Nawrocki (Sep 14 2023 at 22:56):
Mario Carneiro said:
if we just have a hard limit on the number of toolchains installed instead of a time-based limit then we can avoid this
This is a good idea.
Wojciech Nawrocki (Sep 14 2023 at 22:57):
Scott Morrison said:
hoskinson
too, looks like most of them have been broken. :-)
Sorry about this oversight. Fixing them all now.
Scott Morrison (Sep 14 2023 at 22:57):
No problem, I've done exactly the same thing locally multiple times.
Scott Morrison (Sep 14 2023 at 22:57):
Plus written a CI script that does more of it. :-)
Kevin Buzzard (Sep 14 2023 at 22:58):
You could have a hard limit of 1 :-)
Wojciech Nawrocki (Sep 14 2023 at 23:03):
Okay, should be good again. It was hoskinson[39]?
.
Scott Morrison (Sep 14 2023 at 23:03):
Those were the ones I found so far!
Scott Morrison (Sep 14 2023 at 23:07):
Does
cd ~/.elan/toolchains && ls -1td * | tail -n +11 | xargs -I {} echo "rm -rf {} && rm -rf ../update-hashes/{}"
look okay?
Scott Morrison (Sep 14 2023 at 23:07):
(Notice the echo
, this is a dry build that you can safely run on your own machine. :-)
Scott Morrison (Sep 14 2023 at 23:08):
hopefully deletes all but the 10 most recent subdirectories in ~/.elan/toolchains
, and the corresponding files in ~/.elan/update-hashes
Mario Carneiro (Sep 14 2023 at 23:09):
it's a bit wrong in the presence of elan toolchain link
but I guess this isn't a problem for the CI machines
Mario Carneiro (Sep 14 2023 at 23:09):
it also looks like it will delete the default toolchain, not sure if this has any impact
Mario Carneiro (Sep 14 2023 at 23:10):
you don't need -rf
for the second part, it's a file not a directory
Mario Carneiro (Sep 14 2023 at 23:11):
oh and the toolchain link
s don't even appear in update-hashes
so that rm would fail
Scott Morrison (Sep 14 2023 at 23:14):
cd ~/.elan/toolchains && find . -maxdepth 1 -type d -print0 | xargs -0 ls -1td | grep -v "nightly$" | grep -v "stable$" | tail -n +11 | xargs -I {} echo "rm -rf {} && rm ../update-hashes/{}"
Scott Morrison (Sep 14 2023 at 23:15):
Ignores symbolic links, skips nightly and stable.
Mario Carneiro (Sep 14 2023 at 23:15):
I think the $
needs escaping or single quotes
Mario Carneiro (Sep 14 2023 at 23:16):
it's interpreted as a malformed shell variable in fish
Mario Carneiro (Sep 14 2023 at 23:17):
the first line I get is rm -rf . && rm ../update-hashes/.
which looks bad
Scott Morrison (Sep 14 2023 at 23:21):
hehe :-)
Scott Morrison (Sep 14 2023 at 23:22):
cd ~/.elan/toolchains && find . -maxdepth 1 -type d ! -name . -print0 | xargs -0 ls -1td | grep -v 'nightly$' | grep -v 'stable$' | tail -n +2 | xargs -I {} sh -c 'echo {} && rm -rf "{}" && rm "../update-hashes/{}"'
Scott Morrison (Sep 14 2023 at 23:22):
That one has a live rm
in it.
Scott Morrison (Sep 14 2023 at 23:23):
cd ~/.elan/toolchains && find . -maxdepth 1 -type d ! -name . -print0 | xargs -0 ls -1td | grep -v 'nightly$' | grep -v 'stable$' | tail -n +2 | xargs -I {} sh -c 'echo {}'
Scott Morrison (Sep 14 2023 at 23:23):
without the rm
Scott Morrison (Sep 14 2023 at 23:34):
PR'd as #7166
Scott Morrison (Sep 14 2023 at 23:42):
Okay, between that and https://github.com/leanprover/elan/issues/107, hopefully we should be set in future.
Scott Morrison (Sep 14 2023 at 23:43):
We should make sure someone besides Wojciech has shell access to the runners, in practice as well as theory. :-)
Jireh Loreaux (Sep 15 2023 at 03:02):
Goodness, I missed all the fun in 6 hours. :joy:
Johan Commelin (Sep 15 2023 at 03:17):
When mathlib starts tracking rc's and the toolchain only bumps ~1/week then this problem should also fade into a more relaxed state
Johan Commelin (Sep 15 2023 at 03:18):
Or maybe it is already tracking rc's? I lost track....
Mario Carneiro (Sep 15 2023 at 03:20):
it's already tracking rc's
Mario Carneiro (Sep 15 2023 at 03:20):
but the PRs are still on a variety of old nightlies
Mario Carneiro (Sep 15 2023 at 03:22):
we have been on rc's for 3 weeks
Johan Commelin (Sep 15 2023 at 03:23):
Hopefully that means that there's 3 weeks worth of PRs that are now using <= 3 toolchains
Scott Morrison (Sep 15 2023 at 03:57):
Yeah, but CI is still using every nightly, because of the semi-automated nightly-testing
branch. :-)
Scott Morrison (Sep 15 2023 at 03:57):
Also a new toolchain for each Lean 4 PR. :-)
Scott Morrison (Sep 15 2023 at 03:57):
Don't get your hopes up!
Mario Carneiro (Sep 15 2023 at 04:06):
Scott Morrison said:
Yeah, but CI is still using every nightly, because of the semi-automated
nightly-testing
branch. :-)
True, but nightly-testing
is a good citizen, it only tests today's nightly, once a day, and doesn't keep coming back to the PR after a month without a merge
Yaël Dillies (Oct 06 2023 at 12:23):
We're hitting this again: https://github.com/leanprover-community/mathlib4/actions/runs/6429072971/job/17457436332
Anatole Dedecker (Oct 06 2023 at 12:29):
Runner disabled for now
Anatole Dedecker (Oct 06 2023 at 12:40):
I think I've bors retry
-ed all affected ready-to-merge PRs
Patrick Massot (Oct 06 2023 at 12:42):
Thanks!
Utensil Song (Oct 08 2023 at 03:10):
Would https://github.com/jlumbroso/free-disk-space be helpful for this too? It can clear up to 31 GB of disk space in about 3 minute, I have used it and it works as claimed.
Scott Morrison (Oct 09 2023 at 00:59):
While I have shell access and can perhaps implement a fix if needed, I neither know why we are running out of space, nor even how to enter the environment the runners run in to have a look around! Someone else will need to fix one of these two problems. :-)
Wojciech Nawrocki (Oct 09 2023 at 18:00):
It was just 10GiB of Lean toolchains again, combined with some other old files that had accumulated.
Scott Morrison (Oct 10 2023 at 00:27):
@Wojciech Nawrocki, sorry to be dumb, but could you tell me where I should have gone after ssh'ing into this machine in order to find those toolchains?
Wojciech Nawrocki (Oct 10 2023 at 00:28):
Np, the easy way is to run sudo ncdu /
and look at what takes up the most space on the drive. This will usually be some folder inside the docker image.
Scott Morrison (Oct 10 2023 at 00:29):
$ ncdu
-sh: 9: ncdu: not found
Wojciech Nawrocki (Oct 10 2023 at 01:31):
Which runner is this on? In general, for small utilities like these please feel free to apt install
.
Scott Morrison (Oct 10 2023 at 01:43):
That was on the "unnumbered" hoskinson.
Scott Morrison (Oct 10 2023 at 01:46):
(I've installed ncdu everywhere.)
Yaël Dillies (Nov 04 2023 at 12:31):
It's happening again: https://github.com/leanprover-community/mathlib4/actions/runs/6754826458/job/18362702531
Yaël Dillies (Nov 04 2023 at 12:31):
@maintainers
Anatole Dedecker (Nov 04 2023 at 12:35):
Runner disabled
Wojciech Nawrocki (Nov 04 2023 at 19:41):
Restarted
Yaël Dillies (Nov 10 2023 at 13:09):
Happening again: https://github.com/leanprover-community/mathlib4/actions/runs/6825189236/job/18562531775?pr=8202
Yaël Dillies (Nov 10 2023 at 13:09):
@maintainers
Sebastian Ullrich (Nov 10 2023 at 13:15):
This is how the speedcenter solves this issue fwiw: https://github.com/Kha/mathlib4-bench/blob/c6be8420f6dcbf70716586abee3c8e05e6a26ef1/bench#L5
Bryan Gin-ge Chen (Nov 10 2023 at 14:06):
Runner disabled
Wojciech Nawrocki (Nov 10 2023 at 21:22):
Sebastian Ullrich said:
This is how the speedcenter solves this issue fwiw: https://github.com/Kha/mathlib4-bench/blob/c6be8420f6dcbf70716586abee3c8e05e6a26ef1/bench#L5
This is unfortunately not enough. In this case the culprit was 15GiB of mathport data.
Mario Carneiro (Nov 10 2023 at 21:38):
@Wojciech Nawrocki could you elaborate? AFAIK mathport isn't doing anything that could cause a buildup of data over multiple runs, except maybe the Outputs
directory?
Wojciech Nawrocki (Nov 10 2023 at 22:03):
Ah, sorry, I already deleted the folder! That was the size of the mathport
directory. Your guess sounds very plausible: accumulation of many Outputs/
. The GitHub runner does not clean directories up across runs by default.
Henrik Böving (Nov 13 2023 at 22:20):
https://github.com/leanprover-community/mathlib4_docs/actions/runs/6852564004
We seem to have a space issue @ mathlib4_docs as well
Wojciech Nawrocki (Nov 14 2023 at 00:10):
It's been fixed!
Henrik Böving (Nov 14 2023 at 00:23):
@Wojciech Nawrocki did you delete some additional stuff? https://github.com/leanprover-community/mathlib4_docs/actions/runs/6857390839/job/18646491207 it's failing because there is no git config set up now
Henrik Böving (Nov 14 2023 at 00:26):
Alternatively I can also teach the action to set up a git config? But idk if we want that?
Wojciech Nawrocki (Nov 14 2023 at 00:34):
That's a bit strange, I reset the entire container so it should be the same setup it has always been.
Wojciech Nawrocki (Nov 14 2023 at 00:35):
If there was any transient state on it that the job relied on, that would have been erased as well.
Wojciech Nawrocki (Nov 14 2023 at 00:38):
It's also possible that one of the upstream GH actions changed in some way so that it no longer sets up a git config?
Henrik Böving (Nov 14 2023 at 07:18):
Its getting evsn more interesting: https://github.com/leanprover-community/mathlib4_docs/actions/runs/6857431293 it works again
Mario Carneiro (Nov 14 2023 at 07:19):
hoskinson6 seems to be full again, I got another mathport failure on it. That last run you linked is on hoskinson5
Wojciech Nawrocki (Nov 14 2023 at 16:22):
@Mario Carneiro could you link the failure? Looking at it now, there is space.
Mario Carneiro (Nov 14 2023 at 16:23):
https://github.com/leanprover-community/mathport/actions/runs/6859657810/job/18652262231
Mario Carneiro (Nov 14 2023 at 16:24):
It could be that something has changed upstream which is causing the mathport build to take a lot more space now. I am suspicious that we are seeing many space issues in a short time period now, when nothing in particular has changed recently in mathport
Mario Carneiro (Nov 14 2023 at 16:25):
Oh, it could be the post-update hook
Mario Carneiro (Nov 14 2023 at 16:27):
yeah this is probably making a mess because lean3port and mathlib3port have nonstandard setups, we are getting 3 mathlibs instead of just one
Junyan Xu (Nov 14 2023 at 22:26):
The mathlib4 speedcenter currently says "no worker registered". Is it maybe due to some disabled runners?
Sebastian Ullrich (Nov 15 2023 at 08:54):
Thanks, it's up again now
Mauricio Collares (Nov 20 2023 at 08:53):
hoskinson9
is out of disk space: https://github.com/leanprover-community/mathlib4/actions/runs/6927605611/job/18841912762
Scott Morrison (Nov 20 2023 at 09:38):
Ugh, I'm not sure what to remove. I'd hoped #8525 would help. Maybe it hasn't had enough time. I've been running ncdu on these machines, and there's nothing conspicuously out of place.
Mauricio Collares (Nov 20 2023 at 09:40):
It probably would help if it got a chance to run, but the build fails earlier than that
Mauricio Collares (Nov 20 2023 at 09:41):
I guess as more branches merge master the odds of accumulating 10 toolchains become vanishingly small
Mauricio Collares (Nov 20 2023 at 09:42):
That being said, did you check if the mathlib cache directory is a problem?
Mauricio Collares (Nov 20 2023 at 09:45):
I now see there is a lake exe cache clean
step before lake exe cache get
, so it shouldn't be a problem.
Henrik Böving (Dec 05 2023 at 14:34):
https://github.com/leanprover-community/mathlib4_docs/actions/runs/7097849825 we are closing in again /o\
Wojciech Nawrocki (Dec 12 2023 at 19:27):
This should be less of an issue moving forward as the CI runners now have 100GiB storage each thanks to @Jeremy Avigad .
Last updated: Dec 20 2023 at 11:08 UTC