Zulip Chat Archive

Stream: general

Topic: No space left on device


Xavier Roblot (Sep 05 2023 at 12:02):

This is from CI for #6832:

Current runner version: '2.308.0'
Runner name: 'hoskinson1'
Runner group name: 'Default'
Machine name: 'b715c4209e97'
GITHUB_TOKEN Permissions
Secret source: Actions
Prepare workflow directory
Prepare all required actions
Getting action download info
Error: No space left on device : '/home/lean/actions-runner/_work/_actions/dcarbone/install-jq-action/v1.0.1'

Is this something to be worried about?

Johan Commelin (Sep 05 2023 at 12:06):

cc @Wojciech Nawrocki

Johan Commelin (Sep 05 2023 at 12:07):

It probably means we need to delete the cache directory

Anatole Dedecker (Sep 05 2023 at 12:37):

Can reproduce in all of my PRs :scream:

Arthur Paulino (Sep 05 2023 at 12:49):

I thought the CI script included lake exe cache clean

Jireh Loreaux (Sep 05 2023 at 12:51):

I think it's failing on a different step.

Scott Morrison (Sep 05 2023 at 12:52):

The install-jq-action is new (added by me ....), and I'm worried that it is installing multiple copies and filling up the drive.

Scott Morrison (Sep 05 2023 at 12:52):

I'm not sure if we can diagnose that until someone with shell access to the machines is available.

Anatole Dedecker (Sep 05 2023 at 13:59):

The faulty runner has been temporarily disabled, re-running jobs should now work :fingers_crossed:

Sebastien Gouezel (Sep 05 2023 at 14:01):

I've resent accepted PRs to bors.

Wojciech Nawrocki (Sep 05 2023 at 15:36):

It's actually .elan/toolchains :sweat_smile:

Wojciech Nawrocki (Sep 05 2023 at 15:43):

I cleaned up the toolchains and reenabled hoskinson1. Toolchain size will certainly come up again, but hopefully less so assuming that rc and stable release bumps are less frequent than nightly bumps.

Jireh Loreaux (Sep 05 2023 at 16:04):

Doesn't this just mean we need garbage collection for toolchains within elan? E.g., a date- or storage-based system where elan automatically removes toolchains after the specified limit? Even if the default is "never delete", we probably want this option.

Scott Morrison (Sep 06 2023 at 00:16):

Note we are still using every toolchain (possibly even more so than previously), because the nightly-testing branch now has automation to try out every nightly as soon as it arrives.

Scott Morrison (Sep 06 2023 at 00:36):

Fixing this in elan requires more care, because we wouldn't want to delete a toolchain that was still in use.

In CI it is easy, because there is no danger to occasionally have to redownload.

#6969

Yuyang Zhao (Sep 14 2023 at 19:09):

It's back. : (

Patrick Massot (Sep 14 2023 at 19:12):

@Wojciech Nawrocki

Anatole Dedecker (Sep 14 2023 at 19:35):

I've disabled the runner hoskinson3 while we wait for a fix so that it doesn't cause more issues.

Wojciech Nawrocki (Sep 14 2023 at 20:36):

Fixed. It was the same issue: hoskinson3 had 15GiB of Lean toolchains stored on it. Even though #6969 was merged, this does not appear to suffice. My conjecture is that there are jobs that execute on all these runners that pull in some old toolchains. For example, on hoskinson2 2023-08-05 was pulled in today. So checking for toolchain folders older than two weeks does not correspond to checking for old Lean versions. The bruteforce solution would be to shorten the period to two days or even just one day.

Patrick Massot (Sep 14 2023 at 20:39):

Thanks!

Ruben Van de Velde (Sep 14 2023 at 21:36):

Wojciech Nawrocki said:

Fixed. It was the same issue: hoskinson3 had 15GiB of Lean toolchains stored on it. Even though #6969 was merged, this does not appear to suffice. My conjecture is that there are jobs that execute on all these runners that pull in some old toolchains. For example, on hoskinson2 2023-08-05 was pulled in today. So checking for toolchain folders older than two weeks does not correspond to checking for old Lean versions. The bruteforce solution would be to shorten the period to two days or even just one day.

Looks like hoskinson3 is now failing at

Run lake exe cache clean
error: toolchain 'leanprover/lean4:v4.0.0' is not installed

Kevin Buzzard (Sep 14 2023 at 21:42):

Maybe it just pulls a random lean and adds it to the pile, then hopes it got the right one at some point in the past?

Patrick Massot (Sep 14 2023 at 21:43):

See https://leanprover.zulipchat.com/#narrow/stream/345428-mathlib-reviewers/topic/v4.2E0.2E0.20not.20found/near/391020404

Scott Morrison (Sep 14 2023 at 22:37):

I suspect the problem is that @Wojciech Nawrocki deleted the toolchains in some way that left elan in a broken state, knowing the toolchain wasn't there but unwilling to install it.

Scott Morrison (Sep 14 2023 at 22:39):

Indeed, I can reproduce locally.

Deleting ~/.elan/toolchains but not ~/.elan/update-hashes leaves elan in exactly this broken state.

Scott Morrison (Sep 14 2023 at 22:40):

So --- to fix, we need:

  • Someone (@Wojciech Nawrocki?) with Hoskinson access to delete also the update-hashes directories
  • @Sebastian Ullrich to patch elan so it can cope with this sort of abuse from users :-)

Scott Morrison (Sep 14 2023 at 22:41):

@Mario Carneiro, do you still have shell access to these machines?

Mario Carneiro (Sep 14 2023 at 22:41):

in theory

Scott Morrison (Sep 14 2023 at 22:42):

It looks like it may just be hoskinson3 affected, so we can just disable it again.

Scott Morrison (Sep 14 2023 at 22:43):

I've disabled it, so hopefully we're good to go now.

Scott Morrison (Sep 14 2023 at 22:47):

Ugh, hoskinson9 is having the same problems, I'm not sure how many this affected.

Scott Morrison (Sep 14 2023 at 22:49):

hoskinson too, looks like most of them have been broken. :-)

Mario Carneiro (Sep 14 2023 at 22:50):

maybe we need a script before downloading a toolchain which just ensures that there are fewer than N toolchains installed by removing old ones

Anatole Dedecker (Sep 14 2023 at 22:51):

Which problem, no space available or toolchain not installed ?

Mario Carneiro (Sep 14 2023 at 22:51):

no space available

Mario Carneiro (Sep 14 2023 at 22:51):

the toolchain not installed problem is the side effect of not having the space to install it IIUC

Mario Carneiro (Sep 14 2023 at 22:52):

if we just have a hard limit on the number of toolchains installed instead of a time-based limit then we can avoid this

Scott Morrison (Sep 14 2023 at 22:52):

Also, the current scripts is only deleting from the .elan/toolchains/ directory, leaving .lean/update-hashes/ alone, so is borked anyway.

Anatole Dedecker (Sep 14 2023 at 22:52):

Scott Morrison said:

I suspect the problem is that Wojciech Nawrocki deleted the toolchains in some way that left elan in a broken state, knowing the toolchain wasn't there but unwilling to install it.

So this isn’t the right diagnosis?

Scott Morrison (Sep 14 2023 at 22:53):

I think it is?

Anatole Dedecker (Sep 14 2023 at 22:53):

I thought Mario said it isn’t

Mario Carneiro (Sep 14 2023 at 22:53):

no

Mario Carneiro (Sep 14 2023 at 22:54):

fixing the messed up installation is good, but the root issue is that we are running out of space for toolchains

Scott Morrison (Sep 14 2023 at 22:54):

Here's the chain of events:

  • we don't delete enough old toolchains
  • so the runners still run out of space
  • yesterday Wojciech cleaned up by deleting from the .elan/toolchains/ directory
  • but because .elan/update-hashes was still there, elan refuses to redownload the toolchains

Anatole Dedecker (Sep 14 2023 at 22:54):

Right, sorry of the misunderstanding

Adam Topaz (Sep 14 2023 at 22:54):

Why do we need a script at all? Shouldn’t the runs running on hoskinson et al all be running in a container with an ephemeral volume?

Mario Carneiro (Sep 14 2023 at 22:54):

we want to cache toolchains

Mario Carneiro (Sep 14 2023 at 22:55):

just not forever

Adam Topaz (Sep 14 2023 at 22:55):

Oh right ok

Scott Morrison (Sep 14 2023 at 22:55):

Life would be simpler if we didn't!

Scott Morrison (Sep 14 2023 at 22:56):

It seems the savings from caching the toolchains are:

  • saving on network traffic (are we actually paying incrementally?)
  • saving a little bit of time on each CI run

Wojciech Nawrocki (Sep 14 2023 at 22:56):

Mario Carneiro said:

if we just have a hard limit on the number of toolchains installed instead of a time-based limit then we can avoid this

This is a good idea.

Wojciech Nawrocki (Sep 14 2023 at 22:57):

Scott Morrison said:

hoskinson too, looks like most of them have been broken. :-)

Sorry about this oversight. Fixing them all now.

Scott Morrison (Sep 14 2023 at 22:57):

No problem, I've done exactly the same thing locally multiple times.

Scott Morrison (Sep 14 2023 at 22:57):

Plus written a CI script that does more of it. :-)

Kevin Buzzard (Sep 14 2023 at 22:58):

You could have a hard limit of 1 :-)

Wojciech Nawrocki (Sep 14 2023 at 23:03):

Okay, should be good again. It was hoskinson[39]?.

Scott Morrison (Sep 14 2023 at 23:03):

Those were the ones I found so far!

Scott Morrison (Sep 14 2023 at 23:07):

Does

cd ~/.elan/toolchains && ls -1td * | tail -n +11 | xargs -I {} echo "rm -rf {} && rm -rf ../update-hashes/{}"

look okay?

Scott Morrison (Sep 14 2023 at 23:07):

(Notice the echo, this is a dry build that you can safely run on your own machine. :-)

Scott Morrison (Sep 14 2023 at 23:08):

hopefully deletes all but the 10 most recent subdirectories in ~/.elan/toolchains, and the corresponding files in ~/.elan/update-hashes

Mario Carneiro (Sep 14 2023 at 23:09):

it's a bit wrong in the presence of elan toolchain link but I guess this isn't a problem for the CI machines

Mario Carneiro (Sep 14 2023 at 23:09):

it also looks like it will delete the default toolchain, not sure if this has any impact

Mario Carneiro (Sep 14 2023 at 23:10):

you don't need -rf for the second part, it's a file not a directory

Mario Carneiro (Sep 14 2023 at 23:11):

oh and the toolchain links don't even appear in update-hashes so that rm would fail

Scott Morrison (Sep 14 2023 at 23:14):

cd ~/.elan/toolchains && find . -maxdepth 1 -type d -print0 | xargs -0 ls -1td | grep -v "nightly$" | grep -v "stable$" | tail -n +11 | xargs -I {} echo "rm -rf {} && rm ../update-hashes/{}"

Scott Morrison (Sep 14 2023 at 23:15):

Ignores symbolic links, skips nightly and stable.

Mario Carneiro (Sep 14 2023 at 23:15):

I think the $ needs escaping or single quotes

Mario Carneiro (Sep 14 2023 at 23:16):

it's interpreted as a malformed shell variable in fish

Mario Carneiro (Sep 14 2023 at 23:17):

the first line I get is rm -rf . && rm ../update-hashes/. which looks bad

Scott Morrison (Sep 14 2023 at 23:21):

hehe :-)

Scott Morrison (Sep 14 2023 at 23:22):

cd ~/.elan/toolchains && find . -maxdepth 1 -type d ! -name . -print0 | xargs -0 ls -1td | grep -v 'nightly$' | grep -v 'stable$' | tail -n +2 | xargs -I {} sh -c 'echo {} && rm -rf "{}" && rm "../update-hashes/{}"'

Scott Morrison (Sep 14 2023 at 23:22):

That one has a live rm in it.

Scott Morrison (Sep 14 2023 at 23:23):

cd ~/.elan/toolchains && find . -maxdepth 1 -type d ! -name . -print0 | xargs -0 ls -1td | grep -v 'nightly$' | grep -v 'stable$' | tail -n +2 | xargs -I {} sh -c 'echo {}'

Scott Morrison (Sep 14 2023 at 23:23):

without the rm

Scott Morrison (Sep 14 2023 at 23:34):

PR'd as #7166

Scott Morrison (Sep 14 2023 at 23:42):

Okay, between that and https://github.com/leanprover/elan/issues/107, hopefully we should be set in future.

Scott Morrison (Sep 14 2023 at 23:43):

We should make sure someone besides Wojciech has shell access to the runners, in practice as well as theory. :-)

Jireh Loreaux (Sep 15 2023 at 03:02):

Goodness, I missed all the fun in 6 hours. :joy:

Johan Commelin (Sep 15 2023 at 03:17):

When mathlib starts tracking rc's and the toolchain only bumps ~1/week then this problem should also fade into a more relaxed state

Johan Commelin (Sep 15 2023 at 03:18):

Or maybe it is already tracking rc's? I lost track....

Mario Carneiro (Sep 15 2023 at 03:20):

it's already tracking rc's

Mario Carneiro (Sep 15 2023 at 03:20):

but the PRs are still on a variety of old nightlies

Mario Carneiro (Sep 15 2023 at 03:22):

we have been on rc's for 3 weeks

Johan Commelin (Sep 15 2023 at 03:23):

Hopefully that means that there's 3 weeks worth of PRs that are now using <= 3 toolchains

Scott Morrison (Sep 15 2023 at 03:57):

Yeah, but CI is still using every nightly, because of the semi-automated nightly-testing branch. :-)

Scott Morrison (Sep 15 2023 at 03:57):

Also a new toolchain for each Lean 4 PR. :-)

Scott Morrison (Sep 15 2023 at 03:57):

Don't get your hopes up!

Mario Carneiro (Sep 15 2023 at 04:06):

Scott Morrison said:

Yeah, but CI is still using every nightly, because of the semi-automated nightly-testing branch. :-)

True, but nightly-testing is a good citizen, it only tests today's nightly, once a day, and doesn't keep coming back to the PR after a month without a merge

Yaël Dillies (Oct 06 2023 at 12:23):

We're hitting this again: https://github.com/leanprover-community/mathlib4/actions/runs/6429072971/job/17457436332

Anatole Dedecker (Oct 06 2023 at 12:29):

Runner disabled for now

Anatole Dedecker (Oct 06 2023 at 12:40):

I think I've bors retry-ed all affected ready-to-merge PRs

Patrick Massot (Oct 06 2023 at 12:42):

Thanks!

Utensil Song (Oct 08 2023 at 03:10):

Would https://github.com/jlumbroso/free-disk-space be helpful for this too? It can clear up to 31 GB of disk space in about 3 minute, I have used it and it works as claimed.

Scott Morrison (Oct 09 2023 at 00:59):

While I have shell access and can perhaps implement a fix if needed, I neither know why we are running out of space, nor even how to enter the environment the runners run in to have a look around! Someone else will need to fix one of these two problems. :-)

Wojciech Nawrocki (Oct 09 2023 at 18:00):

It was just 10GiB of Lean toolchains again, combined with some other old files that had accumulated.

Scott Morrison (Oct 10 2023 at 00:27):

@Wojciech Nawrocki, sorry to be dumb, but could you tell me where I should have gone after ssh'ing into this machine in order to find those toolchains?

Wojciech Nawrocki (Oct 10 2023 at 00:28):

Np, the easy way is to run sudo ncdu / and look at what takes up the most space on the drive. This will usually be some folder inside the docker image.

Scott Morrison (Oct 10 2023 at 00:29):

$ ncdu
-sh: 9: ncdu: not found

Wojciech Nawrocki (Oct 10 2023 at 01:31):

Which runner is this on? In general, for small utilities like these please feel free to apt install.

Scott Morrison (Oct 10 2023 at 01:43):

That was on the "unnumbered" hoskinson.

Scott Morrison (Oct 10 2023 at 01:46):

(I've installed ncdu everywhere.)

Yaël Dillies (Nov 04 2023 at 12:31):

It's happening again: https://github.com/leanprover-community/mathlib4/actions/runs/6754826458/job/18362702531

Yaël Dillies (Nov 04 2023 at 12:31):

@maintainers

Anatole Dedecker (Nov 04 2023 at 12:35):

Runner disabled

Wojciech Nawrocki (Nov 04 2023 at 19:41):

Restarted

Yaël Dillies (Nov 10 2023 at 13:09):

Happening again: https://github.com/leanprover-community/mathlib4/actions/runs/6825189236/job/18562531775?pr=8202

Yaël Dillies (Nov 10 2023 at 13:09):

@maintainers

Sebastian Ullrich (Nov 10 2023 at 13:15):

This is how the speedcenter solves this issue fwiw: https://github.com/Kha/mathlib4-bench/blob/c6be8420f6dcbf70716586abee3c8e05e6a26ef1/bench#L5

Bryan Gin-ge Chen (Nov 10 2023 at 14:06):

Runner disabled

Wojciech Nawrocki (Nov 10 2023 at 21:22):

Sebastian Ullrich said:

This is how the speedcenter solves this issue fwiw: https://github.com/Kha/mathlib4-bench/blob/c6be8420f6dcbf70716586abee3c8e05e6a26ef1/bench#L5

This is unfortunately not enough. In this case the culprit was 15GiB of mathport data.

Mario Carneiro (Nov 10 2023 at 21:38):

@Wojciech Nawrocki could you elaborate? AFAIK mathport isn't doing anything that could cause a buildup of data over multiple runs, except maybe the Outputs directory?

Wojciech Nawrocki (Nov 10 2023 at 22:03):

Ah, sorry, I already deleted the folder! That was the size of the mathport directory. Your guess sounds very plausible: accumulation of many Outputs/. The GitHub runner does not clean directories up across runs by default.

Henrik Böving (Nov 13 2023 at 22:20):

https://github.com/leanprover-community/mathlib4_docs/actions/runs/6852564004

We seem to have a space issue @ mathlib4_docs as well

Wojciech Nawrocki (Nov 14 2023 at 00:10):

It's been fixed!

Henrik Böving (Nov 14 2023 at 00:23):

@Wojciech Nawrocki did you delete some additional stuff? https://github.com/leanprover-community/mathlib4_docs/actions/runs/6857390839/job/18646491207 it's failing because there is no git config set up now

Henrik Böving (Nov 14 2023 at 00:26):

Alternatively I can also teach the action to set up a git config? But idk if we want that?

Wojciech Nawrocki (Nov 14 2023 at 00:34):

That's a bit strange, I reset the entire container so it should be the same setup it has always been.

Wojciech Nawrocki (Nov 14 2023 at 00:35):

If there was any transient state on it that the job relied on, that would have been erased as well.

Wojciech Nawrocki (Nov 14 2023 at 00:38):

It's also possible that one of the upstream GH actions changed in some way so that it no longer sets up a git config?

Henrik Böving (Nov 14 2023 at 07:18):

Its getting evsn more interesting: https://github.com/leanprover-community/mathlib4_docs/actions/runs/6857431293 it works again

Mario Carneiro (Nov 14 2023 at 07:19):

hoskinson6 seems to be full again, I got another mathport failure on it. That last run you linked is on hoskinson5

Wojciech Nawrocki (Nov 14 2023 at 16:22):

@Mario Carneiro could you link the failure? Looking at it now, there is space.

Mario Carneiro (Nov 14 2023 at 16:23):

https://github.com/leanprover-community/mathport/actions/runs/6859657810/job/18652262231

Mario Carneiro (Nov 14 2023 at 16:24):

It could be that something has changed upstream which is causing the mathport build to take a lot more space now. I am suspicious that we are seeing many space issues in a short time period now, when nothing in particular has changed recently in mathport

Mario Carneiro (Nov 14 2023 at 16:25):

Oh, it could be the post-update hook

Mario Carneiro (Nov 14 2023 at 16:27):

yeah this is probably making a mess because lean3port and mathlib3port have nonstandard setups, we are getting 3 mathlibs instead of just one

Junyan Xu (Nov 14 2023 at 22:26):

The mathlib4 speedcenter currently says "no worker registered". Is it maybe due to some disabled runners?

Sebastian Ullrich (Nov 15 2023 at 08:54):

Thanks, it's up again now

Mauricio Collares (Nov 20 2023 at 08:53):

hoskinson9 is out of disk space: https://github.com/leanprover-community/mathlib4/actions/runs/6927605611/job/18841912762

Scott Morrison (Nov 20 2023 at 09:38):

Ugh, I'm not sure what to remove. I'd hoped #8525 would help. Maybe it hasn't had enough time. I've been running ncdu on these machines, and there's nothing conspicuously out of place.

Mauricio Collares (Nov 20 2023 at 09:40):

It probably would help if it got a chance to run, but the build fails earlier than that

Mauricio Collares (Nov 20 2023 at 09:41):

I guess as more branches merge master the odds of accumulating 10 toolchains become vanishingly small

Mauricio Collares (Nov 20 2023 at 09:42):

That being said, did you check if the mathlib cache directory is a problem?

Mauricio Collares (Nov 20 2023 at 09:45):

I now see there is a lake exe cache clean step before lake exe cache get, so it shouldn't be a problem.

Henrik Böving (Dec 05 2023 at 14:34):

https://github.com/leanprover-community/mathlib4_docs/actions/runs/7097849825 we are closing in again /o\

Wojciech Nawrocki (Dec 12 2023 at 19:27):

This should be less of an issue moving forward as the CI runners now have 100GiB storage each thanks to @Jeremy Avigad .


Last updated: Dec 20 2023 at 11:08 UTC