Zulip Chat Archive

Stream: mathlib4

Topic: Error: No space left on device

Jack McKoen (Aug 19 2024 at 22:48):

I pushed some changes to my PR #13710 and the build immediately failed with
Error: No space left on device : '/home/lean/actions-runner/_work/_actions/dcarbone/install-jq-action/v1.0.1'
What does this error mean?

Matthew Ballard (Aug 19 2024 at 22:52):

The computer running the CI ran out of fish space. Nothing you did

Jack McKoen (Aug 19 2024 at 22:54):

Thanks, that makes sense :sweat_smile:

Bryan Gin-ge Chen (Aug 20 2024 at 00:15):

I've taken the faulty machine out of the pool of runners until it can be fixed; hopefully you won't run into this again if you re-run the job.

Damiano Testa (Aug 20 2024 at 11:43):

I also hit this here.

Bryan Gin-ge Chen (Aug 20 2024 at 11:51):

OK, I've removed that runner temporarily.

Bryan Gin-ge Chen (Aug 20 2024 at 18:40):

According to @Wojciech Nawrocki who is working on fixing this, there might be a branch or PR which is generating abnormally large oleans. Does that ring a bell for anyone?

Matthew Ballard (Aug 20 2024 at 18:46):

I saw abnormally large oleans earlier today

Matthew Ballard (Aug 20 2024 at 18:46):

@Damiano Testa

Matthew Ballard (Aug 20 2024 at 18:47):

But I saw failures yesterday too

Damiano Testa (Aug 20 2024 at 19:11):

I was experimenting with the abnormally large oleans yesterday indeed...

Damiano Testa (Aug 20 2024 at 19:12):

I hope that the new PRs do not suffer from the same problem.

Matthew Ballard (Aug 20 2024 at 19:13):

Was it all on branch#adomani/linter_assertNotExists_or_imported?

Damiano Testa (Aug 20 2024 at 19:14):

I think that this one might be the good one. If a branch starts with test or ends in dev it can safely be removed.

Damiano Testa (Aug 20 2024 at 19:15):

Unless of course that branch (https://github.com/leanprover-community/mathlib4/tree/adomani%2Flinter_assertNotExists_or_imported) has large oleans, in which case I need to work more on the PR.

Damiano Testa (Aug 20 2024 at 19:15):

(I'm not at a computer, so I'm not able to remove the branches that I think are large.)

Bryan Gin-ge Chen (Aug 20 2024 at 19:19):

I think it's OK if the branches exist in the repo; we just don't want CI to run on them.

Though maybe we just need our CI workflow to clean up the runners more thoroughly.

Damiano Testa (Aug 20 2024 at 19:21):

Anyway, I'll remove the branches, probably tomorrow. If you want to remove the oleans as well, feel free to do so for all my PRs.

Matthew Ballard (Aug 20 2024 at 19:21):

Is this just preventing future bloat or can things be cleaned out more surgically with precise knowledge of the problem?

Damiano Testa (Aug 20 2024 at 19:22):

Should there be a CI step that reports the size of the oleans?

Bryan Gin-ge Chen (Aug 20 2024 at 19:24):

More info never hurts... probably.

Damiano Testa (Aug 20 2024 at 19:27):

So, a du invocation in check the cache, maybe?

Damiano Testa (Aug 20 2024 at 19:28):

I had never considered this issue before, but calling du -h, even just on .lake should be cheap and informative.

𝚠𝚘𝚓𝚌𝚒𝚎𝚌𝚑 𝚗𝚊𝚠𝚛𝚘𝚌𝚔𝚒 (Aug 20 2024 at 19:29):

Damiano Testa said:

Should there be a CI step that reports the size of the oleans?

This is a good idea

𝚠𝚘𝚓𝚌𝚒𝚎𝚌𝚑 𝚗𝚊𝚠𝚛𝚘𝚌𝚔𝚒 (Aug 20 2024 at 20:01):

Damiano Testa said:

Unless of course that branch (https://github.com/leanprover-community/mathlib4/tree/adomani%2Flinter_assertNotExists_or_imported) has large oleans, in which case I need to work more on the PR.

I don't think it's that branch, at least not on the latest commit. It produces about the expected size

Damiano Testa (Aug 21 2024 at 07:12):

I am happy to hear that the oleans of that branch are normal-sized, since that is the branch that I contains the "good" version of the PR!

I also think that branch#adomani/linter_assertNotExists_or_imported_dev is the one with the big oleans and I deleted it. The latest commit might have reasonable sized ones, but the earlier commits probably were the ones filling up memory: sorry!

Damiano Testa (Aug 21 2024 at 07:20):

Also, #16020 adds a report of the size of .lake. Most of the space in .lake is the oleans, so I only added the size of the whole dir, rather than the sizes of all the subdirs. Let me know if you think a more granular approach is more informative.

Damiano Testa (Aug 21 2024 at 07:27):

If you expand check the cache in this CI run, the last line tells you the size of .lake.

Rémy Degenne (Feb 25 2025 at 09:13):

I just got a "no space left on device" in CI for PR #22271 (running on hoskinson2).

Daniel Weber (Feb 25 2025 at 10:22):

I alao got this in #22180

Jakob von Raumer (Feb 25 2025 at 10:24):

Me too, guess someone either needs to clear some caches or pay someone else more money :grinning_face_with_smiling_eyes:

Kevin Buzzard (Feb 25 2025 at 11:09):

#22280 also hit this (does anyone dare merge it anyway? ;-) )

Kevin Buzzard (Feb 25 2025 at 11:09):

(deleted)

Bryan Gin-ge Chen (Feb 25 2025 at 11:41):

@Wojciech Nawrocki Could you take a look when you have a chance?

Bryan Gin-ge Chen (Feb 25 2025 at 11:45):

For now, I've removed the bors and pr tags from hoskinson2 so no more jobs should go its way. I've also retriggered CI / bors on the abovementioned PRs.

Jakob von Raumer (Feb 25 2025 at 12:20):

Can you restart #22248 as well?

Floris van Doorn (Feb 25 2025 at 12:23):

done

Bryan Gin-ge Chen (Feb 25 2025 at 12:30):

(Note that if you have write access to mathlib, you should be able to restart failed jobs yourself by clicking the "Re-run jobs" button in the upper right corner on the CI job page: https://github.com/leanprover-community/mathlib4/actions/runs/13518357916/job/37780479089)

Kevin Buzzard (Feb 25 2025 at 12:56):

(this is true but it wasn't clear to me whether spamming the runners was a good idea when there was a problem with the runners)

Bryan Gin-ge Chen (Feb 25 2025 at 13:06):

Right, I maybe also should have added: in general don't try re-running a job unless you have a reasonable expectation that the same thing won't happen again. Also, if you have to resort to re-running, it's probably a good idea to report it on Zulip as well. In this case it doesn't look like any other runners are broken and there's already a thread here so it's pretty safe.

Kevin Buzzard (Feb 25 2025 at 13:37):

My ignorance of the system is sufficiently deep that I didn't know whether (a) hitting retry would send it to the same machine again and (b) whether one machine having the problem meant that all machines had the problem. This was why I didn't retry. I was happy to wait :-)

Bryan Gin-ge Chen (Feb 25 2025 at 13:44):

With regards to (a), GitHub actions searches for a new runner every time a job is started. To be quite honest I was gambling a bit on (b) not being the case - I figured that the worst that could happen is that we'd find out a few more runners were broken, and better to find out sooner rather than later, so that Wojciech can reset them all in one go.

𝚠𝚘𝚓𝚌𝚒𝚎𝚌𝚑 𝚗𝚊𝚠𝚛𝚘𝚌𝚔𝚒 (Feb 25 2025 at 18:59):

The main issue seems to be ~60GiB of .cache. There is already code to cleanup the toolchains, should there also be something to clean up .cache?

𝚠𝚘𝚓𝚌𝚒𝚎𝚌𝚑 𝚗𝚊𝚠𝚛𝚘𝚌𝚔𝚒 (Feb 25 2025 at 19:03):

hoskinson2 should be back up.

Pim Otte (Feb 26 2025 at 06:27):

@Wojciech Nawrocki Fyi, this build makes it seem like hoskinson7 has the same issue:)

Sébastien Gouëzel (Feb 26 2025 at 07:20):

@Bryan Gin-ge Chen , indeed it looks like hoskinson7 is also full.

Jakob von Raumer (Feb 26 2025 at 09:41):

Yes, just hit this on hoskinson7 for #22248 as well

Floris van Doorn (Feb 26 2025 at 10:24):

Maybe the runners can do a weekly/monthly lake exe cache clean?

Bryan Gin-ge Chen (Feb 26 2025 at 13:31):

I've removed the pr and bors labels from hoskinson7. BTW, any maintainer should be able to do this via https://github.com/organizations/leanprover-community/settings/actions/runners (click on the runner name, then on the gear icon next to the list of tags to edit them)

Floris van Doorn (Feb 26 2025 at 13:37):

I tried to do this just now, but didn't find how to do this (I tried to press the dots to the right of the runner, which only gives the option to delete the runner).

Bryan Gin-ge Chen (Feb 26 2025 at 13:38):

Oh, can't you click the runner name itself? Doing so takes me to a page like this: https://github.com/organizations/leanprover-community/settings/actions/runners/106
edit: ah, I guess maybe you meant before you saw my message

𝚠𝚘𝚓𝚌𝚒𝚎𝚌𝚑 𝚗𝚊𝚠𝚛𝚘𝚌𝚔𝚒 (Feb 26 2025 at 17:35):

Cleaned hoskinson7. Actually, cleaned all of them as they all seem to be running out of .cache space. This may result in some one-off build failures.

Floris van Doorn (Feb 27 2025 at 00:36):

Bryan Gin-ge Chen said:

Oh, can't you click the runner name itself? Doing so takes me to a page like this: https://github.com/organizations/leanprover-community/settings/actions/runners/106
edit: ah, I guess maybe you meant before you saw my message

Sorry, I indeed wasn't clear in my previous message. I indeed meant before reading what you wrote. Thanks for the clarification!

𝚠𝚘𝚓𝚌𝚒𝚎𝚌𝚑 𝚗𝚊𝚠𝚛𝚘𝚌𝚔𝚒 (Mar 11 2025 at 21:30):

𝚠𝚘𝚓𝚌𝚒𝚎𝚌𝚑 𝚗𝚊𝚠𝚛𝚘𝚌𝚔𝚒 said:

The main issue seems to be ~60GiB of .cache. There is already code to cleanup the toolchains, should there also be something to clean up .cache?

FYI, we are seeing more build failures because of this. I would recommend adding a .cache cleaning step.

Bryan Gin-ge Chen (Mar 11 2025 at 21:41):

If someone can confirm that all we need is a weekly action to run lake exe cache clean on the self-hosted runners than I can put that together later today.

𝚠𝚘𝚓𝚌𝚒𝚎𝚌𝚑 𝚗𝚊𝚠𝚛𝚘𝚌𝚔𝚒 (Mar 11 2025 at 21:50):

I cannot confirm that it won't run out of space for other reasons, but all of the runner disk space issues happening recently have been due to .cache folders growing large; it would certainly help prevent that.

𝚠𝚘𝚓𝚌𝚒𝚎𝚌𝚑 𝚗𝚊𝚠𝚛𝚘𝚌𝚔𝚒 (Mar 11 2025 at 21:51):

If I understand the CI scripts, cleaning toolchains happens on every build. Are you suggesting to have a dedicated CI action just for cleaning .cache that is somehow scheduled to run weekly?

Bryan Gin-ge Chen (Mar 11 2025 at 23:12):

That was my understanding of @Floris van Doorn’s message above; sorry, I should have linked it earlier:

Floris van Doorn said:

Maybe the runners can do a weekly/monthly lake exe cache clean?

Bryan Gin-ge Chen (Mar 12 2025 at 01:49):

Ah, thinking about this more, the CI action approach doesn't really make sense after all, since it's not guaranteed that it will run on every machine. Not sure what I was thinking... would it be possible to set up a weekly cron job on the machines that runs lake exe cache clean then? Maybe we could even have them each send a weekly message to a Zulip channel with basic diagnostic info.

Kim Morrison (Mar 12 2025 at 02:01):

Why not just include this in the standard CI? On every run, check the size of the cache directory, and if it is above some limit, clear it out.

Kim Morrison (Mar 12 2025 at 02:01):

One could even be clever and delete the oldest files, hopefully avoiding any cache misses.

Bryan Gin-ge Chen (Mar 12 2025 at 03:49):

OK! I put together something (with Claude's help) in #22863.

Kevin Buzzard (Mar 12 2025 at 08:22):

I guess a potential cron job might remove a bunch of oleans in the middle of a CI run, which would be an exciting new way of things going wrong...

Damiano Testa (Mar 12 2025 at 08:31):

The I'm feeling naughty cron job.

Bryan Gin-ge Chen (Mar 12 2025 at 17:35):

Bryan Gin-ge Chen said:

OK! I put together something (with Claude's help) in #22863.

This is now merged so hopefully this thread will be quiet from now on...

Arthur Paulino (Mar 12 2025 at 18:42):

I'd suggest at least some way of getting a feel for how big the current mathlib cache is. Losing track of this number will some day put maintainers in a hurry, with a tight response time.

One loose method is following up the cache clean with a message to be posted here on Zulip. Then if the cleanup is being performed every two weeks or so, things are about to get really ugly.

Bryan Gin-ge Chen (Mar 12 2025 at 19:04):

For the time being, the workflow step is logging the total size of the cache folder on that specific machine: looks like it's around 5 GB on hoskinson9 at the moment.

Arthur Paulino (Mar 12 2025 at 19:31):

Minor: with du -sh instead of du -sb it would print the size in "human readable format". (e.g. 1K, 234M, 2G etc). So you'd be able to remove the "(in bytes)" part from the printed string. Though I see you're using that same number for the check of size

Junyan Xu (Mar 12 2025 at 19:47):

I'm getting

/home/lean/actions-runner/_work/_temp/2ee3dddc-a39c-4a1f-94d4-804fedfb62b4.sh: line 17: lake: command not found
Cache size (in bytes): 47178990759
Cache size exceeds threshold, running lake exe cache clean
Error: Process completed with exit code 127.

on #22885

Bryan Gin-ge Chen (Mar 12 2025 at 19:54):

Ah, that was silly. lake isn't installed yet in that step. :man_facepalming:
Let me put together a quick fix.

Bryan Gin-ge Chen (Mar 12 2025 at 20:03):

@Damiano Testa if you're around, would mind putting #22887 on the queue real quick before this breaks more builds? Looks like we have a red checkmark on master due to this issue as well... for now I'll remove the tags from hoskinson5 temporarily.

Damiano Testa (Mar 12 2025 at 20:07):

Oops, I should have also noticed that lake was not available!

Junyan Xu (Mar 12 2025 at 20:09):

Seems I bumped into another machine with almost full cache:

/home/lean/actions-runner/_work/_temp/92b4c16c-37b2-4537-88b9-6011d77103cc.sh: line 17: lake: command not found
Cache size (in bytes): 43540418002
Cache size exceeds threshold, running lake exe cache clean
Error: Process completed with exit code 127.

Bryan Gin-ge Chen (Mar 12 2025 at 20:10):

Looks like that was hoskinson1 this time. Sorry for all the trouble!

Junyan Xu (Mar 12 2025 at 20:12):

No worries, thanks for the quick fix!

Damiano Testa (Mar 12 2025 at 20:12):

The PR with the fix seems to be going well through CI!

Damiano Testa (Mar 12 2025 at 20:13):

And the "clean up" step was executed and gave no problem.

Bryan Gin-ge Chen (Mar 12 2025 at 20:20):

Great! I've re-enabled hoskinson1 and hoskinson5.

Last updated: May 02 2025 at 03:31 UTC