Zulip Chat Archive

Stream: mathlib4

Topic: no space left on device


Jeremy Tan (Jun 06 2023 at 06:38):

https://github.com/leanprover-community/mathlib4/actions/runs/5185396528/jobs/9345272031 is this a result of us switching to the Hoskinson machines?

Johan Commelin (Jun 06 2023 at 06:39):

they should have plenty space, I think

Johan Commelin (Jun 06 2023 at 06:39):

cc @Wojciech Nawrocki

Jeremy Tan (Jun 06 2023 at 06:40):

yeah, I triggered re-run twice and got the same error back

Jeremy Tan (Jun 06 2023 at 06:44):

It's failing for every build now

Jeremy Tan (Jun 06 2023 at 06:47):

It really has to be an OOM on the Hoskinson machines

Johan Commelin (Jun 06 2023 at 06:48):

cc @Mario Carneiro @Scott Morrison

Scott Morrison (Jun 06 2023 at 07:02):

We will need to add rm -rf ~/.cache/mathlib in the appropriate place in CI. We already do this for mathport CI. We only recently moved to running mathlib4 PRs on the Hoskinson machines.

Mario Carneiro (Jun 06 2023 at 07:05):

I wonder if we couldn't have faster builds if we use a smarter eviction policy like deleting the oldest things until what remains is under X MB

Scott Morrison (Jun 06 2023 at 07:05):

I'm probably out for a little while, but if someone wants to add this line to .github/workflows/build.yml.in and run mk_build_yml.sh there, I can merge.

Mario Carneiro (Jun 06 2023 at 07:06):

this could also be a command line arg for cache get

Jeremy Tan (Jun 06 2023 at 08:27):

where should rm -rf ~/.cache/mathlib be added in build.yml.in?

Jeremy Tan (Jun 06 2023 at 08:36):

here or not?

build:
    if: github.repository MAIN_OR_FORK 'leanprover-community/mathlib4'
    name: BuildJOB_NAME
    runs-on: RUNS_ON
    steps:
      - name: cleanup
        run: |
          find . -name . -o -prune -exec rm -rf -- {} +
          rm -rf ~/.cache/mathlib

Jeremy Tan (Jun 06 2023 at 09:41):

I pushed that to !4#4721, still didn't work

Scott Morrison (Jun 06 2023 at 09:51):

I am trying to summon the relevant people with shell access.

Scott Morrison (Jun 06 2023 at 09:53):

@Jeremy Tan, in the meantime could you revert 18b622df767cc9ef272454f416192a53e52a41e4 on your PR and see if that succeeds? If should move CI back to the github runners in the meantime.

Scott Morrison (Jun 06 2023 at 09:57):

Looks good.

Jeremy Tan (Jun 06 2023 at 09:57):

build job is now running at https://github.com/leanprover-community/mathlib4/actions/runs/5187158004/jobs/9349183280?pr=4721

Scott Morrison (Jun 06 2023 at 09:58):

Okay, I will make a PR reverting 18b622df767cc9ef272454f416192a53e52a41e4 for now, I guess.

Scott Morrison (Jun 06 2023 at 10:01):

Can we insta-merge !4#4726?

Jeremy Tan (Jun 06 2023 at 10:02):

Or you can just merge !4#4721 instead, which incorporates the same revert

Jeremy Tan (Jun 06 2023 at 10:04):

I've edited the description on my PR to indicate this

Scott Morrison (Jun 06 2023 at 10:06):

@Jeremy Tan, no, it's better we leave a clean paper trail here. :-)

Scott Morrison (Jun 06 2023 at 10:08):

In any case, I broke protocol and merged !4#4726 myself. Hopefully I'm not making anything worse. :-)

If we see more build failures because of out of space issues, you can either merge !4#4726 directly into it, or wait until it lands on master and merge that.

Scott Morrison (Jun 06 2023 at 10:10):

Ugh, the bors step is still partially running on hoskinson machines...

Scott Morrison (Jun 06 2023 at 10:13):

@Chris Hughes properly insta-merged it without even using bors. Thanks! :-)

Scott Morrison (Jun 06 2023 at 10:14):

I think however no bors jobs are going to succeed with the current setup... I think I should make another PR to remove bors from hoskinson too.

Jeremy Tan (Jun 06 2023 at 10:15):

I merged master into !4#4721 and it's working

Johan Commelin (Jun 06 2023 at 10:48):

@Scott Morrison @Jeremy Tan what is the status now? Can I bors merge some PRs again? Or should I wait a bit?

Scott Morrison (Jun 06 2023 at 10:51):

I think we're all working again now.

Johan Commelin (Jun 06 2023 at 10:51):

great, let's see what bors does!

Scott Morrison (Jun 06 2023 at 10:52):

(But temporarily not on Hoskinson until we get those cleaned up, and then merge !4#4728.)

Jireh Loreaux (Jun 06 2023 at 10:55):

Does the lack of Hoskinsin runners mean mathlib3 is stalled for now?

Scott Morrison (Jun 06 2023 at 10:56):

Probably. :-(

Ruben Van de Velde (Jun 06 2023 at 11:15):

Did you mean "more stalled"? :)

Eric Rodriguez (Jun 06 2023 at 11:58):

this is affecting mathlib3 too, yes

Yury G. Kudryashov (Jun 06 2023 at 15:07):

I get "no space left on device" in !4#4695

Jeremy Tan (Jun 06 2023 at 15:23):

It's fixed now :ok:

Scott Morrison (Jun 06 2023 at 22:34):

Okay, the Hoskinson machines have been manually cleaned up, and !4#4728 now passes CI, running again on the Hoskinson machines, and using lake exe cache clean! on every run to avoid the cache growing unboundedly.

If someone could take a look at that and then merge, that would be great.

Ideally we would have a lake exe cache clean-old, that just removes older files from the cache, so we could avoid redownloading everything each cycle, but keep the cache bounded.

Arthur Paulino (Jun 06 2023 at 22:51):

@Scott Morrison how do you qualify "old"? There's already lake exe cache clean

Scott Morrison (Jun 06 2023 at 22:52):

Hmm, okay, does that just delete everything not relevant for the currently open branch?

Scott Morrison (Jun 06 2023 at 22:52):

I guess I saw that and didn't immediately understand what "non-linked" meant, to dismissed it too quickly.

Scott Morrison (Jun 06 2023 at 22:53):

I guess the ideal would be "keep anything relevant to the current branch and/or current master".

Mario Carneiro (Jun 06 2023 at 22:59):

linked means that get would have tried to download it

Mario Carneiro (Jun 06 2023 at 23:00):

the variant I was thinking of implementing was clean-old --size=1G would delete non-linked files from oldest to newest until the size of the cache drops below 1G

Scott Morrison (Jun 06 2023 at 23:03):

Okay !4#4728 now uses lake exe cache clean rather than lake exe cache clean!.

Mario Carneiro (Jun 06 2023 at 23:04):

alternatively, it can be an option to get as in get --purge-old=1G which would determine the linked files, clean up everything else per the above cache policy, then download missing files. The main advantage being that it wouldn't require headroom for an additional copy of the cache compared to downloading and then purging

Eric Wieser (Jun 06 2023 at 23:10):

That leaves you in a mess if your network cuts out

Eric Wieser (Jun 06 2023 at 23:11):

Which doesn't matter for CI, but sounds annoying for humans

Eric Wieser (Jun 06 2023 at 23:11):

Oh nevermind, "determine your linked files" means "salvage the parts that are useful" I guess

Mario Carneiro (Jun 06 2023 at 23:13):

actually another difference with get --purge would be that the cache size hovers a bit above the specified value since the missing files are not accounted for in the 1G limit (we don't know how big they are until we download them)

Mario Carneiro (Jun 06 2023 at 23:15):

actually I guess that is equivalent to running clean-old prior to get


Last updated: Dec 20 2023 at 11:08 UTC