Zulip Chat Archive

Stream: mathlib4

Topic: build failing


Michael Stoll (Jan 26 2024 at 14:50):

On #10029, commit acf36f6, "build mathlib" fails at the "get cache" stage with:

Attempting to download 285 file(s)
Downloaded: 0 file(s) [attempted 1/285 = 0%]
Downloaded: 0 file(s) [attempted 48/285 = 16%]
Downloaded: 0 file(s) [attempted 100/285 = 35%]
Downloaded: 0 file(s) [attempted 147/285 = 51%]
Downloaded: 0 file(s) [attempted 151/285 = 52%]
Downloaded: 0 file(s) [attempted 199/285 = 69%]
Downloaded: 0 file(s) [attempted 202/285 = 70%]
Downloaded: 0 file(s) [attempted 246/285 = 86%]
Downloaded: 0 file(s) [attempted 280/285 = 98%]
Downloaded: 0 file(s) [attempted 285/285 = 100%] (0% success)
Warning: some files were not found in the cache.
This usually means that your local checkout of mathlib4 has diverged from upstream.
If you push your commits to a branch of the mathlib4 repository, CI will build the oleans and they will be available later.
/home/lean/.cache/mathlib/f97fb2ce8d3ecb1e.ltar: failed to fill whole buffer
uncaught exception: leantar failed with error code 1
Decompressing 3875 file(s)
Error: Process completed with exit code 1.

Restarting the build had the same result. Can I do anything about that?

Matthew Ballard (Jan 26 2024 at 14:53):

Add white space to lake-manifest.json for a workaround

Michael Stoll (Jan 26 2024 at 14:53):

Anywhere?

Matthew Ballard (Jan 26 2024 at 14:56):

See also #mathlib4 > Why does CI fail on my branch?

Michael Stoll (Jan 26 2024 at 14:56):

I added an empty line somewhere. Now it tries to download 4160 files, but unsuccessfully...

Matthew Ballard (Jan 26 2024 at 14:56):

Oh maybe not an empty line but just some space between existing characters

Michael Stoll (Jan 26 2024 at 15:01):

Can you be more precise? This time I added a blank between "packages" and "Dir", and the build fails immediately :

error: ./lake-manifest.json: improperly formatted manifest: manifest missing required property 'packagesDir'

(This is on a Mathlib PR; the one before (#10028), which CI was working on shortly before, did not have this problem.)

Mauricio Collares (Jan 26 2024 at 15:03):

I think Matthew means you should commit this whitespace change to lake-manifest.json and wait for CI to rebuild everything

Michael Stoll (Jan 26 2024 at 15:04):

That's what I did (commit and push, then wait to see what CI is doing).

Mauricio Collares (Jan 26 2024 at 15:05):

Ah, sorrt: the whitespace change shouldn't be inside one of the strings in the json file

Mauricio Collares (Jan 26 2024 at 15:05):

Add a whitespace at the end of a line, for example

Mauricio Collares (Jan 26 2024 at 15:05):

(deleted)

Matthew Ballard (Jan 26 2024 at 15:07):

I did it after a colon for example

Mauricio Collares (Jan 26 2024 at 15:07):

It would be good if someone with Azure access could lake exe cache put! to fix the cache

Michael Stoll (Jan 26 2024 at 15:08):

OK; merging master seems to have worked.
Do we know what the root cause for this (apparently recurring) problem is?

Eric Rodriguez (Jan 26 2024 at 15:11):

https://leanprover.zulipchat.com/#narrow/stream/144837-PR-reviews/topic/.239933.20actually.20check.20cache.20works it'd be nice if we actually fixed the "check cache" step so that we don't have random cache failures

Mauricio Collares (Jan 26 2024 at 15:34):

Also, a corrupted file should just be skipped, no need to interrupt the whole decompression process

Mario Carneiro (Jan 28 2024 at 07:41):

Actually it doesn't interrupt the decompression process

Mario Carneiro (Jan 28 2024 at 07:42):

but it does return a nonzero error code, which causes lake exe cache get to fail

Mario Carneiro (Jan 28 2024 at 07:42):

although it is on the last step, so nothing is skipped

Mauricio Collares (Jan 28 2024 at 07:50):

Ah, good to know! Sorry for not checking.

Mario Carneiro (Jan 28 2024 at 07:51):

This is now fixed in #10063, leantar will now remove corrupted files on its own (and you can either run lake build to rebuild the missing files like you would for a partial cache get, or run lake exe cache get again if it was a network issue)

Sebastian Ullrich (Jan 28 2024 at 08:09):

How do they even get corrupted in the first place? Are we insufficiently handling error output from curl?

Mauricio Collares (Jan 28 2024 at 08:11):

Not sure if this still happens, but hitting Ctrl-C during the download (after a network stall, for example) used to leave corrupted files

Sebastian Ullrich (Jan 28 2024 at 08:15):

Right, but we have IO.FS.rename now. This is solvable.

Mario Carneiro (Jan 28 2024 at 08:35):

unknown. We are using IO.FS.rename in the download process already

Kevin Buzzard (Jan 28 2024 at 09:09):

What happens if the 4.5 gig download size catches an unsuspecting student out, the original attempt fails, the student deletes 10 gigs of mp3s and tries again?

Mario Carneiro (Jan 28 2024 at 09:25):

I'm not sure I understand the question

Mario Carneiro (Jan 28 2024 at 09:26):

The current architecture is already supposed to be immune to torn files, but #10063 adds yet another defense layer against bad files

Mario Carneiro (Jan 28 2024 at 09:28):

If you interrupt lake exe cache get halfway through the process, you will get half the files, and all the files you get will be valid. The second attempt will not try to redownload those files, but if somehow you still managed to corrupt some files those files will be deleted and you can rerun lake exe cache get to download better ones

Mario Carneiro (Jan 28 2024 at 09:32):

There is one scenario where I think corrupt files can get through, which is where lake exe cache put is interrupted or is partially successful and uploads torn files to the cloud. In that case every lake exe cache get will download those files, and prior to #10063 you will get a message like failed to fill whole buffer when unpacking it (but this message is otherwise harmless, except that you didn't get the file and something may need to be recompiled). After #10063 you will instead get the message removing corrupted file and you will still have to rebuild it; if you use lake exe cache get it will download the bad file again, notice it is corrupted and delete it again

Mario Carneiro (Jan 28 2024 at 09:33):

The "add whitespace to lake-manifest.json" trick is a way to work around bad files in the cloud

Kevin Buzzard (Jan 28 2024 at 09:39):

Mario Carneiro said:

I'm not sure I understand the question

I think you answered it anyway. Maybe "I got interrupted because my drive filled up" is no different to "I pressed Ctrl-C"

Mario Carneiro (Jan 28 2024 at 09:47):

It is slightly different since different parts of the code are likely to fail in that case. The failed to fill whole buffer error is mainly associated with .ltar files which are truncated due to network or curl issues, while if you run out of space during unpacking it is the .olean writing that will fail. #10063 adds an additional protection against issues in the .olean writing stage, namely that it rolls back (deletes) any files it produced in the course of an unpacking operation that is aborted halfway through due to an IO error (such as running out of disk space).

So this avoids the situation where lake exe cache get runs, leantar gives an error about the disk being full (and the user ignores this for some reason), and then running vscode results in lean segfaulting because the olean files are corrupt. (Although I think this wouldn't happen anyway because lake would notice the hash doesn't match and rebuild it.) Now leantar will avoid creating half a file and lake will try to rebuild anything that wasn't unpacked.


Last updated: May 02 2025 at 03:31 UTC