Zulip Chat Archive
Stream: mathlib4
Topic: build failing
Michael Stoll (Jan 26 2024 at 14:50):
On #10029, commit acf36f6, "build mathlib" fails at the "get cache" stage with:
Attempting to download 285 file(s)
Downloaded: 0 file(s) [attempted 1/285 = 0%]
Downloaded: 0 file(s) [attempted 48/285 = 16%]
Downloaded: 0 file(s) [attempted 100/285 = 35%]
Downloaded: 0 file(s) [attempted 147/285 = 51%]
Downloaded: 0 file(s) [attempted 151/285 = 52%]
Downloaded: 0 file(s) [attempted 199/285 = 69%]
Downloaded: 0 file(s) [attempted 202/285 = 70%]
Downloaded: 0 file(s) [attempted 246/285 = 86%]
Downloaded: 0 file(s) [attempted 280/285 = 98%]
Downloaded: 0 file(s) [attempted 285/285 = 100%] (0% success)
Warning: some files were not found in the cache.
This usually means that your local checkout of mathlib4 has diverged from upstream.
If you push your commits to a branch of the mathlib4 repository, CI will build the oleans and they will be available later.
/home/lean/.cache/mathlib/f97fb2ce8d3ecb1e.ltar: failed to fill whole buffer
uncaught exception: leantar failed with error code 1
Decompressing 3875 file(s)
Error: Process completed with exit code 1.
Restarting the build had the same result. Can I do anything about that?
Matthew Ballard (Jan 26 2024 at 14:53):
Add white space to lake-manifest.json
for a workaround
Michael Stoll (Jan 26 2024 at 14:53):
Anywhere?
Matthew Ballard (Jan 26 2024 at 14:56):
See also #mathlib4 > Why does CI fail on my branch?
Michael Stoll (Jan 26 2024 at 14:56):
I added an empty line somewhere. Now it tries to download 4160 files, but unsuccessfully...
Matthew Ballard (Jan 26 2024 at 14:56):
Oh maybe not an empty line but just some space between existing characters
Michael Stoll (Jan 26 2024 at 15:01):
Can you be more precise? This time I added a blank between "packages" and "Dir", and the build fails immediately :
error: ./lake-manifest.json: improperly formatted manifest: manifest missing required property 'packagesDir'
(This is on a Mathlib PR; the one before (#10028), which CI was working on shortly before, did not have this problem.)
Mauricio Collares (Jan 26 2024 at 15:03):
I think Matthew means you should commit this whitespace change to lake-manifest.json and wait for CI to rebuild everything
Michael Stoll (Jan 26 2024 at 15:04):
That's what I did (commit and push, then wait to see what CI is doing).
Mauricio Collares (Jan 26 2024 at 15:05):
Ah, sorrt: the whitespace change shouldn't be inside one of the strings in the json file
Mauricio Collares (Jan 26 2024 at 15:05):
Add a whitespace at the end of a line, for example
Mauricio Collares (Jan 26 2024 at 15:05):
(deleted)
Matthew Ballard (Jan 26 2024 at 15:07):
I did it after a colon for example
Mauricio Collares (Jan 26 2024 at 15:07):
It would be good if someone with Azure access could lake exe cache put!
to fix the cache
Michael Stoll (Jan 26 2024 at 15:08):
OK; merging master seems to have worked.
Do we know what the root cause for this (apparently recurring) problem is?
Eric Rodriguez (Jan 26 2024 at 15:11):
https://leanprover.zulipchat.com/#narrow/stream/144837-PR-reviews/topic/.239933.20actually.20check.20cache.20works it'd be nice if we actually fixed the "check cache" step so that we don't have random cache failures
Mauricio Collares (Jan 26 2024 at 15:34):
Also, a corrupted file should just be skipped, no need to interrupt the whole decompression process
Mario Carneiro (Jan 28 2024 at 07:41):
Actually it doesn't interrupt the decompression process
Mario Carneiro (Jan 28 2024 at 07:42):
but it does return a nonzero error code, which causes lake exe cache get
to fail
Mario Carneiro (Jan 28 2024 at 07:42):
although it is on the last step, so nothing is skipped
Mauricio Collares (Jan 28 2024 at 07:50):
Ah, good to know! Sorry for not checking.
Mario Carneiro (Jan 28 2024 at 07:51):
This is now fixed in #10063, leantar will now remove corrupted files on its own (and you can either run lake build
to rebuild the missing files like you would for a partial cache get
, or run lake exe cache get
again if it was a network issue)
Sebastian Ullrich (Jan 28 2024 at 08:09):
How do they even get corrupted in the first place? Are we insufficiently handling error output from curl?
Mauricio Collares (Jan 28 2024 at 08:11):
Not sure if this still happens, but hitting Ctrl-C during the download (after a network stall, for example) used to leave corrupted files
Sebastian Ullrich (Jan 28 2024 at 08:15):
Right, but we have IO.FS.rename now. This is solvable.
Mario Carneiro (Jan 28 2024 at 08:35):
unknown. We are using IO.FS.rename
in the download process already
Kevin Buzzard (Jan 28 2024 at 09:09):
What happens if the 4.5 gig download size catches an unsuspecting student out, the original attempt fails, the student deletes 10 gigs of mp3s and tries again?
Mario Carneiro (Jan 28 2024 at 09:25):
I'm not sure I understand the question
Mario Carneiro (Jan 28 2024 at 09:26):
The current architecture is already supposed to be immune to torn files, but #10063 adds yet another defense layer against bad files
Mario Carneiro (Jan 28 2024 at 09:28):
If you interrupt lake exe cache get
halfway through the process, you will get half the files, and all the files you get will be valid. The second attempt will not try to redownload those files, but if somehow you still managed to corrupt some files those files will be deleted and you can rerun lake exe cache get
to download better ones
Mario Carneiro (Jan 28 2024 at 09:32):
There is one scenario where I think corrupt files can get through, which is where lake exe cache put
is interrupted or is partially successful and uploads torn files to the cloud. In that case every lake exe cache get
will download those files, and prior to #10063 you will get a message like failed to fill whole buffer
when unpacking it (but this message is otherwise harmless, except that you didn't get the file and something may need to be recompiled). After #10063 you will instead get the message removing corrupted file
and you will still have to rebuild it; if you use lake exe cache get
it will download the bad file again, notice it is corrupted and delete it again
Mario Carneiro (Jan 28 2024 at 09:33):
The "add whitespace to lake-manifest.json
" trick is a way to work around bad files in the cloud
Kevin Buzzard (Jan 28 2024 at 09:39):
Mario Carneiro said:
I'm not sure I understand the question
I think you answered it anyway. Maybe "I got interrupted because my drive filled up" is no different to "I pressed Ctrl-C"
Mario Carneiro (Jan 28 2024 at 09:47):
It is slightly different since different parts of the code are likely to fail in that case. The failed to fill whole buffer
error is mainly associated with .ltar
files which are truncated due to network or curl issues, while if you run out of space during unpacking it is the .olean
writing that will fail. #10063 adds an additional protection against issues in the .olean
writing stage, namely that it rolls back (deletes) any files it produced in the course of an unpacking operation that is aborted halfway through due to an IO error (such as running out of disk space).
So this avoids the situation where lake exe cache get
runs, leantar
gives an error about the disk being full (and the user ignores this for some reason), and then running vscode results in lean segfaulting because the olean files are corrupt. (Although I think this wouldn't happen anyway because lake would notice the hash doesn't match and rebuild it.) Now leantar
will avoid creating half a file and lake will try to rebuild anything that wasn't unpacked.
Last updated: May 02 2025 at 03:31 UTC