Zulip Chat Archive
Stream: lean4
Topic: Lake hashes are OS dependent
Mario Carneiro (Aug 30 2023 at 11:39):
While investigating an issue replicating lake hashes between lake exe cache
downloads and lake itself, I noticed that the release URL is mixed into the job hash, and this includes OS dependent things, e.g. https://github.com/EdAyers/ProofWidgets4/releases/download/v0.0.13/linux-64.tar.gz
is an input to the hash calculation of mathlib files. This is a problem, because cached files are not supposed to be OS dependent, and it will cause cache misses for everyone who is on a different OS than the CI machines
Mario Carneiro (Aug 30 2023 at 11:42):
Turning off preferReleaseBuild
would solve the problem, but only the package itself gets to set this, there is no workspace override
Mario Carneiro (Aug 30 2023 at 11:42):
@Mac Malone
Scott Morrison (Aug 30 2023 at 11:43):
I'm confused: why aren't we seeing everyone complaining about lake exe cache get
not working?
Martin Dvořák (Aug 30 2023 at 11:44):
And aren't we?
Mario Carneiro (Aug 30 2023 at 11:45):
I am currently debugging a problem of exactly that form
Mario Carneiro (Aug 30 2023 at 11:49):
For example, on commit cc016b64 of mathlib, running lake exe cache get
unpacks a build/lib/Mathlib/Tactic/IrreducibleDef.trace
containing 13360560255192827605
, while lake build Mathlib.Tactic.IrreducibleDef
sets the trace to 18234995878863652882
. Running either one will reliably set the trace file to those respective numbers
Mario Carneiro (Aug 30 2023 at 11:50):
I think I misdiagnosed though, the hash of the release URL is not used in the hash for the target. I will keep investigating
Mario Carneiro (Aug 30 2023 at 13:55):
I've managed to replicate the lake hash algorithm to the extent that I can reproduce the answer 18234995878863652882
, and it's not OS dependent except insofar as the release file linux-64.tar.gz
has OS-dependent contents. (That is, all of these *.tar.gz
files better be byte-identical or else mathlib will be sad.) But all of the uploaded files have wrong hashes, and I'm not sure why
Mauricio Collares (Aug 30 2023 at 14:03):
Any chance the .tar.gz file contains \r\n
and exactly one element of the set {caching script, Lake}
converts it into\n
by mistake? I'm not sure if the .tar.gz
file itself is being hashed.
Mario Carneiro (Aug 30 2023 at 14:15):
the tar.gz
file is being hashed
Mario Carneiro (Aug 30 2023 at 14:15):
and I checked, all three releases are indeed byte-identical
Mauricio Collares (Aug 30 2023 at 14:30):
And do you get the other hash if you do perl -pne 's/\x0d\x0a/\x0a/g' < linux-64.tar.gz > linux-64-borked.tar.gz
and replace the tarball with the borked tar.gz? Lake only does Lake.crlf2lf for source files (https://github.com/leanprover/lean4/blob/a7efe5b60e86b26fefd4795b46d66af369b97329/src/lake/Lake/Build/Module.lean#L23 uses TextFilePath
for source files, which does line conversion, and FilePath
in general, which doesn't) while the caching script seems to do it for every file (https://github.com/leanprover-community/mathlib4/blob/master/Cache/Hashing.lean#L68).
Mario Carneiro (Aug 30 2023 at 15:21):
The issue seems to have gone away after bumping mathlib, it was a combination of some out of date .trace
files and a bad hash verification
Mario Carneiro (Aug 30 2023 at 15:22):
in any case, lean-cache
will now tell you when the unpack results don't match what lake build
does
Last updated: Dec 20 2023 at 11:08 UTC