Zulip Chat Archive

Stream: lean4

Topic: Lake hashes are OS dependent


Mario Carneiro (Aug 30 2023 at 11:39):

While investigating an issue replicating lake hashes between lake exe cache downloads and lake itself, I noticed that the release URL is mixed into the job hash, and this includes OS dependent things, e.g. https://github.com/EdAyers/ProofWidgets4/releases/download/v0.0.13/linux-64.tar.gz is an input to the hash calculation of mathlib files. This is a problem, because cached files are not supposed to be OS dependent, and it will cause cache misses for everyone who is on a different OS than the CI machines

Mario Carneiro (Aug 30 2023 at 11:42):

Turning off preferReleaseBuild would solve the problem, but only the package itself gets to set this, there is no workspace override

Mario Carneiro (Aug 30 2023 at 11:42):

@Mac Malone

Scott Morrison (Aug 30 2023 at 11:43):

I'm confused: why aren't we seeing everyone complaining about lake exe cache get not working?

Martin Dvořák (Aug 30 2023 at 11:44):

And aren't we?

Mario Carneiro (Aug 30 2023 at 11:45):

I am currently debugging a problem of exactly that form

Mario Carneiro (Aug 30 2023 at 11:49):

For example, on commit cc016b64 of mathlib, running lake exe cache get unpacks a build/lib/Mathlib/Tactic/IrreducibleDef.trace containing 13360560255192827605, while lake build Mathlib.Tactic.IrreducibleDef sets the trace to 18234995878863652882. Running either one will reliably set the trace file to those respective numbers

Mario Carneiro (Aug 30 2023 at 11:50):

I think I misdiagnosed though, the hash of the release URL is not used in the hash for the target. I will keep investigating

Mario Carneiro (Aug 30 2023 at 13:55):

I've managed to replicate the lake hash algorithm to the extent that I can reproduce the answer 18234995878863652882, and it's not OS dependent except insofar as the release file linux-64.tar.gz has OS-dependent contents. (That is, all of these *.tar.gz files better be byte-identical or else mathlib will be sad.) But all of the uploaded files have wrong hashes, and I'm not sure why

Mauricio Collares (Aug 30 2023 at 14:03):

Any chance the .tar.gz file contains \r\n and exactly one element of the set {caching script, Lake} converts it into\n by mistake? I'm not sure if the .tar.gz file itself is being hashed.

Mario Carneiro (Aug 30 2023 at 14:15):

the tar.gz file is being hashed

Mario Carneiro (Aug 30 2023 at 14:15):

and I checked, all three releases are indeed byte-identical

Mauricio Collares (Aug 30 2023 at 14:30):

And do you get the other hash if you do perl -pne 's/\x0d\x0a/\x0a/g' < linux-64.tar.gz > linux-64-borked.tar.gz and replace the tarball with the borked tar.gz? Lake only does Lake.crlf2lf for source files (https://github.com/leanprover/lean4/blob/a7efe5b60e86b26fefd4795b46d66af369b97329/src/lake/Lake/Build/Module.lean#L23 uses TextFilePath for source files, which does line conversion, and FilePath in general, which doesn't) while the caching script seems to do it for every file (https://github.com/leanprover-community/mathlib4/blob/master/Cache/Hashing.lean#L68).

Mario Carneiro (Aug 30 2023 at 15:21):

The issue seems to have gone away after bumping mathlib, it was a combination of some out of date .trace files and a bad hash verification

Mario Carneiro (Aug 30 2023 at 15:22):

in any case, lean-cache will now tell you when the unpack results don't match what lake build does


Last updated: Dec 20 2023 at 11:08 UTC