Zulip Chat Archive

Stream: general

Topic: cache is incomplete


Eric Wieser (May 06 2022 at 14:30):

I'm sure there was a thread about this before, but I wasn't able to find it.
I often find that the cache that gets uploaded to azure doesn't actually work. See for instance any of:

All of these are empty commits with a single parent commit, and CI produced a cache for that parent commit; yet all of them required a partial rebuild anyway.

Eric Wieser (May 06 2022 at 14:34):

I vaguelly remember at one point we just had CI run lean make; lean make; lean make to workaround this? (edit: we still do, we only do it twice)

Floris van Doorn (May 06 2022 at 14:37):

To what extend is the order in which Lean compiles files deterministic and independent of the cache? If a build fails, the compilation stops, so you don't get a cache consisting of all files that compile (at, least, I think so). So maybe on the rebuild Lean starts compiling other files first?
Not sure if this is the reason though.

Eric Wieser (May 06 2022 at 14:41):

In all these cases the cache doesn't even contain the files needed to reproduce the failure

Eric Wieser (May 06 2022 at 14:43):

Ah, I guess our lean make; lean make hack doesn't work if the build fails the first time

Eric Wieser (May 06 2022 at 14:43):

Since it will bail before the second command begins

Floris van Doorn (May 06 2022 at 15:52):

Presumably the cache contains all dependencies of the broken file? If that isn't the case, then there is clearly something wrong...

Eric Wieser (May 06 2022 at 16:05):

No, it doesn't contain the dependencies either.

Eric Wieser (May 06 2022 at 16:05):

Or if it does, they're considered invalid and rebuilt

Floris van Doorn (May 06 2022 at 16:47):

oh, ok. Then that is a problem.

Floris van Doorn (May 06 2022 at 16:59):

I can reproduce that the cache of 502be36986 behaves the way Eric says: the error is reported in analysis/normed_space/basic, but if I recompile locally it starts recompiling from tactic/cancel_denoms

Mauricio Collares (May 06 2022 at 17:10):

Weird, the oleans from the GitHub artifact (https://github.com/leanprover-community/mathlib/suites/6402242134/artifacts/233212189) seem to work OK, but not the Azure version

Eric Wieser (May 06 2022 at 17:13):

So they're different as binaries?

Mauricio Collares (May 06 2022 at 17:14):

Yes

Mauricio Collares (May 06 2022 at 17:15):

The Azure version of src/tactic/cancel_denoms.olean has MD5 208fbe78b66c5bc6d924a25b7fc3e03d, the GH artifact version has MD5 2043d4862a2e327db6cc547f9fbd9b90.

Eric Wieser (May 06 2022 at 17:24):

Is either of those hashes the same a the parent commit?

Mauricio Collares (May 06 2022 at 17:29):

The Azure one has a timestamp of 05 May 2022, 01:13 in the archive, the GH one has a timestamp of 06 May 2022, 14:42 (there are files with older timestamps in the GH archive, just not this one).

Mauricio Collares (May 06 2022 at 17:30):

But there are only two olean files in the tactic folder with today's date in the Azure archive.

Mauricio Collares (May 06 2022 at 17:31):

I don't know how reliable the timestamps are, let me search for parent commit hashes.

Mauricio Collares (May 06 2022 at 17:44):

A possible theory is that Azure caches aren't overwritten by default: The first run for 502be36986 uploaded some stale oleans onto 502be36986a2a14757cb43a61f4cfec4edb7361c.tar.xz because it was cancelled (or suffered from an infra failure). The second run produced valid oleans but didn't upload them properly due to the pre-existing archive.

Mauricio Collares (May 06 2022 at 17:44):

Do you see the same problem with 2c79857a01? Here lean --make fails quite quickly (analysis/normed_space/basic.lean:475:24: error: failed to synthesize type class instance for...) with the Azure oleans for this commit.

Rob Lewis (May 06 2022 at 17:53):

This is plausible, Azure caches are not overwritten if they already exist. https://github.com/leanprover-community/mathlib/blob/master/.github/workflows/build.yml.in#L77

Rob Lewis (May 06 2022 at 17:53):

(I assume that's what --overwrite false means)

Mauricio Collares (May 06 2022 at 17:54):

And, according to the same file, something is pushed to Azure on the first run as long as the build started. So restarting a run doesn't help with fixing oleans if the worker OOMed the first time, for example. But a separate commit should have worked (and it does here, as far as I can tell).

Rob Lewis (May 06 2022 at 18:08):

#2039 is the relevant PR. I'm not sure what I meant at the time, and it seems like GitHub has deleted the logs from the demo runs. Why would the second build fail at the push stage? azcopy defaults to --overwrite true

Mauricio Collares (May 06 2022 at 18:11):

Conditioning on the fact that the build failed there, I'm guessing azcopy defaults to interactively asking the user if the file should be overwritten.

Mauricio Collares (May 06 2022 at 18:14):

Maybe it was prompting due to https://github.com/Azure/azure-storage-azcopy/pull/1538 before?

Rob Lewis (May 06 2022 at 18:14):

The docs could be wrong or the behavior could have changed in the past two years, but they claim it defaults to --overwrite true

Rob Lewis (May 06 2022 at 18:14):

Mauricio Collares said:

Maybe it was prompting due to https://github.com/Azure/azure-storage-azcopy/pull/1538 before?

Huh, I'm not sure what that means, but maybe!

Rob Lewis (May 06 2022 at 18:16):

In any case, we can test that --overwrite true does what we want

Mauricio Collares (Jul 23 2022 at 18:18):

The other thread (https://leanprover.zulipchat.com/#narrow/stream/113488-general/topic/.22saving.20olean.22.3F) reminded me of this. Is adding --overwrite true still worth a try?


Last updated: Dec 20 2023 at 11:08 UTC