Zulip Chat Archive
Stream: mathlib4
Topic: cache down?
Kim Morrison (Jan 23 2024 at 04:43):
I'm seeing no cache available on mathlib master at the moment. Anyone else?
% git describe
v2024-405-ge808ef0e67
% lake exe cache get
Attempting to download 4149 file(s)
Downloaded: 0 file(s) [attempted 4149/4149 = 100%] (0% success)
Yury G. Kudryashov (Jan 23 2024 at 06:27):
Same here
Yury G. Kudryashov (Jan 23 2024 at 06:29):
It looks like uploads fail, not downloads: cache works for 69a8f62cf1
Yury G. Kudryashov (Jan 23 2024 at 06:33):
99315bf71cec8eb94c2cc14e8198a032a7d9b12c #9725 is the first commit w/o a cache
Yury G. Kudryashov (Jan 23 2024 at 06:34):
Nothing suspicious in the log but I'm not sure that "Attempting to upload N file(s)" reports an error if it happens.
Jeremy Tan (Jan 23 2024 at 06:40):
I note that GitHub is rolling out a new behaviour for CI – it now always shows you the last 1000 lines of log. And the build times for our repository are now even slower than before – building counterexamples used to take less than a minute, now it takes 3+ minutes
Eric Rodriguez (Jan 23 2024 at 11:57):
The reason we don't seem to notice this is that currently the "check the cache" steps are toothless - the files are already in place so they're not redownloaded in CI to make sure stuff is working properly. I will push a fix for this.
Eric Rodriguez (Jan 23 2024 at 12:03):
#9933, I expect the run to fail
David Renshaw (Jan 23 2024 at 12:52):
It looks to me like maybe the upload curl command does not check the HTTP status code of the response. https://github.com/leanprover-community/mathlib4/blob/2475f305cecbfdf1fb57959e0634db06563c985a/Cache/Requests.lean#L206-L208
David Renshaw (Jan 23 2024 at 12:53):
The --fail-with-body
flag might help: https://curl.se/docs/manpage.html#--fail-with-body
David Renshaw (Jan 23 2024 at 12:55):
(Lean's static curl is new enough to support that flag: https://github.com/leanprover-community/static-curl/releases/tag/v7.88.1 )
David Renshaw (Jan 23 2024 at 13:04):
My hypothesis is that the upload is returning a 4XX or 5XX HTTP response code (maybe Bad Request because it's too big somehow, or maybe something auth-related), and then we don't get to see what the error is because we ignore stderr (we set stderrAsErr := false
: https://github.com/leanprover-community/mathlib4/blob/2475f305cecbfdf1fb57959e0634db06563c985a/Cache/IO.lean#L159-L165
Eric Rodriguez (Jan 23 2024 at 13:08):
David, could you make a patch to the CI that runs this and outputs the error code in order to see what the issue is? One thing I do worry about is that it may accidentaly leak the secrets if this is done
David Renshaw (Jan 23 2024 at 13:10):
I submitted #9934 to try --fail-with-body
, but looks like that CI run isn't going to be so informative, because the machine seems to already have the cache locally.
David Renshaw (Jan 23 2024 at 13:11):
Maybe I should make a trivial change inside of mathlib, so that it needs to upload something
David Renshaw (Jan 23 2024 at 13:11):
One thing I do worry about is that it may accidentaly leak the secrets if this is done
Makes me want a Lean binding to libcurl, so that we would have tighter control.
Eric Rodriguez (Jan 23 2024 at 13:13):
For fast CI running, you could in theory just clear most of Mathlib I think?
David Renshaw (Jan 23 2024 at 13:28):
Aha, we got ~4000 lines of
curl: (22) The requested URL returned error: 403
David Renshaw (Jan 23 2024 at 13:29):
https://github.com/leanprover-community/mathlib4/actions/runs/7626176854/job/20772003823
David Renshaw (Jan 23 2024 at 13:34):
403 means authorization failure ("forbidden").
David Renshaw (Jan 23 2024 at 13:34):
I wonder if we need to refresh secrets.MATHLIB_CACHE_SAS
. I don't know how that stuff is set up.
Yury G. Kudryashov (Jan 23 2024 at 14:40):
In the meantime, 5a0540a5b1 is the last revision with cache.
Matthew Ballard (Jan 23 2024 at 17:43):
Does the FRO cache also have a secrets problem?
David Renshaw (Jan 23 2024 at 17:48):
It looks to me like the FRO/R2 code puts the secret token in a command-line argument, which potentially gets printed on error.
David Renshaw (Jan 23 2024 at 17:49):
The Azure codepath (currently active) puts the token in a config file, which does not get printed.
Matthew Ballard (Jan 23 2024 at 17:50):
Does the FRO code path error?
David Renshaw (Jan 23 2024 at 17:52):
If the curl
subprocess exits with a non-zero code (and we should assume that can happen), then it prints out the args: https://github.com/leanprover-community/mathlib4/blob/e53e613abdea8d5b8863cbfb93014b9493e719d1/Cache/IO.lean#L161-L162
David Renshaw (Jan 23 2024 at 17:53):
It's possible that Github actions has some smart logic that will redact the token from logs, but I wouldn't count on it.
David Renshaw (Jan 23 2024 at 17:56):
I'm also a bit confused about why the Cloudflare R2 code path is being referred to as the "FRO" path (e.g. the useFROCache
variable name). My understanding is that the FRO is footing the bill under both configurations?
Mario Carneiro (Jan 23 2024 at 18:17):
I've added a new MATHLIB_CACHE_SAS
. If someone who is using CI could report whether uploads work now that would be helpful
David Renshaw (Jan 23 2024 at 18:21):
I rebased #9934, triggering a job.
Matthew Ballard (Jan 23 2024 at 18:24):
This seems to work
David Renshaw (Jan 23 2024 at 18:33):
on #9939 (based on an older commit) we get
curl: (22) The requested URL returned error: 409
David Renshaw (Jan 23 2024 at 18:34):
maybe that's the expected error for cache files that have already been uploaded?
Mario Carneiro (Jan 23 2024 at 18:47):
seems likely
Mario Carneiro (Jan 23 2024 at 18:48):
I think there are ways to make curl treat specific http response codes as a success
David Renshaw (Jan 23 2024 at 19:25):
I got the HTTP 409 error also on #9934, -- but less than 1000 instances of it.
David Renshaw (Jan 23 2024 at 19:25):
And then when I do lake exe cache get
locally for that branch, it succeeds.
David Renshaw (Jan 23 2024 at 19:26):
So I think this confirms
- Mario's token update fixed things
- the 409 response indicates cache files that have already been uploaded (typically from other commits)
David Renshaw (Jan 23 2024 at 19:27):
I suppose we'll really see whether things are fixed once the next commit lands on master
David Renshaw (Jan 23 2024 at 19:37):
actually, cache for the latest master commit seems to be working now too. :tada:
Jeremy Tan (Jan 23 2024 at 23:27):
Yes, but why are tests still taking 3+ minutes rather than about 1?
https://github.com/leanprover-community/mathlib4/actions/runs/7631758536/job/20790451359#step:22:8
Mario Carneiro (Jan 23 2024 at 23:35):
this has nothing to do with cache
Mario Carneiro (Jan 23 2024 at 23:35):
you may need to bisect to find when this happened
Mauricio Collares (Jan 24 2024 at 11:09):
(The slowdown discussion here continues at https://leanprover.zulipchat.com/#narrow/stream/287929-mathlib4/topic/Testing.20slowdown)
Last updated: May 02 2025 at 03:31 UTC