Zulip Chat Archive
Stream: mathlib4
Topic: Failing CI
Etienne Marion (Jun 24 2025 at 08:27):
CI fails on both #26264 and #26320 with the same error at the "verify that everything was available in the cache" step. I tried merging master and then rerunning CI in #26320 but it didn't work.
Damiano Testa (Jun 24 2025 at 08:33):
I don't have time to investigate further, but some of the previous steps (e.g. getting the cache) have the following warning:
Removing old Mathlib build directories prior to cache fetch...
Attempting to fetch olean for Mathlib/Init.lean from cache...
Warning: Some Mathlib ecosystem tools assume that the git remote for `leanprover-community/mathlib4` is named `upstream`. You have named it `origin` instead. We recommend changing the name to `upstream`. Moreover, `origin` should point to your own fork of the mathlib4 repository. You can set this up with `git remote add upstream https://github.com/leanprover-community/mathlib4.git`.
Damiano Testa (Jun 24 2025 at 08:34):
The fact that the main CI build uses a setup that is not standard for the "Mathlib ecosystem" seems sub-optimal.
Etienne Marion (Jun 24 2025 at 08:34):
Oh interesting, thanks! It's strange because I'm pretty sure I did the renaming... I'll check it.
Damiano Testa (Jun 24 2025 at 08:36):
Ah, sorry, I had missed that this was the setup in the fork! I did not mean to criticise your setup, I thought that it was a fault purely on the CI side!
Damiano Testa (Jun 24 2025 at 08:36):
Anyway, if this really is the issue, then
- either the error message should be clearer about this;
- or (better) CI should be able to deal with unusual names for upstream and
masterorigin.
Etienne Marion (Jun 24 2025 at 08:38):
Well in fact it's probably an issue with CI, my git remote for leanprover-community/mathlib4 is indeed called upstream and the one for my fork origin.
Damiano Testa (Jun 24 2025 at 08:38):
Ok, unfortunately, I cannot debug this further, but hopefully someone else can assist!
Etienne Marion (Jun 24 2025 at 08:39):
No problem, thanks!
Anatole Dedecker (Jun 24 2025 at 08:41):
I have the same issue on #26041
Damiano Testa (Jun 24 2025 at 08:45):
I did do some little more debugging: it seems that the Configure Lean step tries to fetch the Mathlib cache and is only 99% successful.
Robin Carlier (Jun 24 2025 at 08:45):
Perhaps we could add a --no-warn-remoteflag to cache that CI can use?
Damiano Testa (Jun 24 2025 at 08:45):
I wonder if it is able to download the cache that is on the main repo, but not the one that has just been built and uploaded to your fork.
Damiano Testa (Jun 24 2025 at 08:47):
There has been some CI movement in this direction between yesterday and today: every time that it looked like it was fixed, another problem came up. This could be the latest. Surely, everything will work once this is fixed.
Etienne Marion (Jun 24 2025 at 09:08):
Same issue on #26238.
Bryan Gin-ge Chen (Jun 24 2025 at 12:46):
Does the error go away if you merge master again? It could be related to the fix in #26335.
Etienne Marion (Jun 24 2025 at 13:04):
#26320 still fails.
Bryan Gin-ge Chen (Jun 24 2025 at 13:11):
Thanks for testing! I think getting to the bottom of Damiano's comment here may be the key (this is a job for someone willing to dig into the workings of Cache+our CI, not necessarily you) :
Damiano Testa said:
I did do some little more debugging: it seems that the
Configure Leanstep tries to fetch the Mathlib cache and is only 99% successful.
Here's the relevant line in the latest build log of yours: https://github.com/leanprover-community/mathlib4/actions/runs/15850954631/job/44684495163?pr=26320#step:3:174
In any case the warning message given there doesn't apply.
Rémy Degenne (Jun 24 2025 at 13:13):
#26351 and #26353 have similar errors to #26320
Bhavik Mehta (Jun 24 2025 at 16:46):
#26364 has the same problem
Matthew Ballard (Jun 24 2025 at 16:47):
I got pulled away from the computer when I started looking at this.
I would be nice if cache spat out like 5 modules it failed on for debugging.
Damiano Testa (Jun 24 2025 at 16:52):
I think that consistently
- in
Build,upload cache, lake reports a number of files it should upload, presumably the modified ones and the files depending on those; Post-Build,Configure Leanhas a "Mathlib cache" clickable tab and it says "Attempting to download [some number of files]", but misses the download of the same number that is should have uploaded.
Damiano Testa (Jun 24 2025 at 16:52):
So, I think that while it would be good if lake reported which files it failed to download, I think that in this case it is "all the modified files and the ones that depend on them".
Damiano Testa (Jun 24 2025 at 16:53):
I wonder whether it fails to download them, because they were not stored correctly.
Matthew Ballard (Jun 24 2025 at 17:16):
Yes. This was my suspicion but I only looked at one example very briefly. I only saw it pulling from the central cache and not user cache.
Alastair Irving (Jun 24 2025 at 19:11):
I think I'm also seeing this in #26272.
Matthew Ballard (Jun 24 2025 at 20:35):
Are we seeing trouble pulling the cache in the fork directly?
Etienne Marion (Jun 24 2025 at 20:37):
You mean running lake exe cache get! on my local branch?
Matthew Ballard (Jun 24 2025 at 20:38):
If you create a new clone of your fork and switch the branch where you should have a cache of changes, what happens when you lake exe cache get there
Matthew Ballard (Jun 24 2025 at 20:42):
I can test this myself (at some point...) but it would be easier if someone is quickly in position to check
Sébastien Gouëzel (Jun 24 2025 at 20:44):
Pulling from the cache when I'm on my fork works fine, for instance for #26361: all files are downloaded (most of them from the mathlib-community cache, corresponding to files the PR hasn't changed, and 11 of them from the fork's cache). So, the files have definitely been uploaded.
However, if I'm not on my fork, i.e., if I'm on the main mathlib4, then the result is not the same: it does not download the files from my fork. The output is
PS C:\Users\sebas\Desktop\mathlib4 - 2> gh pr checkout 26361
PS C:\Users\sebas\Desktop\mathlib4 - 2> lake exe cache get
Warning: Some Mathlib ecosystem tools assume that the git remote for `leanprover-community/mathlib4` is named `upstream`. You have named it `origin` instead. We recommend changing the name to `upstream`. Moreover, `origin` should point to your own fork of the mathlib4 repository. You can set this up with `git remote add upstream https://github.com/leanprover-community/mathlib4.git`.
Using cache from origin: leanprover-community/mathlib4
Attempting to download 6878 file(s) from leanprover-community/mathlib4 cache
Downloaded: 6867 file(s) [attempted 6878/6878 = 100%] (99% success)
Matthew Ballard (Jun 24 2025 at 20:45):
Ah that is very helpful. Thank you!
Sébastien Gouëzel (Jun 24 2025 at 20:46):
So it looks like the issue is not the uploading phase, but rather the downloading one where it does not identify the fork correctly.
Matthew Ballard (Jun 24 2025 at 20:46):
Yes and that is the logic that has changed recently
Matthew Ballard (Jun 24 2025 at 20:48):
"We are not in a fork so we don't need to check the user cache"
Matthew Ballard (Jun 24 2025 at 20:54):
No, it is that the cache was getting put in the wrong place and get relied on that not to miss.
Sébastien Gouëzel (Jun 24 2025 at 20:55):
Another example where the logic is still broken: on my fork, after
PS C:\Users\sebas\Desktop\mathlib4SG> gh pr checkout 26366
(which is not one of my PRs), then lake exe cache get still tries to download from the main cache and from my fork's cache, but not from the PR's author cache.
PS C:\Users\sebas\Desktop\mathlib4SG> lake exe cache get
Using cache from origin: sgouezel/mathlib4
Attempting to download 924 file(s) from leanprover-community/mathlib4 cache
Downloaded: 722 file(s) [attempted 924/924 = 100%] (78% success)
Attempting to download 202 file(s) from sgouezel/mathlib4 cache
Downloaded: 0 file(s) [attempted 15/202 = 7%] -- I interrupted it
Matthew Ballard (Jun 24 2025 at 21:08):
Ok L109 in Well, that is actual syntax. But it errors out here and the fallback looks off.Cache.Requests looks like the problem. "@{upstream}" is a just plain ole string. (Your lines may vary)
Matthew Ballard (Jun 24 2025 at 21:36):
Ok, I think the following is the start of a fix: https://github.com/leanprover-community/mathlib4/blob/97a13d0d0981dd42a088436ad84cd495eec51aaa/Cache/Requests.lean#L185 fails in @Sébastien Gouëzel 's situation but git for-each-ref --points-at COMMITNUM succeeds
Matthew Ballard (Jun 24 2025 at 21:36):
I am pulled away again
Johan Commelin (Jun 25 2025 at 10:32):
Maybe we should add a --repo flag to exe cache get, so that we can explicitly pull from the right cache in CI?
Eric Wieser (Jun 25 2025 at 10:45):
Doesn't the flag already exist?
Bryan Gin-ge Chen (Jun 25 2025 at 10:50):
One thing I don't understand yet is why this only affects a subset of PRs. Is there something common to the cases where this pops up?
Johan Commelin (Jun 25 2025 at 10:52):
Eric Wieser said:
Doesn't the flag already exist?
For put it does... not sure about get
Sébastien Gouëzel (Jun 25 2025 at 11:00):
Bryan Gin-ge Chen said:
One thing I don't understand yet is why this only affects a subset of PRs. Is there something common to the cases where this pops up?
Doesn't it affect all PRs?
Robin Carlier (Jun 25 2025 at 11:03):
Johan Commelin said:
Eric Wieser said:
Doesn't the flag already exist?
For
putit does... not sure aboutget
The flag exists for get (and is undocumented :melting: ) but the catch is that you have to use a --repo=... syntax (--repo ... doesn’t work) and you need to put it before the get, not after, so
lake exe cache --repo=leanprover-community/mathlib4 get).
Johan Commelin (Jun 25 2025 at 11:07):
I think we should try to use that in CI
Johan Commelin (Jun 25 2025 at 11:07):
But I don't have time to implement it in the next 3 hrs
Bryan Gin-ge Chen (Jun 25 2025 at 11:12):
Sébastien Gouëzel said:
Doesn't it affect all PRs?
We're still managing to merge some PRs into master, so at least those are getting through without throwing errors in the Post-Build step.
Johan Commelin (Jun 25 2025 at 11:17):
I think that's because bors doesn't run the same setup as PR-CI. And I think that is a usability bug. We should probably fix it.
Sébastien Gouëzel (Jun 25 2025 at 11:22):
PRs which exhibit the Post-Build step error can still be sent to bors and be successfully merged there, because merging on master only needs to fetch from the main cache. I have the impression that all PRs on forks fail this step.
Bryan Gin-ge Chen (Jun 25 2025 at 11:22):
I could be missing something, but I don't see anything in the Post-Build step that should lead to a difference here between build_fork.yml and bors.yml though.
Matthew Ballard (Jun 25 2025 at 11:23):
As @Sébastien Gouëzel pointed out, it causes problems outside of CI with particular git states. I think the logic for finding the repo is a little off. If I can ever get a solid block, it should be straightforward to fix. If someone gets to it first, I can share what I know.
Bryan Gin-ge Chen (Jun 25 2025 at 11:23):
Sébastien Gouëzel said:
PRs which exhibit the Post-Build step error can still be sent to bors and be successfully merged there, because merging on master only needs to fetch from the main cache. I have the impression that all PRs on forks fail this step.
That's what I was missing, thanks!
Matthew Ballard (Jun 25 2025 at 11:25):
Regardless, passing the repo in CI is probably the best practice. Anything that tries to cover many cases in cache will have some edge cases.
Matthew Ballard (Jun 25 2025 at 11:26):
Here you would need to update lean-action
Anatole Dedecker (Jun 25 2025 at 13:01):
Do we have a short term solution here? Should maintainers just check what the CI error is and borsify if it's just this one? Or is that not safe?
Matthew Ballard (Jun 25 2025 at 13:03):
This line https://github.com/leanprover/lean-action/blob/7bd4de6ed238b1e48a311cb641e06b34fb96f1f2/action.yml#L175 should have the fix like #26233
Matthew Ballard (Jun 25 2025 at 13:04):
I am sitting down and thinking about the current logic now. Who knows how long that state will last...
Matthew Ballard (Jun 25 2025 at 13:08):
But if someone else can update lean-action and then the lakefile in the interim, CI should be ok :fingers_crossed:
Kim Morrison (Jun 25 2025 at 23:27):
There is a temporary hacky fix up at # chore: temporary fix retrieving the cache in CI #26417, co-authored with @Thomas Murrills live during office hours. :-)
Kim Morrison (Jun 25 2025 at 23:28):
The correct fix is to improve the existing logic in lake exe cache get, so that when it wakes up during CI it correctly identifies the source repository it should be retrieving from.
Kim Morrison (Jun 25 2025 at 23:28):
Note also that the lake exe cache get step before the build is not even attempting to retrieve oleans from the PR fork, so it is unnecessarily rebuilding these on every run. We really need to fix this.
Kim Morrison (Jun 25 2025 at 23:28):
Help very much appreciated, as I am swamped today. :-)
Thomas Murrills (Jun 26 2025 at 00:30):
Just to confirm that the hacky fix works: #26415, on which CI was previously failing for this reason, is now green :)
Bryan Gin-ge Chen (Jun 26 2025 at 01:50):
Thanks to both you and Kim for the fix!
Iván Renison (Jun 26 2025 at 06:55):
I just open #26425 and I'm getting the error, so for some reason hacky fix is not working in this case
Michael Rothgang (Jun 26 2025 at 07:04):
Is your PRs master branch new enough? (Or does that not matter any more?)
Iván Renison (Jun 26 2025 at 07:05):
Ah, I think that no
Iván Renison (Jun 26 2025 at 07:05):
I will try updating it
Iván Renison (Jun 26 2025 at 07:06):
What I don't find is the button for re-running the job
Rémy Degenne (Jun 26 2025 at 07:11):
I just merged master into #26351 and the CI error at the post-build step is still there.
Kim Morrison (Jun 26 2025 at 07:36):
Yeah, this is still fallout from Matthew's PR, I think.
Damiano Testa (Jun 26 2025 at 07:37):
Trying to debug this, I am wondering about the check the cache step: it tries to get the cache, but reports that there is nothing to download. Presumably, this is because it uses the cache that was just built. Would it make sense to delete the cache and check that the cache can be donwloaded instead?
Kim Morrison (Jun 26 2025 at 07:37):
The logic I had written yesterday would have compared the checked out commit to all the PR refs, and used that to locate the relevant fork (i.e. to match what we put in --repo during lake exe cache put). That logic was deleted without anything replacing it ...
Kim Morrison (Jun 26 2025 at 07:37):
No.
Kim Morrison (Jun 26 2025 at 07:38):
Damiano Testa said:
Trying to debug this, I am wondering about the
check the cache step: it tries to get the cache, but reports that there is nothing to download. Presumably, this is because it uses the cache that was just built. Would it make sense to delete the cache and check that the cache can be donwloaded instead?
That's not what is happening.
Kim Morrison (Jun 26 2025 at 07:38):
You can see from the log https://github.com/leanprover-community/mathlib4/actions/runs/15895172477/job/44825416890?pr=26351#step:3:130 that it is failing to download olean
Kim Morrison (Jun 26 2025 at 07:39):
Because git rev-parse is returning just HEAD, and (after the change last night), lake exe cache get doesn't know what to do with that.
Damiano Testa (Jun 26 2025 at 07:39):
That is in Configure Lean, I was wondering about check the cache in the Build (fork) step: https://github.com/leanprover-community/mathlib4/actions/runs/15895172477/job/44825075239?pr=26351.
Kim Morrison (Jun 26 2025 at 07:40):
Would it make sense to delete the cache and check that the cache can be donwloaded instead?
This is already tested in the post-build step.
Kim Morrison (Jun 26 2025 at 07:40):
Please don't move this test into the main build step. We are trying to move things out.
Damiano Testa (Jun 26 2025 at 07:40):
Ok, fine.
Kim Morrison (Jun 26 2025 at 07:40):
If anything, we should just drop the "test the cache" step in the main build step.
Damiano Testa (Jun 26 2025 at 07:41):
OK, and do you think that the cache has been uploaded then and the later steps are just failing to retrieve it, or that the cache has actually not been uploaded?
Kim Morrison (Jun 26 2025 at 07:42):
Yes, the cache has been uploaded, we're just trying to retrieve it from the wrong place.
Damiano Testa (Jun 26 2025 at 07:42):
Ok, I am understanding better now!
Kim Morrison (Jun 26 2025 at 07:42):
See the output that I linked above:
Current branch: HEAD
Using cache from origin: leanprover-community/mathlib4
Attempting to download 6884 file(s) from leanprover-community/mathlib4 cache
Kim Morrison (Jun 26 2025 at 07:43):
This is wrong, we need to identify at this point that we are looking at a commit coming from a fork.
Damiano Testa (Jun 26 2025 at 07:43):
Ok, thanks! I am starting to be able to read these messages better!
Kim Morrison (Jun 26 2025 at 07:45):
What I am surprised by is that there is no tracking information. This code from Requests.lean:
-- Fall back to using the remote that the current branch is tracking
let trackingRemote ← IO.Process.output
{cmd := "git", args := #["config", "--get", s!"branch.{currentBranch.stdout.trim}.remote"], cwd := mathlibDepPath}
let remoteName := if trackingRemote.exitCode == 0 then
trackingRemote.stdout.trim
else
-- If no tracking remote is configured, fall back to origin
"origin"
was intended to try and work out the remote associated to the current branch.
But given that currentBranch is apparently HEAD in this situation, that's not going to work.
Kim Morrison (Jun 26 2025 at 07:47):
The problem is when we run
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
ref: "${{ PR_BRANCH_REF }}"
in the post build step, we end up in a detached state:
/usr/bin/git checkout --progress --force d7f45861e9b66c8b6456a5c9d2ea199b933fdd92
Note: switching to 'd7f45861e9b66c8b6456a5c9d2ea199b933fdd92'.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:
git switch -c <new-branch-name>
Or undo this operation with:
git switch -
Turn off this advice by setting config variable advice.detachedHead to false
HEAD is now at d7f4586 Merge branch 'master' into coveringNumbers
Kim Morrison (Jun 26 2025 at 07:47):
Prior to #26407 deleting my "match against all PR refs" logic, I think this would have worked.
Damiano Testa (Jun 26 2025 at 07:47):
I think that this is another situation where I would have used the "checkout dance" hack.
Damiano Testa (Jun 26 2025 at 07:48):
Kim Morrison said:
Prior to #26407 deleting my "match against all PR refs" logic, I think this would have worked.
Do you want to revert this, then?
Damiano Testa (Jun 26 2025 at 07:49):
(I am going to be unavailable soon, and for a couple of hours.)
Kim Morrison (Jun 26 2025 at 07:49):
Here are our options:
- work out how to checkout the branch in such a way that there is a branch name and tracking information available
- revert #26407 --- it won't revert cleanly, as today we've changed a bunch of stuff on top
- use
--repoeverywhere, as I did in #26417 and then reverted in #26421, when I thought it was fixed. :-(
Kim Morrison (Jun 26 2025 at 07:50):
my preference is 1 then 2 then 3
Kim Morrison (Jun 26 2025 at 07:50):
but I am about to be offline for a few hours to watch my kids stage "Into the Woods". :-)
Kim Morrison (Jun 26 2025 at 07:51):
My complaint against option 3 is that it is putting the logic in multiple places: sometimes cache is working out where to retrieve from, sometimes we are overriding it. I'd prefer to keep the logic in one place.
Kim Morrison (Jun 26 2025 at 07:51):
My complaint against 2 is that the old logic for comparing against all PR refs was pretty janky and complicated.
Kim Morrison (Jun 26 2025 at 07:52):
But maybe 3 is okay after all...
Kim Morrison (Jun 26 2025 at 07:55):
re: 1. Claude is claiming to me that replacing
ref: "${{ PR_BRANCH_REF }}"
with
ref: ${{ github.head_ref }}
would suffice...
Kim Morrison (Jun 26 2025 at 07:57):
(This would have to happen in both the main step and the post-build step)
Kim Morrison (Jun 26 2025 at 07:57):
I can try that this evening if no one else has got there yet. :-)
Damiano Testa (Jun 26 2025 at 07:58):
I won't do it right now, since I am about to see a student, but can try it later, when I may also be able to fix follow-up errors.
Kim Morrison (Jun 26 2025 at 07:58):
Or maybe ref: ${{ github.head_ref || github.ref }} is better.
Kim Morrison (Jun 26 2025 at 07:58):
but double checking those are not hallucinations will be helpful. :-)
Kim Morrison (Jun 26 2025 at 11:35):
Okay, I am attempting this in #26434.
Ruben Van de Velde (Jun 26 2025 at 12:46):
Uh, I think this may not have worked as intended: https://github.com/leanprover-community/mathlib4/actions/runs/15901467227/job/44847564695#step:5:63
Kim Morrison (Jun 26 2025 at 12:52):
yeah, #26438 reverts it.
Kim Morrison (Jun 26 2025 at 12:53):
It's frustrating, as #26434 was following the example given on the actions/checkout README for when you don't want a detached HEAD. :-(
Kim Morrison (Jun 26 2025 at 12:54):
Okay, time for 2. above, I think, attempting the restore some the functionality dropped in #26407. I'll see if I can slim it down.
Sébastien Gouëzel (Jun 26 2025 at 12:56):
#26407 was already trying to fix some failures, it was failing before it and #26407 was an attempt to fix it. So reverting it won't fix anything.
Damiano Testa (Jun 26 2025 at 12:57):
Well, in the good old CS tradition, it may fix something... :slight_smile:
Kim Morrison (Jun 26 2025 at 12:58):
I don't mean revert it, I mean the revert that part of it that with no explanation deleted the logic I'd written to handle exactly this case :-)
Matthew Ballard (Jun 26 2025 at 13:03):
That logic looked very broken in all the tests I ran. You probably want --points-at to find the target branch in the detached head state. I had thought but not communciated clearly that there would be more testing to see if this was needed. It is hard to know what is wanted when no one writes any behavioral tests or issues with comments in the code.
Kim Morrison (Jun 26 2025 at 13:09):
Okay, here's another attempt at it: #26439
Kim Morrison (Jun 26 2025 at 13:09):
I've tested on master, nightly-testing and my testing PR in #26435. Running on that locally gives the promising looking:
Current branch: ci_test
Using cache from PR #26435 source: kim-em/ci_test (commit 230ee37c found in PR ref)
Attempting to download 2 file(s) from kim-em/mathlib4 cache
Kim Morrison (Jun 26 2025 at 13:10):
My apologies, Matthew, for the snarky criticism. I regret the way I stated it: I've had a very tiring day, and I very much appreciate everyone's help and encouragement getting our CI working again.
Matthew Ballard (Jun 26 2025 at 13:45):
No worries. I know you have an exceptionally large plate and appreciate all your efforts to keep us alive day to day
Matthew Ballard (Jun 26 2025 at 15:16):
Kim's last PR has us green in CI. As you go about your daily business, please report back issues with the cache so we can attempt to address them.
Matteo Cipollina (Jul 21 2025 at 08:56):
I've opened #27315, and it failed post-build CI Post-Build Step (fork). Is it related to the Failing CI issues discussed in this post? I am confused on what I should do with this error. Thanks in advance
https://github.com/leanprover-community/mathlib4/pull/27315
Ruben Van de Velde (Jul 21 2025 at 08:59):
Looks like a network issue, I restarted the job
Matteo Cipollina (Jul 21 2025 at 09:03):
Ruben Van de Velde ha scritto:
Looks like a network issue, I restarted the job
it's all all green now, thank you!
Miyahara Kō (Jul 26 2025 at 02:15):
My PR(#27498) is failed post-bulid CI. Is it error of the server?
Ruben Van de Velde (Jul 26 2025 at 06:41):
Weird. Let's try again
Bryan Gin-ge Chen (Jul 26 2025 at 13:39):
This is strange. Here's the link to the latest failing run. I don't see anything suspicious in the earlier steps and it looks like this has been re-run several times so I don't think it's a network error.
I tried reproducing this locally and I'm not sure if I got my mathlib4 in the same state, but after running:
lake exe cache clean!
gh pr checkout 27498
lake exe cache --repo=Komyyy/mathlib4 get
lake build --no-build -v Mathlib
I also get an exit code of 3, with the last line of log output being:
⣟ [7002/7003] Running mathlib/Mathlib:default
Anyone recognize what might be going on here?
Damiano Testa (Jul 26 2025 at 14:19):
Could it be that the error was maybe in the uploading the cache step, before the failing step, so that re-running did not redo the step that actually had a problem, but just the check?
Bryan Gin-ge Chen (Jul 26 2025 at 14:35):
Hmm, here's the "Upload cache" step. I downloaded the raw log and checked with a regex that the only errors in the upload steps are the "The specified blob already exists." which is expected, since I guess the files were uploaded in a previous try.
Bryan Gin-ge Chen (Jul 26 2025 at 17:00):
I pushed a merge commit to the branch, in the hopes that the issue is just that some random file in the cache is corrupted and rebuilding it will fix things somehow. It'd be nice if we could have better diagnostics for what's going on in cases like this, maybe I'm missing some command line options.
edit: seems to have worked!
Miyahara Kō (Jul 27 2025 at 01:30):
@Ruben Van de Velde @Bryan Gin-ge Chen
Thank you to fix my PR!
Junyan Xu (Jul 27 2025 at 13:19):
"Post-Build Step (fork): verify that everything was available in the cache" also failed at #27482 and was fixed by merging master.
Bryan Gin-ge Chen (Jul 30 2025 at 13:13):
#27676 seems to be suffering from a similar issue, which didn't go away even after merging master once.
Bryan Gin-ge Chen (Jul 30 2025 at 17:06):
I've created #27691 to hopefully help with this issue; I still don't understand the root cause though.
Bryan Gin-ge Chen (Aug 06 2025 at 04:26):
Unfortunately, this still seems to be an issue: https://github.com/leanprover-community/mathlib4/actions/runs/16766739009/job/47473800135
In the log above, note that lake build --no-build passed on the first try after building mathlib in the "Build" job, but then failed in the "Post-Build Step" job.
Bryan Gin-ge Chen (Aug 06 2025 at 04:39):
Assuming lake build --no-build is deterministic, we must not be running it on the same set of olean files, which then suggests something is going wrong in the upload step. However, I don't see anything suspicious in the logs there. Anyone have any ideas?
Robin Carlier (Aug 06 2025 at 12:29):
FTR, #27284 seems to have the same issue. (https://github.com/leanprover-community/mathlib4/actions/runs/16772799598/job/47495772115)
Junyan Xu (Aug 06 2025 at 22:43):
Now this step seems to be printing more diagnostic info:
ℹ [129/604] Replayed Batteries.Data.String.Basic
...
ℹ [456/604] Replayed proofwidgets/widgetPackageLock
...
ℹ [6761/7030] Replayed Batteries.Data.Array.Match
...
ℹ [6762/7030] Replayed Batteries.Data.String.Matcher
...
##[error]Process completed with exit code 3.
Junyan Xu (Aug 06 2025 at 22:44):
log 6_verify that everything was available in the cache.txt from this run
Bryan Gin-ge Chen (Aug 06 2025 at 22:47):
That was added in #27691 (adding the -v command line flag to the lake build commands). I think we may need to add logging to cache somehow to make sure that everything uploaded is what we want though.
Bryan Gin-ge Chen (Aug 07 2025 at 00:45):
I've opened #28062 which will upload the contents of mathlib4 after lake build and lake build --no-build to a workflow artifact. After this gets merged, the next time this error occurs, hopefully comparing what's in that artifact with what's in the cloud cache will let us figure out what's happening.
Violeta Hernández (Aug 07 2025 at 03:29):
PR #28063 adds some module deprecation tags, and idk if I'm doing something wrong because I simply get an exit 1 code in the logs.
https://github.com/leanprover-community/mathlib4/actions/runs/16794175516/job/47561452811?pr=28063
Bryan Gin-ge Chen (Aug 07 2025 at 03:31):
Ah, it's a little confusing, but that step just combines the results of the previous two steps. There's a warning in the Counterexamples folder that you'll need to suppress: https://github.com/leanprover-community/mathlib4/actions/runs/16794175516/job/47561452811?pr=28063#step:25:16
Bryan Gin-ge Chen (Aug 07 2025 at 03:32):
Let me add a more sensible message there to one of my PRs.
Violeta Hernández (Aug 07 2025 at 03:38):
Oh thanks! I forgot to deprecate that one file.
Junyan Xu (Aug 07 2025 at 07:30):
BTW I recently found that if shake suggests removing some import from a Mathlib file, it doesn't seem to suggest adding it back in Archive/Counterexample files where it's necessary.
Yaël Dillies (Aug 07 2025 at 07:51):
This is by design: Mathlib and Archive are different libraries, and shake only works within the scope of a single library`
Bryan Gin-ge Chen (Aug 07 2025 at 17:07):
I won't have time to look into this until later, but Claude wrote me a script that helped me find the following examples where the cache verification is failing; if someone would like to dig through the workflow artifacts and see if they match the oleans in the cache, that'd be greatly appreciated!
Bryan Gin-ge Chen (Aug 07 2025 at 17:11):
Oh wait, these artifacts don't contain oleans. I forgot that they're saved somewhere else now... :man_facepalming:
Bryan Gin-ge Chen (Aug 07 2025 at 17:32):
The artifacts should be more useful after #28092 is merged.
Aaron Liu (Aug 07 2025 at 22:14):
Just encountered this on #28096 :(
Thomas Murrills (Aug 08 2025 at 01:43):
To add another case, I’m hitting this on #28066 (logs). (I tried adding an empty commit, then tried merging master, with no luck.)
Bryan Gin-ge Chen (Aug 08 2025 at 01:52):
Thanks for the report and log link! Unfortunately you may have to try merging master again after some more commits land. Since we don't know the root cause yet, I unfortunately don't have a better suggestion at the moment.
However, we might be able to learn something from the workflow artifact (1.53GB) from the run you linked; I'm going to take a look now.
Bryan Gin-ge Chen (Aug 08 2025 at 02:49):
OK, I can confirm that running lake build --no-build on the copy of mathlib contained in the artifact succeeds, but it fails if I checkout and get the cache corresponding to the commit on your branch using:
gh pr checkout 28066
git checkout 57bf93cd4cd19c879e704a22a308d0bd69921a55
rm -rf .lake/ # to wipe out anything I had in my copy of mathlib previously
lake exe cache --repo=thorimur/mathlib4 get
Here's a gist containing the results of diff -r --brief mathlib4-from-artifact/.lake/ mathlib4-from-cache/.lake/ > diff.txt. There seem to be tons of differences between the contents of the two directories, but probably most of them aren't important.
Here are the lines involving .olean* files; I suspect only these are relevant:
Files mathlib4-from-artifact/.lake/build/lib/lean/Mathlib/Algebra/ContinuedFractions/Computation/CorrectnessTerminating.olean and mathlib4-from-cache/.lake/build/lib/lean/Mathlib/Algebra/ContinuedFractions/Computation/CorrectnessTerminating.olean differ
Files mathlib4-from-artifact/.lake/build/lib/lean/Mathlib/Algebra/ContinuedFractions/Computation/CorrectnessTerminating.olean.hash and mathlib4-from-cache/.lake/build/lib/lean/Mathlib/Algebra/ContinuedFractions/Computation/CorrectnessTerminating.olean.hash differ
Only in mathlib4-from-artifact/.lake/build/lib/lean: mk_all.olean
Only in mathlib4-from-artifact/.lake/build/lib/lean: mk_all.olean.hash
Only in mathlib4-from-artifact/.lake: lakefile.olean.lock
Only in mathlib4-from-artifact/.lake/packages/proofwidgets/.lake: lakefile.olean.lock
Since the workflow upload step (log) is right before the upload cache step (log), it seems something in the upload cache step must be rebuilding olean files (?)
Bryan Gin-ge Chen (Aug 08 2025 at 09:47):
(I'm out of ideas here; if someone with more knowledge of Cache could take a look, I'd appreciate it!)
Michael Rothgang (Aug 08 2025 at 12:44):
#28070 also seems to have had such an issue; I've retried the build job now.
Bryan Gin-ge Chen (Aug 08 2025 at 12:52):
Unfortunately, I don't retrying the build will help since at this point the cache is already poisoned with non-working oleans... I think you'll have to merge master, but even that doesn't always work.
Michael Rothgang (Aug 08 2025 at 12:54):
Never mind: that job actually had a genuine failure, except that is was hidden:
here is the new log: the "build archive step" failed (one line was too long) --- but only the next step errorred. Is there a way this error can be surfaced earlier? I was a bit puzzled at first (but could certainly adjust).
Bryan Gin-ge Chen (Aug 08 2025 at 12:58):
Yes, that's an unfortunate side-effect of trying to ensure that both Archive and Counterexamples are built even if one or the other fails. Maybe there's a way of doing this which makes it more obvious, but this way seemed simplest.
Junyan Xu (Aug 08 2025 at 14:54):
Bryan Gin-ge Chen (Aug 09 2025 at 02:06):
@Mario Carneiro we're having a problem in CI where several times a day, lake build --no-build on Mathlib succeeds before uploading the cache, but then fails in a job that downloads the cache and tries to verify that lake build --no-build succeeds. Is it possible that leantar (or the way that Cache is using it) is compressing an olean into something that forces a rebuild after decompressing?
See these two messages: for a link to a workflow artifact from right before uploading the cache (where lake build --no-build succeeds) and some commands to grab the cache that was uploaded for the corresponding commit (where lake build --no-build fails).
Apologies for bugging you if it's not leantar, but I'm near my wit's end trying to figure this out. Let me know if there's anything I can provide to help here.
Thomas Murrills (Aug 09 2025 at 02:32):
I am unsure if this is related, but since it involves the cache, just in case:
I added a commit to #28133 (which had never hit this issue) which only modified a test in MathlibTest. However, although CI was green before this commit, it rebuilt Mathlib from the PR's starting point anyway after finding 0% of files in the cache. I would have expected it to find all files in the cache and just build the modified file in MathlibTest.
Mario Carneiro (Aug 09 2025 at 09:41):
@Mac Malone Is the new lake cache feature enabled?
Mario Carneiro (Aug 09 2025 at 10:47):
I think part of the issue is that grind is generating nondeterministic proofs. Working on the above branch, I can't get two versions of .lake/build/lib/lean/Mathlib/Algebra/ContinuedFractions/Computation/CorrectnessTerminating.olean that agree
Mario Carneiro (Aug 09 2025 at 11:27):
Here's a MWE: Try running this file from a fresh state (after Restart File). I observe several possible hashes at the end: 830471404, 2267473999, 1107825591
import Mathlib.Algebra.ContinuedFractions.Computation.Translations
import Mathlib.Algebra.ContinuedFractions.TerminatedStable
import Mathlib.Algebra.ContinuedFractions.ContinuantsRecurrence
import Mathlib.Order.Filter.AtTopBot.Basic
import Mathlib.Tactic.FieldSimp
import Mathlib.Tactic.Ring
namespace GenContFract
open GenContFract (of)
theorem compExactValue_correctness_of_stream_eq_some.extracted_1_3
{K : Type*} [Field K] [LinearOrder K] {v : K} [FloorRing K] :
let g := of v;
∀ (n : ℕ) (ifp_n : IntFractPair K),
let pconts := g.contsAux (n + 1);
let ppconts := g.contsAux n;
ifp_n.fr ≠ 0 →
let ppA := ppconts.a;
let ppB := ppconts.b;
let pA := pconts.a;
let pB := pconts.b;
let f := Int.fract (1 / ifp_n.fr);
↑⌊1 / ifp_n.fr⌋ * pA + ppA + pA * f = pA * (1 / ifp_n.fr) + ppA →
↑⌊1 / ifp_n.fr⌋ * pB + ppB + pB * f = pB * (1 / ifp_n.fr) + ppB →
(IntFractPair.of (1 / ifp_n.fr)).fr = f →
(ppA * ifp_n.fr + pA) / (ppB * ifp_n.fr + pB) =
if (IntFractPair.of (1 / ifp_n.fr)).fr = 0 then
(↑⌊1 / ifp_n.fr⌋ * (g.contsAux (n + 1)).a + (g.contsAux n).a) /
(↑⌊1 / ifp_n.fr⌋ * (g.contsAux (n + 1)).b + (g.contsAux n).b)
else
(↑⌊1 / ifp_n.fr⌋ * (g.contsAux (n + 1)).a + (g.contsAux n).a +
(g.contsAux (n + 1)).a * (IntFractPair.of (1 / ifp_n.fr)).fr) /
(↑⌊1 / ifp_n.fr⌋ * (g.contsAux (n + 1)).b + (g.contsAux n).b +
(g.contsAux (n + 1)).b * (IntFractPair.of (1 / ifp_n.fr)).fr) := by grind
open Lean
run_cmd
let env ← getEnv
let some (.thmInfo ci) := env.find? ``GenContFract.compExactValue_correctness_of_stream_eq_some.extracted_1_3._proof_1_1
| throwError ""
println! ci.value.hash
Bryan Gin-ge Chen (Aug 09 2025 at 11:46):
It makes sense that this started showing up when we started using grind then.
Mario Carneiro (Aug 09 2025 at 11:48):
I think it's not a full explanation though, as the lake build --no-build issue does not depend on lean being deterministic as long as the hashing process is deterministic. (I will leave it to someone else to de-mathlibify the MWE and report the nondeterminism issue to lean)
Mario Carneiro (Aug 09 2025 at 11:50):
but it does make double checking the results more difficult
Mario Carneiro (Aug 09 2025 at 12:07):
Oh I see why this is happening. We currently reuse oleans from previous partial runs, but in the example run, nothing was found in the cache and therefore all the files were rebuilt, and because of this nondeterminism issue the new oleans no longer match the upstream oleans. When we later try to clear the build directory and retrieve the cache, we get a mixture of files uploaded from the fresh run and files from the old run, which disagree about the olean that should have been produced on a given input, so lake detects the inconsistency and rebuilds
Mario Carneiro (Aug 09 2025 at 12:10):
So now the question is: why isn't the cache hitting? seems to be the root cause. (Well, lean nondeterminism is very bad, I don't think we can tolerate it in mathlib. That's also a root cause.)
Mario Carneiro (Aug 09 2025 at 12:17):
I wonder if the reason for the rebuilding is because sometimes files are first built on a fork, and then later added to mathlib, meaning that the same file hash can end up in a fork cache and then on mathlib's cache. Since mathlib doesn't copy cache files from forks, that means we can't avoid rebuilding the lean file in this situation, resulting in inconsistencies when the fork later tries to mix and match
Bryan Gin-ge Chen (Aug 09 2025 at 12:18):
I think the issue with the cache not hitting started when we moved to the fork-based contribution model and started uploading to caches associated to each fork repo — I think @Matthew Ballard was working on this? (I wasn’t following all the details at the time so I could be thinking of something else…)
Mac Malone (Aug 09 2025 at 14:18):
Mario Carneiro said:
Mac Malone Is the new lake cache feature enabled?
No.
Bryan Gin-ge Chen (Aug 09 2025 at 19:55):
Just linking another thread related to the cache not hitting: #mathlib4 > Mathlib CI doesn't get all available cache
Bryan Gin-ge Chen (Aug 10 2025 at 19:50):
Mario Carneiro said:
(I will leave it to someone else to de-mathlibify the MWE and report the nondeterminism issue to lean)
Sebastian Kumar (Aug 10 2025 at 22:05):
Bryan Gin-ge Chen (Aug 10 2025 at 23:50):
Junyan Xu said:
The immediate culprit here is the lemma RootPairing.InvariantForm.exists_apply_eq_or in LinearAlgebra.RootSystem.Finite.Lemmas, which uses grind, in particular exists_apply_eq_or._proof_1_4 turns out to be nondeterministic.
Sebastian Kumar said:
Here the issue is the lemma preCantorSet_antitone in Topology.Instances.CantorSet, which uses grind, in particular preCantorSet_antitone._proof_1_6 turns out to be nondeterministic.
Bryan Gin-ge Chen (Aug 10 2025 at 23:52):
I'm going to open a PR to revert the grind proofs in these examples so at least they won't cause future issues. (We can add back grind after lean#9825 is fixed and / or the behavior of Cache is fixed.)
Edit: I've created the branch, but as it seems there's a fix in the works, I'll hold off on the PR for now.
Kenny Lau (Aug 11 2025 at 19:29):
#28148 is now failing the same step in the CI, is this the same error?
Damiano Testa (Aug 11 2025 at 20:22):
The logs show errors in Batteries and warning for circular imports that appear to be from the new module system?
Kenny Lau (Aug 11 2025 at 20:24):
is that in like a log only the maintainers can see
Kenny Lau (Aug 11 2025 at 20:25):
hmm now i also see it for some reason
Damiano Testa (Aug 11 2025 at 20:26):
I don't think that only the maintainers can see it: if you click the triangle in the failing log, you see all the output.
Kenny Lau (Aug 11 2025 at 20:26):
what should I do?
Damiano Testa (Aug 11 2025 at 20:28):
I don't think that this is something that you can change, except possibly by merging master and retrying.
Damiano Testa (Aug 11 2025 at 20:28):
One of the CI steps that involve the cache has been changed some time today: that would be the first place I would look for issues, but I am no longer at a computer for the day...
Damiano Testa (Aug 11 2025 at 20:29):
Pinging @Bryan Gin-ge Chen, since I think that he made the change.
Damiano Testa (Aug 11 2025 at 20:30):
It might also just be some unrelated quantum behaviour of grind.
Bryan Gin-ge Chen (Aug 11 2025 at 20:34):
I didn't see errors in Batteries in the log I was looking at, but when I downloaded the artifact and checked for discrepancies in oleans per , it looks like the failure here was due to RootPairing.InvariantForm.exists_apply_eq_or in LinearAlgebra.RootSystem.Finite.Lemmas again (cf. )
Kenny Lau (Aug 11 2025 at 20:35):
why doesn't the CI tell you where the inconsistency was?
Bryan Gin-ge Chen (Aug 11 2025 at 20:36):
We haven't made it that smart yet, and I'm hoping things will be fixed upstream so quickly we won't have to...
Bryan Gin-ge Chen (Aug 11 2025 at 20:38):
Bryan Gin-ge Chen said:
I'm going to open a PR to revert the
grindproofs in these examples so at least they won't cause future issues. (We can add backgrindafter lean#9825 is fixed and / or the behavior of Cache is fixed.)Edit: I've created the branch, but as it seems there's a fix in the works, I'll hold off on the PR for now.
I've opened the PR now: #28272 Hope I'm using adaptation_note correctly!
Kim Morrison (Aug 12 2025 at 06:55):
The nondeterminism issue has been solved in lean#9867
Kim Morrison (Aug 12 2025 at 06:55):
I am planning on cutting v4.23.0-rc1 in two days time; are we okay to wait until then? There's no particular reason I couldn't just pull the release forward if this issue is sufficiently annoying right now.
Bryan Gin-ge Chen (Aug 12 2025 at 14:10):
2 days is probably fine - when the fix makes it to nightly, I'd like to confirm that the examples above are fixed.
Weiyi Wang (Aug 12 2025 at 14:26):
Possibly unrelated to the previous issue. CI for PRs seems to fail in a new way now: #27282 #27885
Iván Renison (Aug 12 2025 at 14:28):
Junyan Xu (Aug 12 2025 at 14:31):
Junyan Xu (Aug 12 2025 at 14:32):
image.png
Visiting github.com/leanprover/elan/releases/latest/ now shows there are no releases.
Weiyi Wang (Aug 12 2025 at 14:33):
:octopus: Github says no mathing today
Bernhard Reinke (Aug 12 2025 at 20:00):
Seems to be working again, could someone rerun the CI for #25966 ?
Bernhard Reinke (Aug 13 2025 at 06:36):
Thanks @Bryan Gin-ge Chen ! It seems like the "lint and suggest" job still has a spurious failure, could someone please rerun that job as well?
Emily Riehl (Aug 13 2025 at 17:10):
I have another PR #28085 that's ready to merge but is failing CI for perhaps the same reason discussed above. Is there anything an author can do to fix it?
Bryan Gin-ge Chen (Aug 13 2025 at 17:11):
It looks like it's already on the bors queue, so I think there's nothing you need to do at the moment.
Bryan Gin-ge Chen (Aug 13 2025 at 17:13):
(The standard fix for this issue has been to merge master and hope it goes away, but if you do that now, that will take the PR off the bors queue and someone will have to approve it again.)
Emily Riehl (Aug 13 2025 at 18:05):
I didn't realize the bors queue could overcome errors like this. I'll just sit tight then. Thanks.
Heather Macbeth (Aug 14 2025 at 00:19):
CI is failing on my PR #27562 at the "verify that everything was available in the cache" step, does anyone know how to fix this?
Bryan Gin-ge Chen (Aug 14 2025 at 01:01):
Until we move to the next release of Lean (tonight / tomorrow morning, I think?) the only recourse is to merge master or push more commits that force rebuilds and hope it doesn't show up again.
Bryan Gin-ge Chen (Aug 18 2025 at 01:32):
Just a heads up: we've switched over to a new cache host recently and have been seeing some download failures, e.g. at #26074 and in #nightly-testing > Mathlib `lake update` failure @ 💬
Please post here if you see this happen to your PR and let us know if your PR needs to be put back on the bors queue.
Kenny Lau (Aug 18 2025 at 23:30):
27215 is failing because the server was down, could someone please rerun this
Notification Bot (Nov 04 2025 at 13:39):
3 messages were moved from this topic to #mathlib4 > "verify that everything was available in the cache" fails CI by Bryan Gin-ge Chen.
Yaël Dillies (Nov 24 2025 at 14:18):
Is anyone else hitting reviewdog failures? https://github.com/leanprover-community/mathlib4/actions/runs/19636763847/job/56229287309?pr=31883
Michael Rothgang (Nov 24 2025 at 14:19):
Yes! See e.g. #mathlib4 > "lint and suggest" CI step failing - hopefully, things are resolved again
Last updated: Dec 20 2025 at 21:32 UTC