Zulip Chat Archive

Stream: std4

Topic: Mathlib bump patches


François G. Dorais (Aug 17 2023 at 08:24):

I've taken the habit of creating Mathlib patch PR along with each of my Std PR. The purpose is two-fold: first to check for unexpected side effects, and then to help Mathlib maintainers following Std updates. There's a few issues in my current process and I'm wondering there might be a better way.

The first part, testing a Std PR against Mathlib isn't problematic unless Mathlib is using an old version of Std. The best fix here is for Mathlib to track Std as closely as possible. Of course, the patches are helpful for exactly that!

The second part is important after the Std PR is merged. Then, I need to update the patch so that the lakefile points to Std's main branch instead of the PR branch, update the lake-manifest to point to the right version of Std, re-run CI and hand off to Mathlib maintainers. So far so good but here's the first catch: the patches need to be applied in Mathlib in the exact same order as the PRs were merged into Std. The reason for that is that the CI step won't work until the previous patch has been applied. Mathlib maintainers can roll several patches into a larger bump patch to speed up this process but the pace of Std PRs is not that hectic.

The second catch has to do with testing a Std PR against Mathlib. While waiting for the patches to be merged into Mathlib, testing becomes very difficult. The main options are to either use an old Std or to roll out your own temporary Mathlib patch. Neither option makes much sense.

I'm not sure how to fix this. The key issue is to better communicate where these Mathlib patches are so that Mathlib maintainers can easily find them and also for Std contributors to use them to roll out a temporary patch when necessary. Note that Mathlib patches are not required for Std PRs and they are not always necessary, so looking around is not a reliable way to find them.

Kevin Buzzard (Aug 17 2023 at 08:26):

Thanks a lot for thinking about this problem! With lean 3 there were fewer moving parts but now there are more and we're clearly still getting to grips with this.

Scott Morrison (Aug 17 2023 at 10:18):

@François G. Dorais more descriptive titles for these PRs would be great. Just now I'm hunting to the ones about Nat.strongRecOn, and the one about the substring recursion, but after I reach https://github.com/leanprover-community/mathlib4/pulls/fgdorais I just have to click through all of them.

Scott Morrison (Aug 17 2023 at 10:21):

Further, for something like https://github.com/leanprover-community/mathlib4/pull/6622, I would actually stack it on top of the dependent PR, so we can see that CI works.

I've just bors'd the dependent PR, and I would like to put this on the queue too, but can't yet because of the lack of green tick.

François G. Dorais (Aug 17 2023 at 12:52):

Yes, I will definitely use more descriptive titles.

The dependency issue is a bit more complex since I can't know in advance what will be merged when. It's also tricky to find other people's patches or whether they exist.

James Gallicchio (Aug 17 2023 at 15:43):

can we perhaps make a std-bump branch of mathlib which mathlib bump PRs target?

James Gallicchio (Aug 17 2023 at 15:44):

although it is easier to review each bump PR individually instead of multiple bump PRs melted together :/

James Gallicchio (Aug 17 2023 at 15:45):

We could set up a bot that tracks each std bump PR, and every time a PR is merged into Std the corresponding PR is put onto the stack of PRs a la https://github.com/ezyang/ghstack ?

James Gallicchio (Aug 17 2023 at 15:46):

But that's a lot of infrastructure for something that is not yet a huge problem

François G. Dorais (Aug 17 2023 at 16:05):

It's not a huge problem by any means, but I've delayed testing several Std PRs on Mathlib because Mathlib was behind Std. It also makes the task of testing against Mathlib tricky and confusing for new contributors.

François G. Dorais (Aug 17 2023 at 16:12):

Thanks for bringing up ghstack. It doesn't look like a good fit but it does open the realm of possibilities.

François G. Dorais (Aug 17 2023 at 16:26):

Another, perhaps tangential, issue is that it makes more sense for the Mathlib patch to be reviewed along side the Std PR. As a concrete example, I got a good question on an already merged PR this morning. It's always possible to make correction PRs when necessary but that can degenerate into a feedback loop. (Not in this particular example but it's not hard to imagine how that could happen.)

James Gallicchio (Aug 17 2023 at 17:20):

yep, agreed

James Gallicchio (Aug 17 2023 at 18:56):

mario agrees that we need a dependent PR bot. maybe some day.

Scott Morrison (Aug 24 2023 at 01:10):

I think a lot of the suffering of keeping Mathlib and Std in sync would be ameliorated by some automation that does the trivial bumps.

Proposal, round one:

  • every commit to Std master automatically generates a Mathlib branch that just runs lake update std and commits the manifest.
  • at the conclusion of CI,
    • on success, create Mathlib PR with auto-merge-after-CI
    • on failure, post on a topic here, with a link to the CI failure

Proposal, round two:

  • at every commit to Std master, check if there is an existing Mathlib branch that is labelled as corresponding to underlying Std PR
    • if so, and that branch has a green tick, edit its lakefile to point back to main on Std, run lake update std, and commit, and add the label auto-merge-after-CI
  • if such a branch doesn't exist, follow the routine of round one.

Scott Morrison (Aug 24 2023 at 01:11):

I'm not volunteering to write this quite yet, but it seems plausible we could make this happen if needed.

James Gallicchio (Aug 24 2023 at 04:05):

I'm volunteering to write it :) I'm gonna work on it this weekend

Scott Morrison (Aug 24 2023 at 04:15):

Great. This will require that Std CI can create Mathlib branches, Mathlib PRs, and label Mathlib PRs. I think I will need to create a token for you and install it as a secret on the Std repository.

Unless you tell me it needs to work differently, I will create a token under the leanprover-community-mathlib4-bot account, with relevant permissions on the Mathlib repository, and install it as secrets.MATHLIB4_TOKEN on the Std repository.

(I'm presuming here that you're going to implement everything as a github action.)

Mario Carneiro (Aug 24 2023 at 04:16):

I actually wanted to ask about this, do we have a general purpose bot we can add code to? Where is the code hosted?

Scott Morrison (Aug 24 2023 at 04:16):

No, everything is github actions.

James Gallicchio (Aug 24 2023 at 04:16):

I was gonna make a GH action/bot that works on generic Lean repos cuz I wanna also use it for other projects downstream of Std

Mario Carneiro (Aug 24 2023 at 04:16):

that sounds like a "yes"

Mario Carneiro (Aug 24 2023 at 04:17):

I don't really know how bots are piloted

Scott Morrison (Aug 24 2023 at 04:19):

Sometimes the github actions framework is a pain to work with, for perfectly reasonable security reasons.

But I think in James' case here, everything is running after a commit to master, so the action has access to the secrets.

James Gallicchio (Aug 24 2023 at 04:20):

I do need to figure out where to store some of the state/configuration we need to store for each repo... maybe a lean file in the root? :thinking:

James Gallicchio (Aug 24 2023 at 04:24):

actually maybe annotations in the lake file

Scott Morrison (Aug 24 2023 at 04:24):

What state is that? Stateless is always nice. :-)

Scott Morrison (Aug 24 2023 at 04:32):

@James Gallicchio, MATHLIB4_TOKEN should be available on the Std repo now. It expires Aug 24 2024, but that is recorded in our big list of tokens. :-)

James Gallicchio (Aug 24 2023 at 04:44):

I'm sort of hoping that I don't need it, but thank you in advance for when I do :joy:

James Gallicchio (Aug 24 2023 at 04:49):

Thank gosh we just built a simple http library! :)

James Gallicchio (Aug 24 2023 at 06:03):

Scott Morrison said:

What state is that? Stateless is always nice. :-)

Rather than having heuristics for "is this Mathlib PR associated with this Std commit/PR?" I want to just track that info explicitly. Stateless is nice though. This is a thinker :thinking:

James Gallicchio (Aug 24 2023 at 06:07):

I also need to double-check how much I can steal from a former employer's internal tooling without violating my NDA.

James Gallicchio (Aug 24 2023 at 06:10):

Nevermind, they've published a whole talk on the design :-)

Scott Morrison (Aug 24 2023 at 06:27):

I'd highly recommend doing the simple thing first (i.e. what I suggested in "round one"), and only if that works and is usable bother doing any of the complex stuff ("round two") that requires matching up PRs.

Scott Morrison (Aug 24 2023 at 06:29):

Storing any state outside of the github PRs is I think a no go. I don't think we want to make commits to the repositories, or rely on non-github servers. I think links between PRs, comments on PRs, and labels has to be enough.

Yakov Pechersky (Aug 24 2023 at 08:46):

There's also the option of using GitHub job artifacts

Yakov Pechersky (Aug 24 2023 at 08:47):

Regarding testing GitHub workflows, there is nektos/act

James Gallicchio (Aug 24 2023 at 15:10):

Scott Morrison said:

Storing any state outside of the github PRs is I think a no go. I don't think we want to make commits to the repositories, or rely on non-github servers. I think links between PRs, comments on PRs, and labels has to be enough.

at minimum I'd like to have a centralized server to do the webhook dance of propagating updates (ie letting repos "subscribe" to dependency repo's updates) to avoid the complexity of storing tokens for downstream projects in upstream projects

James Gallicchio (Aug 24 2023 at 15:11):

but yes for now we can limit ourselves to Std and Mathlib so that we can test the workflow without doing that

James Gallicchio (Aug 31 2023 at 19:11):

So, I've been thinking about this lots. I think I've settled on a very good abstraction that can be built on top of GH, but it will require lots of infrastructure and heavy control over both VC and CI in the participating repos, so I'm gonna hold off there.

From Scott's proposals, here's what I'm thinking as the goal:

in an Std PR, when given !update-mathlib comment:

  • open Mathlib branch & PR. mark with some tag like blocked-on-std-pr
  • commit to branch with lakefile pointing to the Std PR branch and manifest updated
  • reply to the PR with the link to the new Mathlib branch (use this to track the connection later)

every commit to Std main:

  • figure out if it's from a PR or was just a direct commit
    • this is .. not easy. but given that Std uses squash-merge we can get the info from the commit name I think
  • if PR:
    • find corresponding Mathlib PR link in comments (if no exist, see below)
    • remove blocked-on-std-pr tag
    • push commit to branch with lakefile pointing to Std main and manifest updated to the corresponding commit
      • note: this might change the actual dependency version, since nothing is keeping mathlib bump PRs updated when Std PRs are updated
  • if direct commit or no existing Mathlib PR:
    • open Mathlib branch & PR
    • commit an updated manifest pointing to the commit

A few questions I have:
1) How does auto-merge-after-CI work? If a PR is auto-merge-after-CI and fails CI, is the tag removed?
2) How should we handle keeping Mathlib PRs up to date as a Std PR changes? I know this was not in scope in Scott's original proposals, but it would be very helpful to me personally :)
3) How do we merge Mathlib PRs in the order they were merged in Std? And do we allow for merging two update PRs into one so that they don't need to be stacked correctly?

Mac Malone (Aug 31 2023 at 19:35):

James Gallicchio said:

  • figure out if it's from a PR or was just a direct commit
    • this is .. not easy. but given that Std uses squash-merge we can get the info from the commit name I think

Can this not be easily done with a GitHub Action on a PR merge?

James Gallicchio (Aug 31 2023 at 19:46):

but then how do I tell the direct commit action to not trigger? :stuck_out_tongue:

James Gallicchio (Aug 31 2023 at 19:47):

honestly i'd be pretty okay with just never committing directly to main, but that's mostly up to @Mario Carneiro

Mario Carneiro (Aug 31 2023 at 19:48):

Yeah, rather than looking for the commit you can just check whether github thinks the PR has been merged

Mario Carneiro (Aug 31 2023 at 19:50):

and you don't need an "every commit" action, it can just be a scheduled action which does nothing if master hasn't changed

Mario Carneiro (Aug 31 2023 at 19:52):

same thing for the PR actions, that can be a scheduled bot

James Gallicchio (Aug 31 2023 at 19:59):

sorry, I don't really understand what you mean. do you mean Mathlib have a scheduled bot to check whether the upstream Std PR has been merged?

James Gallicchio (Aug 31 2023 at 22:36):

I agree that that is the much nicer direction to go (and much simpler), but I'm not sure how we'd ensure Mathlib bump PRs are merged in the same order as Std PRs

James Gallicchio (Aug 31 2023 at 22:39):

maybe once a PR is merged into Std, we keep track of the base commit and head commit of the PR (which in the case of squash merge is the parent of the squashed commit and the squashed commit, respectively), and then we only allow a bump PR to merge if the current master's std commit matches the expected base commit.

James Gallicchio (Aug 31 2023 at 22:40):

If all commits to Std main are PRs, this would actually work quite well I think.

Mario Carneiro (Aug 31 2023 at 22:46):

the mathlib merging doesn't happen automatically anyway, the bot should only be labeling things as ready to merge

James Gallicchio (Aug 31 2023 at 22:48):

Scott seemed to indicate he'd be happy with auto-merging, but that's up to the mathlib maintainers.

Mario Carneiro (Aug 31 2023 at 22:48):

If there are no mathlib changes, an auto-merge seems okay, but if there are any changes required then it needs a review

Mario Carneiro (Aug 31 2023 at 22:49):

but if there are no mathlib changes then there aren't PR ordering issues in the first place

James Gallicchio (Aug 31 2023 at 22:51):

I want to avoid accidentally reverting the manifest in some weird race condition bors edge case :joy: but tracking the base commit seems like a reasonable solution anyways.

Mario Carneiro (Aug 31 2023 at 22:53):

this is why you never commit to master :smile:

Mario Carneiro (Aug 31 2023 at 22:54):

bors is the only one with the keys

James Gallicchio (Aug 31 2023 at 23:03):

one other thing -- do we plan to merge Std PRs before the mathlib bump PR is compiling?

suppose Std has two PRs open, A and B
Mathlib has a bump PR for A which does not compile, and a bump PR for B which does
then Std merges A and B

If we blindly updated mathlib's bump PR for B to point to the merge commit for B, then bump PR B will be broken for reasons unrelated to B

James Gallicchio (Aug 31 2023 at 23:05):

There is of course a general way to handle this, where we change the base commit for the B bump PR to point to the A bump PR (that way it's obvious why B bump PR is broken). But that's much more difficult :stuck_out_tongue:

James Gallicchio (Aug 31 2023 at 23:06):

but for Std/Mathlib it seems reasonable for us to just straight up say that we won't commit to Std until the bump PR compiles on Mathlib.

Scott Morrison (Aug 31 2023 at 23:20):

One general thing to keep in mind here is that this might be a finite process, and not warrant automation. Obviously we plan for infinitely many more commits to Std. :-) But the fraction of them that require Mathlib changes will hopefully eventually drop off, once we have:

  • migrated all the theory out of Mathlib that belongs in Std
  • established the right criteria so new material is not introduced to Mathlib that should be in Std
  • similarly for tactics.

If that's really the case, the argument for automating coordination of PRs is weaker (although obviously it is tempting not to suffer in the short term).

James Gallicchio (Aug 31 2023 at 23:23):

I have secret incentives for working out the PR coordination stuff here -- I need it for managing my already-not-small DAG of projects, which are currently unmaintainable due to the dependency coordination problem ;)

Scott Morrison (Aug 31 2023 at 23:44):

Maybe we need a "virtual mono repo" that compiles PRs down to the constituent repos!

Mario Carneiro (Sep 01 2023 at 00:35):

I think https://github.com/josh-project/josh is intended for that, although I don't know too much about it

James Gallicchio (Sep 01 2023 at 01:19):

Scott Morrison said:

Maybe we need a "virtual mono repo" that compiles PRs down to the constituent repos!

That's my bigger plan, at least. Once you add some extra structure on top of the VC ("features", which are like a branch + a base commit in a parent branch), you can have really nice monorepo-esque behavior across repos.

Josh seems to do the opposite, breaking an actual monorepo up into multiple git repos? which seems odd but //

James Gallicchio (Sep 01 2023 at 05:47):

is there a reasonable way to automatically rewrite a project lakefile from a lakescript? since it's a Lean file and we haven't actually worked out pretty-print/autoformatting I'm not sure there's a better approach than calling out to sed...

James Gallicchio (Sep 01 2023 at 05:54):

although require std from git ... is a super clean pattern to replace, so I think I'm overthinking it

Scott Morrison (Sep 01 2023 at 06:45):

If you don't care about whitespace, or human-readability, pretty printing a lakefile should be fine.

Scott Morrison (Sep 01 2023 at 06:45):

But I have to say: if you are thinking about automatically generated lakefiles ...

Scott Morrison (Sep 01 2023 at 06:45):

#xy

James Gallicchio (Sep 01 2023 at 14:30):

Not the whole lakefile, just for the step of substituting PR/main branches into the lakefile require. Alternatively, I could leave the lakefile untouched, just changing the manifest, and tell people to not lake update? Alternatively-alternatively, there could be another file tracking the desired git repo/branch and I just read from that into the lakefile

James Gallicchio (Sep 01 2023 at 14:33):

I think I most prefer having a separate json file for auto-tracked dependencies, and reading that in for the lakefile. Thoughts?

Mario Carneiro (Sep 01 2023 at 14:43):

You can safely assume no one is doing lake update

Mario Carneiro (Sep 01 2023 at 14:43):

although if you change the URL in the lake-manifest.json lake will complain that it doesn't match and suggest you run lake update

James Gallicchio (Sep 01 2023 at 14:45):

aw shucks, okay. then that's a non-option, given that most Std PRs come from forks

Mario Carneiro (Sep 01 2023 at 14:45):

the downside is that you yourself can't run lake update because it will clobber your changes, but without that you might be missing changes to other repos

Mario Carneiro (Sep 01 2023 at 14:45):

Mario Carneiro said:

although if you change the URL in the lake-manifest.json lake will complain that it doesn't match and suggest you run lake update

The complaint is a warning, it will honor the lake-manifest.json anyway

James Gallicchio (Sep 01 2023 at 14:46):

Not a great UX here for Lake to tell people to run lake update if lake update causes everything to break :laughing:

Mario Carneiro (Sep 01 2023 at 14:46):

yeah this has been discussed elsewhere, lake can't tell if you are a user or maintainer and outputs maintainer-focused messages

James Gallicchio (Sep 01 2023 at 14:47):

Mario Carneiro said:

without that you might be missing changes to other repos

Yeah -- that's partially why I think it's best to have a separate manifest and have the lakefile read from there.

Mario Carneiro (Sep 01 2023 at 14:47):

sed

Mario Carneiro (Sep 01 2023 at 14:48):

I'm pretty sure this matches most historical examples of mathlib bump patches

James Gallicchio (Sep 01 2023 at 14:48):

agree -- for v0.1 sed's fine :big_smile:

Patrick Massot (Sep 01 2023 at 15:31):

James Gallicchio said:

Not a great UX here for Lake to tell people to run lake update if lake update causes everything to break :laughing:

We should have a bot posting that message everyday instead of having to find a different human to do it.

James Gallicchio (Sep 06 2023 at 06:32):

So, small update on this -- I don't think GH Actions lets you run a scheduled job on a PR branch. So if we want to have bump PRs automatically incorporate upstream changes we'll need to go through some sort of centralized thing. Not ideal.

James Gallicchio (Sep 06 2023 at 06:35):

Now I'm wondering how dependabot does it

Eric Wieser (Sep 06 2023 at 06:52):

Scheduled jobs can run actions/checkout on any branch they like

Eric Wieser (Sep 06 2023 at 06:52):

You only have a problem if the workflow.yaml only exists on a branch

James Gallicchio (Sep 06 2023 at 06:57):

Yeah. It just means the job has to be run for every PR on the repo, instead of being run separately. I can still do it in the framework of github actions but it becomes substantially more annoying

James Gallicchio (Sep 06 2023 at 06:58):

unless there's a way to use the matrix job workflow somehow ?? I will investigate

James Gallicchio (Sep 06 2023 at 21:53):

actually, we could use the Std token to just update Mathlib bump patches directly (in addition to opening patches when PRs are opened on Std).

so then Std has a few workflows:

  • on PR opened: open a PR on Mathlib
  • on PR commit push: push update to respective bump PR
  • on PR merge: push update to respective bump PR with new manifest commit

James Gallicchio (Sep 06 2023 at 21:54):

this is much nicer than trying to iterate through PRs... I'll try to implement it on two of my libraries and keep y'all posted


Last updated: Dec 20 2023 at 11:08 UTC