Zulip Chat Archive

Stream: mathlib4

Topic: Auto-label PR topics

Damiano Testa (Aug 14 2024 at 11:54):

Following on the conversation about PR labels, does the list below seem like a good choice of auto-labels?

CI
t-algebra
t-algebraicgeometry
t-algebraictopology
t-analysis
t-categorytheory
t-combinatorics
t-computability
t-condensed
t-control
t-data
t-deprecated
t-dynamics
t-fieldtheory
t-geometry
t-grouptheory
t-informationtheory
t-init
t-lean
t-linearalgebra
t-logic
t-mathport
t-measuretheory
t-modeltheory
t-numbertheory
t-order
t-probability
t-representationtheory
t-ringtheory
t-settheory
t-tactic
t-testing
t-topology
t-util

Damiano Testa (Aug 14 2024 at 11:57):

These are just the root dirs inside of Mathlib, plus CI. I could also add t-archive and t-counterexamples, though those are probably rarer.

Yaël Dillies (Aug 14 2024 at 12:13):

Can we get an - in the middle, so t-set-theory rather than t-settheory?

Yaël Dillies (Aug 14 2024 at 12:13):

Also, those are not really topics but, so maybe f-folder-name is better than t-folder-name

Damiano Testa (Aug 14 2024 at 12:20):

If you prefer, I can also just add f-FolderName with the same capitalization as the folder.

Damiano Testa (Aug 14 2024 at 15:19):

I think that I like the idea of the bot posting the folder names, since that is useful information and also indisputable! How about the bot labeling with the subset of

`CI`                    `f-Init`
`f-Algebra`             `f-Lean`
`f-AlgebraicGeometry`           `f-LinearAlgebra`
`f-AlgebraicTopology`           `f-Logic`
`f-Analysis`                `f-Mathport`
`f-Archive`             `f-MeasureTheory`
`f-CategoryTheory`          `f-ModelTheory`
`f-Combinatorics`           `f-NumberTheory`
`f-Computability`           `f-Order`
`f-Condensed`               `f-Probability`
`f-Control`             `f-RepresentationTheory`
`f-Data`                `f-RingTheory`
`f-Deprecated`              `f-scripts`
`f-.docker`             `f-SetTheory`
`f-docs`                `f-Tactic`
`f-Dynamics`                `f-test`
`f-FieldTheory`             `f-Testing`
`f-Geometry`                `f-Topology`
`f-GroupTheory`             `f-Util`
`f-InformationTheory`           `f-widget`

that contains modified files?

Yaël Dillies (Aug 14 2024 at 15:20):

Sounds like it's going to be a lot of noise for not much gain

Damiano Testa (Aug 14 2024 at 15:21):

Ok, so then there is the need to figure out the topic from the PR automatically.

Jon Eugster (Aug 14 2024 at 15:41):

I mean you could also keep the topic labels roughly as they are right now and just asign at least one based on the folders modified.
and put most of the "random" folders (like mathport) either into t-meta or ignore them.

e.g. t-algebra would comprise quite some folders here

Violeta Hernández (Aug 14 2024 at 17:29):

I support t-settheory since it always feels weird tagging PRs about ordinals or games with t-logic

Violeta Hernández (Aug 14 2024 at 17:29):

I guess the latter could be t-combinatorics, but the former is just its own thing

Michael Rothgang (Aug 15 2024 at 10:14):

The current t-something labels are not quite identical to the root folders: for instance,

there are t-euclidean-geometry and t-differential-geometry (which both are subfolders under Geometry
there is t-measure-proabibility, which comprises measure theory and probability (as these are often related)
I'd rather map to existing labels, than changing the labels to fix into some very simple automatic structure.

Michael Rothgang (Aug 15 2024 at 10:16):

The current labels do not cover every folder. We could either

have an "uncategorized" label
have the bot post "I couldn't label this, please do so by hand")
for some PRs, the folder alone is not sufficiently telling. scripts could be CI or linter (or probably, also something else), and this needs manual inspection.

Damiano Testa (Aug 15 2024 at 11:23):

I think that even having a loose root-dir to t-label correspondence that maybe maps (a subset of) root-dirs to (a subset of) t-labels could potentially be useful (and easy enough to implement).

Jon Eugster (Aug 15 2024 at 11:27):

Ok, since we are now thinking about the implementation details; my proposal is to use the following mapping (and then maybe just assign 1 label automatically based on which folder has the most changes):

label-to-folder mapping

t-algebra
- Algebra
- FieldTheory
- RingTheory
- GroupTheory
- RepresentationTheory
- LinearAlgebra
t-algebraic-geometry
- AlgebraicGeometry
- Geometry.RingedSpace ~~(and maybe the default for Geometry.* if there's no better place?)~~
t-euclidean-geometry
- Geometry.Euclidean
t-differential-geometry
- Geometry.Manifold
t-analysis
- Analysis
t-category-theory
- CategoryTheory
t-combinatorics
- Combinatorics
t-computability
- Computability
t-condensed
- Condensed
t-data
- Data
t-dynamics
- Dynamics
t-linter
- Tactic.Linter
t-logic
- SetTheory (--> create a new t-set-theory)
- Logic
- ModelTheory
- ~~InformationTheory~~
t-measure-probability
- MeasureTheory
- Probability
- InformationTheory
t-meta
- Tactic (without Tactic.Linter)
- Lean
- Init
- Util
- Mathport
- Control
- Testing
t-number-theory
- NumberTheory
t-order
- Order
t-topology
- Topology
- AlgebraicTopology
ignored:
- Deprecated
- other Geometry._

Michael Rothgang (Aug 15 2024 at 11:28):

Sounds very good. One small suggestion:
I would rather say "unknown" for Geometry.* by default. There's a PR adding HyperbolicGeometry, which is most definitely not t-algebraic-geometry.

Damiano Testa (Aug 15 2024 at 11:29):

I am happy with the suggestion: a first attempt might even be more conservative and apply a label only when a single label is valid, before looking for more heuristics.

Damiano Testa (Aug 15 2024 at 11:29):

So, the bot would just add a single label and only when a single label applies.

Daniel Weber (Aug 15 2024 at 16:24):

Looks nice, but why is information theory under logic? I think it should be under combinatorics (or maybe probability?)

Frédéric Dupuis (Aug 15 2024 at 16:34):

Information theory should definitely be under the same label as probability.

Jon Eugster (Aug 15 2024 at 16:46):

updated the list above, thanks for looking at it

Damiano Testa (Aug 15 2024 at 18:21):

Let's see if this works! #15848

Damiano Testa (Aug 15 2024 at 18:28):

You can see the tests at #15849.

Damiano Testa (Aug 15 2024 at 18:33):

The test PR shows:

add a new file in Analysis and the bot adds the label t-analysis (which I then remove);
move the new file to AlgebraicGeometry and the bot adds the label t-algebraic-geometry (which I then remove);
finally duplicate the new file in the two folders and the bot does nothing, since there isn't a single option.

Looking at the PR, you can also see that when there is no label with the given modifications, then the bot also does nothing.

Michael Rothgang (Aug 15 2024 at 19:26):

I'm not sure if mapping Init to t-meta makes sense: I'd rather leave that unmapped and print a "I don't know" notice.

Michael Rothgang (Aug 15 2024 at 19:44):

I left a few comments on the script: I'd need more time to wrap my head around the awk/looking over that part would be appreciated.

Damiano Testa (Aug 15 2024 at 20:36):

After thinking about this a bit, the only external input to the script is the output of git diff: I'm probably going to reimplement the action in lean tomorrow, so no need to look at the awk part.

Damiano Testa (Aug 18 2024 at 16:09):

I re-implemented the label-adding script (mostly) in Lean: all the decision of which labels are applicable is done in Lean. However, actually applying the label to the PR in a workflow is done by a small amount of bash-glue.

Damiano Testa (Aug 18 2024 at 16:10):

As is, the PR is already long, even though there are approximately about 100 lines of actual code: the rest is documentation, tests, the CI action and the actual labels mentioned above.

Damiano Testa (Aug 18 2024 at 16:11):

I think that a huge improvement over the shell script is the potential for extensibility and maintainability of the Lean code.

Ruben Van de Velde (Dec 18 2024 at 23:13):

@Yaël Dillies you introduced

scripts/autolabel.lean: the following paths inside `Mathlib/` are not covered by any label: #[Mathlib/Geometry/Group] Please modify `AutoLabel.mathlibLabels` accordingly!

warnings on every PR

Yaël Dillies (Dec 18 2024 at 23:15):

Uh oh! I am already asleep, but I can have a look tomorrow

Yaël Dillies (Dec 18 2024 at 23:16):

FWIW I don't know what label I would give. Geometric group theory is not quite geometry, it's not quite algebra either. Maybe it should be a new label?

Johan Commelin (Dec 19 2024 at 10:04):

How did this happen? Is there some new top-level dir?

Yaël Dillies (Dec 19 2024 at 10:05):

No, it's a new second level dir

Damiano Testa (Dec 19 2024 at 10:06):

Oh, this is one of the tests that checks whether every dir is accounted for, whether explicitly labeled or explicitly ignored.

Johan Commelin (Dec 19 2024 at 10:07):

Hmm, so we need a CI check that helps us not merge PRs that break that check

Damiano Testa (Dec 19 2024 at 10:08):

Rather than failing CI, maybe we can add a label such as adds-dir?

Edward van de Meent (Dec 19 2024 at 10:08):

perhaps this can hitch off of the decl-diff CI?

Yaël Dillies (Dec 19 2024 at 10:09):

Yeah, why is that on the PR CI at all?

Yaël Dillies (Dec 19 2024 at 10:10):

It's rare enough that a non-covered folder is created (this is literally the first time, right?) that it could be a weekly check instead, like leanchecker

Johan Commelin (Dec 19 2024 at 10:10):

you introduced

scripts/autolabel.lean: the following paths inside `Mathlib/` are not covered by any label: #[Mathlib/Geometry/Group] Please modify `AutoLabel.mathlibLabels` accordingly!

warnings on every PR

If this is the consequence of merging a PR, then I think we should really have a check that fails CI.

Damiano Testa (Dec 19 2024 at 10:10):

The check is done as a part of the autolabel script, that runs once a PR is open.

Johan Commelin (Dec 19 2024 at 10:10):

Because then suddenly x00 PRs get warnings, and actually they shouldn't need to do anything

Yaël Dillies (Dec 19 2024 at 10:11):

My claim is that the warnings should not be on the PRs at all, but rather posted somewhere on Zulip

Damiano Testa (Dec 19 2024 at 10:11):

Ok, I think that I was a little skeptical about the check when Jon introduced it, but seeing it working now, it is very easy to miss the introduction of a new dir and hence of possible auto-labels.

Yaël Dillies (Dec 19 2024 at 10:12):

This check does nothing to improve the quality of _mathlib_. Instead it does something to improve the quality of reviewing PRs

Damiano Testa (Dec 19 2024 at 10:12):

I also see Yaël's point: the decision of what label to assign to a newly created dir may not be in scope for the creation of the dir, although this is not so clear.

Damiano Testa (Dec 19 2024 at 10:14):

The options so far seem to be:

make CI fail if a new dir is created and no label applies to it;
add a label to PRs that add an unlabeled dir;
post on Zulip when a PR adds an unlabeled dir.

Anything else?

Johan Commelin (Dec 19 2024 at 10:17):

Move the check to a separate CI workflow that runs once a week. The reviewer team can then create a PR to update the auto-labeler.

Damiano Testa (Dec 19 2024 at 10:18):

Should we poll? I think that my preference is to make CI fail and get the addition of the dir also come with a choice of auto-label.

Edward van de Meent (Dec 19 2024 at 10:18):

Johan Commelin said:

Move the check to a separate CI workflow that runs once a week. The reviewer team can then create a PR to update the auto-labeler.

that option still causes the warning on open PRs when the "bad" PR is merged, no?

Damiano Testa (Dec 19 2024 at 10:19):

After all, a PR adding a dir should not be doing much else and taking care of recreating a good working environment seems in scope.

Edward van de Meent (Dec 19 2024 at 10:19):

so maybe a "nolint" approach to silencing that might be good

Damiano Testa (Dec 19 2024 at 10:19):

Edward van de Meent said:

Johan Commelin said:

Move the check to a separate CI workflow that runs once a week. The reviewer team can then create a PR to update the auto-labeler.

that option still causes the warning on open PRs when the "bad" PR is merged, no?

I think that the proposal is to not perform the check in regular CI, but only perform the check once a week on master.

Johan Commelin (Dec 19 2024 at 10:20):

I don't really care which option. As long as other PRs don't have to worry about the new dir, and don't see warnings about it.

Damiano Testa (Dec 19 2024 at 10:23):

Since this is not something that happens very often, how about we implement the "CI fails if an unlabeled dir is created" and the next time that this happens, we review our decision, in case we dislike it?

Johan Commelin (Dec 19 2024 at 10:35):

Fine with me

Damiano Testa (Dec 19 2024 at 10:37):

Should the label be a new t-geometric-group-theory label?

Yaël Dillies (Dec 19 2024 at 10:38):

That was my suggestion as well :smile: :point_down:
Yaël Dillies said:

FWIW I don't know what label I would give. Geometric group theory is not quite geometry, it's not quite algebra either. Maybe it should be a new label?

Damiano Testa (Dec 19 2024 at 10:43):

First, the simple fix: #20061 makes autolabel aware of geometric-group-theory.

Daniel Weber (Dec 19 2024 at 13:51):

Damiano Testa said:

Should the label be a new t-geometric-group-theory label?

I feel like this is way too specific - we don't have, for example, algebraic-number-theory, or analytic-number-theory

Kevin Buzzard (Dec 19 2024 at 14:34):

No, and as a result I have to label all my algebraic number theory PRs as t-algebra and t-number-theory

Damiano Testa (Dec 19 2024 at 14:35):

Given that now the auto-labelling has been tested and seems to be working well, we can revise the decision to either add more than one label, or expand the pool of t- labels (or both!).

Damiano Testa (Dec 19 2024 at 14:36):

To me it makes perfect sense that as different/more advanced material gets formalized that the subject labels should adapt.

Kevin Buzzard (Dec 19 2024 at 14:37):

The current number theory hack suggests that Yael should be labelling their PRs t-geometry and t-group-theory but I think that expanding the list of labels is probably a more sensible reaction?

Daniel Weber (Dec 19 2024 at 14:37):

Kevin Buzzard said:

No, and as a result I have to label all my algebraic number theory PRs as t-algebra and t-number-theory

So you're claiming it should exist? Interesting. What is the cost of having many labels? Is it just that people might miss relevant labels?

Kevin Buzzard (Dec 19 2024 at 14:38):

I'm just saying that already our system is a bit weird here. Note that the t-number theory description explicitly asks people to do this.

Damiano Testa (Dec 19 2024 at 14:49):

The t-geometric-group-theory label and autolabel has been merged, so, going forwards, we should decide whether we like to have more and more descriptive label, or we should stick to few labels and mix and match.

Damiano Testa (Dec 19 2024 at 14:51):

I personally have a preference for more descriptive labels, as I do not really think that there is a cost to having many labels and scanning through the PRs I find it more useful to read algebraic-number-theory, than the combination of algebra and number-theory.

Damiano Testa (Dec 19 2024 at 14:52):

If, conversely, combining fewer labels is preferred, then I also think that the autolabelling should be allowed to add multiple labels.

Michael Stoll (Dec 19 2024 at 16:46):

Is it still true that PRs with not exactly one t-something label will show up in the "generic" thread (on mathlib reviewers) when maintainer merged/delegated? This is the reason why I stopped adding t-algebra or t-analysis to my number theory related PRs.

Damiano Testa (Dec 19 2024 at 17:39):

Yes, this is still true: all the autolabel automation revolves around the underlying assumption that there is a single applicable label. As soon as there is more than one, PRs are treated as "generic".

Damiano Testa (Dec 19 2024 at 17:39):

This decision was taken to avoid making autolabels too intrusive.

Damiano Testa (Dec 19 2024 at 17:40):

However, to me it looks like they are working well and the scope could be extended.

Damiano Testa (Jan 22 2025 at 09:34):

Partly prompted by #Infinity-Cosmos, I wonder if there are any improvements/changes that would be desired for the auto-label action.

Damiano Testa (Jan 22 2025 at 09:34):

I personally find that it works well, but there are several projects downstream of mathlib (e.g. #Infinity-Cosmos , but also #Equational, #FLT, #Carleson, #PrimeNumberTheorem+ and others that do not necessarily have a dedicated Zulip channel) that may benefit from having some form of autolabel applied to mark PRs as "coming from" that specific project.

Damiano Testa (Jan 22 2025 at 09:34):

Should there be some discussion of whether an automatic labelling of such PRs is a good idea and, if so, how to make the bot pick up such PRs?

Damiano Testa (Jan 22 2025 at 09:42):

Here is a proposal: the auto-label bot stays as is and manages the t-topic labels with its current logic.

A separate action is triggered by some catch-phrase in the PR description and, based on that, applies a (probably not t-topic) label to the PR.

So, we could have

PR XXX
We improve the API around enriched categories.

Project: Infinity-cosmos

and the bot sees Project: Infinity-cosmos and adds a further label, on top of whatever t-topic label it would have applied already.

Jon Eugster (Jan 22 2025 at 09:50):

isn't writing Project: Infinity-cosmos just as much effort as manually adding the label though? Or is the idea that a standardised Project: in the commit message helps people analysing how many commits were made from a specific project?

Damiano Testa (Jan 22 2025 at 09:54):

Yes, Project would be the catch-phrase for the bot that can be reused for each project. I agree that it is as much work as adding the label, but it would also help to have a standardized way of marking a PR in the PR description.

Damiano Testa (Jan 22 2025 at 09:55):

An alternative would be that adding the label for a project edits the PR description and writes Project: whatever!

Damiano Testa (Jan 22 2025 at 09:58):

Also, as a personal comment, I find it easier to remember about adding stuff to the PR description than later remembering about labels.

Damiano Testa (Jan 22 2025 at 09:58):

So, for me, writing a comment in the PR description that triggers the label is a safer way to ensure that the label gets added.

Bryan Gin-ge Chen (Mar 21 2025 at 20:43):

Maybe it was just a random glitch, but the auto-label run just failed on #23207: https://github.com/leanprover-community/mathlib4/actions/runs/14000264140/job/39204909640?pr=23207

For what it's worth, I had already applied the CI label to the PR when I created it.

edit: oddly enough, the run above, which was triggered by the "push" event failed, but there was also a run triggered by "pull_request" that succeeded: https://github.com/leanprover-community/mathlib4/actions/runs/14000463302/job/39205512624?pr=23207

Jon Eugster (Mar 21 2025 at 21:27):

Just from reading your message I would guess that on "push" it fails because it gets no (or an invalid) PR number to apply the label to.

I think you could (A) check the script is really supposed to run on that initial "push" - probably it's not? - and (b) add some test which actually verifies the PR number is valid and only tries to apply the labels if so

Jon Eugster (Mar 21 2025 at 21:29):

i think the script is supposed to run only on PR creation plus when the scrip's Lean file itself has been modified.

Sorry I'm not on the computer, would check myself otherwise

Bryan Gin-ge Chen (Mar 21 2025 at 21:39):

Thanks! I would have thought if that was the case more people would have run into this, but maybe I'm just the first to complain... I'll also see if I can check later.

Jeremy Tan (Mar 22 2025 at 02:42):

@Bryan Gin-ge Chen we use reviewdog, right? We must rotate all secrets immediately!

Jeremy Tan (Mar 22 2025 at 02:45):

(And maybe buy a plan from StepSecurity)

Tristan Figueroa-Reid (Mar 22 2025 at 03:07):

(security advisory) I'm investigating @reviewdog/action-suggester now - it isn't reported as a compromised package, but an initial search reveals that the action does depend on the compromised repository in its installation script.

Tristan Figueroa-Reid (Mar 22 2025 at 03:09):

It is likely that mathlib4 was compromised if @reviewdog/reviewdog as a repository was compromised.

Jeremy Tan (Mar 22 2025 at 03:13):

Where's a maintainer to help us?

Tristan Figueroa-Reid (Mar 22 2025 at 03:14):

Tristan Figueroa-Reid said:

It is likely that mathlib4 was compromised if @reviewdog/reviewdog as a repository was compromised.

The if here might be false! Taking a look at wiz's writeup, the malicious commit in question did not affect the reviewdog installation script :+1:

Jeremy Tan (Mar 22 2025 at 03:15):

But still, better buy that enterprise plan

Tristan Figueroa-Reid (Mar 22 2025 at 03:15):

Of course, this depends on whether that was the only malicious commit in reviewdog. It seems that while it is still a good idea to rotate secrets in case of insufficient investigation on the reviewdog's part, it isn't urgent.

Tristan Figueroa-Reid (Mar 22 2025 at 03:15):

Jeremy Tan said:

But still, better buy that enterprise plan

As far as I know, any report officially integrated with GitHub's security reporting system should give a security alert in any associated repository.

Tristan Figueroa-Reid (Mar 22 2025 at 03:16):

image.png
(An example of one of my personal repositories)

Tristan Figueroa-Reid (Mar 22 2025 at 03:18):

I am partially annoyed by reviewdog here because if they tagged all of their vulnerable repositories with the security advisory, it would have alerted everyone who was using said vulnerable repository rather than just their parent repository.

Jeremy Tan (Mar 22 2025 at 03:26):

This is possibly the next Solarwinds, we have to take action immediately

Bryan Gin-ge Chen (Mar 22 2025 at 04:07):

Hi, thanks for the ping. First, based on the results of the GitHub query provided in the security advisory, it looks like there's no evidence that we were affected at this time.

However, out of an abundance of caution, I have disabled all of our workflows that use reviewdog actions temporarily while we review logs to make sure that none of our keys were exposed. We'll also look into rotating our secrets ASAP as well.

Last updated: May 02 2025 at 03:31 UTC