Zulip Chat Archive

Stream: mathlib4

Topic: What to do with "slop" PRs

Violeta Hernández (Feb 27 2026 at 02:53):

These past few months there's been a sharp decrease in the quality of the average new contributor's PR. These PRs will be hundreds or thousands of lines long while proving a comparatively simple result, or they'll be filled with dubious definitions and only their trivial consequences. The effort that it takes to fix them is comparable to the effort it takes to re-do things from scratch.

I don't want to name names, but I'd imagine most maintainers have come across multiple of these PRs as of late. What's to be done about them? Should they be left unreviewed to accrue dust until they inevitably fall off the queue? Should they be brought to the maintainers' attention so they can deal with them accordingly? Should we leave some standard message like "this PR doesn't meet our quality standards, see here"? I'm worried about potentially shooing away a new contributor, but I'm also worried about the queue becoming filled with increasingly sloppy material.

Jeremy Tan (Feb 27 2026 at 04:25):

Violeta Hernández said:

These PRs will be hundreds of thousands of lines long

Just thousands, not hundreds of thousands. I feel that such PRs should be closed on sight, explaining that the new contributor must tighten their standards

Violeta Hernández (Feb 27 2026 at 04:25):

I meant hundreds or thousands, not hundreds of thousands! Thankfully we haven't seen that just yet.

Weiyi Wang (Feb 27 2026 at 04:31):

This must have been the same for many open source projects recently. How do other projects deal with this?

Jeremy Tan (Feb 27 2026 at 04:34):

Fluxer is an open-source chat client closely modelled after Discord. Its guidelines for those who want to contribute with LLMs mandate that a human-written explanation accompany each PR

Jeremy Tan (Feb 27 2026 at 04:35):

"Contributor must thoroughly understand what they're adding to the project"

Yan Yablonovskiy 🇺🇦 (Feb 27 2026 at 04:41):

Could the generality and broadness of something like mathlib / what it is trying to accomplish make such PRs more voluminous, in comparison to other "narrower" repositories?

Johan Commelin (Feb 27 2026 at 07:22):

Since it is "cheap" (in terms of the blood, sweat, and tears that the PR author needs to invest) to create such a PR, I think there has to be a much higher burden of proof that the code is useful, robust, maintainable, high quality.

One way to establish that is by showing:

that this code has been sitting in the ForMathlib/ folder of downstream repo XyzzyTheory
was refactored and generalized i times in the past
but now has been stable for k time (k > 2 months??)
and has N declarations depending on it, with a DAG depth of d (maybe N > 400 and d > 10)

I know something about Xyzzy theory, and
I recognize the theorems that are being developed in the XyzyyTheory repo, and
the blueprint is passing snif tests,
then I might be quite inclined to believe that moving (without refactoring!) this piece of code from ForMathlib/ into Mathlib proper is a worthwhile thing, and the code has proven itself to be worth its LOC in gold.

This is a higher standard than we apply today to human contributions.
If your AI complains about that, then I'm inclined to tell it

:shrug: C'est la vie. Deal with it.

Ok, that was a bit provocative. But I'm trying to work out whether we can find some pragmatic and deterministic measures of high-quality code. The above is an attempt to have a measurable way of determining that code is useful in/for the broader ecosystem.

Ruben Van de Velde (Feb 27 2026 at 08:48):

And then hope you don't get a hit piece published about you, I guess

Vlad Tsyrklevich (Feb 27 2026 at 08:55):

Perhaps one possible approach is a policy that your first 10/20/N PRs must be entirely handwritten. After that, you can be said to have a passable enough understanding of mathlib style to decide on appropriate use of AI. At a minimum we know that you _actually know some Lean_. This requires honesty from reviewees but for the moment it’s not hard to spot AI proofs

Thomas Browning (Feb 27 2026 at 09:46):

Or put length limits on first PRs? I think that might even be good for completely human-written contributions?

Michael Rothgang (Feb 27 2026 at 12:26):

I like both Vlad's and Thomas' suggestions. (Though I bet some would object to the first requirement.)

Jeremy Tan (Feb 27 2026 at 15:50):

I would be more inclined to implement the minimum PR requirement than the length limit. I'd put the floor at 5 PRs

Shreyas Srinivas (Feb 27 2026 at 17:09):

I like the length limit idea (although this can vary depending on how git actually evaluates diffs which is heuristic).

In addition, I think maybe it is a good idea to institute an "approved PR" badge. One gets this badge for their PR by proposing a PR in a dedicated zulip channel and explaining it succinctly (without sounding like an LLM). Maybe they also submit a draft PR. Maintainers can skim it and give a badge of approval. For example maintainers could tell the contributor to shorten the PR or focus on something specific. Only PRs with the approved badge are accepted from contributors with less than 10 or 20 merged PRs.

Michael Rothgang (Feb 27 2026 at 17:26):

Right now, new contributors' contributions are often implicitly approved (by virtue of being mentored by experienced community members). For those PRs, adding another hoop to jump through feels not great. (Mathlib's PR process is intimidating enough already.)

Michael Rothgang (Feb 27 2026 at 17:29):

I'd rather make it easier to quickly close slop PRs. We're seeing an increase of new members on zulip posting about AI-powered formalisation - so getting better at directing their enthusiasm in useful ways would help more.

Chris Henson (Feb 27 2026 at 17:51):

I think Michael has a good point. Certain regulations can have an asymmetric affect on well meaning newcomers versus bad actors.

It seems Violeta's original question about a standardized message for these cases could be useful for a standard route to quickly closing these PRs. (Maybe it also triggers a queue tag for AI PRs to close??)

Thomas Murrills (Feb 27 2026 at 18:36):

Shreyas Srinivas said:

In addition, I think maybe it is a good idea to institute an "approved PR" badge. One gets this badge for their PR by proposing a PR in a dedicated zulip channel and explaining it succinctly (without sounding like an LLM).

Instead of an “approved PR” label for all new contributors, would it make sense to flip this and label LLM-generated PRs with an LLM label? (Or uses-LLM, LLM-assisted?) Then when LLM and new-contributor appear together, we know how to gently suggest better ways to start contributing.

It could be taken as a bit of a scarlet letter, but maybe that’s just the value-neutral cost of making review take more time? :grinning_face_with_smiling_eyes: We do want LLM use to be declared explicitly anyway.

Perhaps (only) LLM PRs specifically could then receive an additional label (attached by reviewers/maintainers) showing the PR authors are attempting to meet some standards we put in place for high-quality contributions in good-faith. (human-curated?)

Andrew Yang (Feb 27 2026 at 22:33):

I have a tangential question here. Are there any good ways to identify LLM generated code? While reviewing I often see some highly suboptimal use of tactics and lemmas that looks suspicious but to me it might as well be a novice monkey-typing random tactics that they know until Lean said yes. (I also did this when I started, and frankly I still do it now at times, and it is just that my experience gives me a fairly good random distribution that usually make things work nicely). Also another telltale sign is having weird comments like

/— Let a be a natural number with the value 3 -/
def a : Nat := 3

But it might as well be some novice who genuinely have no idea what’s going on without the comments or mistakenly thought that comments are required at every step of proofs. I do not want to mistakenly accuse people (especially newcomers) of LLM slop but I fear I might be wasting my reviewing energy otherwise.

Snir Broshi (Feb 27 2026 at 22:40):

LLM-generated Lean code usually comes with an LLM-generated PR description (and the reverse is also true), which is usually easy to identify. Use of any markdown headings is a pretty good indicator. (excluding the --- in PR descriptions which sometimes creates accidental headings)

Thomas Murrills (Feb 27 2026 at 22:47):

Making some thoughts a bit more concrete, here's a proposal for what the overall flow of social actions/responsibilities/expectations could look like.

Note: the goal of this proposal is to smooth out the "what do I do about this" problem reviewers face when encountering an LLM PR, while letting us both encourage good code and more easily close PRs that are not meeting the standards. (It's kind of a wrapper around our standards for LLM code, so we can decouple the social part of this problem.)

triage: LLM PRs are identified and LLM label is added.
- PR authors are encouraged to add LLM label themselves by commenting LLM
- reviewers suspecting an LLM PR should
  - gently ask the author if there's any doubt (maybe we could have a template for doing this to help reviewers)
  - add the label if there is no doubt
  - in all cases comment encouraging PR author to do so themselves
then, when LLM label is applied to PRs, we communicate of expectations and extend trust (or not):
- a bot comments on LLM-labeled PRs and points to [LLM Contribution Responsibilities]
  - if new-contributor is present as well, the bot also points to [Different ways to start contributing]
  - if PR author has several LLM-labeled PRs already merged, the bot auto-applies human-curated label—i.e. the PR author is a trusted LLM user, and this saves some further reviewer triage (note: bot should still comment with [LLM Contribution Responsibilities] in this case to give reviewers quick access to the standards)
  - if PR is over the length limit and does not have human-curated, the bot also applies awaiting-author and explains the length limit requirement
- reviewers then further triage PRs with (only) the LLM label:
  - if the author is acting in good faith/working to meet the responsibilties, apply human-curated label (does not have to yet fully meet the high standards, the author just has to show that they are definitely acting in good faith to do so)
  - if not (e.g. it's been a while since the bot has commented about [LLM Contribution Responsibilities] and there is still no movement in that direction from the PR author)
    - comment to please meet the [LLM Contribution Responsibilities] and apply awaiting-author
      - if awaiting-author has been removed by the author without material good-faith effort to meet the responsibilities, close the PR with a stock comment that the responsibilities haven't been fulfilled and standards are high for LLM PRs, and review time is at a premium. new-contributors are told future contributions are welcome.

This is of course just a proposal to iterate on, so I've made some specific choices :)

This is a bit outside the proposal per se, but I think the [LLM Contribution Responsibilities] document/blurb that this process wraps should (at least) include the responsibilities of:

asking the contributor to lay out the "here is why this is good code" evidence Johan mentioned in the PR description (or in a comment)
meeting a length limit
meeting high quality standards, with examples of what LLMs frequently get wrong

and should include

clear communication/warning about what happens if you don't try to meet these responsibilities, i.e. that your PR may be closed promptly
why these standards are necessary: short explanation about review time, etc. :)

Likewise the [Different ways to start contributing] document/blurb should explain why we want new contributors to contribute differently, in a welcoming way: it builds trust with the contributor and helps the contributor familiarize themselves with code standards that LLMs struggle to meet. Communicate also that LLM users are expected to closely pilot LLM contributions, and familiarization with ordinary contribution saves a lot of review time.

It should also perhaps(?) gently suggest converting LLM PRs to draft PRs until the contributor is no longer a new-contributor.

Last updated: Feb 28 2026 at 14:05 UTC