Zulip Chat Archive
Stream: mathlib4
Topic: What to do with "slop" PRs
Violeta Hernández (Feb 27 2026 at 02:53):
These past few months there's been a sharp decrease in the quality of the average new contributor's PR. These PRs will be hundreds or thousands of lines long while proving a comparatively simple result, or they'll be filled with dubious definitions and only their trivial consequences. The effort that it takes to fix them is comparable to the effort it takes to re-do things from scratch.
I don't want to name names, but I'd imagine most maintainers have come across multiple of these PRs as of late. What's to be done about them? Should they be left unreviewed to accrue dust until they inevitably fall off the queue? Should they be brought to the maintainers' attention so they can deal with them accordingly? Should we leave some standard message like "this PR doesn't meet our quality standards, see here"? I'm worried about potentially shooing away a new contributor, but I'm also worried about the queue becoming filled with increasingly sloppy material.
Jeremy Tan (Feb 27 2026 at 04:25):
Violeta Hernández said:
These PRs will be hundreds of thousands of lines long
Just thousands, not hundreds of thousands. I feel that such PRs should be closed on sight, explaining that the new contributor must tighten their standards
Violeta Hernández (Feb 27 2026 at 04:25):
I meant hundreds or thousands, not hundreds of thousands! Thankfully we haven't seen that just yet.
Weiyi Wang (Feb 27 2026 at 04:31):
This must have been the same for many open source projects recently. How do other projects deal with this?
Jeremy Tan (Feb 27 2026 at 04:34):
Fluxer is an open-source chat client closely modelled after Discord. Its guidelines for those who want to contribute with LLMs mandate that a human-written explanation accompany each PR
Jeremy Tan (Feb 27 2026 at 04:35):
"Contributor must thoroughly understand what they're adding to the project"
Yan Yablonovskiy 🇺🇦 (Feb 27 2026 at 04:41):
Could the generality and broadness of something like mathlib / what it is trying to accomplish make such PRs more voluminous, in comparison to other "narrower" repositories?
Johan Commelin (Feb 27 2026 at 07:22):
Since it is "cheap" (in terms of the blood, sweat, and tears that the PR author needs to invest) to create such a PR, I think there has to be a much higher burden of proof that the code is useful, robust, maintainable, high quality.
One way to establish that is by showing:
- that this code has been sitting in the
ForMathlib/folder of downstream repoXyzzyTheory - was refactored and generalized
itimes in the past - but now has been stable for
ktime (k > 2 months??) - and has
Ndeclarations depending on it, with a DAG depth ofd(maybeN > 400andd > 10)
If
- I know something about Xyzzy theory, and
- I recognize the theorems that are being developed in the
XyzyyTheoryrepo, and - the blueprint is passing snif tests,
then I might be quite inclined to believe that moving (without refactoring!) this piece of code fromForMathlib/into Mathlib proper is a worthwhile thing, and the code has proven itself to be worth its LOC in gold.
This is a higher standard than we apply today to human contributions.
If your AI complains about that, then I'm inclined to tell it
:shrug: C'est la vie. Deal with it.
Ok, that was a bit provocative. But I'm trying to work out whether we can find some pragmatic and deterministic measures of high-quality code. The above is an attempt to have a measurable way of determining that code is useful in/for the broader ecosystem.
Ruben Van de Velde (Feb 27 2026 at 08:48):
And then hope you don't get a hit piece published about you, I guess
Vlad Tsyrklevich (Feb 27 2026 at 08:55):
Perhaps one possible approach is a policy that your first 10/20/N PRs must be entirely handwritten. After that, you can be said to have a passable enough understanding of mathlib style to decide on appropriate use of AI. At a minimum we know that you _actually know some Lean_. This requires honesty from reviewees but for the moment it’s not hard to spot AI proofs
Thomas Browning (Feb 27 2026 at 09:46):
Or put length limits on first PRs? I think that might even be good for completely human-written contributions?
Michael Rothgang (Feb 27 2026 at 12:26):
I like both Vlad's and Thomas' suggestions. (Though I bet some would object to the first requirement.)
Jeremy Tan (Feb 27 2026 at 15:50):
I would be more inclined to implement the minimum PR requirement than the length limit. I'd put the floor at 5 PRs
Shreyas Srinivas (Feb 27 2026 at 17:09):
I like the length limit idea (although this can vary depending on how git actually evaluates diffs which is heuristic).
In addition, I think maybe it is a good idea to institute an "approved PR" badge. One gets this badge for their PR by proposing a PR in a dedicated zulip channel and explaining it succinctly (without sounding like an LLM). Maybe they also submit a draft PR. Maintainers can skim it and give a badge of approval. For example maintainers could tell the contributor to shorten the PR or focus on something specific. Only PRs with the approved badge are accepted from contributors with less than 10 or 20 merged PRs.
Michael Rothgang (Feb 27 2026 at 17:26):
Right now, new contributors' contributions are often implicitly approved (by virtue of being mentored by experienced community members). For those PRs, adding another hoop to jump through feels not great. (Mathlib's PR process is intimidating enough already.)
Michael Rothgang (Feb 27 2026 at 17:29):
I'd rather make it easier to quickly close slop PRs. We're seeing an increase of new members on zulip posting about AI-powered formalisation - so getting better at directing their enthusiasm in useful ways would help more.
Chris Henson (Feb 27 2026 at 17:51):
I think Michael has a good point. Certain regulations can have an asymmetric affect on well meaning newcomers versus bad actors.
It seems Violeta's original question about a standardized message for these cases could be useful for a standard route to quickly closing these PRs. (Maybe it also triggers a queue tag for AI PRs to close??)
Thomas Murrills (Feb 27 2026 at 18:36):
Shreyas Srinivas said:
In addition, I think maybe it is a good idea to institute an "approved PR" badge. One gets this badge for their PR by proposing a PR in a dedicated zulip channel and explaining it succinctly (without sounding like an LLM).
Instead of an “approved PR” label for all new contributors, would it make sense to flip this and label LLM-generated PRs with an LLM label? (Or uses-LLM, LLM-assisted?) Then when LLM and new-contributor appear together, we know how to gently suggest better ways to start contributing.
It could be taken as a bit of a scarlet letter, but maybe that’s just the value-neutral cost of making review take more time? :grinning_face_with_smiling_eyes: We do want LLM use to be declared explicitly anyway.
Perhaps (only) LLM PRs specifically could then receive an additional label (attached by reviewers/maintainers) showing the PR authors are attempting to meet some standards we put in place for high-quality contributions in good-faith. (human-curated?)
Andrew Yang (Feb 27 2026 at 22:33):
I have a tangential question here. Are there any good ways to identify LLM generated code? While reviewing I often see some highly suboptimal use of tactics and lemmas that looks suspicious but to me it might as well be a novice monkey-typing random tactics that they know until Lean said yes. (I also did this when I started, and frankly I still do it now at times, and it is just that my experience gives me a fairly good random distribution that usually make things work nicely). Also another telltale sign is having weird comments like
/— Let a be a natural number with the value 3 -/
def a : Nat := 3
But it might as well be some novice who genuinely have no idea what’s going on without the comments or mistakenly thought that comments are required at every step of proofs. I do not want to mistakenly accuse people (especially newcomers) of LLM slop but I fear I might be wasting my reviewing energy otherwise.
Snir Broshi (Feb 27 2026 at 22:40):
LLM-generated Lean code usually comes with an LLM-generated PR description (and the reverse is also true), which is usually easy to identify. Use of any markdown headings is a pretty good indicator. (excluding the --- in PR descriptions which sometimes creates accidental headings)
Thomas Murrills (Feb 27 2026 at 22:47):
Making some thoughts a bit more concrete, here's a proposal for what the overall flow of social actions/responsibilities/expectations could look like.
Note: the goal of this proposal is to smooth out the "what do I do about this" problem reviewers face when encountering an LLM PR, while letting us both encourage good code and more easily close PRs that are not meeting the standards. (It's kind of a wrapper around our standards for LLM code, so we can decouple the social part of this problem.)
- triage: LLM PRs are identified and
LLMlabel is added.- PR authors are encouraged to add
LLMlabel themselves by commentingLLM - reviewers suspecting an LLM PR should
- gently ask the author if there's any doubt (maybe we could have a template for doing this to help reviewers)
- add the label if there is no doubt
- in all cases comment encouraging PR author to do so themselves
- PR authors are encouraged to add
- then, when LLM label is applied to PRs, we communicate of expectations and extend trust (or not):
- a bot comments on LLM-labeled PRs and points to [LLM Contribution Responsibilities]
- if
new-contributoris present as well, the bot also points to [Different ways to start contributing] - if PR author has several LLM-labeled PRs already merged, the bot auto-applies
human-curatedlabel—i.e. the PR author is a trusted LLM user, and this saves some further reviewer triage (note: bot should still comment with [LLM Contribution Responsibilities] in this case to give reviewers quick access to the standards) - if PR is over the length limit and does not have
human-curated, the bot also appliesawaiting-authorand explains the length limit requirement
- if
- reviewers then further triage PRs with (only) the LLM label:
- if the author is acting in good faith/working to meet the responsibilties, apply
human-curatedlabel (does not have to yet fully meet the high standards, the author just has to show that they are definitely acting in good faith to do so) - if not (e.g. it's been a while since the bot has commented about [LLM Contribution Responsibilities] and there is still no movement in that direction from the PR author)
- comment to please meet the [LLM Contribution Responsibilities] and apply
awaiting-author- if
awaiting-authorhas been removed by the author without material good-faith effort to meet the responsibilities, close the PR with a stock comment that the responsibilities haven't been fulfilled and standards are high for LLM PRs, and review time is at a premium.new-contributors are told future contributions are welcome.
- if
- comment to please meet the [LLM Contribution Responsibilities] and apply
- if the author is acting in good faith/working to meet the responsibilties, apply
- a bot comments on LLM-labeled PRs and points to [LLM Contribution Responsibilities]
This is of course just a proposal to iterate on, so I've made some specific choices :)
This is a bit outside the proposal per se, but I think the [LLM Contribution Responsibilities] document/blurb that this process wraps should (at least) include the responsibilities of:
- asking the contributor to lay out the "here is why this is good code" evidence Johan mentioned in the PR description (or in a comment)
- meeting a length limit
- meeting high quality standards, with examples of what LLMs frequently get wrong
and should include
- clear communication/warning about what happens if you don't try to meet these responsibilities, i.e. that your PR may be closed promptly
- why these standards are necessary: short explanation about review time, etc. :)
Likewise the [Different ways to start contributing] document/blurb should explain why we want new contributors to contribute differently, in a welcoming way: it builds trust with the contributor and helps the contributor familiarize themselves with code standards that LLMs struggle to meet. Communicate also that LLM users are expected to closely pilot LLM contributions, and familiarization with ordinary contribution saves a lot of review time.
It should also perhaps(?) gently suggest converting LLM PRs to draft PRs until the contributor is no longer a new-contributor.
Last updated: Feb 28 2026 at 14:05 UTC