Zulip Chat Archive

Stream: mathlib4

Topic: Potentially useful results from an AI-assisted formalization

Marcello Paris (Feb 27 2026 at 07:56):

Hi everyone, while working on an experiment of mine, I ended up developing some general-purpose formalization that isn't specific to the experiment (repo here, ~28,800 lines).

I thought some of it might be of interest to the Mathlib community, so I put together a list of candidates organized by topic.

Areas include analytic number theory (large sieve, Van der Corput, Gauss sums, Parseval for finite groups), arithmetic functions, finite group theory, and various small lemmas.

Worth noting: the Lean code was written by an agent swarm powered by Claude Opus 4.6, with my direction. I make no claims about Mathlib-readiness (conventions, style and organization of mathematical content), but if anything looks useful, I'd be happy to split things into small, focused PRs and adapt to the project's standards.

David Loeffler (Feb 27 2026 at 08:41):

My first concern, looking at your "list of candidates", is that a large fraction of these results seem to duplicate (exactly or nearly) existing functionality in mathlib. E.g. we already have Cauchy-Schwartz, Mellin transforms, periodized Bernoulli functions, the Jacobi theta transformation law, etc.

David Loeffler (Feb 27 2026 at 08:54):

A second, and rather more serious, concern is that your repository doesn't seem to actually contain what you claim. Taking an example, you say "you have formalized the Jacobi transformation law". But the declaration in your file IKCh4.lean is

def JacobiThetaTransformation : Prop :=
  ∀ (y : ℝ), 0 < y →
    ∑' m : ℤ, Real.exp (-π * (m : ℝ) ^ 2 / y) =
      Real.sqrt y * ∑' n : ℤ, Real.exp (-π * (n : ℝ) ^ 2 * y)

That's not a theorem; it's a statement of a theorem, not its proof. As a rule, Mathlib is not interested in collecting statements without proofs. Unless you can contribute proofs of these theorems, not just statements, there's nothing here that's mathlib-able.

Most of the content of your files consists of defs like this; and sometimes you don't even have a proper statement, like this one (the next declaration after the above):

/-- Fejér kernel identity — IK (4.27). -/
def FejerKernelIdentity : Prop :=
  ∀ (y : ℝ), 0 < y → ∀ (_x : ℝ),
    True  -- Poisson transform of Fejér pair

I'm afraid that reading this codebase gives the strong impression that you are mechanically copy-pasting the output of an LLM without yourself really understanding what it means to formalize a theorem in Lean. Given that the mathlib code-review process has very strict standards and requires a lot of effort by highly qualified and distinctly overworked reviewers, I'm afraid it will take quite a bit of further work on your side before any of this can be seriously considered for a mathlib PR.

Marcello Paris (Feb 27 2026 at 10:20):

Thank you for your feedback David. To give some context: my goal with this experiment is to understand how to design agent workflows that produce acceptable code and sound mathematical organization. The quality bar I'm measuring against is precisely Mathlib's -> so when you point out that def X : Prop isn't a theorem, I am well aware of that (but what I understand is not the point) : those are concrete failure modes I need to characterize and fix in the generation process.

That said, I should have been much clearer about what the repo actually contains. Here's some stats I collected:

981 proved theorems/lemmas
~190 def X : Prop — known results from Iwaniec-Kowalski stated without proof, not claimed as formalized: there are for the agents to continue their job
~50 def X : Prop — genuinely open hypotheses, used as antecedents in proved reduction chains (some of them proved false)

The architecture is a reduction machine for a well-known conjectuure: its difficulty is one of the reasons that allowed the process to collect results. The def X : Prop declarations you flagged are the intentionally unproved nodes, either open conjectures or known results not yet in Mathlib that could be formalized in subsequent sessions.

None of this changes your point that much of it may not be Mathlib-ready. But I wanted to correct the impression that the repo is mostly unproved statements, the bulk is proved reductions, and that's where any potentially reusable lemmas would live

Johan Commelin (Feb 27 2026 at 10:49):

The following is my personal opinion. I'm not speaking with my Mathlib maintainer hat on.

@Marcello Paris It is too easy to generate code like this. It's not clear why people should spend time reviewing it. It's already been extremely generous of David that he looked at it in a fair amount of detail.

Mathlib isn't interested in "proved reductions", unless this is about reducing an honest open conjecture to another honest open conjecture.
If your formalize P -> Q while there is a known proof of P, then this can happily live in a downstream repo, but not in Mathlib.

AI's currently produce somewhat uncanny-valley like Lean code. But there is no point in risking Mathlib's carefully curated high quality status by mass-adopting AI-generated code that hasn't been battle-tested.

AI's by construction produce code that looks like it is useful and polished.
So it needs to go through extensive battle-testing to prove that it in fact is useful and polished.
I claim this should happen in "production" downstream of Mathlib.

Johan Commelin (Feb 27 2026 at 10:49):

Marcello Paris (Feb 27 2026 at 14:11):

Thank you both. To be clear: I'm not seeking to contribute to Mathlib, and I don't want to take up reviewer time. I posted this as a notification.

A few thoughts on what's behind this, though:
It's not that easy to get LLMs to generate this code: understanding how to direct agents toward an acceptable mixture of readable reasoning and more opaque but valid solutions, possibly with some mathematical organization.

You mention "battle-tested" and "carefully curated": I find this genuinely interesting. If those criteria could be described more formally, it might eventually be possible to automate parts of quality assessment (kind of semantic diff against Mathlib, essentially). That would help downstream repos converge toward Mathlib standards before anyone asks for reviewer time.

I understand your good point about battle-testing happening downstream. Whether a core-with-satellites model is the right long-term architecture is a separate question that could have different answers if taken from a mathematical or from a software engineering viewpoint.

Anyway, the situation you're describing isn't specific to my repo. The volume of AI-generated code that's almost good enough is only going to increase. How formalization communities handle that is (I guess) an interesting open question.

Last updated: Feb 28 2026 at 14:05 UTC