Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: AI-discovered proof on psi-integrals (with Lean formalizat.)

Johannes Schmitt (Dec 15 2025 at 13:51):

Dear all,

two weeks ago was the first time when an AI (namely GPT-5) solved one of the questions submitted to our AI benchmarking project IMProofBench which was considered an open problem by the author (=me). It concerned intersection numbers on moduli spaces Mbar_g,n of curves: finding among all such descendant invariants

< tau_e1 ... tau_en > = \int_{Mbar_g,n} psi_1^e1 .... psi_n^en (*)

the insertions e1, ..., en leading to the maximal value of (*), for fixed g,n. It turns out the maximum is achieved by distributing the numbers e1, ..., en as evenly as possible (while adding up to 3g-3+n). GPT-5 found a neat proof, completely ignoring the existing theory of descendant invariants (the Witten-Kontsevich theorem etc), just using abstract intersection-theoretic properties of the psi-classes (nefness + Khovanskii-Teissier log-concavity).

I decided to write up both the result, and a bit of the context around it

Extremal_Descendant_Integrals.pdf

Apart from presenting the (largely unedited) proof given by the AI, I also decided to provide an alternative presentation of the argument: it combines

a Lean formalized and self-contained optimization result, showing that a certain class of functions D(e1, ..., en) take their maximum on balanced vectors
a geometric argument, with proof written by myself based on the GPT-5 argument, showing that the descendant invariants belong to this class of functions

You can find the Lean blueprint here:
https://schmittj.github.io/balanced-vectors-blueprint/index.html

Lean file itself:
https://github.com/schmittj/balanced-vectors-blueprint/blob/13de9eeda635bc388e4dba68f514e3b42d339ae7/BalancedVectors.lean

Before embarking on this formalization project yesterday, I had no prior experience with Lean, and all the code was generated by Claude Code (using the lean 4 skill developed by @cameronfreer) and ChatGPT 5.2.

I would be very happy to hear any feedback on either:

the mathematical result itself
the chosen presentation and conventions on attribution of AI contributions
the formalization above (in particular any potential mis-formalization, and also how the Lean code reads for experts)

Jason Rute (Dec 15 2025 at 13:55):

So I’m clear, the AI proof was in natural language and the Lean proof was based on the AI proof, and was vibe coded by someone who doesn’t know Lean well, is that correct?

Johannes Schmitt (Dec 15 2025 at 13:56):

Yes, that's correct. The AI proof was checked both by myself and several colleagues of mine, and other colleagues (with more Lean experience) have had at least a cursory look on the Lean formalization.

Eric Vergo (Dec 15 2025 at 14:23):

Johannes Schmitt said:

I would be very happy to hear any feedback on either:

the mathematical result itself

the chosen presentation and conventions on attribution of AI contributions

the formalization above (in particular any potential mis-formalization, and also how the Lean code reads for experts)

The formalization looks like a good candidate testing out the 'AI generated Lean proof format' proposed here: #Machine Learning for Theorem Proving > Discussion: AI-written mathematical proofs @ 💬. I have been working on the automated tool and would love to use this as a case study. Would you be alright with me refactoring this code and using it for that purpose?

Johannes Schmitt (Dec 15 2025 at 14:27):

@Eric Vergo This sounds very interesting (will read up on your precise proposal), but please do go ahead with using this as a case study! Very happy to provide any assistance you might need.

Auguste Poiroux (Dec 15 2025 at 14:28):

"Vibe-formalizing" without prior Lean knowledge, impressive, I like it.

Eric Vergo (Dec 15 2025 at 14:38):

Johannes Schmitt said:

Eric Vergo This sounds very interesting (will read up on your precise proposal), but please do go ahead with using this as a case study! Very happy to provide any assistance you might need.

Great, this will take me a day or two and I'll post here when things are ready.

Kevin Buzzard (Dec 15 2025 at 16:54):

Lean doesn't have even the definition of the moduli space of curves (indeed we need the moduli space of elliptic curves for FLT and it's a big hole right now) so it's hard to imagine that the Lean component is a substantial part of the overall proof?

Jason Rute (Dec 15 2025 at 17:00):

Kevin Buzzard said:

Lean doesn't have even the definition of the moduli space of curves (indeed we need the moduli space of elliptic curves for FLT and it's a big hole right now) so it's hard to imagine that the Lean component is a substantial part of the overall proof?

From the report above:

To our knowledge, the prerequisite results in algebraic geometry going into the proof (divisors on algebraic schemes, intersection numbers, nef classes, moduli spaces of curves, the Witten–Kontsevich theorem, etc) are not yet formalized in Lean. Thus we decided to split the proof in a purely combinatorial optimization result (Theorem 3.1), and a geometric part proved non-formally (Theorem 3.2)

Johannes Schmitt (Dec 15 2025 at 17:09):

Maybe to give some quantitative context: the section containing the formalized proof is about 3.5 pages of the pdf; the informal version of the optimization result (that has been formalized) takes about 2.5 of these pages, the geometric result (which only has an informal proof) takes about 1 page. And I think that both of these parts have approximately the same density of claims-to-be-parsed per paragraph.

For me the advantage of the formalization is: the optimization argument is one of these annoying, fiddly concatenations of inequalities and indices, where you have to be very careful to get everything right and not divide by zero. So the fact that this is now fully verified is quite reassuring, and for me as a reader, I would be happy to be able to focus on the (more interesting) geometric ingredients!

Kevin Buzzard (Dec 15 2025 at 18:12):

Scholze's challenge problem was to prove some technical statement about Ext groups (which we didn't have at the time) but the result was reduced in literally under ten lines of the pdf to a concrete statement about undergraduate level objects whose proof was about 8 pages and which the community formalised in six months. Formalising the remaining ten lines took us over a year :-)

Kevin Buzzard (Dec 15 2025 at 18:15):

I have always pushed back against the idea that AI should be trusted to formalise definitions and even the statement that $\mathcal{M}_{g,n}$ exists is highly nontrivial, however it would be an interesting test of mathlib to see if we could now (manually) formalise the definition of the functor and then one could ask AI to prove the theorem that it's representable (which would be the proof that completes the definition of the scheme).

Johannes Schmitt (Dec 15 2025 at 18:59):

Kevin Buzzard said:

Scholze's challenge problem was to prove some technical statement about Ext groups (which we didn't have at the time) but the result was reduced in literally under ten lines of the pdf to a concrete statement about undergraduate level objects whose proof was about 8 pages and which the community formalised in six months. Formalising the remaining ten lines took us over a year :-)

Indeed, I would imagine that formalizing the missing page from the proof above could easily be a multi-year project. All I was saying was that for a human expert in algebraic geometry, the time to think through that one page (during a referee process in a standard journal) might be comparable to the time they would spend checking all inequalities in the formalized optimization problem. So purely from the perspective of "check this particular result, assuming the rest of the literature", the decision to formalize the optimization part got about 50% of the benefit for about 0.01% of the amount of work it would take to formalize everything.

Johannes Schmitt (Dec 15 2025 at 19:08):

Kevin Buzzard said:

I have always pushed back against the idea that AI should be trusted to formalise definitions and even the statement that $\mathcal{M}_{g,n}$ exists is highly nontrivial, however it would be an interesting test of mathlib to see if we could now (manually) formalise the definition of the functor and then one could ask AI to prove the theorem that it's representable (which would be the proof that completes the definition of the scheme).

Of course as an algebraic and enumerative geometer I would be very happy to see algebraic geometry in general being more fleshed out, in particular with some concrete goal in mind, like the definition of the moduli spaces of curves!

About potential AI involvement here: I agree that AI right now is not reliable enough to formalize extended theoretical constructions unsupervised in a way that we could trust. What I am wondering about is whether it would be acceptable for a human to formalize the chain of core statements and results, and AI to fill in the gaps. In a certain sense, the project above has about 5 definitions and 1 theorem, which is really all I care about, and which can be manually verified to be a correct formalization of the result I want. Then the AI filled in various lemmas and intermediate results, and from my perspective, as long as they make the final theorem compile, I am happy with that state of affairs.

Of course the project above is not trying to build some large theoretical machinery or represent a significant slice of mathematical literature, so it might be that in those cases there are more stringent requirements even on intermediate notions and results. But I still feel that going forward, it might be useful to consider whether there are any parts of the process that one would be willing to hand off to AI helpers.

Eric Vergo (Dec 15 2025 at 19:14):

Kevin Buzzard said:

I have always pushed back against the idea that AI should be trusted to formalise definitions...

Me too, and I think this should be the standard for the foreseeable future. The intent of the template is to quarantine all of the declarations that dont have a proof obligation(plus some other tings) so they can get properly scrutinized. Kim also suggested that those declarations be pulled into Mathlib first, which makes sense to me.

Kevin Buzzard (Dec 15 2025 at 20:19):

I think it certainly makes sense that mathlib should end up as a "canonical" source for every mathematical definition being used today. This is the monorepo philosophy which has served us well so far.

Johannes Schmitt (Dec 15 2025 at 20:25):

I agree that definitions should be carefully curated and thought out. Do you have an estimate which percentage of the development time goes into writing definitions, lemmas and theorems that might be used elsewhere vs. ad-hoc technical lemmas that are useful for precisely one purpose (one step in a bigger proof)?

I could imagine the latter are much more amenable to potential outsourcing. I would be curious what percentage of the work hours they represent (or maybe I am fundamentally misunderstanding how things work when building the mathlib!).

Eric Vergo (Dec 15 2025 at 20:26):

Kevin Buzzard said:

I think it certainly makes sense that mathlib should end up as a "canonical" source for every mathematical definition being used today. This is the monorepo philosophy which has served us well so far.

Absolutely. One of the reasons I am working on this is to prevent things from fracturing in the future. As the number of vibe-proven repos grows, so will the workload on the reviewers. If things get backlogged, people may start building on each others work before it gets into Mathlib, and that seems like something we want to avoid.

Eric Vergo (Dec 16 2025 at 00:27):

I was able to put this together today: https://github.com/e-vergo/Balanced_Vectors

There are a number of deviations from the original suggestion by Kim, but that is due to the fact that this repo contains 18 or so declarations that don't have a proof obligation. All of these can be found in Definitions.lean and are accompanied by a few lemmas that are used by other definitions. There are a few wrappers for things currently in Mathlib, but I don't think there are any meaningful opportunities to reduce the number of definitions.

Once everything in definitions.lean is in Mathlib (which should go through the normal review process) the repo can be updated, reducing the review burden to the two MainTheorem files. If we want to enable people to submit AI generated proofs in this manner we are going to need to have a larger discussion about the specifics of the template/workflow; there are a number of design choices I made somewhat arbitrarily.

Feedback on all levels is welcome.

Johannes Schmitt (Dec 16 2025 at 09:52):

Thanks so much for putting this together so quickly! Some comments:

My guess is that to fit the pattern of the original suggestion, one would have to modify the main theorem to essentially expand out the various definitions that go in there (so it becomes standalone modulo mathlib), and then in the proof first reassemble all these properties noting that they fit the definitions from definitions.lean. I agree that this would make the main theorem more transparent, though of course increase the total number of lines. I personally think it might be worth it, but again I have very few Lean intuitions yet.
The definitions themselves seemed somewhat ad-hoc to me, again unsure how much sense it makes to put them in the mathlib. If people see some potential I am happy to try submitting something for review.
I would be very happy to acknowledge your work in the paper (probably planning to upload to the arxiv for tonight's deadline). Do you think it makes sense to include a link to your repo above? Has the proposal by Kim been formalized in any way? If you would like to suggest a paste-ready formulation, please be my guest!

Also, since you had a bit of a look on the files, one curiosity I had:

How would you rate the quality of the Lean code? I have heard that autoformalizers tend to produce very long, redundant code that can often be condensed by factors of x5 in the hands of an experienced Lean user. Does the code have the same redundant-slop quality, or is it mostly reasonable? Just curious!

Eric Vergo (Dec 16 2025 at 19:03):

Johannes Schmitt said:

Thanks so much for putting this together so quickly! Some comments:

My guess is that to fit the pattern of the original suggestion, one would have to modify the main theorem to essentially expand out the various definitions that go in there (so it becomes standalone modulo mathlib), and then in the proof first reassemble all these properties noting that they fit the definitions from definitions.lean. I agree that this would make the main theorem more transparent, though of course increase the total number of lines. I personally think it might be worth it, but again I have very few Lean intuitions yet.

I can't speak to whether or not the definitions are ad hoc, but Mathlib tends to want statements that are more general. If one were to 'upgrade' the definitions in this way it's likely that the main theorem here could go untouched because you can add additional lemmas saying that what you have now satisfies the more general definitions. I think this is what you are suggesting, and if you are interested in getting this result into mathlib that is probably the best route.

Johannes Schmitt said:

I would be very happy to acknowledge your work in the paper (probably planning to upload to the arxiv for tonight's deadline). Do you think it makes sense to include a link to your repo above?

If you would like to acknowledge me and/or link to the repo I have no objections, but I am certainly not requesting it.

Johannes Schmitt said:

Has the proposal by Kim been formalized in any way?

As far as I know it has not progressed past that post, and the tool I am working on is not ready yet. This is what I have so far. It still doesn't have the functionality to account for in-project definitions as I am struggling to work through all of the exposed/public/private conditions required for using modules. If we want to standardize and adopt this, it is going to need feedback from quite a few stakeholders first. It is very much a work in progress, and in its early stages.

This is the kind of thing that I could knock out in a day if I wrote something in python using text parsing, but I am building this purely in lean and using the environment inspection tools. I am hoping to have something to share more broadly in the next few days, but it may take longer than that as this is something that is a bit outside of my wheelhouse. No doubt that there are many people on this forum that could do a better job faster, but I get the sense here that the experts are becoming increasingly overloaded with work. I learn by getting my hands dirty though, so this is great experience for me.

Johannes Schmitt said:

How would you rate the quality of the Lean code? I have heard that autoformalizers tend to produce very long, redundant code that can often be condensed by factors of x5 in the hands of an experienced Lean user. Does the code have the same redundant-slop quality, or is it mostly reasonable? Just curious!

It’s ok, but not perfect. Here is what my still-being-trained eyes see:

There are small things like missing docstrings and inconsistent variable naming, but those can be cleaned up pretty easily. Also, it looks like nonzeroCount is unused so it can be discarded

On the ‘less reasonable’ side, some of the proofs are long and difficult to follow; particularly unimodal_of_logconcave_palindromic. This proof can be broken up into smaller lemmas roughly corresponding to each of the top level ‘have’ statements. This will help with readability and is generally a good practice.

Another improvement is writing a helper lemma(s?) for the following repeated pattern:

  -- e is the slice at position e.i
  have he_eq_slice : e = sliceComposition e i j hij (e i) hei_lo hei_hi := by
    refine WeakComposition.ext (fun k => ?_)
    simp only [sliceComposition]
    by_cases hki : k = i
    · simp only [hki, ite_true]
    · by_cases hkj : k = j
      · simp only [hkj, ite_true, if_neg hij.symm]; ring
      · simp only [hki, hkj, ite_false]
  -- The modified composition is the slice at position e.i - 1
  have hmod_eq_slice : e.modify i j hi hij = sliceComposition e i j hij (e i - 1) hem1_lo hem1_hi := by
    refine WeakComposition.ext (fun k => ?_)
    simp only [WeakComposition.modify, sliceComposition]
    by_cases hki : k = i
    · simp only [hki, ite_true]
    · by_cases hkj : k = j
      · simp only [hkj, ite_true, if_neg hij.symm]; ring
      · simp only [hki, hkj, ite_false]

This shows up on lines 684, 789, and 995.

Kim Morrison (Dec 16 2025 at 22:03):

I agree that this example doesn't fit cleanly into my original formulation: there are too many definitions needed for the statement of the theorem, that don't really belong in Mathlib. So I like the Definitions.lean file. I think it would be good if this file, and the README, contained a big warning message: "To trust the claimed main result, and for it to be meaningful to have Lean check the actual proofs, you must read and understand that content of Definitions.lean."

Kim Morrison (Dec 16 2025 at 22:05):

@Eric Vergo, a few quick comments:

I got an error running lake exe cache get in the current state of the repo ("error: external command 'git' exited with code 128": @Mac Malone, could we please always print the underlying error message when lake fails in a git operation?)
You've used a lakefile.lean. That works, but our preference is for the ecosystem to use lakefile.toml where possible. You can just run lake translate-config toml.

Eric Vergo (Dec 16 2025 at 23:06):

Kim Morrison said:

I agree that this example doesn't fit cleanly into my original formulation: there are too many definitions needed for the statement of the theorem, that don't really belong in Mathlib. So I like the Definitions.lean file. I think it would be good if this file, and the README, contained a big warning message: "To trust the claimed main result, and for it to be meaningful to have Lean check the actual proofs, you must read and understand that content of Definitions.lean."

Warning added.

Kim Morrison said:

I got an error running lake exe cache get in the current state of the repo

I was unable to reproduce this on two machines. Let me know if there is anything I can do for this.

Kim Morrison said:

You've used a lakefile.lean. That works, but our preference is for the ecosystem to use lakefile.toml where possible. You can just run lake translate-config toml.

done.

Kim Morrison (Dec 17 2025 at 00:07):

@Eric Vergo, I've just made three PRs to your repo, hopefully reducing the "surface area".

Kim Morrison (Dec 17 2025 at 00:09):

@Eric Vergo, also, you don't have any CI setup on this repo. I recommend just running lake new in a fresh directory and then copying over the .github directory from the template that creates.

Kim Morrison (Dec 17 2025 at 00:20):

If you ping me once these PRs are merged I can do a second pass.

Kim Morrison (Dec 17 2025 at 04:08):

@Johannes Schmitt, @Eric Vergo, one way to reduce the amount that needs to be human verified here is to avoid defining the WeakComposition.modify function in Definitions.lean (because it requires a bunch of supporting lemmas), and instead just codify an IsModification predicate relating two WeakCompositions. Then SatisfiesLogConcavity can be stated much more cheaply.

I could implement that easily.

Kim Morrison (Dec 17 2025 at 04:08):

(Standard naming says SatisfiesLogConcavity should be IsLogConcave, btw.)

Johannes Schmitt (Dec 17 2025 at 05:43):

Thank you so much, Eric and Kim, for your work and taking an interest! I agree with the modifications you proposed - please feel free to go ahead and/or tell me if there is something I can contribute to.

The first version of the paper is now on arxiv ( https://arxiv.org/abs/2512.14575 ) and I included the following reference to the project above:
image.png

In a later update I would of course also acknowledge Kim's contributions, change links to point to the new refactored repository, and add a reference to Eric's TAIL project (if you like!).

Timothy Chow (Dec 17 2025 at 13:11):

Kevin Buzzard said:

I think it certainly makes sense that mathlib should end up as a "canonical" source for every mathematical definition being used today.

I agree with the sentiment here, but taking it literally would mean that riemannZeta 1 = (γ - Complex.log (4 * ↑π)) / 2 would become a canonical definition, which surely makes no sense.

Johannes Schmitt (Dec 17 2025 at 13:19):

While I cannot say anything on the feasibility of implementation, from my perspective it might be useful that some mathematical notions do not just have one, canonical definition, but a very tightly connected cluster of definitions, with strong heuristics how to convert between them and transfer properties, and when you need to call on that object, you have a choice to pick one of these notions, work with it, and whenever a construction needs another facet of that mathematical object, it can smoothly rotate into place for continuing a proof.

Indeed, for some mathematical objects, it seems very hard to pick one completely canonical definition that will be the optimal one for any purpose (like the one given in the BOOK OF DEFINITIONS).

Eric Vergo (Dec 17 2025 at 22:06):

Kim Morrison said:

If you ping me once these PRs are merged I can do a second pass.

Merged; generated docs are here.

Notification Bot (Dec 20 2025 at 12:48):

95 messages were moved from this topic to #mathlib4 > Mathlib as a source of definitions by Oliver Nash.

Last updated: Feb 28 2026 at 14:05 UTC