Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Math, Inc, and Gauss discussion

Auguste Poiroux (Sep 11 2025 at 17:51):

This is the discussion thread for our announcement of the formalization of the Strong PNT. We want to ensure that we can make this technology useful for the community and massively accelerate formalization efforts. We are especially interested in hearing about ideal ergonomics of use, domains of expertise, documentation, and user interfaces.

Simon Sorg (Sep 11 2025 at 18:15):

Congrats on the very impressive work! Looking forward to learning more about the details and seeing new applications develop over time. I believe there is a lot of value in this approach.

Matthew Ballard (Sep 11 2025 at 18:54):

Those are some monster proof bodies.

Adam Topaz (Sep 11 2025 at 18:54):

@Auguste Poiroux could you say a bit more about how much of the Lean code was:

Written by an LLM?
Written by a human?
Written by an LLM but edited by a human?

Matthew Ballard (Sep 11 2025 at 18:55):

How much of the terabytes of RAM could be saved if the agents were trained to abstract statements more?

Auguste Poiroux (Sep 11 2025 at 19:23):

@Adam Topaz yes sure. We have a few paragraphs about this in our readme file. Everything in PNT1_ComplexAnalysis.lean, PNT2_LogDerivative.lean, PNT3_RiemannZeta.lean, and PNT4_ZeroFreeRegion.lean is generated by Gauss. PNT5_StrongPNT.lean, Z0.lean, and ZetaZeroFree.lean are adapted from PNT+, i.e. a few statements were modified to fit the Strong PNT conclusion (in particular the exponent change), and proofs were repaired using Gauss.

Adam Topaz (Sep 11 2025 at 19:24):

I see, thanks! That is impressive. Can you say what "Human Supervision" means in practice?

Auguste Poiroux (Sep 11 2025 at 19:28):

Human supervision means that we refined the blueprint latex statements and proofs when the agent struggled to formalize a result.

Adam Topaz (Sep 11 2025 at 19:29):

Thanks for the clarification!

Thomas Zhu (Sep 11 2025 at 20:39):

Did the LLM also generate the blueprint, or was it completely human written?

Auguste Poiroux (Sep 11 2025 at 20:43):

The blueprint is human written. Some informal proofs are LLM-generated

Ching-Tsun Chou (Sep 11 2025 at 20:55):

How do you ensure that the Lean definitions and theorem statements are a correct formalization of the blueprint?

Matteo Cipollina (Sep 11 2025 at 20:57):

Could you say whether you trained Gauss on PNT repo or feed it with the API from PNT repo within the context (or else)?

Jason Rute (Sep 11 2025 at 21:01):

Did the human manually break up lemmas in the blueprint based on the progress of the Guass agent? For example, say the agent had trouble formalizing Lemma 4.123. Therefore you rewrote its proof in terms of three more basic lemmas which would be easier for the agent? (I noticed many of the lemmas are really basic and if they are human written, I assume that is why they are their.)

Adam Topaz (Sep 11 2025 at 21:02):

Yeah, I agree that most of the lemmas would really be thought of as implicit steps in an informal proof. I think calling it a "high level blueprint" is pretty generous. In any case, I still think this is a really impressive result!

Daniel Litt (Sep 11 2025 at 21:24):

Auguste Poiroux said:

Human supervision means that we refined the blueprint latex statements and proofs when the agent struggled to formalize a result.

Is it possible to give a sense of how much of the TeX files were written by humans?

Auguste Poiroux (Sep 11 2025 at 21:27):

Ching-Tsun Chou said:

How do you ensure that the Lean definitions and theorem statements are a correct formalization of the blueprint?

It is an important part of the human supervision. We manually checked critical Lean statements, and refined/adapted corresponding informal statements in case of misformalization.

Ching-Tsun Chou (Sep 11 2025 at 21:32):

Is there an example of the "human supervision" you referred to above? Namely, an example of a proof step that is too big for the agent to handle, what sort of feedbacks the human supervisor gets from the agent, and how the human supervisor breaks up the big step into small ones.

Auguste Poiroux (Sep 11 2025 at 21:33):

Jason Rute said:

Did the human manually break up lemmas in the blueprint based on the progress of the Guass agent? For example, say the agent had trouble formalizing Lemma 4.123. Therefore you rewrote its proof in terms of three more basic lemmas which would be easier for the agent? (I noticed many of the lemmas are really basic and if they are human written, I assume that is why they are their.)

Gauss performance improved over the course of the formalization. The closer you get to Strong PNT in the dependency graph, the larger the gaps are. But yes, I agree that our blueprint is quite dense in comparison to other projects. We are using the blueprint as an interface to guide Gauss in the formalization.

Auguste Poiroux (Sep 11 2025 at 21:33):

Daniel Litt said:

Auguste Poiroux said:

Human supervision means that we refined the blueprint latex statements and proofs when the agent struggled to formalize a result.

Is it possible to give a sense of how much of the TeX files were written by humans?

TeX files are written by humans

Jason Rute (Sep 11 2025 at 21:41):

Gauss performance improved over the course of the formalization. The closer you get to Strong PNT in the dependency graph, the larger the gaps are.

Did you start at the leaf nodes or the final goal? Or a mix?

Auguste Poiroux (Sep 11 2025 at 22:00):

Ching-Tsun Chou said:

Is there an example of the "human supervision" you referred to above? Namely, an example of a proof step that is too big for the agent to handle, what sort of feedbacks the human supervisor gets from the agent, and how the human supervisor breaks up the big step into small ones.

An example is I_is_antiderivative (upper right of the dependency graph). It has a whole dependency subgraph all for itself.
When the agent is not able to prove a statement, an analysis is provided at the end. Reasons span misformalization, proof step is too big, and sometimes incorrect informal statements. About "how the human supervisor breaks up the big step", mathematician's role is critical here to add intermediate statements.

Auguste Poiroux (Sep 11 2025 at 22:00):

Jason Rute said:

Gauss performance improved over the course of the formalization. The closer you get to Strong PNT in the dependency graph, the larger the gaps are.

Did you start at the leaf nodes or the final goal? Or a mix?

It's a mix, but mainly from the leaf nodes

Notification Bot (Sep 11 2025 at 23:40):

8 messages were moved from this topic to #Machine Learning for Theorem Proving > Danger of Autoformalizers (e.g. Gauss) by Jason Rute.

Jason Rute (Sep 12 2025 at 02:19):

@Auguste Poiroux Since you start from the leaves, is your algorithm greedy in that it takes the first formalization and proof of that formalization that it can prove, possibly compounding errors, or does it handle multiple formalization candidates, not committing to a formalization until human review, or until that formalization proves itself useful for filling in many later steps in the blueprint? For example, maybe you need $1/2 < 1$ , but it formalizes it as:

theorem half_less_than_one : 1/2 < 1 := by norm_num

which, while a correct theorem, is a misformalization. When you try to continue, you will eventually find this lemma is not useful in the places you intend to use it. (This toy example, might be too simple, but I hope you see my point.) Do you have to manually go in and clear out this bad lemma, or is your algorithm able to still consider alternate versions of the lemma in parallel, seeing which version is more useful for later steps?

David Renshaw (Sep 12 2025 at 02:31):

I thought it would be fun to run tryAtEachStep exact? on the repo. Here are its top findings, which seem to suggest that the code has lots of room to be compressed still:

uniqueDiffWithinAt_convex_complex is a trivial consequence of uniqueDiffWithinAt_convex_of_isRCLikeNormedField.
lem_tsum_norm_vonMangoldt_depends_on_Re_cast is a duplicate of lem_tsum_norm_vonMangoldt_depends_on_Re.
enat_le_iff_forall_nat is a trivial consequence of ENat.forall_natCast_le_iff_le

Kim Morrison (Sep 12 2025 at 03:11):

Something I would love to see is how easily (or otherwise) Gauss copes with upgrading this repository to the latest PrimeNumberTheoremAnd (and v4.22.0 of Lean+Mathlib).

(There is an error in PNT1_ComplexAnalysis.lean, so this is not a triviality.)

Kim Morrison (Sep 12 2025 at 03:12):

Proof repair after upgrades is a sine qua non for autoformalizers to participate in the larger ecosystem. I hope it is very doable, but we need to see it happening!

Auguste Poiroux (Sep 12 2025 at 08:25):

@Jason Rute this indeed happened a few times during our formalization of Strong PNT. In later steps, the agent sometimes simply skipped misformalized lemmas, and sometimes used them. Misformalizations are often not provable, and human supervision is necessary in these cases.

Kevin Buzzard (Sep 12 2025 at 09:40):

There was a question above which I think was unanswered: was the computer system trained on the human PNT effort led by Kontorovich/Tao before it started to help?

Auguste Poiroux (Sep 12 2025 at 10:01):

Kevin Buzzard said:

There was a question above which I think was unanswered: was the computer system trained on the human PNT effort led by Kontorovich/Tao before it started to help?

Sorry we missed the question. We did not train on the PNT+ project

Matteo Cipollina (Sep 12 2025 at 10:18):

Auguste Poiroux ha scritto:

Kevin Buzzard said:

There was a question above which I think was unanswered: was the computer system trained on the human PNT effort led by Kontorovich/Tao before it started to help?

Sorry we missed the question. We did not train on the PNT+ project

Thanks, could you say whether as part of the human-made blueprint you provided 'API' blueprint within context to teach Gauss about PNT+? It's not part of generic LLM's training, except recently, so I assume the LLM will tend to use basic mathlib API rather than PNT+ unless carefully prompted.

Matteo Cipollina (Sep 12 2025 at 10:37):

(deleted)

Matteo Cipollina (Sep 12 2025 at 10:39):

(deleted)

Matteo Cipollina (Sep 12 2025 at 10:40):

(deleted)

Filippo A. E. Nuccio (Sep 12 2025 at 11:14):

This is perhaps a trivial consideration, but I tried to grep the words structure and class inside StrongPNT and found 0 occurrences. To me (and I hope this is not taken as a criticism, or as an underestimation of the incredible effort!) this shows to what extent this project differs from a human one. I would easily imagine that for such a big project, a human team -- or a human-led one -- would come up with useful structures or classes to organise the code, which would in turn reflect abstractions and concepts developed during the process.

Jason Rute (Sep 12 2025 at 11:22):

I think I counted ~30 definitions and they were all really basic, sometimes trivial. I think that is a current limitation of this work (and the almost everywhere abc theorem by Trinity). I think a definition-heavy project like FLT would be a very different beast. It probably isn’t surprising that they chose theorems like this which use existing definitions and machinery.

Alfredo Moreira-Rosa (Sep 12 2025 at 11:33):

I don't think it's the AI fault that it did not build more abstraction per say. It's just a matter of guidance in the instruction you give them.
If you make a set of context instruction to build abstractions using structures, class to abstract concepts, it will try to do so.
It's the same on my software engineering field, if i just let the AI build functional code it will do so by writting some flat code without abstractions, but when i write an instruction file that says, what are the standard practices for my projects, the LLM follows them and write proper abstractions.

Michael Bucko (Sep 12 2025 at 12:08):

Congratulations!
When Gauss used a misformalized lemma, what did the proof look like?
How did it know to skip certain lemmas?
What is the price (compute-hours) of a single misformalized lemma?
How often does Gauss reinvent the wheel?
What kind of rewards are you using? (per-step rewards (tactic correctness, goal-count delta, token length, and so on) plus sparse end-of-proof rewards (compilation success, lemma reuse, blueprint hit rate)?
Are you planning to release the model on HF?

Jason Rute (Sep 12 2025 at 12:12):

As for rewards, it is not even clear if they are training a model? This could just be a large agentic system using current existing models (or minimally fine-tuned models with no RL).

Alfredo Moreira-Rosa (Sep 12 2025 at 12:21):

Yes it seems pretty clear in the announcement that it's just an agentic setup.

Matthew Ballard (Sep 12 2025 at 12:29):

I asked about the lack of abstractions with the abc result at the INI talk announcing it. I don't think I was satisfied but since then I've come to the following feeling.

If you look at the human input, it is still pretty thorough for a guide. From my experience, I would guess today that if you, as a math/lean expert, provided a structure with parents and fields and some semantic info, it would fill that in easily. A good test would be handling lightly sketched API for a new structure.

The workflow would look not too dissimilar from today (yesterday?). Human experts decompose the problems and invent structures + API to model the mathematics. Work is attempted to fill in. Based on that work the humans would gain greater understanding and tweak the structures and approach. Then work to fill things again would start.

It is the work to fill things step that could really see the acceleration.

Matthew Ballard (Sep 12 2025 at 12:42):

Jason Rute said:

As for rewards, it is not even clear if they are training a model? This could just be a large agentic system using current existing models (or minimally fine-tuned models with no RL).

I don't think it is a coincidence that Christian mentioned the propensity for GPT 5 to insert axioms in Lean code in the same podcast where he highlighted the PNT as a formalization target. #general > Lean in the wild @ 💬

Jason Rute (Sep 12 2025 at 13:46):

@Alex Kontorovich it would be great to hear your thoughts on this. In particular, what do you feel it would have taken to do this manually?

Alex Kontorovich (Sep 12 2025 at 14:18):

Sorry, I was traveling all day yesterday and am now about to go teach and have a slew of meetings with my grad students. I will be sure to collect my thoughts (which I've had some weeks to do, since they told me and Terry about this a while ago), and report them here, but it might take a few days... Overall I think it's a great achievement and shows just how close we are to potentially getting a lot more math formalized much more rapidly than before; but there are also many caveats... More soon!

(deleted) (Sep 13 2025 at 03:22):

How much does this cost

Filippo A. E. Nuccio (Sep 13 2025 at 08:54):

@Auguste Poiroux I see that lean proofs contain a lot of comments, somewhat explaining the proof strategy. Were they also inserted by Gauss or by humans writing the blueprint? If it was Gauss, had it been explicitly instructed to do so, or is it due to some other reason?

(deleted) (Sep 13 2025 at 09:33):

LLMs in general insert lots of comments. In fact more than humans.

(deleted) (Sep 13 2025 at 19:52):

I am more and more certain this must be using some sort of tree search algorithm

(deleted) (Sep 13 2025 at 19:52):

To drive the LLM

(deleted) (Sep 13 2025 at 19:53):

Infinibranch is mentioned for a reason

(deleted) (Sep 13 2025 at 19:55):

Given that Infinibranch is publicly available, maybe someone should try setting up a tree search agent :eyes: and see how it goes

(deleted) (Sep 13 2025 at 19:58):

Wow. Actually so much about the system is already publicly available. Treating the LLM as a low level component instead of an all powerful superintelligence really is a way to engineer an autonomous AI for formal verification.

(deleted) (Sep 13 2025 at 20:00):

This is an engineering problem. The high level idea is rather clear

Jason Rute (Sep 14 2025 at 12:54):

@GasStationManager would it be possible for SafeVerify to check this whole project up to the main theorem at the end? Or does the fact that it is a multi file development mean that SafeVerify would just trust the imported files without checking them?

Jason Rute (Sep 14 2025 at 13:13):

It might be easy to concatonate this code into just one file if that is a limitation.

Jason Rute (Sep 14 2025 at 13:14):

(And for those who don’t know, SafeVerify does stuff that other tools don’t do like check the term proofs, check the axioms, and check that the theorem statement term didn’t change in the presence of preceding code in the file.)

Jason Rute (Sep 14 2025 at 16:08):

This (vague) post by Tao might (or might not) be referring to this project. https://mathstodon.xyz/@tao/115196924307085967

Jason Rute (Sep 14 2025 at 16:08):

I'm not 100% what he is referring to, but one thing might be Math Inc. contrasting the time it took this proof (about three weeks) with the time it took the original Lean PNT project (about 18 months). Obviously, the original project had a large learning component to it. The participants weren't Lean experts. Other formal proofs of the PNT, such as those by Jeremy Avigad, John Harrison, or Mario Carneiro, probably did it faster (and could certainly do it much faster today).

Jason Rute (Sep 14 2025 at 16:08):

Also, Tao mentions upstreaming to Mathlib. If you rush to a proof, that is certainly not the same thing as taking the time to clean and review the code for maintainability and future application. I think both the rushed AI approach and slow and careful human approach are valuable with independent strengths, but again, I could see the comparison of the two projects rubbing some the wrong way (especially if the mainstream news picks this up). I think an interesting datapoint would be how easy it is for others to clean this code and incorporate it into the original PNT project.

Jason Rute (Sep 14 2025 at 16:20):

I guess there is also the aspect that this project in some ways "scooped" the original PNT project (which again was a learning opportunity). So this means they have to shift their focus to other projects to learn Lean (which could be scooped as well). @Kevin Buzzard: Would you be happy or sad if Math Inc. scooped you on FLT?

Kevin Buzzard (Sep 14 2025 at 16:21):

I would be elated!

Kevin Buzzard (Sep 14 2025 at 16:23):

I am very unclear though about how well these things are going to be able to handle definitions, especially definitions in mathlib style, which is why I have another project focussing on formalising modern mathematical definitions in a mathlib-acceptable way (which I should really write a blog post about)

Jason Rute (Sep 14 2025 at 16:24):

I don't think this approach would currently scale to FLT. For one, @Jared Duker Lichtman would have to learn the proof. :rofl:

(deleted) (Sep 14 2025 at 17:18):

I don't think we can MCTS our way to FLT :eyes:

(deleted) (Sep 14 2025 at 17:21):

What Math, Inc did was very expensive

Ching-Tsun Chou (Sep 14 2025 at 20:34):

Kevin Buzzard said:

I am very unclear though about how well these things are going to be able to handle definitions, especially definitions in mathlib style, which is why I have another project focussing on formalising modern mathematical definitions in a mathlib-acceptable way (which I should really write a blog post about)

Please write the blog post. I very much want to learn what exactly the "mathlib style" is.

Andrew Sutherland (Sep 16 2025 at 13:19):

For those in the Cambridge, MA area, Jared Duker Lichtman and Jesse Han are giving a presentation on this result at Harvard CMSA at noon today 2025-09-16T12:00:00-04:00 as part of the Geometry of Machine Learning workshop and will also be speaking at MIT on Thursday 2025-09-18T14:30:00-04:00. (I think you can also watch the CMSA talk on Zoom, you can register at the link above).

Eric Wieser (Sep 16 2025 at 13:27):

Is that 2025-09-16T12:00:00-04:00 and 2025-09-18T14:30:00-04:00 ?

Eric Rodriguez (Sep 16 2025 at 13:39):

I'm not sure if there's issues with the webinar, but I think the AlphaGeometry talk is meant to be right now, but I only see a blank screen

harry sanders (Sep 16 2025 at 15:14):

Eric Rodriguez said:

I'm not sure if there's issues with the webinar, but I think the AlphaGeometry talk is meant to be right now, but I only see a blank screen

livestream should be available now!

Zixiao Wang (Sep 16 2025 at 15:17):

Andrew Sutherland said:

For those in the Cambridge, MA area, Jared Duker Lichtman and Jesse Han are giving a presentation on this result at Harvard CMSA at noon today (Sep 16) as part of the Geometry of Machine Learning workshop and will also be speaking at MIT on Thursday (Sep 18) at 2:30. (I think you can also watch the CMSA talk on Zoom, you can register at the link above).

Thanks for sharing! I just registered for the Harvard workshop, though a bit last minute — do you know if there’s still space to attend in person?

(deleted) (Sep 16 2025 at 17:04):

Did Math Inc talk about technical details? I was outside for most of the talk

(deleted) (Sep 16 2025 at 17:10):

Judging from the questions, looks like no

Andrew Sutherland (Sep 16 2025 at 18:02):

Details on the MIT talk on Thursday are available here (note the time is 2025-09-18T14:30:00-04:00). The MIT talk will not be on Zoom, but CMSA usually posts recordings within a few days.

Anthony Wang (Sep 18 2025 at 22:16):

Huỳnh Trần Khanh said:

Did Math Inc talk about technical details? I was outside for most of the talk

I attended the Math Inc talk at MIT today and it was interesting but very light on technical details. The talk was basically the same as the stuff already on the math.inc website.

Deleted User 968128 (Sep 22 2025 at 20:18):

Huỳnh Trần Khanh said:

I don't think we can MCTS our way to FLT :eyes:

It will not just be MCTS but also retrieval augmented generation(RAG). This is where the well formed mathlib helps quite a bit. It's worth mentioning that DeepMind did quite well on IMO 2025 (versus OpenAI) on not just presenting proofs but readable proofs. This is something LLMs excel at, and you will find as these tools mature the problem of generating good lean code will be a tractable hurdle.

The true and challenging hurdle in all of this will be how well the agentic flow can find efficacious inspiration from the repos and research it has access to in order to make the necessary and correct leaps.

And of course, it goes without saying, enfeeblement risk looms large in everything AI touches.

Wrenna Robson (Sep 25 2025 at 10:56):

Tim Shephard said:

Huỳnh Trần Khanh said:

I don't think we can MCTS our way to FLT :eyes:

It will not just be MCTS but also retrieval augmented generation(RAG). This is where the well formed mathlib helps quite a bit. It's worth mentioning that DeepMind did quite well on IMO 2025 (versus OpenAI) on not just presenting proofs but readable proofs. This is something LLMs excel at, and you will find as these tools mature the problem of generating good lean code will be a tractable hurdle.

The true and challenging hurdle in all of this will be how well the agentic flow can find efficacious inspiration from the repos and research it has access to in order to make the necessary and correct leaps.

And of course, it goes without saying, enfeeblement risk looms large in everything AI touches.

What is enfeeblement risk? I think I might know but it would be good to get an expert definition.

Deleted User 968128 (Sep 25 2025 at 15:07):

Wrenna Robson said:

What is enfeeblement risk? I think I might know but it would be good to get an expert definition.

I think the definition is more interesting when defined specifically for a narrow domain (like autoformalization), but generally the risk of weakening something along some dimension of capability due to some change.

Something I learned recently is that Socratese thought the alphabet posed enfeeblement risk. We know this, because his student Plato wrote it down. :)

Wrenna Robson (Sep 25 2025 at 15:11):

Well, he was wrong about that one I guess - but the phenomenon is real in general (I am less good at using log tables than someone of my background would have been sixty years ago, I suspect).

Deleted User 968128 (Sep 25 2025 at 15:12):

To loop this back to Gauss, I think it behooves them to include a section on enfeeblement risk on their math.inc website. And, perhaps a novel but hopefully soon-to-be solved risk: workslop risk.

Wrenna Robson (Sep 25 2025 at 15:12):

Whether or not it's a problem remains to be seen I suppose. It is a concern for me with AI I think - I am really not enthusiastic about it as a technology/set of technologies in general - but we are where we are and there's interesting stuff to learn.

Wrenna Robson (Sep 25 2025 at 15:13):

Tim Shephard said:

Right to a degree. The phenomenon of 'book smarts' is real, I believe.

Ah, yes, I see what you mean.

Wrenna Robson (Sep 25 2025 at 15:16):

The work above is interesting but yeah my goodness this is ugly work. Cool that it can do it but the abstraction and API design is bad to the point of being nearly useless in and of itself, I would say.

Wrenna Robson (Sep 25 2025 at 15:17):

Though I mean surely that is a matter of training rather than a fundamental capacity - certainly I would not make the latter claim.

Deleted User 968128 (Sep 25 2025 at 15:25):

Wrenna Robson said:

The work above is interesting but yeah my goodness this is ugly work. Cool that it can do it but the abstraction and API design is bad to the point of being nearly useless in and of itself, I would say.

There is an idea I am noodling which is 'Proof Axiology'. Spamming/workslopping out lean code is intriguing but might be much less valuable than some imagine at this point.

However, if there was a deeper investigation in creating rubrics around what proofs are of interest, this could help drive autoformalization engines like Gauss in a more productive manner.

There is a similar problem right now in antibiotics discovery. Teams are spamming out new discoveries, but there is limited resources to prove what is eficacious and this could be hindering finding real breakthroughs rather than helping.

Wrenna Robson (Sep 25 2025 at 15:36):

I would say even in human-authored lean code, "what makes a good API" is under-researched - I mean, people have opinions about it, but I am not aware of that much work (a little I think?) to transition that into research knowledge.

Wrenna Robson (Sep 25 2025 at 15:36):

Something like that would obviously also have utility here.

Last updated: Feb 28 2026 at 14:05 UTC