Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Checking large AI generated projects

Jason Rute (Sep 15 2025 at 11:34):

The recent proof by Gauss of the Strong PNT brings up an important question, one that I briefly discussed with @Talia Ringer on X (https://x.com/TaliaRinger/status/1967148801862533359).

Jason Rute (Sep 15 2025 at 11:34):

How do we verify large AI proofs?

Jason Rute (Sep 15 2025 at 11:34):

A common answer is just:

Does it check in Lean?
Does it only use the three standard axioms (according to #print axioms)?
Does it check in lean4checker?

Jason Rute (Sep 15 2025 at 11:34):

But this avoids more subtle questions:

Is the statement correct, and how does one know?
Did the AI exploit bugs in Lean (in any form, not necessarily in the kernel)? There are many kinds of exploits that lean4checker can’t catch.
Is the statement what it really seems to mean, either because of a bug in an AI-generated definition, because of a bug in a mathlib definition, because of misleading pretty printing, or because of the code changing the meaning of notation or coercion?

Jason Rute (Sep 15 2025 at 11:34):

As for checking things automatically, I’ve been very impressed with SafeVerify by @GasStationManager. I think it is really ahead of any official Lean tool in this regard. It lets you put the theorem statement (and any needed definitions to state the statement) in one file (with a sorry proof) and the full development in another file (including all lemmas, definitions, and proofs). Then it runs a lean4checker-like verifier on the new proof terms, checks axioms, and also checks that the code didn’t change the meaning of the statement. I don’t think, however, that it works on multiple file developments, so that is an issue.

Jason Rute (Sep 15 2025 at 11:35):

As for checking the statement of the theorem, I’m less sure about what should be standard here. Obviously, one needs to check the theorem statement and the new definitions used in it. (The Strong PNT statement only uses one non-mathlib definition, and that depends only on mathlib definitions.) But can one blindly trust definitions in mathlib? One approach is to go far back in the definition hierarchy until one gets to common definitions that are well used (with lots of theorems about them). (It would be nice to have a tool that automatically lists the hierarchy of definitions for a statement.)

Jason Rute (Sep 15 2025 at 11:35):

Another approach to definitions is to follow the Liquid Tensor Experiment and give unit-test-like theorems that uniquely characterize each definition. One can even have an AI fill in the proofs of these theorems. One could conceivably build a library of such tests, and a list of what standard mathlib definitions have been verified this way.

Jason Rute (Sep 15 2025 at 11:35):

Ideally, (for checking correctness at least), one should not have to go through every lemma and intermediate definition used in the development (and certainly not every proof). That is the point of automatic theorem provers. But for that to really work, we have to trust the automated tools.

Jason Rute (Sep 15 2025 at 11:35):

Of course, the best test is human understanding. In this particular case, the strong PNT is a well-known fact that is 100 years old. So even if this proof was wrong, we wouldn't doubt the original theorem. Also, the original Lean PNT project plans to clean up and incorporate this strong PNT proof into their project. That will provide a lot of human scrutiny. But as this sort of thing becomes more common, human review will be a bottleneck.

Tyler Josephson ⚛️ (Sep 15 2025 at 14:33):

I’ve wondered a lot about this, since our group is gearing up to do a large-scale autoformalization in the science domain. Here, potential semantic issues are way more prominent. Formalizing science will inevitably involve many custom definitions outside of Mathlib, as well as involve probing a domain where the literature is much less precise than mathematics.

Kelly Davis (Sep 15 2025 at 15:06):

Tyler Josephson ⚛️ said:

I’ve wondered a lot about this, since our group is gearing up to do a large-scale autoformalization in the science domain. Here, potential semantic issues are way more prominent. Formalizing science will inevitably involve many custom definitions outside of Mathlib, as well as involve probing a domain where the literature is much less precise than mathematics.

For this I'd suggest starting with much simpler, more well-defined foundations (e.g. axiomized Everettian QM, Blanchet/Turaev TQFT axioms, Wightman QFT axioms, Haag Kastler QFT axioms...), then working "upward" from one or more of these.

GasStationManager (Sep 15 2025 at 17:23):

Thanks @Jason Rute for starting the discussion. It looks like multi-file autoformalizations will become more common, as larger scale autoformalizatoins projects are being attempted. Currently SafeVerify is limited to single files; I think it is very doable to extend it to multiple files, but care need to be taken to the design and the technical implementation.

I'm not too worried about this particular proof; Morph lab's previous project (abc almost always true) passed SafeVerify, and this one is most likely correct too. But it is worthwhile to have a SafeVerify that can check multiple files. If anyone wants to collaborate on a PR, (or help reviewing PRs), please let me know!

In particular, here's a question about how lean4checker deals with multiple files that perhaps @Kim Morrison and @Mario Carneiro know the answer: right now lean4checker could process multiple modules in parallel, or (with the --fresh option) replay multiple modules into a single environment. My guess is that the latter is most likely needed, if the goal is to check the validity of a single theorem (that depends on the other files)? But I don't have a deep enough understanding of the details of Environment.replay to be sure.

Kim Morrison (Sep 17 2025 at 00:59):

Sorry, I didn't understand what the question is here.

Jason Rute (Sep 17 2025 at 07:40):

@Kim Morrison I don’t know the specifics, but the larger context is that there is a Lean program, SafeVerify (https://github.com/GasStationManager/SafeVerify), which can be used to check Lean code. It uses the environment replay from lean4checker I think. The goal of the project is to check that the kernel proof of a theorem is correct, that the axioms are just the usual three, and that the term of the theorem statement is the same as the reference version (which just has a sorry proof an no lemmas, etc). It’s the best thing we currently have for checking AI proofs. One thing @Mario Carneiro pointed out is that while he thought SafeVerify’s code looked good and secure, it will just trust any imports. This means in multi file projects, that it won’t replay the proof of the imports (even if they are new imports not found in the reference file). @GasStationManager’s technical question is about how to extend his checker to multiple projects imports where one also needs to replay some of the imports.

Jason Rute (Sep 17 2025 at 07:47):

Maybe it is just enough to do both lean4checker and SafeVerify. The former will check all the term proofs, not just those in the final file. Then SafeVerify will check the other things that lean4checker is missing (like axioms, and that the final theorem statement matches the reference version).

Jason Rute (Sep 17 2025 at 07:48):

Is there a correctness gap here I’m missing? (This will require any new definitions needed for the final statement to go in the reference file.)

Paul Lezeau (Sep 17 2025 at 12:29):

GasStationManager said:

If anyone wants to collaborate on a PR, (or help reviewing PRs), please let me know!

@GasStationManager I'd be interested in helping out with this!

Mario Carneiro (Sep 18 2025 at 00:24):

GasStationManager said:

In particular, here's a question about how lean4checker deals with multiple files that perhaps Kim Morrison and Mario Carneiro know the answer: right now lean4checker could process multiple modules in parallel, or (with the --fresh option) replay multiple modules into a single environment. My guess is that the latter is most likely needed, if the goal is to check the validity of a single theorem (that depends on the other files)? But I don't have a deep enough understanding of the details of Environment.replay to be sure.

The specification of the single-module version of lean4lean / lean4checker is that assuming the imports are correct, the output is also correct. That means you can run it in parallel in order to prove that all modules are correct, and the proofs compose. If you want to check a multi-file project you just need to make sure each module individually is validated, down to imports that you have previously checked.

Jason Rute (Sep 18 2025 at 09:42):

Ok, this sounds like what we as a community need to do is as follows:

trace the main theorem and its definitions
generate a template project with only those theorems/definitions (with any proofs sorried)
manually inspect the template project
run lean4checker on the full final project
run SafeVerify or similar to compare each template file with the final version.

AYUSH DEBNATH (Sep 24 2025 at 19:28):

Is there any repo for me to contribute?

Deleted User 968128 (Sep 25 2025 at 02:14):

Jason Rute said:

The recent proof by Gauss of the Strong PNT brings up an important question, one that I briefly discussed with Talia Ringer on X (https://x.com/TaliaRinger/status/1967148801862533359).

Good read: https://hbr.org/2025/09/ai-generated-workslop-is-destroying-productivity

That said, there is an opportunity with lean versus other domains because it's about verifiable correctness. The statement formalization problem and lean exploits ofc are at risk of workslop.

(deleted) (Sep 25 2025 at 03:24):

AYUSH DEBNATH said:

Is there any repo for me to contribute?

Mathlib. The one library to contribute to.

Martin Dvořák (Oct 05 2025 at 09:11):

AYUSH DEBNATH said:

Is there any repo for me to contribute?

Depends on what you want to contribute.

Last updated: Feb 28 2026 at 14:05 UTC