Zulip Chat Archive

Stream: general

Topic: TAIL - Template for AI generated Lean proofs

Eric Vergo (Dec 28 2025 at 15:04):

Chris Henson said:

it is valuable to have a canonical, relatively user friendly tool like Comparator to evaluate Lean proofs

I have been working on something in this direction. Initiated by this discussion/comment I put together a template for AI generated lean proofs, or 'TAIL', along with a verification tool that checks a repo against the template. It offers two modes, a 'strict' mode which does not allow the introduction of new definitions, and a more relaxed mode that does.

High level, the intent is to automate some of the review process as well as minimize the review burden for anyone checking the proof. There are about a dozen or so things I don't like about it right now, and it's still very much a work in progress. The tool is designed so that it can be used locally while -vibe proving to ensure that a major refactor does not need to happen before sharing. So far I have personally used it to vibe-prove a formalization of an old result of mine, as well as refactor the proof discovered here. I have found it quite useful in steering the LLM while it is vibe proving, as its easy for the LLM to make a catastrophic error that needs to be corrected and this tool helps expose them.

Feedback/comments/suggestions on all levels are welcome.

Jason Rute (Dec 28 2025 at 20:02):

@Eric Vergo, I have a lot of reservations about your system.

Jason Rute (Dec 28 2025 at 20:02):

What I like:

Your template separates out what is being proven from the proof, making it so that the “referee” only needs to check the statement and definitions used in the statement.
Unlike SafeVerify and Comparator, this doesn’t make the (often unrealistic) assumption that the statement of the theorem came from an independent trusted party.
It should be compatible with Comparator, and you can use Comparator to check the proof (with a bit of metaprogramming to separate it into a challenge and solution project).

Jason Rute (Dec 28 2025 at 20:02):

What I don’t like:

Unlike Comparator and SafeVerify, which have very principled designs, I think your checks are pretty ad hoc, leading to both allowing (fake) proofs of False and rejecting perfectly legitimate developments.
As far as (fake) proofs of False, I haven’t read your code, but I can think of a few ways that I could trick your system into (fake) proving False given your description of the checks. (I’m not going to say my strategy, because I think even if you ban those keywords, there are ways around them.) Again, SafeVerify and Comparator work on the term proof level, so they are much more robust.
As far as rejecting good proofs, the thing I’m most worried about is definitions. You require definitions to go into the same place, whether they are definitions needed for the theorem statement or the proof. This is too restrictive, I think. Many proofs are very definition-heavy. Consider constructing a counterexample or developing a new mathematical object to prove a theorem. Or sometimes definitions are just needed for bookkeeping. There is no need for the “referee” to have to check each and every intermediate definition by hand (or at least no more than the need to check every lemma statement or proof). [Edit: And some definitions also depend on theorems, like definitions using choice.]
Less troublesome, some banned keywords, like partial/unsafe, could have legitimate uses, like making new tactics. Similarly with banning opaque, which can be useful at times (and opaque has no soundness—or fake soundness—issue that I’m aware of, right?).

Jason Rute (Dec 28 2025 at 20:02):

So I would be supportive of something like your directory structure, but with more principled checking of the result (reducing the need to ban keywords or do other brittle checks).

Eric Vergo (Dec 29 2025 at 14:42):

Jason Rute said:

Eric Vergo, I have a lot of reservations about your system.

All of the points here are well taken. In the interest of keeping this thread focused I will start a new thread for this tool. (maybe a mod should break it off?)

Ruben Van de Velde (Dec 29 2025 at 15:24):

(I think you can do that yourself, hotkey "m" or under the kebab menu)

Notification Bot (Dec 29 2025 at 16:06):

A message was moved here from #general > Standards for Lean proofs of unsolved problems by Eric Vergo.

Eric Vergo (Dec 31 2025 at 16:54):

@Jason Rute responding in-line to your points, but a lot of the details here may be irrelevant given the proposal I make afterwards. Additionally, this is my first real interaction with the lean backend and this effort was partially a forcing-function for me to learn about the details of it.

Jason Rute said:

Unlike Comparator and SafeVerify, which have very principled designs, I think your checks are pretty ad hoc, leading to both allowing (fake) proofs of False and rejecting perfectly legitimate developments.

They are somewhat ad-hoc and are mostly the result of preventing unwanted behavior I personally witnessed an LLM engage in while vibe-proving. Of course, this should not be what is driving the list of things we need to check; it was just how I got things off the ground.

Jason Rute said:

What I don’t like:

As far as (fake) proofs of False, I haven’t read your code, but I can think of a few ways that I could trick your system into (fake) proving False given your description of the checks. (I’m not going to say my strategy, because I think even if you ban those keywords, there are ways around them.) Again, SafeVerify and Comparator work on the term proof level, so they are much more robust.

I assume this unease is because the tool checks olean files for keywords, and does not use the elaboration tooling directly. At one point I was checking things on the term level but found it to be much slower. There were also some challenges with the module system and having things marked as axioms, but that may have been due to a misunderstanding on my end.

Jason Rute said:

What I don’t like:

As far as rejecting good proofs, the thing I’m most worried about is definitions. You require definitions to go into the same place, whether they are definitions needed for the theorem statement or the proof. This is too restrictive, I think. Many proofs are very definition-heavy. Consider constructing a counterexample or developing a new mathematical object to prove a theorem. Or sometimes definitions are just needed for bookkeeping. There is no need for the “referee” to have to check each and every intermediate definition by hand (or at least no more than the need to check every lemma statement or proof). [Edit: And some definitions also depend on theorems, like definitions using choice.]

Indeed. This is poorly (incorrectly?) explained in the documentation and might not even be implemented correctly. As it stands, definitions with propositional type are allowed outside of the definitions directory. See this example. Maybe the right approach is to drop the definitions directory and update the language/rules to quarantine all declarations of non-propositional types instead. This could be done in a way that partitions them into ones needed for the statement of the main theorem, and ones needed for the proof. This may create complicated import trees as some declarations would be used in both.

Jason Rute said:

Less troublesome, some banned keywords, like partial/unsafe, could have legitimate uses, like making new tactics. Similarly with banning opaque, which can be useful at times (and opaque has no soundness—or fake soundness—issue that I’m aware of, right?).

My understanding is that ‘unsafe’ marks a declaration from being skipped during the kernel checking and can introduce soundness issues, as suggested here. If it does have legitimate uses but may also introduce soundness issues, my opinion is that we should err on the side of caution and not allow it.

As for opaque definitions, I think you are correct in that it does not create soundness issues; although it is not clear to me as to why when reading the documentation.

Eric Vergo (Dec 31 2025 at 16:55):

My proposal:

Update the TAIL standard as we see fit (discussion ongoing)
Re-architect TailVerify to focus on checking for compliance to our standard using the more robust tools
Drop the custom implemented soundness checks and leverage one or more of Lean4Checker/SafeVerify/comparator/etc. instead.

There are negative tradeoffs associated with this, mainly maintenance costs in making sure the tools integrate with each other properly. What I do like about it is that once we have this standard set, it is unlikely to change much, while the other tools will likely have to constantly update as lean evolves.

Last updated: Feb 28 2026 at 14:05 UTC