Zulip Chat Archive

Stream: lean4

Topic: A better Lean checker

Jason Rute (May 07 2025 at 01:31):

I’ve written about this in a few places, but just to make it more central, I’ve gathered my thoughts in one place.

Jason Rute (May 07 2025 at 01:31):

Lean is a great piece of software with incredible correctness guarantees, but we also know it can be easily “tricked”. The DeepSeek case was one very subtle example (hopefully fixed), but it is as simple as this to trick Lean:

local instance instNatAdd : Add Nat := ⟨Nat.mul⟩
theorem foo : 1 + 0 = 0 := by rfl

This will pass lake build, lean4checker, #print axioms foo, and even #check foo and #print foo will show the 1 + 0 = 0. (This isn't a kernel/soundness bug, but at the same time, the officially supported tools don't catch it.) As we get more Lean written by AI trained by reinforcement learning, it could easily find exploits like this and learn to repeatedly apply them. Also, as Lean becomes a greater standard for correctness used in more industrial settings, we need better assurance guarantees.

Jason Rute (May 07 2025 at 01:31):

Luckily, we already have two such tools: @Rob Lewis's autograder and @GasStationManager 's SafeVerify. But they are not officially supported by Lean and may rely on buggy components.

Jason Rute (May 07 2025 at 01:31):

I’d like to propose that Lean adopt something like the above tools, but which would be backed up by officially supported components slightly extending Lean’s kernel, lean4checker, and Lean4Lean. It should support these three use cases:

Jason Rute (May 07 2025 at 01:31):

It should check term proofs. Like lean4checker and Lean4Lean, it should check exported proofs in a trusted kernel.

Jason Rute (May 07 2025 at 01:31):

It should track the axioms used. In #lean4 > Checking axioms with leanchecker @Eric Wieser proposed that lean4checker have a flag checking that a proof only uses the three good axioms. I propose something slightly broader. It should be able to tell you for each declaration which axioms are used in the proof. There are legitimate uses for axioms and Lean supports them so people use them (even when maybe they aren’t the best thing). For example, the recent CombiBench benchmark uses axioms in some of its problems. (This tool should not rely on #print axioms or Lean.collectAxioms as those are user-interface code with possible bugs. That is, unless these tools are elevated to the status of the kernel and this tool can trace an axiom from the beginning of the proof.)

Jason Rute (May 07 2025 at 01:32):

It should compare a candidate proof to a reference term. It should take the same declaration by name in both a candidate and reference export and make sure the old and new versions have the same type. This way, we can check a proof in an adversarial setting, where one party writes a reference theorem in Lean (either as a sorry stub or with a reference proof), and an untrusted agent can supply a proof. Obviously, this would only be useful if the reference term was correct. That would involve it being written independently from the solution, extensively tested for correctness, and (ideally) manually inspected at the expression term-level for correctness. But if it is correct, then this tool would be able to check the proof for correctness without fear of the above exploits, or bugs in the user interface Lean code. It could be used for evaluating AI agents, grading students, judging competitions, and rewarding bounties for solved proofs. Also, it would be good for automated applications, where one is automatically proving thousands of theorems (like in the equational project or in industrial applications). In such cases, you want to be sure all the found proofs are correct (especially if you have an AI agent involved). Here are three specific use cases, in increasing challenge of supporting (but if only the first case is supported, that would be very helpful):
- Take a reference theorem statement and verify a candidate proof, where the Lean code may contain many new lemmas, new definitions, and even new meta code. We have to check that such new code wouldn’t change the target theorem type. (I worry there would be challenges in practice like type-class resolution giving different-but-definitionally-equivalent types between the reference and candidate. I also worry about variables.)
- Take both a stubbed definition D and a stubbed theorem T about D, fill in that definition and the proof of the theorem about that definition. This might be a lot more complicated. How does one verify that the theorem T is the “same” if the definition D changes from := sorry to the actual definition body? (Also, variables affect definitions even more than theorems.) But this is also very relevant for lots of benchmarks. For example, math competition problems where the user has to supply an answer as well as proof, or verified-code examples where one has to write code according to specification(s).
- For a blueprint-like project with sorry lemmas, check that a theorem is solved relative to existing sorry lemmas in the library, but not using additional sorries.

Jason Rute (May 07 2025 at 01:32):

The largest difference between my proposal and @Rob Lewis's autograder and @GasStationManager 's SafeVerify is that it would be officially supported by Lean. By that I mean the following (although I understand this is ambitious):

Jason Rute (May 07 2025 at 01:32):

There should be an officially supported FRO version, like lean4checker, and it should be the canonical tool to check correctness.

Jason Rute (May 07 2025 at 01:32):

There should be a spec like there (sort of) is already with external type checkers, so people can build their own reimplementations (including formally verified versions).

Jason Rute (May 07 2025 at 01:32):

This spec would be added to @Chris Bailey's Type Checking in Lean 4 (or similar).

Jason Rute (May 07 2025 at 01:32):

It would be in the scope for @Mario Carneiro's Lean4Lean creating a verified version following the spec, (which I think wouldn’t be adding much to that project, just the ability to trace axioms and compare theorem statements)

Jason Rute (May 07 2025 at 01:32):

It should be secure. It shouldn't have known exploits using tactics, macros, or other user code. (Yes, someone can mess up the reference theorem term or a hacker could modify the lean exporter code or the lean4checker code, but short of those sorts of things, no exploits.) (If the tool is extra cautious and gives a few false negatives for cases it can't support, that is fine, but no false positives.)

Jason Rute (May 07 2025 at 01:32):

If this ever fails, where it says a theorem is solved relative to a reference type using only a certain set of axioms, but it is not (because the proof actually uses an extra axiom or the types don’t actually match), then it should be treated like a kernel bug (even if it isn’t technically in the kernel), i.e. something people are shocked about and the FRO quickly fixes.

Jason Rute (May 07 2025 at 01:32):

Extra paranoid users may not trust this code (unless it is formally verified in Lean4Lean and then only up to their trust in the full TCB stack), but they will at least recognize this covers most of the worries they care about. To increase their assurance, they could use several external checkers and/or help @Mario Carneiro complete the Lean4Lean project.

Jason Rute (May 07 2025 at 01:33):

Someone should feel reasonably safe to use this for adversarial situations like AI-generated proofs, proof bounties, and checking a gigantic Lean library. (Nothing will ever completely replace human introspection, but this will come close.)

Jason Rute (May 07 2025 at 01:33):

The motivation for my suggesting this is as follows:

Jason Rute (May 07 2025 at 01:33):

It is hard to know right now what is the right way to check adversarial Lean code. In the recent DeepSeek bug, there have been many claims that they should have used lean4checker, but that would not have been sufficient since it doesn't check that the theorem even exists in the export. The emerging consensus, is that we need something like the tool I propose.

Jason Rute (May 07 2025 at 01:33):

It was previously believed that the only way for a theorem to be skipped by the kernel or transformed into something else was for someone to write metaprogramming code in front of the theorem, but the DeepSeek bug showed that bad things can happen just with regular everyday tactics inside a proof, nothing too suspicious. Sure, the proof was a bit vacuous looking and used apply?, but what if the next bug is triggered by a powerful tactic like grind or aesop? And what if instead of not adding the theorem to the environment, it transformed the theorem term into something else? There would be little or no way for a human to manually inspect such a proof to see if it was suspicious, vs. the automation just being really good.

Jason Rute (May 07 2025 at 01:33):

For tools like #print axioms, it is unclear if they should be considered part of the trusted-computing base of Lean. Some say they are, and others consider them to be less integral, mere convenience tools that shouldn’t be trusted. We need agreement on what the trusted computing base of Lean even is.

Jason Rute (May 07 2025 at 01:33):

Automated proofs, using both AI and symbolic methods, are going to get more common. There is much talk of building large libraries of mathematics with AI, and of using AI to formally verify software and hardware. We want to make sure Lean plays well with AI and that we have backup plans to prevent AI from reward-hacking Lean too much. (“Reward hacking” is a technical term here. I don't mean to imply that the AI will have a devious conscious plan, but something more like the DeepSeek bug where a bug can be found just by randomly trying out millions of things in Lean, and then the model learns to repeatedly exploit whatever it finds that works.)

Jason Rute (May 07 2025 at 01:33):

People occasionally mention using Lean for a bounty system or as a way to verify highly critical software. This seems like an opportunity for bad human actors to submit bad proofs. It would be good to get ahead of this.

GasStationManager (May 07 2025 at 03:09):

This might be a lot more complicated. How does one verify that the theorem T is the “same” if the definition D changes from := sorry to the actual definition body?

I think this is actually fine; the type of definition D hasn't changed. Of course the meaning of the theorem T will change (and possibly be false) depending on the body of the definition. But that is the intended behavior for these use cases. In fact this is the format for the code-with-proof challenges on the Code with Proof Arena website, which is currently backed by SafeVerify as judging backend. (I expect the same applies to autograder, though have not tested.)

GasStationManager (May 07 2025 at 03:37):

Broadly I agree that ideally there should be an "officially supported" Lean proof checking tool. I imagine it would not be hard to get the AI labs to support such an effort by the FRO. A safe proof checker is essential for anything involving potentially AI-submitted proofs (that is why I worked on SafeVerify in the first place). Now with the recent deepseek example this becomes even more obvious: without such a safe checking component, no one would be able to reliably train prover models using RL.

GasStationManager (May 07 2025 at 04:34):

On the more technical question of what needs to be checked. (Disclaimer: I am not an expert; more qualified people should jump in here.) I am basing most of my information from the discussions in this older thread: It seems like much of the hard work has been done by lean4checker and its underlying library Environment.replay, which protects against manipulations of the environment. What remains are some relatively easy checks on the "clean" versions of the submitted theorems after the replay:

making sure the target and submitted theorems/defs (are present and) have matching types
inspect the axioms to make sure nothing besides the standard ones are added.

This is what I was aiming for in SafeVerify though was done pretty clumsily. Nevertheless it covers all potential exploits that I am aware of.

Beyond this, the needs may diverge depending on use case. I was focusing on coding tasks, so wanted to make sure the submitted functions are actually "executable". I couldn't find a robust solution in Lean and ended up with the hack of banning the keywords uncomputable andimplemented_by... Suggestions are welcome on more robust approaches!

Niels Voss (May 07 2025 at 05:29):

The chapter on trust in the Type Checking in Lean 4 book seems relevant: https://ammkrn.github.io/type_checking_in_lean4/trust/trust.html, in particular the part about Pollack Consistency.

Niels Voss (May 07 2025 at 05:42):

It's important to remember that Lean metacode is capable of arbitrary code execution, so running a Lean file could give you a virus. In theory, someone could make a Lean file which gives you a virus that modifies the behavior of Lean to always accept certain proofs. This could even be done under the guise of "fixing" the Lean installation, similar to how some python programs modify the python installation or pip install packages from within the script.

Niels Voss (May 07 2025 at 05:50):

So, under the most extreme circumstances, the proof will need to be checked on a separate "trusted" machine which is distinct from the "work" machine that actually runs the tactics to generate the proof. In any case, it seems nearly impossible to trust everything about Lean, since even the behavior of #print axioms can be modified. There was another thread (I can't remember which) in which someone suggested that going forward, Lean will need to be seen not as a proof checker, but as a proof compiler.

Niels Voss (May 07 2025 at 06:05):

To solve the problem of misleading definitions in external checkers, we could imagine a hypothetical load_proof_from_file tactic which takes a zip files of .oleans and uses them to prove the given statement. Then, if you say wanted to verify the finished proof of FLT, you could compile it using your untrusted "work" computer, zip up all the oleans and send them over to the "trusted" computer, and then from the trusted computer have Lean execute:

def flt : ∀ (a b c n : Nat), a ≠ 0 → b ≠ 0 → c ≠ 0 → 3 ≤ n → a^n + b^n ≠ c^n := by
    load_proof_from_file "flt_blob.zip"

Your trusted codebase would now include core Lean and load_proof_from_file, but doesn't need anything else.

Niels Voss (May 07 2025 at 06:16):

Basically, my understanding of your question is that you are looking for an officially supported tool which is like lean4checker but a little bit more powerful in the sense that we should be able to make reasonable assertions about what the environment contains after replaying. It seems to me that the most practical way to achieve this is if lean4checker were upgraded to have some sort of system where you can provide a single command for it to run after replaying the environment, which will let you do something like the flt example above.

Niels Voss (May 07 2025 at 06:32):

GasStationManager said:

Beyond this, the needs may diverge depending on use case. I was focusing on coding tasks, so wanted to make sure the submitted functions are actually "executable". I couldn't find a robust solution in Lean and ended up with the hack of banning the keywords uncomputable andimplemented_by... Suggestions are welcome on more robust approaches!

Unfortunately, my understanding is that verifying the behavior of compiled Lean code is orders of magnitude harder than verifying proofs written in Lean. I'm pretty sure that doing so perfectly is effectively impossible right now. If you want to be sure that the semantics of Lean programs are correct, you have to trust

Every single @implemented_by and @extern clause in the entire library. This includes a ton of code marked unsafe. For example, MLList depends on an unsafe inductive type, which I think is consistent, but it does mean that we are relying on the computational properties of unsafe code. Also, all external C++ code needs to be pure. There are hundreds of these throughout Lean, and I think it's very unlikely that they are all completely correct. And unlike axioms, these aren't even tracked.
The Lean compiler and the C compiler. There was a bug a few months ago where the C compiler was optimizing out an infinite loop, which led to undefined behavior. Lean's compiler is much more complicated than Lean's kernel and is way more likely to have bugs.

Niels Voss (May 07 2025 at 06:43):

As a direct example of how this is problematic, suppose you gave a homework assignment which was to create a computer program which given a natural number n, returned a prime greater than n. (Basically, a constructive version of the proof of the infinitude of primes.) A student who doesn't feel like writing an actual computer program can use a nonconstructive proof by contradiction to prove something like Nonempty {f : Nat -> Nat // forall n, ... }. Then they can use the partial keyword to get a function of type {f : Nat -> Nat // forall n, ... } by setting it to loop forever, and now they have a program of the desired type and properties which doesn't use noncomputable, even though it's a useless program. If this is fine by you, then there's no reason to ban noncomputable in the first place.

Jason Rute (May 07 2025 at 10:18):

Niels Voss said:

It's important to remember that Lean metacode is capable of arbitrary code execution, so running a Lean file could give you a virus. In theory, someone could make a Lean file which gives you a virus that modifies the behavior of Lean to always accept certain proofs. This could even be done under the guise of "fixing" the Lean installation, similar to how some python programs modify the python installation or pip install packages from within the script.

This is a separate issue from the correctness one I have in mind, but certainly one that Lean is going to have to deal with in the future (possibly after a painful incident). Just opening a Lean file can lead to arbitrarily running code. I remember where I was when the ILOVEYOU virus came out in 2000!

Jason Rute (May 07 2025 at 10:25):

Niels Voss said:

So, under the most extreme circumstances, the proof will need to be checked on a separate "trusted" machine which is distinct from the "work" machine that actually runs the tactics to generate the proof. In any case, it seems nearly impossible to trust everything about Lean, since even the behavior of #print axioms can be modified. There was another thread (I can't remember which) in which someone suggested that going forward, Lean will need to be seen not as a proof checker, but as a proof compiler.

Yes, I'm focused more on the "proof compiler" aspect of Lean here where once the term proof is generated, then it is checked in something like a new-and-improved lean4checker (ideally on a separate computer than generated the proof, yes!). I'm also suggesting #print axioms wouldn't be needed as the final source of trust, but would be replaced by a better axiom-checking functionality in this new-and-improved lean4checker.

Jason Rute (May 07 2025 at 10:35):

@Niels Voss I'm not sure I see the point of the load_proof_from_file functionality. I was thinking more of the main check of correctness happening apart from Lean, in something more like lean4checker, but with the added support of comparing a compiled/exported candidate proof to a compiled/exported theorem statement. Full Lean wouldn't be part of the trusted computing base (except so far as it is used to generate the target theorem statement). This load_proof_from_file could be useful, however, for avoiding dangerous side effects of an untrusted Lean program, and for getting greater assurance that the whole process round-trips.

Jason Rute (May 07 2025 at 10:52):

And to be clear, what I'm proposing is very practical, and not theoretical, basically as practical as lean4checker if not more so. The fact that two people have already built working prototypes shows how much it is needed. (I suspect there are industrial users already using Lean with Alpha-Proof-like AI as a form of ATP for industrial problems. A system like this could be very valuable to them.) Moreover, this design of having both a candidate term proof and reference type spec is the design of @Mario Carneiro's trust-conscious and underrated MM0 project.

Jason Rute (May 07 2025 at 11:10):

Niels Voss said:

Then they can use the partial keyword to get a function of type {f : Nat -> Nat // forall n, ... } by setting it to loop forever, and now they have a program of the desired type and properties which doesn't use noncomputable, even though it's a useless program.

You don't even need partial. Just use docs#Nat.find to get the minimal n satisfying some property (if you have an existence proof).

GasStationManager (May 07 2025 at 12:46):

Nat.find at least seems to emit computable code. There is the issue of computational complexity, which is important and needs to be addressed by building on top of the base proof checking. The problem with partial is that it could be used to provide infinite loops that satisfy the type requirement. That's why for my use case I've been also banning partial and unsafe (though only in the submitted file). In my case I'm trusting all imports because the submission is a single file.

Chris Bailey (May 07 2025 at 18:20):

There's a lot here, but one initial thought is that the proposal assumes a degree of commonality among the mentioned use cases that may not exist. The degree of confidence/correctness sought can become time and resource intensive as projects get large (see mathlib); full environment replay is suitable for checking components of a paper, claims of high profile results, etc. It's not suitable for integration in an AI training loop or as an automatic grader on a leetcode style website.

My sense is that the AI users are going to have a very high degree of variability in their demands based on how they want to structure, train, and pay for their model, and (this is admittedly more speculative) that those demands will take precedence over adherence to/use of a one size fits all tool any time there's a conflict.

It's not clear to me even after skimming the deepseek thread whether the actual problem is just that their reported results are wrong, or whether the model being able to fool itself is actually having a significant negative effect on the training and making the model worse.

GasStationManager (May 08 2025 at 00:10):

The model is worse in that it occasionally outputs a "proof" that is not a valid proof. It has likely learned to use the same technique in similar situations. Given the correct feedback from a more robust proof checker, it could have learned the correct way to prove.

What this incident made me appreciate is that due to the nature of RL, once a model finds an exploitable bug in the proof checker, it will use it to maximize its reward, which is the number of theorems proved. And if the proof checker has a flaw, with sufficient exploration the model is likely to find it. For better or worse, the proof checker defines the reward function, which in turn determines the quality of the learned policy.

Personally, I am worried that one day a model finds an exploit that is not caught until it got deployed into a production environment. It would be able to claim to prove that its code does A, while the code actually does B. It would be able to execute arbitrary code, while enjoying a "trusted" status. If that sounds far-fetched, to me it is less far-fetched than I thought a week ago.

Niels Voss (May 08 2025 at 00:19):

GasStationManager said:

Personally, I am worried that one day a model finds an exploit that is not caught until it got deployed into a production environment. It would be able to claim to prove that its code does A, while the code actually does B. It would be able to execute arbitrary code, while enjoying a "trusted" status. If that sounds far-fetched, to me it is less far-fetched than I thought a week ago.

Verifying the behavior of programs should be considered a completely separate problem than verifying proofs, as it is so much more difficult. Right now, I would not trust proofs that a Lean program behaves correctly, especially not in an adversarial situation.

Chris Bailey (May 08 2025 at 00:39):

GasStationManager said:

The model is worse in that it occasionally outputs a "proof" that is not a valid proof. It has likely learned to use the same technique in similar situations. Given the correct feedback from a more robust proof checker, it could have learned the correct way to prove.

That part I got, but it wasn't clear whether it did the second part (exploiting the loophole) enough during training to make the resulting model much worse than the initially reported results suggested, or whether the spotted issues with sorry/apply were only occasionally present, and the underlying model was more or less as advertised. It's interesting that it apparently didn't learn to exploit this guaranteed winning move everywhere.

Jason Rute (May 08 2025 at 00:42):

@Chris Bailey, those are good points. As far as levels of correctness vs resources, I totally agree about that tradeoff. Currently for checking that a proof is correct Lean has roughly four levels:

Front end: Editor, LSP, or REPL. This is the most buggy since it has parallelism and other real-world concerns. But for the most part it is sufficient.
Build system (with cached oleans): This still is suseptable to bugs and metaprogramming hacks, but is very reliable
Build system (with cached oleans) exported to lean4checker: (I think I have this right that lean4lean can use cached oleans.) This is a sweet middle ground of speed and reliability. It would only fail if there was an error in the cache or a soundness issue in the kernel.
Build system (no caching) exported to lean4checker: This is the gold standard, but slow and expensive. Only run this occasionally.

Jason Rute (May 08 2025 at 00:42):

I imagine the same rough levels (more or less) for the new things I want to add to the kernel (like checking axioms, checking that the final type is as expected, and even just that a proof of a given declaration even exists). Starting from the other end:

Build system (no caching) exported to the checker: This would replay everything from the beginning. It is mostly already built in lean4checker, but it would keep track of axioms and if terms are the same (either exactly or definitionally) across exports. Yes, it is slow, but it is the gold standard for “soundness” of the system, and it is mathematically precise enough to go in Lean4Lean.
Build system (with caching) exported to the checker: This would be the sweet spot. I think it would be fast enough for a lot of applications, and (as long as the adversarial agent had no ability to change the cache) it would be almost as secure as the former.
Build system: Not sure what this would look like or if we need anything special at this level.
Front end: This would be a set of tools which mimic the checker as much as possible, but is more seceptible to bugs and hacks. This could be written by either the FRO or by end users (building on existing work like #print axioms, SafeVerify and autograder).

Jason Rute (May 08 2025 at 00:42):

The front end would be the first line of defense. Then as resources allow given the need one would revert to slower but more secure methods. If the checker with cached oleans is fast enough it would be good for a lot of applications, including AI training and Leet code judging. The full checker run from the beginning would be reserved for nightly (or weekly) sanity checks and for checking final results.

Jason Rute (May 08 2025 at 00:43):

If a more secure level contradicts a less secure level, then that would either signal a bug, or a way an agent can exploit Lean in the front end (using metaprogramming). That would have to be dealt with at that level (maybe by disabling access to features like debug.skipKernelTC). But right now we have no secure kernel-like thing to act as a backup and that is what is most needed.

Jason Rute (May 08 2025 at 00:43):

It's not clear to me even after skimming the deepseek thread whether the actual problem is just that their reported results are wrong, or whether the model being able to fool itself is actually having a significant negative effect on the training and making the model worse.

I think it is probably both. But at the same time, we know it could be a lot worse. For example if an RL agent discoveres debug.skipKernelTC it would quickly lead to collapse (and we wouldn’t even know about it since the developers wouldn’t release such a broken model :grinning:.) (Ocassionally you see posts by AI developers who are suprised to discover that sorry has synonyms, presumably because it lead to collapse of their training. :grinning:). However, more subtle bugs like this one can get through without the developers noticing and isn’t clear what effect or danger they have. It certainly erodes trust. Already people are saying we shouldn’t use AI in Lean because you cannot trust that the results will be correct (even after being checked by Lean). I’m worried I’m to blame for that message. It isn’t the message I wanted to convey.

Jason Rute (May 08 2025 at 00:48):

Also, getting back to the extensions to Lean4checker that I have in mind, the more I think about it, it wouldn’t be difficult to create. Much easier than writing a kernel, and it would run much faster all things being equal.

Jason Rute (May 11 2025 at 13:53):

I think the test of whether my Lean4Checker addition suggestions are feasible is to write the following in Lean code:

A tool that traverses the exact same structures used by Lean4Checker, but which outputs which axioms are used for each declaration. Can it be made fast? (It isn't doing any type checking, just traversing.)
A tool that takes two extractions meant for Lean4Checker, one with a sorry theorem and one with that same theorem filled in (and a bunch of stuff in front of that theorem), then compares the signatures to make sure they are the same theorem.
A tool that does the previous one but also allows for definitions used in the final theorem statement to be sorry in the first extraction and filled in with the second extraction.

In short, can we do everything that SafeVerify does, but at the kernel level? I have to admit that I'm not so familiar with the code of Lean4Checker or SafeVerify, and if I was maybe all this would be obvious.

Justin Asher (May 11 2025 at 16:33):

SafeVerify already does most of this. However, SafeVerify is written for "leanprover/lean4:v4.14.0-rc2", so I would have to check if it is compatible with the more recent Lean versions.

Ruben Van de Velde (May 11 2025 at 16:37):

You should assume it isn't

Jason Rute (May 11 2025 at 17:15):

@Justin Asher You have a typo in the URL: https://github.com/GasStationManager/SafeVerify

Jason Rute (May 11 2025 at 17:16):

As for whether SafeVerify has the needed functionality, I think it does, but it doesn't have the safety guarantees that other tools have, like Lean4Checker, Lean4Lean, and external checkers. These are all very minimal code that works just with the expression terms and are written with safety at the forefront. SafeVerify is more user-level. So what I'm proposing is that we need to make a version of the following tools with SafeVerify features:

Lean4Checker: a simple CLI one can run to detect environment hacking without having to worry about user-interface bugs, which Lean4Checker doesn't really do since there are lots of hacks and exploits it doesn't cover (if I understand correctly) like the recent DeepSeek bug (which wasn't in the kernel), using local typeclass instances, changing the command theorem, etc.
Lean4Lean: same as above, but formally proved to be correct
external checkers: have a spec or examples that others can copy to write their own checkers

Jason Rute (May 11 2025 at 17:18):

I'm worried that SafeVerify is susceptible to user-facing bugs in Lean. For example, in its axiom checker which uses the same code as #print axioms. I could be mistaken, but my understanding is that in the past, there have been various bugs where #print axioms doesn't print the right axioms. But they aren't at the level of soundness bugs. They are just front-end bugs. And I've exchanged messages here with a mathlib maintainer who doesn't consider #print axioms to be a part of Lean's soundness guarantees (although I was being very annoying at the time, so it might have been in reaction to that). It seems like it shouldn't be relied on at the most secure kernel level used by something like Lean4checker. One should just directly loop over the proof term to find the axioms. (And if that is what print axioms does, why does it have bugs sometimes?) (It is probably fine if the user-facing SafeVerify uses it if it is faster than looping over proof terms, but there should be a backup.)

Jason Rute (May 11 2025 at 17:19):

I'm less sure how the other parts of SafeVerify work, but it would be good to have it looked over more closely so we know if it is free of bugs. It isn't much code.

Jason Rute (May 11 2025 at 17:34):

The other thing I'm asking for is a standard of what it means to check a proof, especially one generated by an untrusted agent. The current standard seems to be to run Lean4Checker and do #print axioms, which can miss important bugs and environment hacking cases. If the standard is run SafeVerify, that would probably be ok (after it is looked over). We just need to know it will be maintained, and there is agreement in the community that this is the standard.

Justin Asher (May 11 2025 at 18:04):

Do you know why the #print axioms command was not working properly? Do you have a link to a GitHub issue, for instance, or a Zulip chat thread? I want to read up on this first.

Jason Rute (May 11 2025 at 18:13):

The two I know about are #lean4 > #print axioms does not report sorry and #lean4 > soundness bug: native_decide leakage. In both cases, the proofs wouldn't have passed Lean4Checker, but at least on the user end, #print axioms gave a misleading result. If there was some assurance the only way #print axioms could be wrong was if the proof was also wrong and wouldn't pass Lean4Checker, that would be better. I'm not sure if that is generally accepted.

GasStationManager (May 12 2025 at 00:20):

Re: print axioms, from looking at the code (CollectAxioms) it is recursively going into terms to check for axioms, which seems to be the right way. As for why it sometimes has bugs, I'm not sure; one of the issues linked above seems to be about which version of environment to search in.

In SafeVerify, CollectAxioms is called on the environment after Environment.replay, which is lean4checker's backend. It's more robust than straight #print axioms since it protects against environment manipulations.

As for the question of whether some or all of these functionality can be done in kernel, I don't know. I do share the sentiment that there should be an official definition of what is an acceptable proof, and an official way to check it.

Jason Rute (May 12 2025 at 00:27):

@Thomas Zhu You expressed some worry that SafeVerify (or the autograder) uses CollectAxioms. Do you still feel that way after the above message? (Link to your reservation: #Machine Learning for Theorem Proving > DeepSeek-Prover V2 @ 💬)

Mario Carneiro (May 12 2025 at 00:58):

I don't think it's a good idea to be discouraging use of CollectAxioms. As written it is correct AFAIU, although it has weird behavior around some kind of malformed environments. That is, it reports axioms correctly assuming that the kernel would construct the environment

Thomas Zhu (May 12 2025 at 00:59):

The issue of collectAxioms is described and fixed in lean#8214. Basically, before this PR but after a change in asynchronous elaboration (i.e. in Lean v4.19.0) it silently ignores any constant in the environment but not kernel environment, so it could falsely ignore axioms if a previous error introduces this. This does not make SafeVerify invalid: the previous error would not have passed the kernel anyway. ~~However, the concern I have is if in the future another exploit can suppress this error message.~~ (Edit: sorry, I didn't understand how SafeVerify works)

Personally I think that Lean should make sure collectAxioms and hence #print axioms is sound, in that if collectAxioms c = #[a1, ..., an] for a constant c : C and axioms ai : Ai, then A1 → ... → An → C is inhabited without assuming any axioms. I also think that collectAxioms looks sound after lean#8214, but this should also be verified (i.e. something like this: 1. collectAxioms elevated to the kernel, and the use of user-defined axioms to the metatheory; or 2. restricting lean4checker to only accept the 3 standard axioms, and a way to construct c' : A1 → ... → An → C for a theorem c : C that depends on axioms Quot.sound, propext, Classical.choice, a1, ..., an, and c' can then be checked by lean4checker).

Mario Carneiro (May 12 2025 at 00:59):

To avoid environment malformedness issues you need something like Environment.replay which is what lean4checker / lean4lean use

Jason Rute (May 12 2025 at 01:00):

Ok, so if the environment is good (i.e. comes from Environment.replay, which is also what SafeVerify uses), then it is as trustworthy as the Lean kernel?

Mario Carneiro (May 12 2025 at 01:01):

I can't say I have audited the part to do with kernel.environment, this is a recent development, but I don't have a reason to distrust it

Jason Rute (May 12 2025 at 01:01):

Can it be added to Lean4Lean?

Mario Carneiro (May 12 2025 at 01:02):

what's the interface design there?

Mario Carneiro (May 12 2025 at 01:02):

You should come up with a proposal that can fit into lean4checker and lean4lean workflows

Mario Carneiro (May 12 2025 at 01:05):

One thing I should mention: lean isn't great at handling multiple environments which are not in an ancestry relation. So your proposal up thread to "compare" environments between one with a sorry and one without sounds difficult. It would be easier to have the proof be an extension of the theorem statement environment

Jason Rute (May 12 2025 at 01:05):

SafeVerify does this, but maybe it is unsound. That would be good to know.

Mario Carneiro (May 12 2025 at 01:06):

i.e.

def RiemannHypothesis : Prop := ...

-- later

theorem rh_proof : RiemannHypothesis := ...

Mario Carneiro (May 12 2025 at 01:07):

you don't need any fancy additions to lean to be able to express this, the only thing is that you need to identify rh_proof and RiemannHypothesis and say "I assert that one has the type of the other"

Mario Carneiro (May 12 2025 at 01:09):

Because you are totally right that checking that an .olean is well formed on its own is no good since it could just be empty

Jason Rute (May 12 2025 at 01:10):

That does sound like a good first approximation, but it also wouldn't fit into a lot of real-world use cases.

Mario Carneiro (May 12 2025 at 01:11):

it's certainly a minimalistic approach. What are you thinking of that wouldn't work with this?

Mario Carneiro (May 12 2025 at 01:12):

If the type check was up to defeq then you wouldn't need rh_proof to literally refer to RiemannHypothesis, and the two could even be in separate files

Mario Carneiro (May 12 2025 at 01:12):

I don't like having the statement expressed using sorry though, that makes it a bit too easy to solve the proof by referring to the sorry theorem

Jason Rute (May 12 2025 at 01:13):

One doesn't typically train AI models to produce theorems of theorem rh_proof: RiemanHypothesis where the theorem is just a single Prop constant. Similarly, with grading students. Maybe filling in blueprints is more like this, so if Lean had more of a common and highly used standard for saying "fill this in for me", then I could see this workflow using that.

Jason Rute (May 12 2025 at 01:13):

Also, does this cover the case of filling in a definition and a theorem about that definition?

Mario Carneiro (May 12 2025 at 01:14):

As mentioned, the proof doesn't have to literally use the prop constant

Mario Carneiro (May 12 2025 at 01:14):

but it does have to be re-typable as using it

Mario Carneiro (May 12 2025 at 01:15):

Jason Rute said:

Also, does this cover the case of filling in a definition and a theorem about that definition?

You could do that by bundling things into a structure or sigma type

Jason Rute (May 12 2025 at 01:15):

Mario Carneiro said:

but it does have to be re-typable as using it

Is this easy? Can I take a proof of:

theorem foo: <stmt of rh hypothesis>

and automatically turn it into:

theorem foo: RiemannHypothesis

easily at the code syntax level?

Jason Rute (May 12 2025 at 01:16):

Mario Carneiro said:

Jason Rute said:

Also, does this cover the case of filling in a definition and a theorem about that definition?

You could do that by bundling things into a structure or sigma type

That gets really awkward really fast.

Mario Carneiro (May 12 2025 at 01:16):

If you have

def RiemannHypothesis := <stmt of rh hypothesis>
theorem foo: <stmt of rh hypothesis>

then it's easy to ask the kernel to check that foo : RiemannHypothesis

Mario Carneiro (May 12 2025 at 01:17):

I agree that it's not great. It doesn't handle universes very well at all

Mario Carneiro (May 12 2025 at 01:17):

you can't just bundle a universe polymorphic definition into a subtype

Jason Rute (May 12 2025 at 01:19):

Again, I'm curious if the current SafeVerify (which I think handles these cases more naturally) or autograder are sound, or if they run into problems in practice with comparing two separate environments (the problems that @Mario Carneiro is mentioning).

Jason Rute (May 12 2025 at 01:22):

@Mario Carneiro do you think you could come up with an example that breaks SafeVerify?

Mario Carneiro (May 12 2025 at 01:22):

It seems like it would be hard to write a spec for SafeVerify

Jason Rute (May 12 2025 at 01:25):

Good question. I assumed it was that the second environment had to have at least the same constants as the first and the types have to be the same (where same means then have the same construction tree in the environment (save for filling in sorries in the first environment). But I haven't thought about it as much. Maybe there is an obvious problem here.

Mario Carneiro (May 12 2025 at 01:26):

I should clarify, the readme is a reasonable description of behavior. I mean a formal spec, sufficient to convince someone that the result is sound and actually implies some theorem is proved

Jason Rute (May 12 2025 at 01:28):

I was trying to outline one. :) But I agree. Without a formal agreement one can't talk about correctness.

Mario Carneiro (May 12 2025 at 01:39):

Jason Rute said:

Mario Carneiro do you think you could come up with an example that breaks SafeVerify?

It looks pretty good, but one thing I spotted is that it trusts the imports of the target file. So if my submission contains an import on another file under my control then I can put whatever I want in there (both malicious code and also just malformed declarations) and they will not be checked

Jason Rute (May 12 2025 at 03:51):

@GasStationManager, if we want to verify that the imports are the same, would it be as easy as recursing through them as well? Or I think olean’s have a hash or something that could be used to stop recursing (if we trust them). If a file is new, it would have to be recursed.

Jason Rute (May 12 2025 at 03:56):

Also, @Mario Carneiro is asking really good questions about workflows and the exact logic spec we have in mind. I feel like I don’t want to take that question lightly since it probably matters a lot for the design of practical tools and user workflows, not just in AI benchmarking but for user facing tools, and connecting with the common Lean blueprint workflow.

GasStationManager (May 12 2025 at 17:06):

@Mario Carneiro Yes that is a good point. For my current use case I was trusting the imports because the submission is via the web, which the system then saves as a single temporary file into a fixed Lean installation.

For multi file submissions, where a main file imports the other modules, I guess the right thing would be to replay the "local" imports. Like what lean4checker --fresh does but perhaps have a whitelist of trusted imports (e.g. Mathlib) that do not have to be replayed?

Mario Carneiro (May 12 2025 at 17:08):

it's fine if you want to not check imports if this is part of a distributed checking process (you have to check the imports in a separate invocation and in the given invocation you can assume that check has already happened)

Mario Carneiro (May 12 2025 at 17:08):

but you have to take care not to run untrusted code at initialization time

Mario Carneiro (May 12 2025 at 17:09):

I can put untrusted code in initialize blocks in an upstream file and it will get executed when you try to set up the initial environment for the downstream file

Last updated: Feb 28 2026 at 14:05 UTC