Zulip Chat Archive

Stream: lean4

Topic: new bug mentioned on Twitter

Jason Rute (Dec 30 2025 at 01:45):

I see mention of a “soundness bug” on X with no explanation.l by @Elliot Glazer and @James E Hanson . Is this discussed somewhere? https://x.com/elliotglazer/status/2005798788829700259?s=46

Shreyas Srinivas (Dec 30 2025 at 01:46):

This is not a bug per-se.

Shreyas Srinivas (Dec 30 2025 at 01:47):

You can always insert arbitrary declarations into the environment

Shreyas Srinivas (Dec 30 2025 at 01:49):

It just won't pass lean4checker or external checkers where every declaration is replayed. I can see the potential for misinformation though

Elliot Glazer (Dec 30 2025 at 01:56):

@Shreyas Srinivas Perhaps there should be a setting on Lean 4 Web to analyze code the way lean4checker does?

Shreyas Srinivas (Dec 30 2025 at 01:59):

That’s not my forte. But lean4checker doesn’t run in the editor afaik. What you see is a conscious design decision to not nuke the editor by replaying all declarations to check each new declaration. This has been discussed on Zulip before.

Elliot Glazer (Dec 30 2025 at 02:03):

I understand that the editor is optimized for typical use rather than adversarial code, but surely it's feasible to install a "rigorously check this proof" button that exports the code to a checker and validates it?

Jason Rute (Dec 30 2025 at 02:05):

@Elliot Glazer, I know you are very excited by Lean, but people also see you as a leader in this space because of your work on FrontierMath. I think dropping that there is a “ soundness bug” on X with no prior discussion with the lean community about these issues is a bit rude to this community.

Shreyas Srinivas (Dec 30 2025 at 02:10):

#general > PSA about trusting Lean proofs @ 💬

Shreyas Srinivas (Dec 30 2025 at 02:10):

External checkers are necessarily going to take forever to run. So I am not sure it makes sense to add such a command or button to the editor. It's basically going to grind to a halt for more than an hour and make the editor unusable.

Shreyas Srinivas (Dec 30 2025 at 02:11):

You can run lean4checker from the terminal

Shreyas Srinivas (Dec 30 2025 at 02:13):

Mathlib and all other projects handle this by running the external checker on the CI

Shreyas Srinivas (Dec 30 2025 at 02:15):

To go slightly further, there is a property of proof assistants called "Pollack consistency", which basically no major proof assistant achieves : https://ammkrn.github.io/type_checking_in_lean4/future_work.html
Trusting lean or any other theorem prover is slightly non-trivial as other pages on that document mention. This is par for the course.

Elliot Glazer (Dec 30 2025 at 02:18):

@Jason Rute I didn't say "soundness bug" in my public tweets. When crediting Jannis with the initial discovery, I used "bug" for lack of a better word. Perhaps "web editor exploit" would be better?

I don't think it's unreasonable to disseminate an example that "putting code in the web editor and checking that it compiles and checking that no axioms are added" is insufficient to verify correctness of a Lean proof, considering that's the standard recent high-profile Lean proofs have been judged by. I myself didn't realize this was possible, having thought that the previously discussed exploits were either patched out, or caught with #print axioms like native_decide uses are.

Jason Rute (Dec 30 2025 at 02:20):

I’m sorry. You are right. I mixed up a direct message you sent to me with your public post. I take back what I said and am again sorry.

Shreyas Srinivas (Dec 30 2025 at 02:20):

It's not an "exploit" either. It's a design trade off that can catch someone unaware if their idea of "trust lean" is "lean says it is right so it is right". It's an editor UX limitation. It also works on your local VSCode installation.

Shreyas Srinivas (Dec 30 2025 at 02:21):

You can also add all the tactics that use Lean.ofReduceBool to gotchas if we are talking about people inexperienced with ITPs. This one is just slightly worse because it doesn't show up in #print_axioms

Jason Rute (Dec 30 2025 at 02:21):

But I still don’t like the modern Twitter post style of dropping something with no explanation. But I guess this is my grip with the whole modern internet.

Shreyas Srinivas (Dec 30 2025 at 02:23):

Basically trust is a complicated thing. The good news is, lean4 is still sound as far as this issue goes.

James E Hanson (Dec 30 2025 at 02:24):

Shreyas Srinivas said:

You can also add all the tactics that use Lean.ofReduceBool to gotchas if we are talking about people inexperienced with ITPs. This one is just slightly worse because it doesn't show up in #print_axioms

Would it be possible to add a warning to addDeclCore when the doCheck argument is used (in the same way that you get a warning whenever you use native_decide)?

Shreyas Srinivas (Dec 30 2025 at 02:25):

You could add that and someone else could bypass addDeclCore altogether or just write a modified version without the warning.

James E Hanson (Dec 30 2025 at 02:26):

Sure, I'm not saying it would be foolproof.

Shreyas Srinivas (Dec 30 2025 at 02:26):

It wouldn’t be fool-anything. You are just adding one small step

Elliot Glazer (Dec 30 2025 at 02:26):

Shreyas Srinivas said:

It's not an "exploit" either. It's a design trade off that can catch someone unaware if their idea of "trust lean" is "lean says it is right so it is right". It's an editor UX limitation. It also works on your local VSCode installation.

Do you have a preferred lingo for describing this type of editor behavior to the public? There's a lot of people working right now to post AI-generated "proofs" of major conjectures and sooner or later someone is gonna use this exploit (even if it's just Claude deep diving the Zulip for "insta-prove" methods). I'm gonna be expected to make some sort of adjudication on those events, and it's much healthier to have that conversation with the public using an obviously intentional example BEFORE someone seriously says they've solved Riemann Hypothesis with 10k lines of code obscuring this exploit.

James E Hanson (Dec 30 2025 at 02:27):

Well the specific term for this is "environment hacking" but I guess that's not what you're looking for, right?

Elliot Glazer (Dec 30 2025 at 02:28):

Eh, I can go with "environmental hack."

Shreyas Srinivas (Dec 30 2025 at 02:29):

“Editor UX limitation: The lean editor/elaborrator doesn’t replay the environment. To be foolproof a proof must be run through the environment replay tool lean4checker. Trust is further improved by running external checkers”

Elliot Glazer (Dec 30 2025 at 02:30):

Jason Rute said:

But I still don’t like the modern Twitter post style of dropping something with no explanation. But I guess this is my grip with the whole modern internet.

I'll happily drop an explanation thread asap. What would be the most informative Zulip thread to cite?

Shreyas Srinivas (Dec 30 2025 at 02:30):

I just posted it up there

Shreyas Srinivas (Dec 30 2025 at 02:30):

I don’t think you should post it though

Shreyas Srinivas (Dec 30 2025 at 02:31):

Just write a brief explanation and link to the doc I posted

Shreyas Srinivas (Dec 30 2025 at 02:31):

Shreyas Srinivas said:

https://ammkrn.github.io/type_checking_in_lean4/future_work.html

Shreyas Srinivas (Dec 30 2025 at 02:31):

Also I wrote about this in the equational theories project paper

Shreyas Srinivas (Dec 30 2025 at 02:32):

Nothing new per se. But important info for mathematicians.

Elliot Glazer (Dec 30 2025 at 02:34):

Shreyas Srinivas said:

External checkers are necessarily going to take forever to run. So I am not sure it makes sense to add such a command or button to the editor. It's basically going to grind to a halt for more than an hour and make the editor unusable.

This doesn't convince me the feature would be a bad idea. Its use case is precisely "random onlooker sees some code and wants to know if it proves what it says it does." If that requires them to ignore a tab for an hour, so be it.

Chris Henson (Dec 30 2025 at 02:34):

@Elliot Glazer We have canonical ways (lean4checker, etc.) of checking that metaprogramming shenanigans have not created a situation that is a kernel error on replaying the environment. It's really not been a focus on even AI input because something in the Lean.Kernel is obscure enough that I've never seen a model try this, but I agree it could be a future concern. This is part of why I hope that we can standardize on one officially supported tool (maybe Comparator now?) that checks for these sorts of things.

Shreyas Srinivas (Dec 30 2025 at 02:35):

@Chris Henson an AI model just needs to insert that snippet in the middle of 25000 lines of proof.

Shreyas Srinivas (Dec 30 2025 at 02:35):

And if it’s available on the web…

Chris Henson (Dec 30 2025 at 02:37):

I fully agree, just citing my anecdotal experience that they don't seem to do so currently and that if they do in the future we have appropriate tools to point people to.

Shreyas Srinivas (Dec 30 2025 at 02:37):

Elliot Glazer said:

Shreyas Srinivas said:

External checkers are necessarily going to take forever to run. So I am not sure it makes sense to add such a command or button to the editor. It's basically going to grind to a halt for more than an hour and make the editor unusable.

This doesn't convince me the feature would be a bad idea. Its use case is precisely "random onlooker sees some code and wants to know if it proves what it says it does." If that requires them to ignore a tab for an hour, so be it.

It would be a UX nightmare. Much easier to use the CI action (which can be run locally in vscode).

Shreyas Srinivas (Dec 30 2025 at 02:40):

Technically you need to have lean4checker as a project dependency which is not the default thing in the math template. So that would have to happen before making the button useful.

James E Hanson (Dec 30 2025 at 02:45):

Shreyas Srinivas said:

Technically you need to have lean4checker as a project dependency

Is this documented somewhere? I didn't see this on the readme of the lean4checker github repo.

Shreyas Srinivas (Dec 30 2025 at 02:49):

Right sorry my bad. I had to check the etp repo

Shreyas Srinivas (Dec 30 2025 at 02:50):

You can download lean4checker and build it and then run it in your project directory

Jason Rute (Dec 30 2025 at 02:50):

Sorry I am out with friends and not able to respond quickly (or read everything), but the feature you want from the editor @Elliot Glazer is more or less built into the blockchain project by @GasStationManager. It uses SafeVerify so it also checks axioms and that terms match the target theorem.

Shreyas Srinivas (Dec 30 2025 at 02:51):

Same goes for lean4lean which can serve as an external checker. (Boosting lean4lean here tbh)

Shreyas Srinivas (Dec 30 2025 at 02:52):

@Elliot Glazer : you could in-principle try writing a separate lean4checker vscode extension and see how it plays with the code. I find it much easier to write a new extension than make PRs to the lean4 extension to try out ideas.

Shreyas Srinivas (Dec 30 2025 at 02:53):

It shouldn’t be hard. It took me < 10 hours to (re)-learn typescript and write the loogle lean extension. I am guessing it’s simpler with AI tools.

Jason Rute (Dec 30 2025 at 10:27):

@Elliot Glazer @James E Hanson Just to be clear, just running lean4checker, while much better, it will not guard against all the practical issues you are both worried about. These are two simple ways to give a very fake proof of Fermat's last theorem. Both will "pass" lean4checker, since lean4checker does not check that the statement of the theorem is correct.

-- casually insert another hypotheses into the theorem statement
variable (nothing_to_see_here: False)
include nothing_to_see_here in

def fermats_last_theorem (n x y z : Nat)
  (hn : n > 2) (hx : x > 0) (hy : y > 0) (hz : z > 0) :
  x^n + y^n ≠ z^n
:= by
  contradiction

#print axioms fermats_last_theorem
-- 'fermats_last_theorem' does not depend on any axioms

#check fermats_last_theorem
-- fermats_last_theorem (nothing_to_see_here : False) (n x y z : Nat) (hn : n > 2) (hx : x > 0) (hy : y > 0) (hz : z > 0) :
--  x ^ n + y ^ n ≠ z ^ n

-- casually change the meaning of `+`
instance nothingToSeeHere: Add Nat := ⟨λ x y => 0⟩

def fermats_last_theorem (n x y z : Nat)
  (hn : n > 2) (hx : x > 0) (hy : y > 0) (hz : z > 0) :
  x^n + y^n ≠ z^n
:= by
  have h : x^n + y^n = 0 := by rfl
  grind

#print axioms fermats_last_theorem
-- 'fermats_last_theorem' depends on axioms: [propext, Classical.choice, Quot.sound]

#check fermats_last_theorem
-- fermats_last_theorem (n x y z : Nat) (hn : n > 2) (hx : x > 0) (hy : y > 0) (hz : z > 0) : x ^ n + y ^ n ≠ z ^ n

Jason Rute (Dec 30 2025 at 10:27):

Neither is a bug in Lean. They are both intended behavior. But the existence of both means that one has to be careful scrutinizing claimed Lean results. The gold standard right now is something like SafeVerify or Comparator which checks that the theorem statement matches a target. Hopefully the target is simple enough to understand and vet.

Jason Rute (Dec 30 2025 at 10:27):

But as others have said, there is a tradeoff of user friendliness to security. The web editor is intended for user friendliness. If someone drops a Lean "proof" of a millennial problem on our lap, this community is probably equipped to quickly vet it even if no single online tool is. (But if someone did want to try to make such an online tool, something like https://theorem-marketplace.com/ would come close. It would be a lot harder to trick. It might just be slow to run.)

Jason Rute (Dec 30 2025 at 11:15):

Also, both https://github.com/GasStationManager/SafeVerify/tree/main/SafeVerifyTest and https://github.com/leanprover/lean4checker/tree/master/Lean4CheckerTests have long lists of tests showing the sort of hacks that are known (and for which these tools guard against).

Alfredo Moreira-Rosa (Dec 30 2025 at 12:15):

Could we at least have some linters that detect suspicious code, like False prop injection, Override of definitions, env injection ?
While not fool proof, they would be highlighted in UX with a warning or error.

Bartosz Naskręcki (Dec 30 2025 at 13:13):

@Elliot Glazer @Jason Rute @James E Hanson I think this kind of experiments that Elliot is advertising are important for people to realize one has to be careful with Lean 4 code. The recent example from Elliot should be considered in the category where the famous C obfuscation competition lies. I find these kind of bugs much less harmless than citing a full-fledged GPT hallucinated paper as a complete evidence. BTW: I am formalizing with Aristotle a part of Bourbaki's book. So far it's going smooth but I've learned a ton of Lean by just doing inspections of the auto-formalized code and being in general suspicious. Elliot's example give me this extra warning flag every time I post Lean code. So overall I think it's a very positive effect for the specialists. I agree that for general audience it's better to keep very positive attitude. Please also let me know if I do something wrong in the public space. I want to advertise Lean in the best possible way so that more people get engaged. Cheers!

Shreyas Srinivas (Dec 30 2025 at 13:21):

The point here is “soundness bug” means something much more serious (although it can happen from time to time). That is why Jason is emphasising correct communication.

An ITP is ultimately checking proofs for theorems that are stated in its logic. Any reasonably powerful logical calculus allows statements that don’t make much mathematical sense (like $2 \in 3$ in zfc) . It can’t figure out human intent. This is why I don’t take AI systems seriously beyond assisting in the proof of given statements. Being impressively capable on occasion is less important than being systematically correct at understanding math as mathematicians understand it. That is not easy even for humans.

James E Hanson (Dec 30 2025 at 16:33):

Shreyas Srinivas said:

The point here is “soundness bug” means something much more serious (although it can happen from time to time).

As an aside, I don't really buy the idea that $2 \in 3$ doesn't make mathematical sense to be completely honest. People just say that it doesn't all the time.

James E Hanson (Dec 30 2025 at 16:39):

Shreyas Srinivas said:

The point here is “soundness bug” means something much more serious (although it can happen from time to time).

I understand the rationale for why this is the way that it is and I can accept that this is a reasonable design philosophy for usability, but from the point of view of an outsider it really looks pretty indistinguishable from a soundness bug.

James E Hanson (Dec 30 2025 at 16:42):

Shreyas Srinivas said:

An ITP is ultimately checking proofs for theorems that are stated in its logic.

This also isn't really what's happening here. Environment hacking isn't the same kind of thing as a junk theorem.

No formal presentation of type theory includes a proviso that you're allowed to just inject unproven theorems freely in the middle of a proof.

James E Hanson (Dec 30 2025 at 16:46):

I would say this is more analogous to the fundamental soundness issue with definitions in Metamath (i.e., the fact that definitions must be given as new axioms and there's no built-in system that checks the conservativity of these axioms although it's relatively easy to check with external tools).

James E Hanson (Dec 30 2025 at 16:52):

Obviously this depends on what counts as the 'official' proof-checker, but, again, I'm coming at this from the point of view of an outsider. For an outsider, when you boot up a local copy of Lean in your IDE or look at the web playground, it's pretty reasonable to think that that is the 'official' context in which proofs are verified by Lean.

Shreyas Srinivas (Dec 30 2025 at 17:08):

There will always be an “outsider” problem with a tool of any sophistication.

Shreyas Srinivas (Dec 30 2025 at 17:15):

Those who choose to learn and use it will necessarily have to learn the pitfalls.

Jason Rute (Dec 30 2025 at 18:26):

I sympathize a lot with your concerns @James E Hanson. I think ultimately this is a communication challenge. Communicating the design decisions of Lean, what correctness means, and how to achieve it. Here is how I see it (which are not universally shared opinions):

Lean values flexibility and meta-programming, which will mean there are always going to be ways to hack it. Reigning this in doesn't seem to be a strong priority of the Lean community, which is fine, but needs to be communicated.
There are levels of correctness. For a lot of needs, the editor is a good measure of correctness, and there are further levels with more guarantees.
Similarly there are different levels of "soundness". I agree with you, it is not black and white, but grey. However the term "soundness" has a lot of sensitivity in this community. A "soundness bug" communicates something that Leo has to get up in the middle of the night to fix. I this community it is enforced to mean a bug in the kernel which can be used to prove false at the term level. Anything short of that is considered at worst a bug, and at best a design decision. (I got yelled at for this once.)
But short of a kernel bug, there are tools to get stronger guarantees in Lean.
The main tool which is advertised is lean4checker. It is not that difficult to use. Short of a kernel soundness bug, lean4checker will say correctly if a term proof is valid or not.
Another tool is the set of external checkers which like, lean4checker, check the term proof. (None are really that independent of lean4checker and all I think have the same major design decision which is to use bigint packages for natural number reduction. In Lean 3 it used to be possible to use Rocq as an external checker, but I don't know if this is true in Lean 4.)
But lean4checker (and external checkers) aren't sufficient for all needs. The largest need it is missing is the checking of axioms. It just doesn't. Sure there is #print axioms but that has weaker ~~soundness~~ correctness guarantees. #print axioms once had a bug (at least in the online editor) and it also can be environment hacked. At one point @Eric Wieser proposed Lean add some kind of axiom checking to lean4checker. (Note, lean4checker does check for native_decideaxioms, but that is only because those axioms can't be executed in the kernel.)
Another problem with lean4checker is that it doesn't check that your theorem even exists at the term level. You can just delete the theorem from the environment (and in at least one previous bug, a certain error in the proof caused Lean to skip the theorem silently). #print axioms helps here, since #print and #print axioms would fail if the theorem doesn't exist (short of environment hacking of the print system), but the best would be a way to check that the theorem exists at the term level.
Finally, one needs to check that the final theorem is correct and not misformalized. This is a problem in all theorem provers, including metamath, but is trickier in Lean since there are so many ways to change the meaning of a theorem statement just by putting code in front of it. (See my above two examples.)
The gold standards in checking Lean are SafeVerify and Comparator which check for all of the above issues (term proof correctness, axioms, the theorem exists, and that the theorem statement is not hacked). Comparator (by the Lean FRO) is also careful about code execution (since running code could have malicious effects in the most extreme cases). But this also requires a very different way to write your Lean code where you separate out the main thing you are trying to prove and the proof into two different files. But I think as Lean is used more in conjunction with AI and as a tool for correctness, it makes sense.
(Also, as for checking that a theorem statement is correct, there is the obvious method of inspecting the statement by hand, but another useful method used in the Liquid tensor project is to prove auxiliary statements about all the definitions used in the theorem. For example, proving that Real is a complete real closed field, and similar.)
So in summary, there are many levels of "proof" in Lean, not just one. For most applications the editor is fine, but ultimately should be checked with more scrutiny with anything important. Some of these levels of scrutiny are more work and not agreed-upon standards.
Communication is the most important thing. A good start to that is the document https://leanprover-community.github.io/did_you_prove_it.html, (which we are having a discussion about in #general > Standards for Lean proofs of unsolved problems).

James E Hanson (Dec 30 2025 at 19:00):

Jason Rute said:

Communication is the most important thing.

Everything you're saying makes sense to me. I just think that it's fundamentally at odds with the rhetoric and marketing that is commonly employed by people promoting Lean. For instance, people often talk about Lean eliminating the need for peer review of papers, because with a Lean proof in hand you can be sure that the theorem is true, or something like that. Or just look at this statement guaranteeing "absolute correctness in mathematical proofs" that is prominently displayed on lean-lang.org

image.png

and contrast it with your statements here

Jason Rute said:

There are levels of correctness. For a lot of needs, the editor is a good measure of correctness, and there are further levels with more guarantees.

Similarly there are different levels of "soundness". I agree with you, it is not black and white, but grey.

I understand that on a technical level the statement from lean-lang.org is referring to the kernel and your statements are about the editor, but I'm talking about the rhetorical nature of these statements rather than their technical content.

James E Hanson (Dec 30 2025 at 19:02):

On a pedantic level, the statement on lean-lang.org is just false. There have now been at least two kernel-level soundness bugs in Lean, so we know for a fact that the Lean kernel did not provide a guarantee of "absolute correctness in mathematical proofs" before May of this year, at best.

Shreyas Srinivas (Dec 30 2025 at 19:03):

Well... Firstly lean is not going to eliminate the need for peer review. It’s adoption will definitely change how peer review works and possibly even change what questions are considered “interesting”

James E Hanson (Dec 30 2025 at 19:05):

Your phrasing suggests that you think it's a given that Lean will have mass adoption among mathematicians.

Shreyas Srinivas (Dec 30 2025 at 19:05):

It already has adoption in the PL world and I am extrapolating from that. Secondly I am not a lawyer.

Shreyas Srinivas (Dec 30 2025 at 19:06):

Secondly, there is always room for human error.

Thirdly, lean is only as reliable as the system you run it in.

Shreyas Srinivas (Dec 30 2025 at 19:06):

If your operating system is malicious and alters how lean runs for example, you can’t do anything about it.

Shreyas Srinivas (Dec 30 2025 at 19:07):

The guarantees of lean are in the same spirit as Signal’s guarantees of security and privacy. If someone hijacks your phone, it doesn’t hold.

James E Hanson (Dec 30 2025 at 19:08):

I feel like these statements don't contradict my point. The claim regarding "absolute correctness" made on the front page of lean-lang.org is false.

Shreyas Srinivas (Dec 30 2025 at 19:09):

Trying to seek absolute correctness in a three line snippet is an exercise in futility.

James E Hanson (Dec 30 2025 at 19:10):

So you're telling me that interpreting the statement made on the front page of lean-lang.org as it is literally written is an exercise in futility?

Shreyas Srinivas (Dec 30 2025 at 19:10):

Yes. And that applies to the front page of anything

James E Hanson (Dec 30 2025 at 19:11):

So I'm being silly for thinking that the statement

Lean's minimal trusted kernel guarantees absolute correctness in mathematical proof, software and hardware verification.

is saying that Lean guarantees absolute correctness in mathematical proofs?

Shreyas Srinivas (Dec 30 2025 at 19:13):

It guarantees correctness of proofs upto

Soundness of the type theory w.r.t to a relevant model
Correctness of the kernel’s implementation
Correct behavior of your system (and a bunch of things listed by Mario in the document I linked above)

Shreyas Srinivas (Dec 30 2025 at 19:15):

And for the record, this is and will remain true of any system claiming any guarantee of correctness ever.

Shreyas Srinivas (Dec 30 2025 at 19:18):

Here is a full list : https://ammkrn.github.io/type_checking_in_lean4/trust/trust.html

Alfredo Moreira-Rosa (Dec 30 2025 at 20:57):

As you pointed out, the phrase on the website, says :
"Lean's minimal trusted kernel guarantees absolute correctness in mathematical proof, software and hardware verification."

This means :
"If a mathematical statement is correct AND it's context is correct, then the Lean kernel will garanty that a validated proof is correct".

And maybe it's a better statement to have in landing page of Lean. But i don't think it's a marketing lie from the community. More of an oversight.

So Lean is a tool for mathematicians to check THEIR proofs. If they trust someone, it can also extend to trust OTHERS proofs (like for Mathlib). But cheaters will always find ways to cheat. And lean don't give strong garanties in this category.
And of course that's why we have tools like comparator and safeVerify to address part of these concerns.

It's also important to note that all the examples shown shown at the beginning of the thread and the tweet, don't refute the statement :

In the tweet exemple, is just obfuscation to forge a False statement by telling lean to bypass type checking. So the context is not correct anymore.
in the first exemple given by Jason, a false statement in injected in the proof context. So again the context not correct anymore.
in the second exemple, a Mathematical definition is overriden. So the Statement is not correct.

But in none of these cases did the kernel validated a correct statement given a correct context.

And of course, it would be nice to have better integrated tooling to detect these adversarial attacks. And lean will get better at this, given the understandable cristicism you and others are raising.

James E Hanson (Dec 30 2025 at 22:58):

Alfredo Moreira-Rosa said:

But in none of these cases did the kernel validated a correct statement given a correct context.

These examples were not the basis of me saying that the claim on lean-lang.org was incorrect. I was specifically referring to the instances of actual Kernel-level soundness bugs.

James E Hanson (Dec 30 2025 at 22:59):

On a pedantic level, we know that the statement was incorrect before May of this year. Lean's kernel simply did not guarantee absolute correctness in mathematical proofs before that point in time.

James E Hanson (Dec 30 2025 at 22:59):

I feel like this isn't really a debatable point.

Damiano Testa (Dec 30 2025 at 23:05):

Which soundness bugs are you referring to?

James E Hanson (Dec 30 2025 at 23:08):

The two found by Mario. The more clear-cut one is the issue with Level.data, which was back in May. The older one involved native_decide being able to prove False without reporting any axioms.

Damiano Testa (Dec 30 2025 at 23:16):

The second one involving native_decide is not really a Kernel-level bug: native_decide involved trusting much more than the kernel.

James E Hanson (Dec 30 2025 at 23:19):

It's still a kernel soundness issue. The kernel was trusting the compiler without reporting that it was doing so.

James E Hanson (Dec 30 2025 at 23:20):

But regardless, even if that one is not as clear-cut, the first one is unambiguously a kernel soundness bug, and that's enough for the statement on lean-lang.org to have been incorrect.

James E Hanson (Dec 30 2025 at 23:20):

"Absolute correctness" is an extremely strong claim.

Jason Rute (Dec 31 2025 at 03:58):

@James E Hanson I’m not saying this will help at all with your worries about Lean lying to the mathematical community, but I think you would be very interested in @Mario Carneiro ‘s ongoing lean4lean work formally proving the correctness of the Lean kernel. https://github.com/digama0/lean4lean Unlike the Lean advertising, Mario is very careful with his claims and is one of the people most concerned by correctness, not just of Lean but many other provers as well. If you search for lean4lean on Zulip you will find some progress reports from Mario on the project as well as his thoughts on how bug-free the Lean kernel currently is.

Jason Rute (Dec 31 2025 at 04:01):

https://leanprover.zulipchat.com/#narrow/channel/236446-Type-theory/topic/Lean4Lean.20progress/with/541108980

Boris Alexeev (Dec 31 2025 at 14:58):

Chris Henson said:

I fully agree, just citing my anecdotal experience that they don't seem to do so currently and that if they do in the future we have appropriate tools to point people to.

I just wanted to mention that I have already seen AI models attempt to "escape the box" when proving theorems in Lean. They haven't been Lean kernel soundness bugs yet, of course, but it seems that a sufficiently smart LLM + RL model will, after enough compute, at some point attempt to use a bug/oversight in whatever sandbox/environment they're in to get their reward for falsely proving something. I've seen ~3 different instances of this.

Jason Rute (Dec 31 2025 at 15:00):

In another thread (since this one is a mess) would you be willing to document the cases @Boris Alexeev? I would be very interested.

Jason Rute (Dec 31 2025 at 15:00):

Maybe in #Machine Learning for Theorem Proving

Boris Alexeev (Dec 31 2025 at 15:06):

Eh, I think they're mostly artifacts of whatever checking environment is used. I think one example can get the idea of these across quickly:

def result_statement : Prop := foo bar
unsafe def my_sorry {P : Prop} : P := unsafeCast ()
unsafe def result_proof : result_statement := my_sorry

Shreyas Srinivas (Dec 31 2025 at 15:06):

Well, it’s unsafe

Shreyas Srinivas (Dec 31 2025 at 15:08):

See here

Boris Alexeev (Dec 31 2025 at 15:09):

Right, and it doesn't pass normal checks that we do.
But my claim is that when people set up RL environments for these Lean agents, (1) often times they're checking weaker properties and (2) the agents are trying to hack them.
Over time, the environments are hardened against whatever exploits you know about, but I'm just saying that (2) is already happening.

Shreyas Srinivas (Dec 31 2025 at 15:11):

That's a problem with the relevant python API then. it can always check these things

Jason Rute (Dec 31 2025 at 15:27):

Yeah, I think on top of whatever issues Lean does or does not have, some AI labs are not very good at designing robust RL lean environments. That is how for example, the DeepSeek bug got through. If they had done something as simple as print axioms they would have caught it. Again I think there isn’t good communication about what correctness means in Lean, but I also think some blame falls on the AI labs as well (at least if they release proofs with easy to find exploits). I hope in the near future we can have good standards around this, not onerous and brittle standards with banned keywords, but robust, principled standards like what Comparator is trying to do. (Actually RL environments may have to take shortcuts for speed, but they should understand the limitations of those shortcuts and know how to vet their final results before announcing them.)

Shreyas Srinivas (Dec 31 2025 at 15:36):

I think it's worth asking for something like the external checkers guide, which contains a nice account of trust, to be included in the lean-lang reference manual

Jason Rute (Dec 31 2025 at 16:02):

My opinion (which is buried above) is that accounts of correctness and trust in Lean (including that one) forget about practical issues like axioms, theorems not making it into the environment, and theorem statements changing by other stuff in the code before the theorem. External checkers are only as good as the terms they are checking, and it is easy in Lean to get the wrong terms. No external checker would have caught the DeepSeek bug or the unsafe bug above because those theorems are not exported or they rely on weird axioms. (I know this doc talks about Pollack consistency, but that seems like a throw away comment. Pollack consistency is the most common issue. And tools like Comparator and SafeVerify are practical solutions which really help increase trust about that issue.)

Chris Bailey (Dec 31 2025 at 22:38):

I will add something to the book expanding on this and some of the potential solutions; there are two related but separate concerns here. The first is "we want a free-standing type checker and pretty printer that can offer a second opinion on whether something type checks, can tell us exactly what was checked, and which is completely independent of the Lean ecosystem". External type checkers do come with a pretty printer and the ability to print and spot-check the "big parts" of a Lean project (which will not type check if they rely on things like unsafe definitions, so there is protection here). Before the AI boom this was the primary concern, and as far as the "absolute correctness" semantic debate is concerned, this is the concern implicated in my opinion.

The second concern is "we want something which is robust against potentially malicious inputs which we may not have even seen without having to manually compare the pretty printer output for every declaration". This is totally different, and is much more reliant on an ability to parse Lean (or do other things that involve the elaborator) to enable robust comparison of declarations, which is why (as far as I'm aware) the existing works that do this rely on Lean itself and are therefore unable to address concern number one.

I don't think there are any magic bullets that will simultaneously address both concerns while also remaining a "push-button" tool. None of the existing tools on either side of the fence address the "theorem statements changing by other stuff in the code" part either unless you're willing to read the pretty-printer output from an external checker.

Jason Rute (Jan 01 2026 at 14:36):

@Chris Bailey I think I was a bit too mean to your document. Your document, is very helpful, and when combined with https://leanprover-community.github.io/did_you_prove_it.html gives a fairly complete side of the story. My main complaint is that when various issues arise with checking Lean proofs, hacks and AI exploits, the first answer is usually "they should have used lean4checker". But in half those cases, lean4checker won't catch them.

Jason Rute (Jan 01 2026 at 14:36):

can tell us exactly what was checked

I could be wrong, but looking at the terms themselves seems like a no-go. They are much too large and complicated in Lean. Now, of course, external checkers could print other information like declarations being checked and their axioms (but I don't think any do at the moment, right)? As far as pretty printing, I think that would get complicated really fast.

Jason Rute (Jan 01 2026 at 14:36):

External type checkers do come with a pretty printer

I was unaware of this. Honestly, I thought this would be a unrealistic since pretty printing is context dependent. Of course you could print the raw terms but type class resolution blows up terms making them unreadable. For example, twice recently I came across the issue of norm ||x||. In the pretty printed version, it is hard to know what norm this is (sup norm, L2/Euclidean norm, operator norm?), and in the pp.all version the answer is so buried in a super long expression that you also can't read it. (See here.) Of course, if all you want to do is check that terms haven't changes, then it is just term matching (and pretty printing is not important).

Jason Rute (Jan 01 2026 at 14:36):

completely independent of the Lean ecosystem

Since you brought this up, and I agree it is related to the debate about absolute correctness, I'm not really sure current external checkers are that independent of the Lean ecosystem. An important case study would be if the current Lean 4 checkers avoided all the various kernel bugs that Mario and others have found over the years. But regardless, I think they also make the same decision to use bigint packages for reducing certain nat terms, right, which I know worry some people? It would be great to see an external checker based on metamath/mm0 and one based on coq. The former has some challenges with reduction (see here). The latter I thought was once possible (https://github.com/SkySkimmer/rocq/tree/lean-import, https://discourse.rocq-prover.org/t/alpha-announcement-coq-is-a-lean-typechecker/581), but maybe I am mistaken.

Jason Rute (Jan 01 2026 at 14:36):

we want something which is robust against potentially malicious inputs which we may not have even seen without having to manually compare the pretty printer output for every declaration

Kind of. What I think we want is a way to safeguard against exploits. And yes, manually comparing pretty printed output is not enough (as I have an example above where the pretty printed output does not change).

Jason Rute (Jan 01 2026 at 14:36):

is much more reliant on an ability to parse Lean

I don't see why. I think a lot can happen at the term level, the same level that external checkers work with.

Jason Rute (Jan 01 2026 at 14:38):

don't think there are any magic bullets that will simultaneously address both concerns while also remaining a "push-button" tool.

I think SafeVerify and Comparator come very close (and I think Comparator will soon enable use of external typecheckers as well). I'd really love to get your thoughts on these two tools. But maybe I'm misunderstanding your concerns.

The only thing that I think existing push button tools can't do is check that the original targets are semantically correct. I think the three best ways to do this are (1) checking the code manually (at the code and pretty printer level), and (2) doing what the Liquid Tensor Project did, which is to prove other characterizing facts about those objects. This helps ensure one is working with the correct mathematical objects. (But even then, I understand a type about an edge case or a missing hypothesis could totally change the intended meaning of a statement.) (3) Use the proven theorem to prove consequences of it. That would further help ensure it is the theorem you want.

Thomas Murrills (Jan 02 2026 at 20:54):

There's another, more restricted issue here that I'm not sure anyone has mentioned yet (though this is a long thread, so please forgive me if I missed it): expectations surrounding #print axioms foo. Generally, I believe people (reasonably) expect #print axioms foo to mean "foo can be proven from the logged axioms". But #print axioms does not typecheck all declarations it uses, and therefore the tacit part of the expectation that foo is in fact proven from these axioms is not necessarily met.

So, I think it'd be reasonable to have #print axioms typecheck as well, as I believe "providing some degree of peace of mind that you have a proof, up to the other exploits mentioned here" is its actual responsibility for users. (I claim it's fine if it's not terribly performant, and won't be too bad if we keep track of which declarations have been typechecked and which haven't.)

If we do simply want #print axioms to mechanically list axioms (though I'm not sure why this would be a useful responsibility to be fulfilled by a user-facing command!), I think we ought to at least have something which fulfills the typechecking responsibility, and have it replace #print axioms in the lexicon as the first stage in the "see? I proved it!" pipeline. :)

Chris Bailey (Jan 03 2026 at 00:56):

Jason Rute said:

can tell us exactly what was checked

Now, of course, external checkers could print other information like declarations being checked and their axioms (but I don't think any do at the moment, right)? As far as pretty printing, I think that would get complicated really fast.

The rust checker requires users to white list axioms by name up front in the execution config (so does Comparator) and you can ask for checked declarations to be printed/read back. There's a pretty printer options setting in the config format for controlling things like implicits, typeclasses, universes, etc. The difference in the external pretty printers is that there is no support for complex notation or macros (only simple prefix, infix, and suffix notation).

Jason Rute said:

completely independent of the Lean ecosystem

Since you brought this up, and I agree it is related to the debate about absolute correctness, I'm not really sure current external checkers are that independent of the Lean ecosystem. An important case study would be if the current Lean 4 checkers avoided all the various kernel bugs that Mario and others have found over the years. But regardless, I think they also make the same decision to use bigint packages for reducing certain nat terms, right, which I know worry some people? It would be great to see an external checker based on metamath/mm0 and one based on coq. The former has some challenges with reduction (see here). The latter I thought was once possible (https://github.com/SkySkimmer/rocq/tree/lean-import, https://discourse.rocq-prover.org/t/alpha-announcement-coq-is-a-lean-typechecker/581), but maybe I am mistaken.

None of the kernel issues Mario reported (that I'm aware of) were replicated in the rust checker. I would argue that by definition it's "completely independent of the Lean ecosystem" in that building and running the checker does not rely on Lean or any software downstream of Lean.
The bignum extension can be turned off (there's a flag in the config file "nat_extension": bool) but in practice it's no longer optional for anything downstream of core. This is the kind of feature that gets added, people start using it, and now you have terms in core that will just run indefinitely or OOM if you try to check them without the extension enabled. Not to say anything about whether this was a good or bad addition, that's just the current state of affairs.
The mm0 thing is a question for Mario, but I presume it's (at least) blocked on developer time; adapting the kernel changes that have taken place in the last few years would take a lot of effort (and he would probably just use lean4lean as the type checker/reduction machine now).

Jason Rute said:

And yes, manually comparing pretty printed output is not enough (as I have an example above where the pretty printed output does not change).

Can you post this example? This should not be possible with the pretty printer included in an external checker.

Jason Rute said:

don't think there are any magic bullets that will simultaneously address both concerns while also remaining a "push-button" tool.

I think SafeVerify and Comparator come very close (and I think Comparator will soon enable use of external typecheckers as well). I'd really love to get your thoughts on these two tools. But maybe I'm misunderstanding your concerns.

The only thing that I think existing push button tools can't do is check that the original targets are semantically correct. I think the three best ways to do this are (1) checking the code manually (at the code and pretty printer level), and (2) doing what the Liquid Tensor Project did, which is to prove other characterizing facts about those objects. This helps ensure one is working with the correct mathematical objects. (But even then, I understand a type about an edge case or a missing hypothesis could totally change the intended meaning of a statement.) (3) Use the proven theorem to prove consequences of it. That would further help ensure it is the theorem you want.

From what I can tell, Comparator does exactly what you would want to address concern 2, and is what people should be using for most of the tasks I see discussed these days. The rust checker is not currently able to do challenge/solution because I did not write it in a way that supports comparison of declarations which come from different environments. It's possible to add this, but I haven't had much time to think about what's necessary for the comparison to be sound. In my opinion, adequately addressing concern 1 requires using something that is not itself written in or in some way dependent on Lean (the extent to which lean4lean represents an exception to this is above my pay grade).

James E Hanson (Jan 03 2026 at 08:16):

Chris Bailey said:

Can you post this example? This should not be possible with the pretty printer included in an external checker.

I think Jason is referring to this example he already posted.

Eric Wieser (Jan 03 2026 at 09:32):

Right, it's the nothingToSeeHere example

Joachim Breitner (Jan 03 2026 at 10:10):

It should be possible to extend comparator to use the rust checker for “are these declarations correct”, that's orthogonal to the comparison with the challenge file, right?
(Or are you saying that ideally that functionality would also be implemented multiple times and independently?)

Jason Rute (Jan 03 2026 at 11:34):

As for the example, as others just said, it is that nothingToSeeHere example. @Chris Bailey I don't know what you mean by "the pretty printer included in an external checker", but if you just mean something similar to the verbose output in Lean of printing the statement with pp.all turned on, then yes, you could use that to compare theorem statements. (Of course, at that point you could also just compare the actual terms which is what I believe Comparator does. No need for a human to get involved, especially since fully elaborated terms can get insanely long.) But if you hide type class resolution, then you won't be able to compare the two versions I think, right?

Jason Rute (Jan 03 2026 at 11:43):

@Joachim Breitner I don't know if you are asking me or Chris, but yes, Comparator should be able to extendable to use external checkers for checking if the exported proofs are valid. Comparing the terms of the Challenge and Solution file does not seem to be difficult or controversial computation in the same way that running the Lean kernel is. Which means, (1) I would trust Comparator to do it correctly, (2) it wouldn't be hard to make an external tool which does the same thing, and (3) it wouldn't be hard to make a verified tool which also does it.) (Note, this might not be so trivial if we are looking for something more than an exact match of terms, as has come up a few times.)

Jason Rute (Jan 03 2026 at 11:48):

@Chris Bailey said:

The rust checker requires users to white list axioms by name up front in the execution config (so does Comparator) and you can ask for checked declarations to be printed/read back.

Ok, that is nicer than I realized. That would help a lot for axiom issues and for cases when the environment drops expressions. (I'll keep this in mind in discussions like this, and possible applications.)

Jason Rute (Jan 03 2026 at 12:01):

@Chris Bailey said:

None of the kernel issues Mario reported (that I'm aware of) were replicated in the rust checker.

If this is true, I think this should be made more well-known (and systematically checked with tests for each of Lean's previous bugs). I know Lean has "officially" at points emphasized that its approach to soundness includes external checkers. (I honestly don't know if any lean users actually use external checkers, but I think Lean was slightly implying that if a user really wanted to be sure there were no soundness issues, they should use one.) Now, that Lean can no longer say (as they used to) that there are not soundness bugs ever found, they might still be able to say that there are no soundness bugs ever found that would not have been caught with an external type checker. If that is the case, that would be very helpful to know.

Jason Rute (Jan 03 2026 at 12:09):

@Chris Bailey said:

The bignum extension can be turned off (there's a flag in the config file "nat_extension": bool) but in practice it's no longer optional for anything downstream of core.

That is also good to know. I also agree we are sort of stuck with the feature now that people are writing proofs using it. (A side comment, which is probably best for its own thread, is that I think that there are probably good post hoc ways to get rid of specific bignum reductions in a proof. One just can track all such reductions, prove them with something like norm_num, and replace the term proof with that. It won't be completely trivial, but it could be useful if there is ever a serious worry about such computations.)

Last updated: Feb 28 2026 at 14:05 UTC