Zulip Chat Archive

Stream: general

Topic: Automated verification of native_decide

GasStationManager (Dec 21 2025 at 13:52):

Now that AIs including Aristotle, Kimina Prover, and Claude Opus 4.5 are known to produce proofs containing native_decide, we will see more of these in the wild.

native_decide can be quite useful, and most of these uses of native_decide in AI proofs are benign; but as we know, it also asks us to trust the entire compiler, and has known exploits that prove false.

How do we verify that a native_decide is valid, not an exploit? My proposed solution: we say that a use of native_decide is verified if there exists another Lean proof of the same goal without using native_decide. And we ask AIs to help find such replacement proofs.

In ReplaceNativeDecide, I define a Claude Code subagent that takes a particular native_decide in a file, and replaces it with an explicit proof. You can prompt claude to invoke the subagent repeatedly to replace all native_decides in a file.

Contributions / suggestions are welcome! In particular, if you know of recipes for replacing native_decide when using frequently-used Mathlib features, that would be good candidates to include into the subagent's prompt! Try it out on your use cases, and let me know if it doesn't work.

Thomas Bloom (Dec 21 2025 at 15:11):

As someone who knows none of the technical details, why does native_decide even exist if sometimes it results in false proofs? Doesn't this undermine the entire point of Lean?

Yaël Dillies (Dec 21 2025 at 15:15):

The point is that it is more of an axiom scheme rather than a single axiom. native_decide means "Trust all the non-Lean implementations of the Lean functions I use".

Kevin Buzzard (Dec 21 2025 at 15:19):

Lean has uses beyond mathematical proving. native_decide is very useful for writing fast code to do computations. You can easily cheat and get it to prove false things but conversely you can just not cheat and then get it to do fast calculations accurately in the same way that python is accurate without being formally verified.

Henrik Böving (Dec 21 2025 at 15:47):

The key usage of native_decide is that you write a decision procedure and can then execute it with native speed without allowing a third party to inject code into the piece of code whose execution you trust. In these situations its still perfectly fine.

Kim Morrison (Dec 21 2025 at 18:24):

Thomas Bloom said:

As someone who knows none of the technical details, why does native_decide even exist if sometimes it results in false proofs? Doesn't this undermine the entire point of Lean?

@Thomas Bloom, you can use #print axioms foo to verify that a given proof doesn't use native_decide. Mathlib, for example, does not allow the use of native_decide.

Unfortunately this wasn't said at the top of the thread --- perhaps it is a good habit for any thread discussing native_decide to include a mention of or link to these basics, to avoid unnecessary confusion.

Thomas Bloom (Dec 21 2025 at 18:48):

Thanks everyone, all very helpful!

Mirek Olšák (Dec 22 2025 at 00:58):

Not sure how controversial this opinion is but I think that native_decide should eventually become trusted. Currently, I am aware of the following reasons why native_decide is not trusted

It relies on trusting much more code than the Lean Kernel, and that code can be a bit buggy, one issue discussed here.
Users can define their own untrusted implementations, which is what people usually refer to when they say that you can prove False with it.
(moreover) it can lead to weird outcomes, like non-standard natural numbers. This is not contradictory in any way, just a little weird.

However, these issues can get in principle resolved.

The compiler could get verified via meta-reasoning. This would be a long-term project but I don't think it is completely unrealistic.
native_decide could track which unsafe functions it relied on, possibly there could be a safe version of native decide which only runs the trusted part of the compiler.
to calm down users, they can also imagine the non-standard number as 10^100, or any other number higher than any reasonable steps of computation...

In the long-term, kernel cannot compete with fast evaluation, so it makes sense to prove things with native_decide. But sure, we are not yet in a stage where verifying Lean's compiler would be on the agenda, so attempts of replacing it in the proofs make a lot of sense too.

Weiyi Wang (Dec 22 2025 at 01:12):

If we'd like to trust native_decide, what would be the plan to prevent things like #general > Cardinality model incompatible with Lean compiler @ 💬 ? The specific instance has been fixed, but I think there needs to be an overall scheme to prevent this if we put more trust on native_decide

Mirek Olšák (Dec 22 2025 at 01:13):

Why do we need the cardinality model?

Mirek Olšák (Dec 22 2025 at 01:15):

Lean already dropped homotopy type theory in the past by making some kernel decisions. I think the cardinality model can be dropped too.

Weiyi Wang (Dec 22 2025 at 01:16):

I don't think asking specifically about cardinality model is my point. My point is, how do we remember to ask this question when trusting the compiler?

Mirek Olšák (Dec 22 2025 at 01:18):

You mean to prevent Lean from crashing / doing weird things in runtime if proofs are wrong? Like Array overflow by sorry

Mirek Olšák (Dec 22 2025 at 01:20):

Then I think you could just say that you only trust the three basic Lean's axioms (+ trusted native_decide) if you want to run Lean evaluation in a trusted way too.

Mirek Olšák (Dec 22 2025 at 01:33):

Actually, there is a little of evaluation inconsistency

axiom mySorry {α : Prop} : α

#eval let a := #[0]; a[0]'(sorry) -- aborting evaluation since the expression depends on the 'sorry'
#eval let a := #[0]; a[0]'(mySorry) -- 0

GasStationManager (Dec 22 2025 at 01:42):

@Mirek Olšák I think I broadly agree with you; in the long term I would much prefer that more and more of the compiler become fully trusted, so that native_decide becomes safer to use, and (eventually) executable code can have provable guarantees that are as safe as other regular mathematical proofs. Perhaps a goal of this current mini-project is to show that AI-assisted proofs can have a role in this endeavor.

I am not advocating to eliminate native_decide. It is perfectly reasonable to keep both the native_decide proof and the replacement proof; the former is more efficient and shorter, while the latter provides added security and peace of mind.

Mirek Olšák (Dec 22 2025 at 01:56):

Weiyi Wang said:

I don't think asking specifically about cardinality model is my point. My point is, how do we remember to ask this question when trusting the compiler?

Oh sorry, I just read the title and didn't realize you ask specifically about Aaron's example which used a specific unsafe implemented_by. That's what I addressed with point 2, there should be a trusted part of the compiler where people ensure there are no buggy implementations, and a native_decide should track if it only relies on those, or for example some user-defined code.

Henrik Böving (Dec 22 2025 at 10:52):

The compiler could get verified via meta-reasoning. This would be a long-term project but I don't think it is completely unrealistic.

As it stands we do not want to do this, having to fix proofs whenever we touch the compiler seems like too much of a burden. A more realistic alternative would be to introduce alternative native computation engines that can be trusted.

Mirek Olšák (Dec 22 2025 at 16:24):

That sounds like a challenge for any provably correct software, not just (hypothetically) Lean compiler. I don't know what is the good way of developing provably correct SW but hope to see more of it in the future.

Mirek Olšák (Dec 22 2025 at 16:27):

But nothing against detached trusted compilers.

Alfredo Moreira-Rosa (Dec 22 2025 at 16:51):

Maybe, to evoid the "having to fix proofs whenever we touch the compiler" would be to have it as native_decide? that expands to the proof terms of the meta-reasoning. This way the proof is fixed in time, and changing the compiler would not impact it ?

Weiyi Wang (Dec 22 2025 at 17:03):

I thought the whole deal of native_decide is that it can't expand to terms?

Weiyi Wang (Dec 22 2025 at 17:05):

oh I think I see what you meant. If there is a proof for the compiler, then in principle it can be transformed into proof terms of proof generated by the said compiler. Though not sure how feasible it is

Weiyi Wang (Dec 22 2025 at 17:07):

I imagine the proof term from this will be huge

Mirek Olšák (Dec 22 2025 at 17:09):

There are speed & specification issues.

So far, I am not aware of a compiler specification that mimicnative_decide being translated into axioms. At least non-standard numbers, and semantics of casting would have to be axiomatically solved. On the other hand, it would be fair to have it...
If the proof term is big, we don't gain the speed (which is why we wanted native_decide in the first place). In some cases, proofs can be shorter (which is how bashing tactics work) but I am not sure if in proofs by computation.

Tomas Skrivan (Dec 22 2025 at 17:23):

I don't really understand the theory behind all this but what about having a version of native_decide that does not allow partial nor implementedy_by(apart from very selected collection of functions that powers numbers, arrays etc.). This should exclude non standard natural numbers, no? But we could still benefit from complied speed.

Mirek Olšák (Dec 22 2025 at 17:42):

By the way, @GasStationManager , sorry for hijacking a bit your thread about prompting LLM to replace native_decide. Do you have examples / database of where you would like to do the replacement? Maybe that could help to figure out further methods beyond those already mentioned in the current prompt

GasStationManager (Dec 22 2025 at 18:30):

@Mirek Olšák , no worries, these are interesting and relevant discussions. One comment that might link back to my prompt: one part of my prompt says to "trace the computation" when other attempts fail. This amounts to (mentally) simulating the compiler to reveal the computation steps, then verifying the computation. Are there other ways of extracting a computation trace (of native_decide) from the compiler? (e.g. perhaps show the generated C code to the LLM, or is that too confusing to read; and the LLM would still need to do the proof in Lean)

So far I have tested this agent on Aristotle's IMO 2025 solutions, and a few of Kimina Prover's published MiniF2F solutions. Those proof attempts all eventually succeeded, though some took longer than others. Are there other interesting datasets of proofs containing native_decide?

Mirek Olšák (Dec 22 2025 at 18:43):

Are you extracting the proof state at the point of native_decide only, or giving the AI the full proof? I think just a local solution could work well.

Mirek Olšák (Dec 22 2025 at 18:47):

As far as I am aware, it is hard to follow native_decide proof closely with logical steps (I mean if standard decide fails). Then it means that the default logic of some function was replaced with another implementation which is not trusted (sometimes external).

Henrik Böving (Dec 22 2025 at 19:19):

The key use case for native_decide is not to have some untrusted implementation or whatever but simply performance. By using native execution instead of a low performance symbolic execution through reduction, as is done by decide, you can get speedups of several orders of magnitude while still just running the exact same code as before.

Henrik Böving (Dec 22 2025 at 19:22):

For these use cases it is also not usually sensible to get a trace of the computation to verify or something like that. The trace of the computation is simply going to be too large. If you e.g. want to replay an UNSAT certificate of hunreds of megabytes ore ven gigabytes into the kernel it's not helpful if you produce a few gigabytes of trace that certify you correctly checked the certificate. The kernel will struggle processing that trace just as much as it would've struggled to consume a regular Lean translation of the certificate (which would in fact be utterly trivial to produce).

GasStationManager (Dec 23 2025 at 00:28):

@Henrik Böving that's a good point. I think the immediate use cases I have in mind are more for double-checking proofs by AIs, which (from what I've seen) tend to use native_decide for relatively small computations of base cases.

Kim Morrison (Dec 23 2025 at 04:07):

I think the real solution here is to keep reminding the owners of such AIs that "off by default" control of whether the AI uses native_decide is desirable.

Kim Morrison (Dec 23 2025 at 08:07):

And hopefully we will have new automation next year that will cover this "relatively small computations of base cases" use case, producing efficient kernel checked proofs, and we can wean the AIs off native_decide.

Last updated: Feb 28 2026 at 14:05 UTC