Zulip Chat Archive

Stream: lean4

Topic: FRO's new verifier


Jason Rute (Oct 22 2025 at 14:40):

@Henrik Böving mentioned that he (through the Lean FRO) is developing a new Lean verifier. From context is sounds like it is more along the lines of @GasStationManager's SafeVerify. I'd love to learn more. Here are some questions:

  • What is the intended workflow and use case?
  • Is it like SafeVerify in that it is intended for checking a full implementation against a reference implementation? SafeVerify checks that the terms of the theorems are the same, that the proof terms replay (using same tools as Lean4Checker) and that the axioms are correct.
  • Will this be an official Lean verifier supported by the FRO?
  • Will this supersede Lean4Checker?
  • When will it be released?
  • SafeVerify doesn't work with multiple files. Will this cover that case?
  • SafeVerify doesn't allow custom axioms. Will this have support for that?
  • Can this be used to easily check human and AI generated developments?
  • Is it robust to known attacks (short of a bug in the kernel).

Jason Rute (Oct 22 2025 at 14:42):

Henrik Böving said:

Jason Rute said:

Beside checking the statements carefully, it would be nice practice to also check the code carefully too with:

  • print axioms
  • lean4checker
  • SafeVerify: In particular, you can make two versions of the file, one with just the theorem statements and the definitions needed to state the theorems, and then the whole file. SafeVerify will compare them, checking that the statements have the same terms and that the proofs/axioms are correct. If SafeVerify is happy, then you only need to check the smaller file for mathematical correctness of the definitions and statements.

The verifier that I've been developing at the FRO says yes :)

Henrik Böving (Oct 22 2025 at 14:51):

Jason Rute said:

  • What is the intended workflow and use case?
  • Is it like SafeVerify in that it is intended for checking a full implementation against a reference implementation? SafeVerify checks that the terms of the theorems are the same, that the proof terms replay (using same tools as Lean4Checker) and that the axioms are correct.

In a similar vein to SafeVerify, comparing a challenge and a solution file and certifying that the solution is correct

  • Will this be an official Lean verifier supported by the FRO?

It will be at some point yes

  • Will this supersede Lean4Checker?

I think it fits into a bit of a different niche in that lean4checker has no concept of a challenge

  • When will it be released?

I don't have a concrete date for that yet unfortunately

  • SafeVerify doesn't work with multiple files. Will this cover that case?

It doesn't right now but that seems like a pretty trivial change to me

  • SafeVerify doesn't allow custom axioms. Will this have support for that?

Yes

  • Can this be used to easily check human and AI generated developments?

That feels like a bit of an open question, what does this even mean

  • Is it robust to known attacks (short of a bug in the kernel).

Yes, it is designed in a very defensive manner that also anticipates (and negates) some attacks which are only a theoretical possibility so far. On top of that it's also built in a way where integrating third party kernels is quite easy if people want to check with say, the Lean kernel and nanoda. That said it hasn't been stress tested yet by anyone apart from me so when we will release it there is going to be a call for testing during which you should maybe not yet trust it completely :)

Jason Rute (Oct 22 2025 at 15:00):

That feels like a bit of an open question, what does this even mean

Fair question. I see the following needs:

  • Recently there have been a lot of long proofs (sometimes with multiple files), created by a combination of AI and humans. I'm more worried about the ones created by reinforcement-learning-trained AI where there may be subtle hacks in the proof. We don't want to have to check thousands of lines of AI-generated Lean code by hand. So it would be good to figure out how to check all the code and get the proof down to a small chunk of Lean for the theorem statement and definitions that we can check by hand. This could itself be an automated process where one inputs the theorem name(s) they are interested in and we get a "challenge" project with only those theorems and dependencies.
  • With human developments like blueprints (or even mathlib), it would be good to check/report what axioms are used.
  • With training and benchmarking AI, one often has a challenge file/project with sorrys that they want filled in. Then they want to check that the AI solution is correct. Ideally this could be fast enough to use during training (so one wants an option where it doesn't recheck all of mathlib say).

Jason Rute (Oct 22 2025 at 15:00):

Here are some exploits I know about: Changing the meaning of notation in a theorem using local instances, using debug.skipTCKernel to skip kernel checking, exploiting native_decide, using metaprogramming to change the behavior of keywords or add fake theorems to the environment, crashing Lean in just the right way so that a theorem isn't added to the environment (and hence Lean and Lean4Checker are happy), and exploiting a bug in #print axioms so it says no axioms but axioms are used.

Henrik Böving (Oct 22 2025 at 16:01):

  • Recently there have been a lot of long proofs (sometimes with multiple files), created by a combination of AI and humans. I'm more worried about the ones created by reinforcement-learning-trained AI where there may be subtle hacks in the proof. We don't want to have to check thousands of lines of AI-generated Lean code by hand. So it would be good to figure out how to check all the code and get the proof down to a small chunk of Lean for the theorem statement and definitions that we can check by hand. This could itself be an automated process where one inputs the theorem name(s) they are interested in and we get a "challenge" project with only those theorems and dependencies.

Automatically generating these challenge files from an untrusted input source such as an AI seems like an easy trap for correctness for me. The fundamental assumption of the checker is that the challenge file is to be trusted under all circumstances.

  • With human developments like blueprints (or even mathlib), it would be good to check/report what axioms are used.

I guess it could also be molded in a way that collects axioms instead of just checking them against an allowset pretty easily yeah.

  • With training and benchmarking AI, one often has a challenge file/project with sorrys that they want filled in. Then they want to check that the AI solution is correct. Ideally this could be fast enough to use during training (so one wants an option where it doesn't recheck all of mathlib say).

The current setup is not built for this. However, one thing you could do (which is basically the trick of the module system) is to say, okay, when we encounter a theorem within a trusted module export it as an axiom and dynamically add that theorem to the allowset instead. That way you would at least be skipping proofs (though not definitions as that is of course impossible) that come from a trusted source.

In general the goal of the tool in its current incarnation is to provide as high certainty as possible that indeed the challenge file is guaranteed to match the solution file statement wise and that the solution file is type correct. It is completely biased towards this in the way it is engineered and accepts slowdowns that are not necessary unless you are really very paranoid, as if an OpenBSD person wrote a Lean checker really :D

Here are some exploits I know about: Changing the meaning of notation in a theorem using local instances, using debug.skipTCKernel to skip kernel checking, exploiting native_decide, using metaprogramming to change the behavior of keywords or add fake theorems to the environment, crashing Lean in just the right way so that a theorem isn't added to the environment (and hence Lean and Lean4Checker are happy), and exploiting a bug in #print axioms so it says no axioms but axioms are used.

Regarding the exploits:

  • notation, using local instances, skipping kernel checking, changing keywords are all avoided by construction through this challenge solution mechanism afaict?
  • adding fake theorems is avoided by replaying the environment into the kernel
  • crashing lean in a right way is avoided because we interact with olean files produced by the solution file so if something is not in there we also don't believe it
  • native_decide will be as exploitable as ever if it is in your allowlist for axioms, if it is not you should be good

Jannis Limperg (Oct 22 2025 at 16:45):

FYI, the training use case is quite important in practice, specifically for reinforcement learning, where we want to reward an AI only for correct proofs. This use case is sensitive to latency since the GPUs are idle while we check each batch of AI-generated proofs, and idle GPUs = big waste of money. So it would be very nice if the checker had an option to just trust some provided Mathlib oleans.

Jannis Limperg (Oct 22 2025 at 16:48):

... and actually perhaps also preload Mathlib (à la Pantograph) so that we don't have to load it for every sample. I have a tool that works like this (can maybe open-source at some point), but it would be nicer if the official checker could support this use case directly.

Henrik Böving (Oct 22 2025 at 17:34):

I do understand that people are interested in this use case but in its current state this is not the type of service it aims to provide. It does a bit of additional work to ensure that really nothing fishy is going on under any circumstances. I do see ways to fit something like a fast mode into its architecture but this feels wrong to me. We end up in a situation where it is again not fully clear what running the official tool actually means unless you inspect it for yourself. In my mind this tool should be the ultimate judge where if you have the challenge file properly set up and point it at a set of theorems and a set of legal axioms it tries as hard as in any way possible to guarantee that the result is correct so that if you can say this tool checks your result there is no trapdoor, it is just correct.

Oliver Dressler (Oct 22 2025 at 17:40):

I am very excited about this news! I have been working on a configurable verifier. While it will work standalone, the goal is to expose it as a tool to agents. lean4checker is a dependency; to reuse some of its battle-tested replay logic. Looking forward to integrating the new verifier eventually and throw out some custom code!

Jason Rute (Oct 22 2025 at 18:03):

I agree it is a tricky balance between completeness and speed. I agree that the official tool should be the final arbitrator. If someone claims to prove RH, we should know how to check it.

Jason Rute (Oct 22 2025 at 18:03):

But I also agree with @Jannis Limperg that we need a fast tool as well, even if this isn't that one, or the official one. Ideally it should still borrow as much from the official tooling so it is as trustworthy as speed allows. Otherwise, without a fast tool, we will train our models to exploit weaknesses in Lean's front end, and not notice them. (A simple example would be a model which receives feedback that a proof passed tactics, but failed the kernel, and then it just disables the kernel to make the proof pass. This might be rare enough that it would go unnoticed that the model has this behavior.)

Jason Rute (Oct 22 2025 at 18:03):

After your tool is released, I think there could be a lot of community support to modifying into a fast version (but less secure), maybe using the axiom trick you mentioned above. But still the slow version should still be the default, and the one used to check anything important (final benchmark results, full proofs, any contract you are paying money for).

Jason Rute (Oct 22 2025 at 18:04):

But also to be honest, I think we will find this challenge-solution approach will only go so far. I think a large number of times the challenge will have mistakes in it. And many other times we will be looking to judge a proof which never had a challenge to begin with. The best we will be able to hope for is that we can distill a challenge post-hoc from the solution.

Jannis Limperg (Oct 22 2025 at 18:09):

We end up in a situation where it is again not fully clear what running the official tool actually means unless you inspect it for yourself.

The tradeoffs can be documented clearly, with the default being as secure as possible, and then we don't have a problem, no? If you're saying you don't want to make the checker too complex, I can understand that (complexity is the enemy of security), but then the complexity will just live in a non-official repo and some functionality will be duplicated. I can live with that, but I'd argue it's not the optimal outcome.

Jason Rute (Oct 22 2025 at 18:15):

Jason Rute said:

The best we will be able to hope for is that we can distill a challenge post-hoc from the solution.

I guess another approach, if the theorem is easy enough to state, is to restate the theorem statement independently of the solution. Make that the "challenge". And then use the original solution to prove the new challenge, making a new solution which matches the new challenge.

Arthur Paulino (Oct 22 2025 at 18:22):

What is a "challenge" in this context?

Henrik Böving (Oct 22 2025 at 18:26):

Two things are important here I think, for one yes the complexity part but also the fact that if someone tells you it passes the checker there are still multiple levels to what that could mean which I would like to avoid. Besides that, the checker logic without all of the additional guard rails is 176 LoC. I guess we could share that if we want to but it's really not that fancy.

I can give a brief primer of what it actually does, it's really not that fancy. The basic idea here is that you trust the challenge file to be fine, the solution file to be arbitrarily evil and from there the following things happen:

  1. we parse the config file which contains names of theorems to verify and an allowlist of axioms
  2. we compile and load the challenge file into the verifier proces
  3. we compile the solution file in a landrun sandbox to avoid whatever tricks an IO executing meta program in the solution file could perform
  4. we run lean4export on the solution file output in a separate process to extract the olean contents (also in a landrun sandbox), this way even if the attacker managed to produce an olean file that is able to exploit something (recall that olean files basically just contain pointers that we follow blindly) the best they will be able to do is exploit the lean4checker process which in turn can only send input to our verifier but nothing else. If they were e.g. running in the same process they might be able to pull a funky trick where they trick us into printing accept when we would not really accept
  5. The verifier parses the lean4export output
  6. (this is where the mentioned 176 LoC start) the verifier checks that no axioms that are not on the allowlist occur in the parsed solution
  7. the verifier checks that all constants in the types of the target theorems (and constants they use transitively) match between the two parsed environments
  8. we run the kernel on the solution
  9. we accept

Really the thing that is taking the most time here is steps 1-5 which are precisely the high assurance steps. If you use the axiom trick 6 and 7 will be fast but then you basically throw away just about everything that is interesting about this tool :D

Jason Rute (Oct 22 2025 at 18:29):

So this doesn't run any kernel on the proof terms (or am I missing that)?

Henrik Böving (Oct 22 2025 at 18:29):

Oh sorry I forgot, we do of course run the kernel

Henrik Böving (Oct 22 2025 at 18:31):

and note that because this already uses lean4export anyway one thing you can do for cheap in this setup is actually to just feed the already exported thing to multiple kernels

Jason Rute (Oct 22 2025 at 18:36):

When we say this is slow, how slow are we talking?

Jason Rute (Oct 22 2025 at 18:37):

Is it rechecking all of mathlib in a kernel for example?

Henrik Böving (Oct 22 2025 at 18:37):

Totally

Alex Meiburg (Oct 22 2025 at 18:39):

I actually have no sense how slow that is. I know how slow it is to build Mathlib, but not to check it.

Jason Rute (Oct 22 2025 at 18:40):

Can this be cached? Many solutions would use the same mathlib (and many even would use the same challenge). Since this is happening in a separate process anyway, can we not safely cache our kernel check of mathlib.

Henrik Böving (Oct 22 2025 at 18:41):

With a fetched mathlib cache checking the erdos thing from today on our FRO dev machine with a AMD Ryzen 9 7950X3D 16-Core Processor takes 20 seconds

Henrik Böving (Oct 22 2025 at 18:41):

Jason Rute said:

Can this be cached? Many solutions would use the same mathlib (and many even would use the same challenge). Since this is happening in a separate process anyway, can we not safely cache our kernel check of mathlib.

as I've alluded to multiple times above you can do this using the axiom trick that the module system uses as well

Henrik Böving (Oct 22 2025 at 18:56):

(is that actually fast? I feel like it's horribly slow)

Kevin Buzzard (Oct 22 2025 at 19:04):

If you only have to do it once then it's acceptable! (at least to me.) The answer I was fearing was "an hour".

Henrik Böving (Oct 22 2025 at 19:04):

Oh I mean certainly acceptable for a final judge use yeah, but not for AI use lol

Henrik Böving (Oct 22 2025 at 19:07):

I don't really have an intutition for how long AI training cycles take but I would imagine it has to be in the 1s and below magnitude to be good

Jason Rute (Oct 22 2025 at 19:14):

20 seconds on 16 cores is probably too slow for many RL applications (although one only has to check a proof if it builds, so the amortized time is lower if most of your proofs fail). But for other applications, like checking final results, it isn't as bad as I thought it would be. (And it is much faster than I thought you were going to say, so I might just be anchored) Also, this "axiom trick", as much as you are discouraging it, seems to sound easy to setup, not that much off a soundness risk, and natural to do---so I think many would just find a way to do that for stuff that needs to be really fast.

Henrik Böving (Oct 22 2025 at 19:15):

No no its 20 seconds on 1 of those 16 cores, this whole process isn't really parallelizable well

Jason Rute (Oct 22 2025 at 19:18):

That is better.

Henrik Böving (Oct 22 2025 at 19:19):

Is it though?

Henrik Böving (Oct 22 2025 at 19:20):

Whether you need x16 cores to make it this fast or just 1 shouldn't matter should it? The main point is that, and again this is just based on my experience with HPC, I would imagine 1 output cycle on your GPU is much faster than 20 seconds so this would be the bottle neck in the training feedback pipeline

Snir Broshi (Oct 22 2025 at 22:54):

Henrik Böving said:

The basic idea here is that you trust the challenge file to be fine, the solution file to be arbitrarily evil

This sounds exactly like Mario's vision: Metamath Zero
Is he involved in some way?

Joseph Myers (Oct 22 2025 at 23:37):

Do the theoretical attacks this avoids include ones based on Mario's list of divergences between lean4lean and the Lean kernel? For example, anything involving an adversarial prelude used by the claimed solution (though maybe it's impossible to use an adversarial prelude and still pass the comparison of types to the challenge file).

Is the comparison between the challenge and solution based on definitional equality or syntactic equality? (Mathematically it doesn't make much difference. Defeq may have advantages when changed imports mean instances are found through different paths that are defeq but not syntactically the same. Syntactic equality might be closer to the ideal if an answer is supposed to be given in an exact syntactic form, e.g. a numeric literal, rather than an expression that might be defeq to the desired answer.)

Does the verifier handle the case where the challenge file has multiple sorrys to fill in, one depending on another (that is, a "determine" problem where an answer must be filled in, along with a proof of a type that references the answer, and a human is expected to check that the answer is legitimate rather than just restating the problem)? Or is it expected in such cases that the human checks legitimacy of the claimed answer first, and puts it in a copy of the challenge file, before running the verifier? (This is only an issue for problems where the correct answer could be expressed in many different legitimate forms rather than having one clear canonical form; IMO 2025 P4 is an example of such a problem. If there's a clear canonical form for the answer, say a numeric literal, then the challenge file used with the verifier can just give the answer in that form, and the solver can be told to put the answer in the canonical form in their solution.)

Niels Voss (Oct 23 2025 at 05:53):

If we do choose to implement a fast mode, can I request that we make sure it has a different name than the full version, so that there's absolutely no confusion? My worry is that if the full version is e.g. named the comparator, and we make another tool called the "fast comparator", then people might start saying that something has been checked by the comparator when it has only been checked by the fast comparator. Then if the fast comparator has a flaw in it, it will degrade trust in the full comparator just because they happen to share the same name.

Henrik Böving (Oct 23 2025 at 07:29):

Do the theoretical attacks this avoids include ones based on Mario's list of divergences between lean4lean and the Lean kernel? For example, anything involving an adversarial prelude used by the claimed solution (though maybe it's impossible to use an adversarial prelude and still pass the comparison of types to the challenge file).

If your challenge file uses the built-in prelude a solution that modifies the prelude would indeed be detected yes. One situation that is imaginable is one where the challenge file does not use a part of the prelude to state the challenge but the solution does and can thus in principle modify this part of the prelude without being caught. I think this is quite an unlikely vector of attack given that most of these problems will likely contain the relevant parts of the prelude in their challenge. That being said, due to the architecture already using lean4export it is in principle quite easy to just feed it to another checker (like nanoda or lean4lean) if that checker also supports the export format so even this unlikely vector is totally fixable.

Is the comparison between the challenge and solution based on definitional equality or syntactic equality? (Mathematically it doesn't make much difference. Defeq may have advantages when changed imports mean instances are found through different paths that are defeq but not syntactically the same. Syntactic equality might be closer to the ideal if an answer is supposed to be given in an exact syntactic form, e.g. a numeric literal, rather than an expression that might be defeq to the desired answer.)

syntactic

Does the verifier handle the case where the challenge file has multiple sorrys to fill in, one depending on another (that is, a "determine" problem where an answer must be filled in, along with a proof of a type that references the answer, and a human is expected to check that the answer is legitimate rather than just restating the problem)? Or is it expected in such cases that the human checks legitimacy of the claimed answer first, and puts it in a copy of the challenge file, before running the verifier? (This is only an issue for problems where the correct answer could be expressed in many different legitimate forms rather than having one clear canonical form; IMO 2025 P4 is an example of such a problem. If there's a clear canonical form for the answer, say a numeric literal, then the challenge file used with the verifier can just give the answer in that form, and the solver can be told to put the answer in the canonical form in their solution.)

So you are talking about a situation like the following:

def foo : MyType := sorry

theorem bar : P foo := sorry

yes? It would not accept such a problem at the moment because the definition of a constant used in the target theorem bar changed. I see two ways to address this:

  1. Additionally inject a list of things whose body need not be inspected. This is going in a similar direction as the axiom trick above really
  2. Rephrase the problem as something like def bar : { x : MyType // P x} and then give that as a challenge?

Niels Voss said:

If we do choose to implement a fast mode, can I request that we make sure it has a different name than the full version, so that there's absolutely no confusion? My worry is that if the full version is e.g. named the comparator, and we make another tool called the "fast comparator", then people might start saying that something has been checked by the comparator when it has only been checked by the fast comparator. Then if the fast comparator has a flaw in it, it will degrade trust in the full comparator just because they happen to share the same name.

This was one of the two points I was trying to make above yes

Jason Rute (Oct 23 2025 at 12:43):

You meant to write theorem bar : P foo := sorry, right?

Henrik Böving (Oct 23 2025 at 12:55):

yes


Last updated: Dec 20 2025 at 21:32 UTC