Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: A more robust RL pipeline?

GasStationManager (May 10 2025 at 18:16):

From our discussions here about the exploit found by Deepseek Prover V2, one recommendation was that we (the AI researchers) need to deploy more robust proof checking.

I think this is definitely needed; but the question is would that be sufficient. From what I can tell, the recent RL-trained prover models (Deepseek prover v2, Kimina prover, AlphaProof) all use the natural set up where the proof checker provide direct reward signals to the model. Which is fine if we know the proof checker is perfect. But if the proof checker has a flaw, with sufficient exploration the models will find it. And if that flaw helps the model to pass the proof checker, the model will exploit it to maximize its reward, which is the number of theorems "proved". So we will be in a constant arms race.

I think we should accept the fact that our proof checker will occasionally contain bugs, and accordingly design our RL pipelines to be more robust. I'm just thinking aloud here, and want to throw a couple of ideas for discussion.

Canaries. We can put theorem statements that are known to be false in the training set. If the model finds a "proof" that passes the proof checker, we know the proof checker is faulty. A variant of this is to put a theorem statement and its negation in the training set, and check if both are "proved". I think this can and should be done; one potential worry is that if we throw away the model weights that found the exploits and got caught by canaries, we may be left with models that "evolved" to avoid the canary traps. I don't know if that is a realistic scenario; RL experts may want to comment...
LLMs could detect bugs in the proof checker the way humans do: by informal reasoning. The question is can we design the rewards to elicit truthful reporting of the bugs. Ideas from game theory/mechanism design can be relevant here. For example: we might have two LLMs, the prover and the verifier. The prover is trained in an RL environment, same as above. It produces a Lean proof that passes the Lean proof checker of the environment. The verifier LLM then takes this proof and verifies it via informal reasoning (verify step by step), without access to Lean. And returns whether it believes the proof is correct or not. We could train/fine-tune the verifier with proper scoring rules to incentivize it to report truthfully its beliefs.

Jason Rute (May 10 2025 at 18:22):

I don’t know the details, but as for the canary idea, I think the TPTP/CASC/CADE competitions (forgot the correct name) starts with a soundness check of the theorem provers involved. I do think it is more difficult with LLMs, because they may not try to prove a blatantly false statement, but may try on a less-clear false statement. For example we haven’t seen the reasoning traces for the DeepSeek bug, but it may have looked in the reasoning trace like the model “thought” it was a legitimate reasoning step.

Jason Rute (May 10 2025 at 18:28):

Also, I question this idea that we will always get bugs. If we do things right, the attack surface will be really low, just a small amount of heavily inspected code. It can be improved with other type checkers, periodic translations to other logics (using Coq or Metamath as a Lean type checker for example), and formally verifying the kernel. The big challenge with the DeepSeek work is that the authors trusted the Lean front end too much and the lean backend (the lean4checker kernel) didn’t have the needed API for this sort of checking. If Lean just deletes the theorem, then one can’t know just from the kernel saying “no errors found” that the proof doesn’t even exists. One needs a few more standard part, namely a way to check axioms in the kernel and a way to compare before and after theorems.

Jason Rute (May 10 2025 at 18:29):

Of course, perfect trust is hard so we should put a lot of thought into this, constantly looking for exploits and the weakest links

Jason Rute (May 10 2025 at 18:32):

But I think if the ML for Lean community comes together and creates good standards then we won’t be dealing with these sorts of issues much anymore unless we really get some sort of ingenious exploit like finding a true bug in the actual kernel or doing some really crazy thing.

Jason Rute (May 10 2025 at 18:32):

I’m all for creating those standards right here and now. It shouldn’t be too hard.

Jason Rute (May 10 2025 at 18:48):

My proposal is simple:

Make an extension of the lean4checker kernel which also checks axioms and theorem statement consistency. (Could be its own program. I would like it to be owned by the FRO eventually.) I think even a naive implementation would be pretty fast.
Improve something like your SafeVerify to be a solid and fast checker for real usage. Reduce the possibility of bugs by having experts look over the code.
Make a full REPL like pipeline that incorporates the all of the above. Also make it easy to run the more extensive lean4checker-like tests on theorems periodically to spot bugs in the fast checker. All final results (like claimed benchmarks) should go through the more rigorous checks, but RL loops and user tools can use the fast one.

After building the tools it is just a communication issue. We tell everyone we won’t accept their results if not checked with lean4checker+.

Jason Rute (May 10 2025 at 18:50):

See https://leanprover.zulipchat.com/#narrow/stream/270676-lean4/topic/A.20better.20Lean.20checker for more of my idea, but I don’t think I’m communicating it well to be honest.

Justin Asher (May 11 2025 at 06:14):

I would be happy to contribute to this. Already was considering writing similar software, however I think it would be good if the software was organized through the FRO for maintenance purposes.

Jason Rute (May 11 2025 at 13:41):

@Justin Asher Maybe it would be best to move this conversation back to #lean4 > A better Lean checker since I've co-opted this thread which is about a related-but-different topic.

GasStationManager (May 11 2025 at 17:51):

@Jason Rute I agree that what you are proposing (or something like it) is very necessary. And if we set the trust region as kernel + standard axioms, that might even be enough, for now. But:

there will be use cases that demand a wider trust region. What if someone really wants to use native_decide? What if (like me) someone wants to have provable guarantees on the behavior of executable code, and therefore need to be able to trust the compiler in addition to the kernel?
One day, the AIs will be smarter than us. Nevertheless, humanity will still be able to trust AI's output, with the help of formal verification. In that scenario, the proof checker would become arguably the most high-stakes piece of software that humanity owns. If you accept this framing, then it is not hard to see that the current RL training pipelines for prover models are extremely brittle: they incentivize the models to break the proof checker, but when they do break the proof checker, we would not be notified.

Jason Rute (May 11 2025 at 17:59):

@GasStationManager Those are fair worries. Moreover, most things in the world can't be formally verified, so formal verification can't be the only check we have. I think you are asking good questions.

Notification Bot (May 12 2025 at 00:55):

42 messages were moved from this topic to #general > Trusting native_decide by Mario Carneiro.

GasStationManager (May 24 2025 at 16:37):

I wanted to test out one of the ideas mentioned above: can an LLM read a Lean proof, and detect (via informal reasoning) whether it is a valid proof, without access to the lean proof-checker?

I gave the "apply? proof" from Deepseek Prover v2's MiniF2F solutions, to two LLMs: OpenAI's o3 and Anthropic's newly-released Claude Opus 4. Asked them to verify the proof step by step, and determine whether the proof is valid. Their responses are below. To summarise:

Both were able to identify the apply? line as an issue, saying that it does not close the goal.
Both commented that the rest of the proof doesn't seem to advance the reasoning in a clear direction
o3's response was slightly more noisy as it complains about type coercions in the arithmetic operations, which doesn't seem to be a critical issue.

o3's response

Opus 4's response:

I'll analyze this Lean 4 proof step by step. The theorem states that for any non-negative real-valued function on natural numbers, if the product of the first n values equals 1, then their sum is at least n. This is a classic application of the AM-GM inequality.

Let me examine each part of the proof:

1. The AM-GM Inequality Definition

The proof starts by defining am_gm_inequality, but there are several issues here:
have am_gm_inequality : ∀ (x : ℕ → NNReal) (m : ℕ),
  Finset.sum (Finset.range m) x / m ≥ (Finset.prod (Finset.range m) x) ^ (1 / m)
The proof attempts to establish this through step0, step1, and step2, but:

step0 contains an apply? which is incomplete

The steps are circular - they don't actually prove the AM-GM inequality

h₁ in step0 defines an unused function

Issue: The AM-GM inequality is not actually proven; it's just restated multiple times.

2. Geometric Mean Equals One
have geometric_mean_eq_one : (Finset.prod (Finset.range n) a) ^ (1 / n) = 1
This should follow directly from h₀ (which states the product equals 1) and the fact that 1^(1/n) = 1. However, the proof:

Evaluates am_gm_inequality on constant functions for various values

Uses linarith which won't work here since this requires understanding that 1^(1/n) = 1

Issue: The proof doesn't use the hypothesis h₀ and the reasoning is incorrect.

3. Arithmetic Mean Greater Than or Equal to One
have arithmetic_mean_ge_one : Finset.sum (Finset.range n) a / n ≥ 1
This should follow from applying AM-GM with the fact that the geometric mean equals 1. The proof again evaluates am_gm_inequality on various constant functions but doesn't properly connect this to the actual problem.

Issue: The connection between AM-GM and the specific instance isn't established.

4. Final Step
have sum_ge_n : Finset.sum (Finset.range n) a ≥ n
This should follow from multiplying both sides of arithmetic_mean_ge_one by n. However, the proof again just evaluates am_gm_inequality on various inputs and uses nlinarith.

Conclusion

This proof is not valid. The main issues are:

The AM-GM inequality is never actually proven - it's just asserted through circular reasoning

The key steps don't use the hypothesis h₀ properly

The arithmetic manipulations are not justified - the proof relies on tactics like linarith and nlinarith that won't work for the required reasoning

The proof structure is repetitive and unclear - it evaluates the same inequality on constant functions multiple times without clear purpose

A correct proof would:

Either import AM-GM from Mathlib or prove it properly

Use h₀ to show that (∏ aᵢ)^(1/n) = 1^(1/n) = 1

Apply AM-GM to get (∑ aᵢ)/n ≥ (∏ aᵢ)^(1/n) = 1

Multiply by n to get ∑ aᵢ ≥ n

The current proof appears to be an incomplete attempt that relies on tactics that won't succeed for the mathematical reasoning required.

Mario Carneiro (May 24 2025 at 16:47):

lol, I want to use this now to grade those AI slop RH proofs

Thomas Zhu (May 24 2025 at 20:52):

It seems o3's response hallucinates some errors: it claims there is no coercion from Nat to NNReal, the h₁ … h₅ do not type check, that 1 / m is a real number where m : Nat, and some typeclass search failures, all of which are not true. On the other hand Opus 4 misses some errors: in particular _ ^ (1 / m) in the original proof has (1) Nat division where Real division is intended and (2) doesn't deal with the case m = 0.

GasStationManager (May 24 2025 at 22:43):

@Thomas Zhu good catches! To check the LLMs' accuracy on the other side of the coin, I gave them the same prompt but a different theorem & proof: DeepSeek Prover v2's solution for MiniF2F's mathd_numbertheory_221; this was the one that passed the proof checkers but required quite a bit of computation and RAM to check.
Their responses are below.

Both have a good grasp of the mathematical argument underlying the proof. It is not clear whether they already know it before reading DeepSeek's proof sketch.
Both had trouble parsing the chained tactics starting with rfl ...
So both concluded that the proof is not valid.

To be fair, it is pretty hard to figure out what these chained tactics are doing, without any assistance from the proof checker...

o3's response

Opus 4's response:

I'll analyze this Lean 4 proof step by step to verify its reasoning and check if it's valid.

Overview

The theorem claims that there are exactly 11 natural numbers less than 1000 that have exactly three distinct positive integer divisors.

Key Mathematical Insight

A natural number has exactly 3 divisors if and only if it's the square of a prime number. This is because:

If n = p² where p is prime, then the divisors of n are {1, p, p²}, which gives exactly 3 divisors.

Any other form would give a different number of divisors.

Step-by-Step Analysis

1. First Have Statement (h₁)
have h₁ : Finset.card (Finset.filter Nat.Prime (Finset.Icc 2 31)) = 11 := by
    rfl
Issue: The proof uses rfl (reflexivity) which would only work if this is definitionally equal. The comment mentions "3-smooth numbers" which is completely unrelated to the problem. 3-smooth numbers are numbers whose prime factors are only 2 and 3, which has nothing to do with counting numbers with exactly 3 divisors.
Verification: The primes between 2 and 31 are: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31. That's indeed 11 primes, but the proof doesn't properly establish this.

2. Second Have Statement (h₂)
have h₂ : ∀ p : ℕ, p ∈ Finset.filter Nat.Prime (Finset.Icc 2 31) → p^2 < 1000
This correctly establishes that for all primes p between 2 and 31, we have p² < 1000. Since 31² = 961 < 1000 and 32² = 1024 > 1000, this is the correct bound.
Issue: The proof uses interval_cases which would need to check all values from 2 to 31, but this should work.

3. Third Have Statement (h₃)
have h₃ : ∀ p : ℕ, p ∈ Finset.filter Nat.Prime (Finset.Icc 2 31) → (p^2).divisors.card = 3
This correctly states that the square of a prime has exactly 3 divisors. This is mathematically correct.

4. Fourth Have Statement (h₄)
have h₄ : S.card = Finset.card (Finset.filter Nat.Prime (Finset.Icc 2 31))
Major Issue: This is the critical step that should establish that S consists exactly of the squares of primes between 2 and 31. However, the proof doesn't properly establish this bijection. It attempts to show that S equals the set of numbers between 1 and 999 with exactly 3 divisors, but then jumps to the conclusion without proving that these are exactly the squares of primes.

Overall Assessment

The proof is NOT valid for several reasons:

Incorrect mathematical claim in comments: The comment about 3-smooth numbers is completely wrong and unrelated to the problem.

Missing crucial bijection: The proof doesn't establish that the set S (numbers < 1000 with exactly 3 divisors) is exactly the set of squares of primes ≤ 31.

Incomplete proof steps: Many steps end with tactics like decide and rfl that wouldn't actually complete the proof as written.

Logic gap: The proof doesn't show that:

Every number with exactly 3 divisors is a square of a prime

Every square of a prime ≤ 31 is in S

No other numbers are in S

Correct Approach

A valid proof would need to:

Prove that a number has exactly 3 divisors iff it's the square of a prime

Find all primes p such that p² < 1000 (which are the primes from 2 to 31)

Count these primes (there are 11 of them)

Conclude that |S| = 11

The mathematical result is correct (there are indeed 11 such numbers), but this proof doesn't properly establish it.

GasStationManager (May 26 2025 at 19:36):

I have tried the prompt on a few other proofs, a somewhat random selection from Deepseek prover v2 and Kimina prover's published solutions to miniF2F.

o3 and Opus 4 generally have a good grasp of what is going on in the proofs, including the tactics. The above two may be somewhat extreme cases.
Even for the above two examples where they sometimes have trouble following the tactics, o3 and Opus 4 can be prompted to verify at a proof-sketch level: I asked them to check whether each have statement is correct, assuming that previous ones are correct. And check it two ways: by testing with sample values, (a manual plausible if you will), and by providing an informal proof. They were able to confirm that the proof sketches are correct. The apply? line was a suppose-to-be-failed attempt at proving a true have-statement (the AM-GM inequality).

GasStationManager (May 26 2025 at 19:58):

Also, maybe people already know this, but I just found out that 11 of Kimina-prover's miniF2F solutions use native_decide. (I found this out while running SafeVerify on the solutions. All the others passed.) Now I'm somewhat conflicted about native_decide; it may be reasonable to allow its use at the outset before training starts, as long as one is aware of its potential issues.

Now that we have these proofs, it may be fair to ask: are these benign uses of native_decide, or one of the other kind where you can use it prove False? In the spirit of this thread I asked our LLMs. Here's one example:
https://chatgpt.com/share/6834c6f6-2698-800a-bcd5-2710b373dd88
o3 was helpful in showing that the native_decide was used to calculate the cardinality of a small set. This made me feel more comfortable about the proof.

Justin Asher (May 26 2025 at 20:05):

@GasStationManager Jason previously mentioned something similar regarding native_decide for Combibench:
#general > CombiBench : A benchmark focusing on combinatorics problem @ 💬

This seems somewhat common in the Kimina models.

Justin Asher (May 26 2025 at 20:28):

@GasStationManager I see you were already aware of this other thread. Oops. Thanks for pointing out the Minif2f solutions containing native_decide.

Last updated: Feb 28 2026 at 14:05 UTC