Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Tool to verify a solution solves the actual problem?


Nehal Patel (Nov 25 2025 at 16:20):

Hi,

My high level question is: suppose I have a Problem.lean file with a theorem with a single sorry for a proof, and I have a Solution.lean which purports to provide a full proof of the theorem. Assume Solution.lean compiles cleanly without warnings, but there is still questions of validity: has the theorem has been slightly perturbed, have axioms been introduced, etc.

I understand that this is a nuanced topic and there are likely various solutions depending on one's level of paranoia and patience, and that the difficulty depends in part in how exactly the Problem.lean has been structured, but I am also assuming this has come up before with regards to AI generated proof (or even FLT) (though I could not find a prior discussion...) so I'll leave it here in case someone has a quick pointer to an easy to use recipe of how to check Solution actually solves Problem.

Thank you! nehal


... But as concrete example, here is a problem/solution pair from deepmind's formalimo. The solution has re-ordered some content ( in a natural way), and added some lemmas, etc. Is there are tool that can confirm the types have not been mutated? In this case, formalimo uses annotations that likely make robust checking viable or easier via a special tool, but what about when these annotations do not exist?

/-
Copyright 2025 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-/

import Imo.ProblemImports

open scoped Real
open scoped Nat

/-- $a,b,c$ are real with $a$ non-zero. $x_1,x_2,\ldots,x_n$ satisfy the $n$ equations:

$ax_i^2+bx_i+c=x_{i+1}$, for $1\le i< n$

$ax_n^2+bx_n+c=x_1$

\noindent Prove that the system has zero, $1$ or $>1$ real solutions according as $(b-1)^2-4ac$ is $< 0,=0$, or $>0$.
-/
@[imo_problem_subject algebra]
theorem imo_1968_p3 (n : ) (hn : 2 < n) [NeZero n] (a b c : ) (ha : a  0) (S : Set (Fin n  ))
    (hS : S = {x : Fin n   |  i, a * x i ^ 2 + b * x i + c = x (i + 1)}) (d : )
    (hd : d = (b - 1) ^ 2 - 4 * a * c) :
    (d < 0  S = )  (d = 0   s, S = {s})  (d > 0   s t, s  t  s  S  t  S) := by
  sorry

theorem imo_1968_p3.parts.negative (n : ) (hn : 2 < n) [NeZero n] (a b c : ) (ha : a  0)
    (hd : (b - 1) ^ 2 - 4 * a * c < 0) :
    ¬∃ x : Fin n  ,  i, a * x i ^ 2 + b * x i + c = x (i + 1) := by
  sorry

theorem imo_1968_p3.parts.zero (n : ) (hn : 2 < n) [NeZero n] (a b c : ) (ha : a  0)
    (hd : (b - 1) ^ 2 - 4 * a * c = 0) :
    ∃! x : Fin n  ,  i, a * x i ^ 2 + b * x i + c = x (i + 1) := by
  sorry

theorem imo_1968_p3.parts.positive (n : ) (hn : 2 < n) [NeZero n] (a b c : ) (ha : a  0)
    (hd : (b - 1) ^ 2 - 4 * a * c > 0) :
     x y : Fin n  ,
      x  y   i, a * x i ^ 2 + b * x i + c = x (i + 1)  a * y i ^ 2 + b * y i + c = y (i + 1) := by
  sorry

And here is a solution (by claude-code, opus4.5)

[PastedText.txt](/user_uploads/3121/cGBI7xfO7fnsr6Vq-53LAxqR/PastedText.txt)

Kevin Buzzard (Nov 25 2025 at 20:51):

Search this forum for SafeVerify ( https://github.com/GasStationManager/SafeVerify )-- this is one tool which does this job at a high level of paranoia. And here #lean4 > FRO's new verifier @ 💬 is a thread talking about a second tool (which might not exist yet but is being developed by the FRO).

Kim Morrison (Nov 25 2025 at 23:31):

Kevin is referring to https://github.com/leanprover/comparator as the tool provided by the FRO (specifically, Henrik Böving_).

Jason Rute (Nov 25 2025 at 23:35):

@Henrik Böving is this ready to go?

Henrik Böving (Nov 25 2025 at 23:36):

Well as the README says it has not yet been pressure tested by public users so it could still have bugs but it does very much work yes.

Jason Rute (Nov 25 2025 at 23:40):

Also, note all these tools have edge cases where they are arguably too strict, marking stuff as wrong because of extremely minor syntactic differences. @Eric Wieser gave a good example in the discussion of Pantograph’s similar tool. https://leanprover.zulipchat.com/#narrow/channel/219941-Machine-Learning-for-Theorem-Proving/topic/Track.20checking.20function.20in.20Pantograph/with/554519378

Jason Rute (Nov 25 2025 at 23:41):

Also SafeVerify doesn’t check multiple file proofs (or more accurately it trusts any imported files), but I think comparator however does that correctly.

Eric Wieser (Nov 25 2025 at 23:53):

Not even syntactic differences; identical syntax in different files produces different things if the auxiliary declaration cache is populated with different names

Jason Rute (Nov 26 2025 at 12:48):

Eric Wieser said:

Not even syntactic differences; identical syntax in different files produces different things if the auxiliary declaration cache is populated with different names

Can you give an example? (And what is this "auxiliary declaration cache"?)

Jason Rute (Nov 26 2025 at 12:49):

Also @Eric Wieser, considering that you have clearly been thinking about this, do you have a recommended solution?

Oliver Dressler (Nov 27 2025 at 12:11):

I've been compiling test cases (including some based on the example by @Eric Wieser, thx!) and running them against some verifiers here.

The tool itself (LeanParanoia) is not what you asked for; it works without comparing to a reference statement. But the verifier comparison should still give an overview of the capabilities of some of the available tools.

Keep in mind that both LeanParanoia and the test cases have not really been tested publicly yet.

Jason Rute (Dec 02 2025 at 23:16):

Henrik Böving said:

Well as the README says it has not yet been pressure tested by public users so it could still have bugs but it does very much work yes.

Two places to pressure test your tool (if not already), which I assume these are exactly the applications you had in mind:

  1. A number of theorems from formal conjectures (I think 4 of them in total now) have been proved automatically between various AI companies.
  2. PutnamBench has a new SoTA. It would be good to partner with @George Tsoukalas to get him to stress test your tool in his verification process.

Both would also help determine if false negatives are common, where the tool rejects a correct proof because of trivial syntactic or file issues. (And hopefully not, but maybe your tool will also help find legitimate issues in the submitted proofs.)

Nehal Patel (Dec 06 2025 at 10:44):

I've found using SafeVerifier very nice and easy to use. (I use it in script mode e.g. lean --run SafeVerify.lean prob.olean solution.olean))

Here are some datapoints that folks might find interesting for solutions that compile sorry free but fail safeverify:

claude-code-opus-4.5, sampled problems from fate-m/h/x, formalimo, putnambench, minif2f:

              compiles safeverified  total
dataset_name
fate-h              19           15     50
fate-m              48           46     50
fate-x               2            2     50
formalimo           14           12     50
minif2f             45           34     50
putnambench         11            9     50
    agent       model  compiles safeverified
0  claude    opus-4.5       234          203
2   codex     gpt-5.1        92           90
1  claude  sonnet-4.5        45           39

Most of the failures are due to using nativeDecide. Occasionally sonnet 4.5 introduces axioms even though it has been told it shouldn't

I'll be writing more about this here: https://g-research-innovation.github.io/lean4-llm/blog3_three_horsemen

Deleted User 968128 (Dec 07 2025 at 07:38):

Curious, is gpt-5.1 preferable to codex-max (high) ?

Such a moving target. They say GPT5.5 might come out before EOY.

Nehal Patel (Dec 07 2025 at 14:22):

ahh --that's a bug in the table above.... 'gpt-5.1' should be 'gpt-5.1-codex', (w/ high thinking effort) -- will try gpt-5.1-codex-max, ultra-high in a few days.

Bas Spitters (Dec 07 2025 at 16:25):

@Nehal Patel what's the issue with SafeVerify and NativeDecide? Is it just did SafeVerify does not trust it, or is there another reason?
https://github.com/GasStationManager/SafeVerify

Nehal Patel (Dec 07 2025 at 17:24):

@Bas Spitters , safeverify does not "trust" it -- more precisely it reports an error like:

found a problem in /home/ubuntu/src/lean-agents/data/scratch/putnambench/putnam_verify/workspace/lean4/submission_ec4d5266-82df-4dde-85f3-1d1551f6528f.olean
uncaught exception: (kernel) (interpreter) unknown declaration 'putnam_1990_a6._nativeDecide_1_1'

If I understand correctly, nativeDecide means "use computation" which I cannot imagine would be implemented in a minimal kernel -- but I think others can comment better than I on this.

The models could have been easily prompted to avoid using nativeDecide, but I had never heard of it before so it was a learning opportunity for me. In general, nativeDecide is not really fair play for something like Putnam -- contestants are not are allowed to simply brute force solutions. However, many of the models' usages are innocuous:

....
have hgcd : Nat.gcd 90 92 = 2 := by native_decide
....

other instances are egregious:

theorem putnam_1990_a6 :
    ((Finset.univ : Finset <| Finset (Set.Icc 1 10) × Finset (Set.Icc 1 10)).filter
      fun S, T  ( s  S, T.card < s)  ( t  T, S.card < t)).card =
    ((17711) :  ) :=
  by
    -- The search space is finite (only 2¹⁰ choices for each subset), so we verify the count by exhaustive computation.
    native_decide

:rolling_on_the_floor_laughing:

Deleted User 968128 (Dec 08 2025 at 01:09):

Nehal Patel said:

ahh --that's a bug in the table above.... 'gpt-5.1' should be 'gpt-5.1-codex', (w/ high thinking effort) -- will try gpt-5.1-codex-max, ultra-high in a few days.

I am attempting to parse your results but they seem to be conflicting with the data I've seen elsewhere.

Ie, Opus is very bad at math. But perhaps its persistence and doggedness in Claude code is making up for that? Would a more persistent re-trying agent around gpt-5.1 achieve better results? (like GPT-5-pro, for example)

What are token costs (importantly, including cache hits, which can reduce costs significantly)?

For practical though temporary reasons, your results might be what matters, but also this data could provide an inaccurate picture to people unaware of the underlying reasons for it.

Another way of looking at this problem is splitting model utilization into plan($$$ model)/act($ model). You can configure most agentic code clients to do this.

If I am missing something here, please let me know. What you're trying to do here is really important and quite useful.

Nehal Patel (Dec 08 2025 at 08:00):

@Tim Shephard
90% cache hits, a full PB run costs between $500-1000 depending on the model (which is same order of magnitude for the costs reported for Ax-Prover and Aleph, and given Herald uses Gemini, I think it would be similar). the main point here is that from both the tech write-ups and blogs documenting costs, it seems clear (to me) that the leaderboard is topped by custom agentic loops, and it seemed (to me) worth pointing out there are already very nicely engineered agentic loops -- and that these are loops that anybody can use. There is no easy way to use the tools on the leaderboard, but anybody can download claude-code, build PB, and pick a problem, tell it to clear the sorries, and have 33% percent chance of success. And you can use the same tool not just for theorem proving but also refactoring, adding comments, teaching you lean 4, etc. Sometimes we can collectively fail to see simple solutions; or, perhaps, these simple solutions improving very quickly.

(Why and how these agentic approaches work is a topic for another thread.)

Deleted User 968128 (Dec 08 2025 at 14:48):

Agreed, very pragmatic and useful, I don't disagree with that at all.

Deleted User 968128 (Dec 08 2025 at 14:58):

Still, and correct me if I am wrong, doesn't this speak a little to the state of things that you are concerning yourself with a class of problems less frontier than say https://critpt.com/ and https://epoch.ai/frontiermath?

Have the benchmarks gotten ahead of every day usage? Or is that class of usage just too expensive?

Kevin Buzzard (Dec 08 2025 at 15:42):

Tim, we're talking about formal datasets here; FrontierMath is informal mathematics which is a completely different ballgame.

Deleted User 968128 (Dec 09 2025 at 03:28):

Kevin Buzzard said:

Tim, we're talking about formal datasets here; FrontierMath is informal mathematics which is a completely different ballgame.

Kevin Buzzard, I think a more efficacious and perhaps diplomatic way of stating this might be that tools such as gpt-5-pro and gemini-3-deep-think perform poorly on lean benchmarks despite their strong showings on informal reasoning and the fact that they are grounded in tool calling and websearch.

Unfortunately, I couldn't find such benchmarks. Also, as I mentioned here - echo'd by Aristotle in their paper here #Machine Learning for Theorem Proving > Kaggle, 38/50 already via open models, $2M prize fund @ 💬 and in the latest deepseek math paper #Machine Learning for Theorem Proving > DeepSeekMath_V2 @ 💬 - the approach that is being used is informal reasoning to help guide formalization.

As I've explained elsewhere and in DMs to you specifically, the question is what gains can come from RLVR versus scaling on human generated informal reasoning.

There are fundamental issues (lack of continual learning in AI) and non fundamental issues (mathlib coverage, compute requirements for compilation, training datasets) involved in that question. Decomposition strategies play a large role, and with minimal (though expert) hinting could uplift results immensely.

Which is why I suggested above to consider splitting into plan/act as several coding clients provide out of box. It would be interesting to at least hear some anecdotal experiences around that.

Deleted User 968128 (Dec 09 2025 at 03:34):

#Machine Learning for Theorem Proving > Planner + Generator Trend in Recent-ish NTP models is another good area to discuss this.

Deleted User 968128 (Dec 09 2025 at 03:42):

If someone were to provide a link to a small set of well formed problems that would provide to them some idea of capability, I can run them through as above. It would not be a full benchmark but there would be some anecdotal evidence.

The concept of this dataset would be a 'smoketest' type set about perhaps 5 in size with the full spectrum of difficulty. We could report results on each problem. Given the constantly shifting model versions, this might be more practical. The smoketest benchmark should be updated periodically as it is saturated.

I have personally done this manually (not through agentic), and I found opus painfully lacking. I am curious though to hear examples to the contrary.

TongKe Xue (Dec 09 2025 at 03:55):

Dumb question: Can this be easily solved by

  1. running some lean program to dump elaborated terms (from the original .lean file)
  2. running some lean kernel checker that only accepts elaborated terms

[I don't know the commands for either command, but I'm guessing they must exist.]

?

(or am I missing something very obvious)

Simon Sorg (Dec 09 2025 at 08:44):

Which is why I suggested above to consider splitting into plan/act as several coding clients provide out of box. It would be interesting to at least hear some anecdotal experiences around that.

See the Draft Sketch Prove paper, which has been quite influential and had exactly this goal in mind.

I believe plan / act is normally done via sketching first, then filling in gaps, but there is absolutely no reason why this could not be achieved with a different scaffold that for instance first writes a todo-list, then sketches, and then does tool calls to an ATP model for individual parts of the proof sketch.

Granted, there is a lot of open questions into what scaffold works best empirically or whether we need stringent scaffolds at all.


Last updated: Dec 20 2025 at 21:32 UTC