Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: DeepSeek-Prover V2

Andy Jiang (Apr 30 2025 at 15:43):

Looks like Deepseek's new model set a new record on PutnamBench ->https://x.com/zhs05232838/status/1917600755936018715

George Tsoukalas (Apr 30 2025 at 15:45):

49 problems solved is a huge improvement over previous bests. I've reached out to the DeepSeek team so they can send me the proofs to validate, so I can add the model to the leaderboard. (Though I only sent to their service email, if any members of the team see this, reach out!)

Auguste Poiroux (Apr 30 2025 at 17:52):

I like the super long Lean proof starting page 23 and ending page 32 of the report :exploding_head:

Floris van Doorn (Apr 30 2025 at 18:58):

It has certainly learned how to break proofs into small steps!

[...]
have h30 : (p : Z) = 2 := by omega
have h31 : (q : Z) ⩾ 3 := by omega
have h32 : (r : Z) ⩾ 4 := by omega
have h33 : (p : Z) * q * r − 1 = 2 * ((p − 1) * (q − 1) * (r − 1)) := by linarith
have h34 : (p : Z) = 2 := by omega
have h35 : (q : Z) ⩾ 3 := by omega
have h36 : (r : Z) ⩾ 4 := by omega
have h37 : (p : Z) * q * r − 1 = 2 * ((p − 1) * (q − 1) * (r − 1)) := by linarith
have h38 : False := by
  have h39 : (p : Z) = 2 := by omega
  have h40 : (q : Z) ⩾ 3 := by omega
  have h41 : (r : Z) ⩾ 4 := by omega
[...]

Floris van Doorn (Apr 30 2025 at 18:58):

But joking aside, the performance and these formal proofs look impressive! That is very different than some of the previous proofs that some models gave, which basically felt like "we learned how to provide 2 useful arguments to nlinarith".

Jason Rute (Apr 30 2025 at 19:00):

Maybe this should be its own topic?

Notification Bot (Apr 30 2025 at 19:02):

6 messages were moved here from #Machine Learning for Theorem Proving > PutnamBench by Floris van Doorn.

Jason Rute (Apr 30 2025 at 19:56):

This is impressive work! I look forward to understanding it better, especially in relation to the recent Kimina work.

Joseph Myers (Apr 30 2025 at 23:59):

Does validation for PutnamBench include making sure that the proofs do solve the intended informal problem (rather than showing up incorrect formal statements in the benchmark)?

The report also refers to solving 12 problems from CombiBench, which even more needs assessment of whether those problems have correct formal statements.

Joseph Myers (May 01 2025 at 00:02):

Given this work, does DeepSeek intend to make formal-to-formal attempts at the problems from the next few IMOs, Putnams etc. when they come out (preferably with independent formal statements shared between all AIs making such attempts for better comparability than each AI having its own Lean versions of the problems)?

Li Xuanji (May 01 2025 at 05:36):

One unfortunate thing I noticed from the paper is that it's evaluated using Lean 4.9 (which was released last July) and benchmark problems which are incompatible with Lean 4.9 are excluded. It's not clear to me whether this is due to model limitations or for better testing control (since DeepSeek-Prover-1.5 is also evaluated with Lean 4.9)

Alok Singh (May 01 2025 at 06:47):

Li Xuanji said:

One unfortunate thing I noticed from the paper is that it's evaluated using Lean 4.9 (which was released last July) and benchmark problems which are incompatible with Lean 4.9 are excluded. It's not clear to me whether this is due to model limitations or for better testing control (since DeepSeek-Prover-1.5 is also evaluated with Lean 4.9)

I just opened my phone to post the same point, but you were there first

Kelly Davis (May 01 2025 at 08:04):

Jason Rute said:

This is impressive work! I look forward to understanding it better, especially in relation to the recent Kimina work.

They speak a bit to this point in the Cold Start by Synthetic Data subsection of Section 2.2

Auguste Poiroux (May 01 2025 at 08:06):

Li Xuanji said:

One unfortunate thing I noticed from the paper is that it's evaluated using Lean 4.9 (which was released last July) and benchmark problems which are incompatible with Lean 4.9 are excluded. It's not clear to me whether this is due to model limitations or for better testing control (since DeepSeek-Prover-1.5 is also evaluated with Lean 4.9)

I would say it is probably neither about "model limitations" or "better testing control", but just about their technical setup. They are using https://github.com/xinhjBrant/mathlib4/tree/deepseek with a fixed version of a forked REPL. Their tooling and datasets are probably built around that.

Albert Jiang (May 01 2025 at 10:59):

George Tsoukalas said:

49 problems solved is a huge improvement over previous bests. I've reached out to the DeepSeek team so they can send me the proofs to validate, so I can add the model to the leaderboard. (Though I only sent to their service email, if any members of the team see this, reach out!)

Let me ask Zhihong about this.

Jason Rute (May 01 2025 at 11:03):

@Albert Jiang you may want to also point out all the mistakes people found in their benchmark: https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/ProverBench/

Wang Jingting (May 01 2025 at 11:21):

Well, I tried the 7B model locally, and it worked really great. But there's something I feel that I have to point out: In the Appendix B of the paper, both of the proofs there has used apply?, and the proof seems not to actually work? (Since aside from apply?, it seems that nothing non-trivial has been proved, so I'm wondering if they have mistaken that as having proved the problem?)

Jason Rute (May 01 2025 at 12:47):

Wang Jingting said:

Well, I tried the 7B model locally, and it worked really great. But there's something I feel that I have to point out: In the Appendix B of the paper, both of the proofs there has used apply?, and the proof seems not to actually work? (Since aside from apply?, it seems that nothing non-trivial has been proved, so I'm wondering if they have mistaken that as having proved the problem?)

I agree it looks wrong. The proof doesn't use most of the assumptions. @George Tsoukalas Please look at this one closely when you get a chance. Is this problem wrong in PutnamBench, or do they have a buggy proof-checking harness? If the latter, I'm worried that their numbers might be much too high due to bugs in their infrastructure.

Auguste Poiroux (May 01 2025 at 12:53):

A wrong apply? proof being accepted closely resembles to the bugs of the tactic mode of the REPL that have been recently fixed (second fix specific to wrong self-application).
However, I would be surprised if they used the REPL tactic mode for DeepSeekProver-V2 given that they only used the command mode for DeepSeek-Prover v1.5 as far as I know. And from what I understand in their DeepSeek-Prover V2 report, they are also very likely using the command mode only.

Jason Rute (May 01 2025 at 13:03):

It would be good to go through their miniF2F solutions and find proofs with apply?. (And also to independently check all their solutions in Lean 4.9 or whatever it is that they use.)

Mantas Baksys (May 01 2025 at 14:00):

There is another Putnam problem using apply? that only formalizes variations of the answer statement in have statements without proof and then uses the final one to close the goal. I am really use about the results of this paper now, I would speculate that their REPL is dodgy and allows for these proofs to pass. Just a quick mention that the Kimina Lean server (https://github.com/project-numina/kimina-lean-server) catches these proofs as incomplete.

Marco Dos Santos (May 01 2025 at 14:01):

Yes, for example the first proof of Appendix B doesn't pass our verification with the Kimina Lean Server:
{'has_error': False, 'is_valid_no_sorry': False, 'is_valid_with_sorry': True}
It is a valid proof, but it counts as a proof using sorry.

Jason Rute (May 01 2025 at 14:02):

This could be really bad for them. Is anyone willing to check all there minif2f results? (They haven’t posted the Putnam bench results yet.)

Mert Ünsal (May 01 2025 at 14:16):

Let me try to validate their MiniF2F using Kimina Lean Server.

Jason Rute (May 01 2025 at 14:17):

(They officially use Lean 4.9 I think, so that may create issues.)

Pietro Monticone (May 01 2025 at 14:19):

I tested both proofs published in Appendix B using Lean v4.9.0 and I can confirm they're both incomplete.

Auguste Poiroux (May 01 2025 at 14:32):

Just checked all their minif2f proofs using LeanInteract with Lean v4.9.0. Unless I made a mistake in my code, no issues to report. They probably just made a mistake in their appendix.

Code

import os
from concurrent.futures import ThreadPoolExecutor, as_completed

from lean_interact import AutoLeanServer, FileCommand, LeanREPLConfig, TempRequireProject
from lean_interact.interface import LeanError

config = LeanREPLConfig(lean_version="v4.9.0", project=TempRequireProject("mathlib"))


# Helper function to process a single file
def process_file(path):
    server = AutoLeanServer(config)
    response = server.run(FileCommand(path=path))
    if isinstance(response, LeanError) or not response.lean_code_is_valid(allow_sorry=False):
        print(f"Error in {path}: {response}")
        return False
    else:
        return True


# run all the lean files inside the `test` and `valid` folders
root_dir = os.path.dirname(os.path.abspath(__file__))
for split in ["test", "valid"]:
    files_to_run = []
    for root, dirs, files in os.walk(os.path.join(root_dir, split)):
        for file in files:
            if file.endswith(".lean"):
                path = os.path.join(root_dir, root, file)
                files_to_run.append(path)

    counter_correct = 0
    counter_total = 0

    with ThreadPoolExecutor() as executor:
        futures = {executor.submit(process_file, path): path for path in files_to_run}
        for future in as_completed(futures):
            counter_total += 1
            if future.result():
                counter_correct += 1
            print(f"Correct: {counter_correct}/{counter_total}")

Pietro Monticone (May 01 2025 at 14:37):

Minor note: there is an apply?-incomplete proof in the valid set.

Auguste Poiroux (May 01 2025 at 14:42):

algebra_amgm_prod1toneq1_sum1tongeqn.lean? ~~It looks ok in the online Lean IDE, although there is a weird message: mathlib-demo.lean:8:0 AddConstAsyncResult.commitConst: constant has level params [u_1] but expected []~~

Simon Sorg (May 01 2025 at 14:45):

Auguste Poiroux said:

algebra_amgm_prod1toneq1_sum1tongeqn.lean? ~~It looks ok in the online Lean IDE, although there is a weird message: mathlib-demo.lean:8:0 AddConstAsyncResult.commitConst: constant has level params [u_1] but expected []~~

One of the messages is declaration uses 'sorry'

Auguste Poiroux (May 01 2025 at 14:46):

Yes indeed, I read apply? as exact? in my head. Sorry for the confusion

Auguste Poiroux (May 01 2025 at 14:52):

Well, then it seems that the REPL is not reporting this declaration uses 'sorry' message. This might be the real issue here. Thanks for reporting this @Pietro Monticone.

Deming Xu (May 01 2025 at 15:17):

https://huggingface.co/playground
https://openrouter.ai/chat

I found this model (deepseek-ai/DeepSeek-Prover-V2-671B) on Hugging Face Playground and OpenRouter! Hugging Face Playground should support free use of this model online. It seems like unlike previous Lean models, the API providers decided to add support for this one.

Deming Xu (May 01 2025 at 15:58):

I ran these tests:

import Mathlib.Tactic
import Mathlib.Data.Real.Basic


/--
Formalize and prove this in Lean:

"A implies B", if and only if "either not A or B"
-/
theorem imp_iff_not_or' (A B : Prop) : (A → B) ↔ (¬ A ∨ B) := by
  apply Iff.intro
  · -- Prove the forward direction: (A → B) → (¬ A ∨ B)
    intro h
    by_cases hA : A
    · -- Case: A is true
      have hB : B := h hA
      exact Or.inr hB
    · -- Case: A is false
      exact Or.inl hA
  · -- Prove the backward direction: (¬ A ∨ B) → (A → B)
    intro h
    intro hA
    cases h with
    | inl hnA =>
      -- Case: ¬ A is true
      exfalso
      exact hnA hA
    | inr hB =>
      -- Case: B is true
      exact hB

/--
Formalize and prove this in Lean:

For a real number x != 0, abs(x + 1/x) >= 2
-/
theorem abs_x_plus_inv_x_ge_two (x : ℝ) (hx : x ≠ 0) : |x + 1 / x| ≥ 2 := by
  have h₁ : (x + 1 / x) ^ 2 ≥ 4 := by
    have h₂ : (x - 1 / x) ^ 2 ≥ 0 := by nlinarith
    calc
      (x + 1 / x) ^ 2 = (x - 1 / x) ^ 2 + 4 := by
        -- Prove that (x + 1/x)^2 = (x - 1/x)^2 + 4
        field_simp [hx]
        ring
        <;>
        nlinarith
      _ ≥ 0 + 4 := by nlinarith
      _ = 4 := by norm_num
  -- Now we know (x + 1/x)^2 ≥ 4, so |x + 1/x| ≥ 2
  have h₂ : |x + 1 / x| ≥ 2 := by
    -- Use the fact that if y^2 ≥ 4, then |y| ≥ 2
    have h₃ : (x + 1 / x) ^ 2 ≥ 4 := h₁
    have h₄ : |x + 1 / x| ≥ 2 := by
      -- Use the property of squares and absolute values
      have h₅ : 2 ≤ |x + 1 / x| := by
        by_contra h₅
        -- If |x + 1/x| < 2, then (x + 1/x)^2 < 4
        have h₆ : |x + 1 / x| < 2 := by linarith
        have h₇ : (x + 1 / x) ^ 2 < 4 := by
          calc
            (x + 1 / x) ^ 2 = |x + 1 / x| ^ 2 := by
              -- Since (x + 1/x)^2 = |x + 1/x|^2
              nlinarith [abs_mul_abs_self (x + 1 / x)]
            _ < 2 ^ 2 := by
              gcongr <;> nlinarith
            _ = 4 := by norm_num
        linarith
      linarith
    exact h₄
  exact h₂


/--
Formalize and prove this in Lean:

For a real number b, if there is not x such that
x^2 + b * x + 1 = 0
then abs(b) < 2
-/
theorem no_real_root_implies_abs_lt_two (b : ℝ) : (¬ ∃ (x : ℝ), x^2 + b * x + 1 = 0) → abs b < 2 := by
  intro h
  have h₁ : b^2 - 4 < 0 := by
    by_contra h₂
    -- If b² - 4 ≥ 0, then the quadratic has a real root, contradicting h.
    have h₃ : b^2 - 4 ≥ 0 := by linarith
    have h₄ : ∃ (x : ℝ), x^2 + b * x + 1 = 0 := by
      -- Use the quadratic formula to find a root.
      use (-b + Real.sqrt (b^2 - 4)) / 2
      have h₅ : b^2 - 4 ≥ 0 := by linarith
      have h₆ : Real.sqrt (b^2 - 4) ≥ 0 := Real.sqrt_nonneg (b^2 - 4)
      nlinarith [Real.sq_sqrt (by linarith : 0 ≤ b^2 - 4), sq_nonneg (b - Real.sqrt (b^2 - 4))]
    exact h h₄
  have h₂ : abs b < 2 := by
    -- Prove that |b| < 2 using h₁.
    have h₃ : b^2 < 4 := by linarith
    have h₄ : abs b < 2 := by
      rw [abs_lt]
      constructor <;> nlinarith [sq_nonneg b, sq_nonneg (b - 2), sq_nonneg (b + 2)]
    exact h₄
  exact h₂

/--
Formalize and prove this in Lean:

For a real number x, if x^3 = 8, then x = 2.
-/
theorem cube_root_of_eight (x : ℝ) (h : x^3 = 8) : x = 2 := by
  have h₁ : (x - 2) * (x^2 + 2 * x + 4) = 0 := by
    have h₂ : x^3 - 8 = 0 := by linarith
    have h₃ : x^3 - 8 = (x - 2) * (x^2 + 2 * x + 4) := by ring
    rw [h₃] at h₂
    exact h₂
  have h₂ : x - 2 = 0 ∨ x^2 + 2 * x + 4 = 0 := by
    apply eq_zero_or_eq_zero_of_mul_eq_zero h₁
  cases h₂ with
  | inl h₂ =>
    have h₃ : x = 2 := by linarith
    exact h₃
  | inr h₂ =>
    have h₃ : x^2 + 2 * x + 4 = 0 := h₂
    have h₄ : (x + 1)^2 + 3 = 0 := by nlinarith
    have h₅ : False := by nlinarith
    exfalso
    exact h₅

Deming Xu (May 01 2025 at 16:02):

image.png

Thomas Zhu (May 01 2025 at 16:49):

Besides fixing REPL issues of not reporting the sorry, in my opinion the Lean side should also be changed: apply? shouldn't close a goal with sorry if no suggestions close the goal. (This was discussed.)

Patrick Massot (May 01 2025 at 17:05):

I don't think we should make Lean less nice to use for humans for the sake of agents. There could be an opt-in option for agents though.

Thomas Zhu (May 01 2025 at 17:25):

I think there is some misunderstanding: I am suggesting that making apply? not admit the goal when no suggestions close the goal is beneficial both for human users and for LLMs. At least this is the impression I got from this thread and my linked thread.

Eric Wieser (May 01 2025 at 18:02):

Why is having tactics use sorry a problem?

Auguste Poiroux (May 01 2025 at 18:21):

Auguste Poiroux said:

Well, then it seems that the REPL is not reporting this declaration uses 'sorry' message. This might be the real issue here. Thanks for reporting this Pietro Monticone.

Coming back to this, it seems that it's not a REPL issue. In Lean v4.9.0, when apply? is used in a proof we don't get a sorry message. Do any of you know when it was introduced? Additionally, are there other tactics that should be reported as sorry? I will update LeanInteract to check for these tactics when looking for sorries.

Eric Wieser (May 01 2025 at 18:22):

I don't think this is a trustworthy way to determine if a proof is sorry-free (but this is a user-interface bug)

Jason Rute (May 01 2025 at 18:23):

Eric Wieser said:

Why is having tactics use sorry a problem?

For one, it is frustrating to debug. Where did the sorry come from? It isn't like Lean will tell you which tactic was the cause of the sorry. For a novice user, they may not even suspect the ? tactics. The above thread is an example of someone getting confused why apply? closes the goal. Conversely, how does it help for a typical tactic? (Obviously not sorry or a well-known synonym, although I would also love for sorry synonyms to be made more sorry-like in appearance, as well as tactics which introduce sorries, no?). But I'm not a Lean power-user. Is there an advantage to using sorry in this case or similar tactics?

Eric Wieser (May 01 2025 at 18:26):

I think the argument for using sorry is to avoid:

the user seeing an annoying "unsolved goals" error on a goal they are actively working on
a warning that the apply? tactic does nothing

Thomas Zhu (May 01 2025 at 18:32):

Eric Wieser said:

I don't think this is a trustworthy way to determine if a proof is sorry-free (but this is a user-interface bug)

This I absolutely agree. Would you say a more future proof way is to check the set of axioms used is a subset of the standard three?

Auguste Poiroux (May 01 2025 at 18:32):

Eric Wieser said:

I don't think this is a trustworthy way to determine if a proof is sorry-free

Well, yes, you are right, I agree with you. A better way would be to enhance the REPL command mode to do an explicit check then. It would probably reuse the getProofStatus method introduced to check proofs in tactic mode. What do you think? I might try to implement that in the coming weeks.

Pietro Monticone (May 01 2025 at 18:36):

The two PutnamBench problems in Appendix B would not compile even without the apply? issue, right?

Pietro Monticone (May 01 2025 at 18:41):

In particular, as far as I can tell,

putnam_2005_a4 would not compile even if you replaced apply? with sorry. The whole calc block is invalid.
putnam_2007_b4 doesn’t compile because of an error in the rw preceding the apply?. Do you happen to know why it wasn’t detected?

Leni Aniva (May 01 2025 at 18:57):

Pantograph can check whether there's a sorry immediately after a tactic finishes, so there is that

Jason Rute (May 01 2025 at 18:58):

Can you use pantograph to check proofs like these?

Leni Aniva (May 01 2025 at 18:58):

Jason Rute said:

Can you use pantograph to check proofs like these?

e.g.?

We need to parse these text proofs into a proof tree if you want to do something like that

Auguste Poiroux (May 01 2025 at 18:59):

Pietro Monticone said:

In particular, as far as I can tell,

putnam_2005_a4 would not compile even if you replaced apply? with sorry. The whole calc block is invalid.

putnam_2007_b4 doesn’t compile because of an error in the rw preceding the apply?. Do you happen to know why it wasn’t detected?

I don't know what's happening with these PutnamBench proofs, I cannot manage to get them to compile, neither in VSCode or using the REPL...

Jason Rute (May 01 2025 at 18:59):

@Leni Aniva Any of the proofs in: https://github.com/deepseek-ai/DeepSeek-Prover-V2/blob/main/minif2f-solutions.zip

Eric Wieser (May 01 2025 at 19:00):

Surely we should just be checking these proofs with leanchecker + an axioms check?

Leni Aniva (May 01 2025 at 19:00):

Jason Rute said:

Leni Aniva Any of the proofs in: https://github.com/deepseek-ai/DeepSeek-Prover-V2/blob/main/minif2f-solutions.zip

probably not, if you want the precise location of the tactics producing sorrys

Eric Wieser (May 01 2025 at 19:00):

Leni Aniva said:

Pantograph can check whether there's a sorry immediately after a tactic finishes, so there is that

Does it detect

theorem foo : 1 = 0 := by
  let rec bar : 1 = 0 := by apply?
  exact bar

Leni Aniva (May 01 2025 at 19:00):

Eric Wieser said:

Pantogr

Leni Aniva said:

Pantograph can check whether there's a sorry immediately after a tactic finishes, so there is that

Does it detect
theorem foo : 1 = 0 := by
  let rec bar : 1 = 0 := by apply?
  exact bar

Yes, but you would have to run it with goal_tactic (what an agent would call when searching for the proof)

from pantograph.server import Server
from pantograph.expr import TacticHave

if __name__ == '__main__':
    server = Server(imports=['Init'])
    state0 = server.goal_start("1 = 0")
    state1 = server.goal_tactic(state0, goal_id=0, tactic=TacticHave("1 = 0"))
    print(state1)
    state1b = server.goal_tactic(state1, goal_id=0, tactic="apply?")
    print(state1b)

This fails on line -2.

Jason Rute (May 01 2025 at 19:02):

Eric Wieser said:

Surely we should just be checking these proofs with leanchecker + an axioms check?

Someone needs to write a document for AI researchers telling them about common pitfalls, axioms, checking proofs, errors in benchmarks, and other things which would embarrass them if they don't handle correctly.

Jason Rute (May 01 2025 at 19:05):

(Also because it wastes everyone's time.)

Auguste Poiroux (May 01 2025 at 19:10):

Eric Wieser said:

Leni Aniva said:

Pantograph can check whether there's a sorry immediately after a tactic finishes, so there is that

Does it detect
theorem foo : 1 = 0 := by
  let rec bar : 1 = 0 := by apply?
  exact bar

If we switch to tactic mode, the REPL detect this issue. The output is a bit long, but you get proof_status='Incomplete: contains metavariable(s)'.

Code

from lean_interact import AutoLeanServer, Command, LeanREPLConfig, ProofStep

conf = LeanREPLConfig(lean_version="v4.9.0")
server = AutoLeanServer(conf)

server.run(Command(cmd="theorem foo : 1 = 0 := by sorry"))
response = server.run(
    ProofStep(
        tactic="(\n  let rec bar : 1 = 0 := by apply?\n  exact bar)",
        proof_state=0,
    )
)
print(response)

Output

ProofStepResponse(proof_status='Incomplete: contains metavariable(s)', messages=[Message(data='Try this: refine Nat.eq_zero_of_add_eq_zero_left ?h', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.eq_zero_of_le_zero ?h', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.le_zero.mp ?a✝', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.eq_zero_of_not_pos ?h', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Int.ofNat_eq_zero.mp ?a✝', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.eq_zero_of_zero_dvd ?h', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.zero_dvd.mp ?a✝', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.eq_zero_of_gcd_eq_zero_left ?H', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.eq_zero_of_gcd_eq_zero_right ?H', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.eq_zero_of_add_eq_zero_right ?h', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.eq_of_beq_eq_true ?a✝', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.le_antisymm ?h₁ ?h₂', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Lean.ofReduceNat 1 0 ?h', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data="Try this: refine Nat.succ_inj'.mp ?a✝", severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.add_right_cancel ?h', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.pred_inj ?a✝ ?a✝¹ rfl', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.eq_of_mul_eq_mul_left ?hn ?h', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.eq_iff_le_and_ge.mpr ?a✝', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.add_left_cancel ?a✝', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.eq_of_mul_eq_mul_right ?hm ?h', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.eq_of_lt_succ_of_not_lt ?hmn ?h', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.eq_of_le_of_lt_succ ?h₁ ?h₂', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.le_antisymm_iff.mpr ?a✝', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Int.negSucc_inj.mp ?a✝', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Int.ofNat_inj.mp ?a✝', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.dvd_antisymm ?a✝ ?a✝¹', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.mul_left_cancel ?np ?h', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.add_right_cancel_iff.mp ?a✝', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.add_left_cancel_iff.mp ?a✝', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine (Nat.mul_right_cancel_iff ?p 1 0).mp ?a✝', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.mul_right_cancel ?mp ?h', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine (Nat.mul_left_cancel_iff ?p 1 0).mp ?a✝', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Fin.mk.inj_iff.mp ?a✝', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.eq_of_testBit_eq ?pred', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0)), Message(data='Try this: refine Nat.compare_eq_eq.mp ?a✝', severity='info', start_pos=Pos(line=0, column=0), end_pos=Pos(line=0, column=0))], goals=[], proof_state=1)

Leni Aniva (May 01 2025 at 19:10):

There's nothing wrong if a proof step contains mvars. Either the error message is misleading in the repl, or its being overeager in detecting errors.

Leni Aniva (May 01 2025 at 19:11):

Im not familiar with lean_interact. Does this example assume the proof is finished?

Auguste Poiroux (May 01 2025 at 19:12):

Leni Aniva said:

There's nothing wrong if a proof step contains mvars. Either the error message is misleading in the repl, or its being overeager in detecting errors.

The check is for expression metavariables with hasExprMVar

Leni Aniva (May 01 2025 at 19:12):

Auguste Poiroux said:

Leni Aniva said:

There's nothing wrong if a proof step contains mvars. Either the error message is misleading in the repl, or its being overeager in detecting errors.

The check is for expression metavariables with hasExprMVar

so? What Im saying is unless we declare a proof to be fully complete, goals can have mvars, proof terms can have mvars, and these would not be a problem.

Auguste Poiroux (May 01 2025 at 19:16):

Leni Aniva said:

Im not familiar with lean_interact. Does this example assume the proof is finished?

This is not specific to LeanInteract, I am just using it for the interface to the REPL and for using Lean v4.9.0.
This example assume the proof is finished because there are no goals left. But there is a hole, hence the hasExprMvar check that fails

Notification Bot (May 01 2025 at 19:38):

11 messages were moved from this topic to #lean4 > #print axioms does not report sorry by Eric Wieser.

Jason Rute (May 01 2025 at 20:19):

Just so it is clear, that proof of algebra_amgm_prod1toneq1_sum1tongeqn is likely an example of an AI agent finding and exploiting a bug in Lean. The bug is the one mentioned in the above message. Even if DeepSeek were diligent and used #print axioms ... it would not have triggered anything suspicious. (Maybe it is the same with the putnam problems when we see the actual code. That could explain why the agent is using Cardinal.toNat.) If this is the case, this is really interesting and should be made more well known!

Jason Rute (May 01 2025 at 20:20):

(Edit: I'm calling it a bug now, not a "soundness bug".)

Jason Rute (May 01 2025 at 20:28):

(Our graph2tac paper found aN issue, but that was just an axiom of False that didn't take care to check for. This is much more crazy.)

Patrick Massot (May 01 2025 at 20:28):

Jason, I think it would be much wiser to remove that bold sentence and stick to facts.

Patrick Massot (May 01 2025 at 20:30):

This is not at all what people call a soundness bug. This has a precise meaning.

Jason Rute (May 01 2025 at 20:30):

Is the precise meaning based on external checking?

Jason Rute (May 01 2025 at 20:30):

Or the kernel?

Jason Rute (May 01 2025 at 20:31):

Of is #print axioms also part of that?

Jason Rute (May 01 2025 at 20:31):

Because there did seem to be a bug in #print axioms?

Patrick Massot (May 01 2025 at 20:31):

The kernel.

Jason Rute (May 01 2025 at 20:32):

What should I call it? A student could use this same trick to cheat on a Lean assignment (if not using an extenal checker).

Patrick Massot (May 01 2025 at 20:32):

#print axiom is a convenient command to quickly check something is axiom free, and it would indeed be nice to make it more reliable. But this has nothing to do with a soundness bug.

Patrick Massot (May 01 2025 at 20:33):

You can call it a bug in a convenience command.

Jason Rute (May 01 2025 at 20:36):

"Has nothing to do with" is too strong. There is a name for this. Let me look it up.

Jason Rute (May 01 2025 at 20:41):

I thought it was "Pollack consistency" but that is slightly different. Okay, I'll just call it a "bug in Lean".

Sebastian Ullrich (May 01 2025 at 20:42):

I commented on the non-severity in the other thread.

Patrick Massot (May 01 2025 at 20:45):

Jason, you make it sound like the bug in this command is enough, but the main thing is the bug in the AI pipeline that simply doesn’t check the Lean kernel accepts the proof.

Jason Rute (May 01 2025 at 20:46):

My point is this. It is well known you can trick Lean to make a bad proof look right. You can mess with the tactic environment and such. Sure, if you do external proof checking, it would be caught, but 99% of Lean users don't do that. this is a case of an AI agent finding a very subtle bug in Lean such that #print axioms doesn't say sorry is used. I think it is a big deal. Lean's kernel is fine and external type checking still works, but for AI research it isn't enough to rely on Lean's command line + print axioms. You have to export the proofs. Is this balanced and fair?

Jason Rute (May 01 2025 at 20:47):

The bug in the AI pipeline could have just been that they didn't use external type checking.

Jason Rute (May 01 2025 at 20:49):

Oh, I see. I should try to run this proof on the command line. That apparently fails according to @Sebastian Ullrich. Let me do that.

Luigi Massacci (May 01 2025 at 20:52):

Jason Rute said:

The bug in the AI pipeline could have just been that they didn't use external type checking.

Surely this is the most reasonable answer. Besides I'm not sure I agree that 99% of lean users don't know this. Anyone who uses lean mostly as a human will inevitably run into issues like these, and they come up all the time on Zulip. There's plenty of threads suggesting to use done at the end of proofs, etc... I think only people that interact with lean mostly via REPL will not have run into something similar.

Thomas Zhu (May 01 2025 at 21:00):

I think I have to disagree with @Jason Rute here:

There are two bugs: a Lean bug of #print axioms and a REPL bug of apply?.
I wrote the #print axioms line myself to debug the seeming absence of squiggly lines of apply? (which turned out to be a VS Code issue)
#print axioms did break, but as far as I know, #print axioms is not used by DeepSeek's team to check correctness. The agent probably didn't find/exploit whatever bug I found.
The original proof in the zip file released by DeepSeek was indeed incorrect, and some bug was exploited by the DeepSeek-Prover-V2 model. But this bug is the apply? bug: it is something like Lean REPL not reporting the sorry warning message in apply? (I have not read all the details above); and I don't have any evidence this apply? issue is at all related to the #print axioms issue. Since this is probably a REPL issue, this should be categorized as a bug in the AI pipeline and not a bug in Lean.
DeepSeek-Prover-V2 did not use apply? on any other problem in miniF2F. I think their numbers on miniF2F are still broadly sound (except for this 0.4% difference on miniF2F-valid). But the proofs of PutnamBench should warrant a more detailed check.
I would say this "bug discovery" is me trying to debug the apply? issue and (because DeepSeek-Prover-V2 wrote the Type*) accidentally triggering the #print axioms bug during debugging. It's interesting that besides the sorry warning, the AddConstAsyncResult.commitConst: constant has level params [u] but expected [] error message was also not caught by DeepSeek-Prover-V2's checking (perhaps because there is also some message upper limit for REPL like VS Code?) but I don't expect DeepSeek-Prover to have put the Type* there to deliberately trigger the #print axioms bug for humans at all (see Appendix B of their paper).

Leni Aniva (May 01 2025 at 21:02):

Lean doesn't have a single message for "this proof is not correct", and the engineers at DeepSeek did not cover all of the cases where a proof fails. That is my impression about this problem

Thomas Zhu (May 01 2025 at 21:04):

I would say it is impressive that DeepSeek-Prover-V2 cheated their pipeline (perhaps using REPL) of checking proofs. But reward cheating in RL happens all the time: it only takes the agent to write an apply? tactic once to find it.

Jason Rute (May 01 2025 at 21:05):

#print axioms did break, but as far as I know, #print axioms is not used by DeepSeek's team to check correctness. The agent probably didn't find/exploit whatever bug I found.

Maybe. But in that case, I'm surprised apply? wasn't used more by the AI. You could solve everything with it.

Leni Aniva (May 01 2025 at 21:06):

they probably havent trained it enough to collapse its output distribution yet

Deming Xu (May 01 2025 at 21:08):

#lean4 > #print axioms does not report sorry

Just to be clear, you mentioned in that thread that the error will not come up in Lean version 4.18.0, and it might be "an unintended side effect of parallelism", which is probably a new feature? And you mentioned that DeepSeek was using Lean 4.9, so this #print axioms bug itself probably does not affect the DeepSeek team, right?

I tested your code snippet on my machine.

universe u

-- declaration uses 'sorry'
-- AddConstAsyncResult.commitConst: constant has level params [u] but expected []
theorem f : False := by
  let a {α : Type u} : α → α := id
  sorry

-- 'f' does not depend on any axioms
#print axioms f

-- Let's try using this `f`
-- (kernel) unknown constant 'f'
theorem g : 1 + 1 = 3 := False.elim f

When my lean-toolchain file is set to leanprover/lean4:v4.19.0-rc3 it prints no axioms. When I change it to leanprover/lean4:v4.18.0 it prints the usual 'f' depends on axioms: [sorryAx].

In Lean 4 Web (https://live.lean-lang.org), it lets me change the version of Lean. The default "Latest Mathlib" version prints no axioms, while "Mathlib stable" prints the sorryAx axiom.

Auguste Poiroux (May 01 2025 at 21:09):

In Lean v4.9.0, at least, I didn't get a sorry message for apply?. So I wouldn't say it is a REPL issue since the command mode is more or less mirroring what we get in the IDE. It was used with the belief that it does the type-checking. I agree that we should improve the REPL in this regard.

Jason Rute (May 01 2025 at 21:09):

You are all correct that I'm making assumptions about how this arose. Maybe it just happens to be that the one proof using apply? (which can solve any goal) was also the one proof for which print axioms won't point out that apply? introduces sorry. (Well there are also the two putnam proofs using apply?, but we can't seem to get those to work, although they do have that weird cardinality stuff.)

Jason Rute (May 01 2025 at 21:12):

Ok, there is a slight sarcasm there, but un-sarcastically, I think the best would be to get DeepSeek to explain more how this arose and give more details on the putnam stuff. Also @George Tsoukalas should make sure he uses external type checking for checking PutnamBench problems.

Thomas Zhu (May 01 2025 at 21:12):

Jason Rute said:

#print axioms did break, but as far as I know, #print axioms is not used by DeepSeek's team to check correctness. The agent probably didn't find/exploit whatever bug I found.

Maybe. But in that case, I'm surprised apply? wasn't used more by the AI. You could solve everything with it.

I suspect: the bug in REPL (or whatever DeepSeek used) was similar to squiggly lines in VS Code in that there is an upper limit to the number of messages, so the declaration uses sorry and the AddConstAsyncResult.commitConst messages were not caught. This depends on apply? suggesting enough suggestions to exceed this upper limit. I suspect REPL will find that example : False := by apply? is wrong but won't find that example : a very long untrue prop := by apply? is wrong. (Edit: check below)

Thomas Zhu (May 01 2025 at 21:13):

I think it's complete coincidence that the apply? REPL bug and #print axioms bug coincided in the same miniF2F problem. This is because of their two incorrect examples in Appendix B didn't add any Type* in the proof.

Jason Rute (May 01 2025 at 21:14):

And (unsarcastically), I am being too sensationalist. Sorry.

Jason Rute (May 01 2025 at 21:15):

For Appendix B, I would like to see the original code and why they think it checks.

Leni Aniva (May 01 2025 at 21:15):

I think tactics should not rely on error messages to indicate they have failed. All tactics should fail via throwing exceptions.

Aaron Liu (May 01 2025 at 21:16):

Leni Aniva said:

I think tactics should not rely on error messages to indicate they have failed. All tactics should fail via throwing exceptions.

What's the difference between an error message and an exception?

Jason Rute (May 01 2025 at 21:16):

Thomas Zhu said:

I think it's complete coincidence that the apply? REPL bug and #print axioms bug coincided in the same miniF2F problem. This is because of their two incorrect examples in Appendix B didn't add any Type* in the proof.

Would Cardinality.toNat add Type*? (Both Putnam examples use this.)

import Mathlib

theorem foo (n m : Nat) : n = m := by
  have h0 : n = m := by calc
    n = Cardinal.toNat (n) := by simp [Nat.cast_inj]
    _ = Cardinal.toNat (m) := by apply?
    _ = m := by simp [Nat.cast_inj]
  exact h0

Leni Aniva (May 01 2025 at 21:17):

Aaron Liu said:

Leni Aniva said:

I think tactics should not rely on error messages to indicate they have failed. All tactics should fail via throwing exceptions.

What's the difference between an error message and an exception?

error messages are just strings in the message log. an exception forces whatever metaprogramming tool interacting with Lean to handle the error, and can't be silently ignored.

Aaron Liu (May 01 2025 at 21:18):

Jason Rute said:

Thomas Zhu said:

I think it's complete coincidence that the apply? REPL bug and #print axioms bug coincided in the same miniF2F problem. This is because of their two incorrect examples in Appendix B didn't add any Type* in the proof.

Would Cardinality.toNat add Type*?
import Mathlib

theorem foo (n m : Nat) : n = m := by
  have h0 : n = m := by calc
    n = Cardinal.toNat (n) := by simp [Nat.cast_inj]
    _ = Cardinal.toNat (m) := by apply?
    _ = m := by simp [Nat.cast_inj]
  exact h0

It introduces a universe metavariable ~~which gets converted into a universe parameter after the declaration finishes elaborating~~. This causes the bug where something bad happens if a declaration has universe parameters not mentioned in its type.

Jason Rute (May 01 2025 at 21:20):

Is that bug related to either the apply? bug or the print axioms bug?

Aaron Liu (May 01 2025 at 21:20):

Leni Aniva said:

Aaron Liu said:

Leni Aniva said:

I think tactics should not rely on error messages to indicate they have failed. All tactics should fail via throwing exceptions.

What's the difference between an error message and an exception?

error messages are just strings in the message log. an exception forces whatever metaprogramming tool interacting with Lean to handle the error, and can't be silently ignored.

I think most tactics do this already. When a nested tactic failure happens, it propagates up the monad.

Leni Aniva (May 01 2025 at 21:23):

Aaron Liu said:

Leni Aniva said:

Aaron Liu said:

Leni Aniva said:

I think tactics should not rely on error messages to indicate they have failed. All tactics should fail via throwing exceptions.

What's the difference between an error message and an exception?

error messages are just strings in the message log. an exception forces whatever metaprogramming tool interacting with Lean to handle the error, and can't be silently ignored.

I think most tactics do this already. When a nested tactic failure happens, it propagates up the monad.

but what if a tactic just sets the message log and silently returns. ring_nf behaves like this if it can't make progress.

Auguste Poiroux (May 01 2025 at 21:23):

Thomas Zhu said:

Jason Rute said:

#print axioms did break, but as far as I know, #print axioms is not used by DeepSeek's team to check correctness. The agent probably didn't find/exploit whatever bug I found.

Maybe. But in that case, I'm surprised apply? wasn't used more by the AI. You could solve everything with it.

I suspect: the bug in REPL (or whatever DeepSeek used) was similar to squiggly lines in VS Code in that there is an upper limit to the number of messages, so the declaration uses sorry and the AddConstAsyncResult.commitConst messages were not caught. This depends on apply? suggesting enough suggestions to exceed this upper limit. I suspect REPL will find that example : False := by apply? is wrong but won't find that example : a very long untrue prop := by apply? is wrong.

~~example : False := by apply? will not raise any warning (sorry message) when using Lean v4.9.0. Same happens for the REPL using this Lean version.~~
~~However, when using more recent Lean versions, the REPL will complain that the declaration is using sorry. This whole discussion would have never happened if DeepSeek used Lean v4.15.0 for example.~~
~~But this raised the important problem that we should type check the proof correctly and that the REPL is currently not doing it in command mode :+1:~~

Thomas Zhu (May 01 2025 at 21:25):

Thanks for the fact check!

Aaron Liu (May 01 2025 at 21:26):

Leni Aniva said:

ring_nf behaves like this if it can't make progress.

Can you provide an example? I can't seems to reproduce this.

Leni Aniva (May 01 2025 at 21:27):

Aaron Liu said:

Leni Aniva said:

ring_nf behaves like this if it can't make progress.

Can you provide an example? I can't seems to reproduce this.

https://github.com/stanford-centaur/PyPantograph/issues/97

Aaron Liu (May 01 2025 at 21:29):

So ring_nf doesn't throw a message.

Pietro Monticone (May 01 2025 at 21:29):

When using Lean v4.9.0 this snippet:

example : False := by apply?

raises the following warning message:

declaration uses 'sorry'

Auguste Poiroux (May 01 2025 at 21:32):

@Thomas Zhu I wasn't testing your hypothesis about example : a very long untrue prop := by apply?. I was just testing if DeepSeek proof apply? issue is now caught with more recent Lean versions and the Lean REPL.
I assumed the same would apply with example : False := by apply?, but your hypothesis is confirmed: in Lean v4.9.0 we have a sorry warning with example : False := by apply? but not with long enough declarations/proofs using apply?.

Thomas Zhu (May 01 2025 at 21:35):

Pietro Monticone said:

When using Lean v4.9.0 this snippet:
example : False := by apply?
raises the following warning message:
declaration uses 'sorry'

Just to be clear: you mean Lean REPL v4.9.0, and not just Lean v4.9.0?

Pietro Monticone (May 01 2025 at 21:36):

No, I mean Lean v4.9.0.

Thomas Zhu (May 01 2025 at 21:41):

Yes, I am sure that an unsuccessful apply? raises the declaration uses 'sorry' message in any proof in any Lean version. The question is whether this message will be caught by Lean REPL (and whatever DeepSeek uses to check their proofs). It appears from the above messages that it isn't (at least in some theorems) but I have not tested myself. Also, completely unrelatedly, the warning message doesn't have an orange squiggly line in VS Code due to the 500 message limit.

Auguste Poiroux (May 01 2025 at 21:42):

@Thomas Zhu your hypothesis "I suspect REPL will find that example : False := by apply? is wrong but won't find that example : a very long untrue prop := by apply? is wrong." is always true even in recent Lean versions. Sorry for the mix up, I added strikethrough to both my messages above. It seems that whenever we have a situation where we don't have the sorry squiggles in the IDE (i.e. no sorry message is raised), the same applies to the REPL

Pietro Monticone (May 01 2025 at 21:44):

[PS: I think the undetected error preceding the apply? in putnam_2007_b4 is at least as interesting]

Thomas Zhu (May 01 2025 at 21:47):

Pietro Monticone said:

[PS: I think the undetected error preceding the apply? in putnam_2007_b4 is at least as interesting]

I think that is the same as the AddConstAsyncResult.commitConst error in algebra_amgm_prod1toneq1_sum1tongeqn: the error is detected, but the error message is somehow flooded by the many suggestions from apply?

Eric Wieser (May 01 2025 at 21:53):

I'd encourage people in future to create new threads in other channels and link to them here when you find a weird Lean behavior; remember that many people very familiar with the internals of Lean may not bother reading threads in this channel.

Aaron Liu (May 01 2025 at 21:54):

Leni Aniva said:

Aaron Liu said:

Leni Aniva said:

ring_nf behaves like this if it can't make progress.

Can you provide an example? I can't seems to reproduce this.

https://github.com/stanford-centaur/PyPantograph/issues/97

#24529

Auguste Poiroux (May 01 2025 at 21:54):

Maybe DeepSeek-Prover V2 specifically learned through RL to generate apply? to flood messages with info messages to hide errors in preceding code. Since they are doing a recursive subgoal decomposition, it is possible that using apply? was the only way for the model to get unstuck.

Thomas Zhu (May 01 2025 at 22:06):

Auguste Poiroux said:

Thomas Zhu your hypothesis "I suspect REPL will find that example : False := by apply? is wrong but won't find that example : a very long untrue prop := by apply? is wrong." is always true even in recent Lean versions. Sorry for the mix up, I added strikethrough to both my messages above. It seems that whenever we have a situation where we don't have the sorry squiggles in the IDE (i.e. no sorry message is raised), the same applies to the REPL

That is very weird to me: the number 500 is hard-coded by VS Code. (source)

Thomas Zhu (May 01 2025 at 22:26):

@Auguste Poiroux do you have an example of the "very long untrue prop := by apply?" that tricks REPL? Somehow I couldn't reproduce it

Auguste Poiroux (May 01 2025 at 22:27):

O_O thanks for looking into this. This makes me dig into the REPL. So:

it turns out that the REPL outputs all the messages.
On the algebra_amgm_prod1toneq1_sum1tongeqn problem, the declaration use sorry message is in the list of the messages for Lean v4.18.0, but not for Lean v4.9.0.

This may explain also why I was confused in my previous messages (it is also late where I am, so I am increasingly making more mistakes as well ^^)

Thomas Zhu (May 01 2025 at 22:29):

So if I understand correctly, this is a bug in REPL v4.9.0 that was fixed before v4.18.0?

Thomas Zhu (May 01 2025 at 22:34):

And thanks for grounding my speculations by doing so many tests! There are many seemingly related issues here that aren't really related.

Auguste Poiroux (May 01 2025 at 22:34):

Well, I don't think so. This surely complicates things, but I am using a backport of the recent REPL in Lean v4.9.0. So the REPL code between v4.18.0 and v4.9.0 should be very similar.
But maybe we should check with the original v4.9.0 of the REPL to make sure that we are not dealing with other issues on top of the original ones. (although it is likely that the issue will be the same since DeepSeek has the same undetected apply? issue)

Auguste Poiroux (May 01 2025 at 22:36):

Thomas Zhu said:

And thanks for grounding my speculations by doing so many tests! There are many seemingly related issues here that aren't really related.

You speculate quite well, hopefully we can fix this issue soon :smile:

Frederick Pu (May 01 2025 at 22:45):

there should be an option for setting which axioms you allow or disallow. I feel like having to manually parse for sorries is a bit sketchy

Auguste Poiroux (May 01 2025 at 22:49):

Indeed, I am just trying to understand what's happening about these sorry messages right now.
But we should implement something rigorous for proof checking in the REPL.

Justin Asher (May 01 2025 at 22:52):

Auguste Poiroux said:

Indeed, I am just trying to understand what's happening about these sorry messages right now.
But we should implement something rigorous for proof checking in the REPL.

Could this be a separate package for benchmarking / automated theorem proving? I was already considering writing this but have been working on other things. Perhaps we should create a new thread to discuss creating a rigorous proof checking system in the REPL.

Leni Aniva (May 02 2025 at 00:10):

we have this kind of proof checking system in pantograph

Justin Asher (May 02 2025 at 00:28):

Leni Aniva said:

we have this kind of proof checking system in pantograph

Via sorry collection? E.g., frontend.process

Justin Asher (May 02 2025 at 00:30):

Can you also check for axioms in pantograph? I am worried about LLMs cheating in more complicated, e.g., document based setups where they try to prove a theorem over many lines and declarations.

Leni Aniva (May 02 2025 at 00:51):

Justin Asher said:

Leni Aniva said:

we have this kind of proof checking system in pantograph

Via sorry collection? E.g., frontend.process

No. During proof search, tactics generating sorrys or unsafe definitions will be immediately rejected. At the end of a proof, if there are sorrys or unsafes the proof will be rejected as well.

Leni Aniva (May 02 2025 at 00:52):

Justin Asher said:

Can you also check for axioms in pantograph? I am worried about LLMs cheating in more complicated, e.g., document based setups where they try to prove a theorem over many lines and declarations.

The way I intended is that unfiltered inputs should never go into frontend.process. All LLM interactions need to be done via goal.{tactic,continue} functions, where we have more control over the environment.

Leni Aniva (May 02 2025 at 00:53):

If the LLM can interact with frontend.process, it is inherently unsafe, and the model can cheat by adding new axioms

Deming Xu (May 02 2025 at 01:24):

I downloaded the minif2f solution folder they released.

https://github.com/deepseek-ai/DeepSeek-Prover-V2

With a grep search, I found a single file containing the apply?.

minif2f-solutions % rg -F 'apply?'
valid/algebra_amgm_prod1toneq1_sum1tongeqn.lean
23:        apply?
minif2f-solutions %

So I think it is not that bad for MiniF2F?

The file
minif2f-solutions/valid/algebra_amgm_prod1toneq1_sum1tongeqn.lean
looks like this

import Mathlib
import Aesop

set_option maxHeartbeats 0

open BigOperators Real Nat Topology Rat

/-- Show that for any real-valued function $a$ on the natural numbers such that $\forall i \in \mathbb{N}, a_i \geq 0$, if $prod_{i=0}^{n-1} a_i = 1$, then $sum_{i=0}^{n-1} a_i \geq n$.-/
theorem algebra_amgm_prod1toneq1_sum1tongeqn (a : ℕ → NNReal) (n : ℕ)
  (h₀ : Finset.prod (Finset.range n) a = 1) : Finset.sum (Finset.range n) a ≥ n := by
  have am_gm_inequality : ∀ (x : ℕ → NNReal) (m : ℕ),
    Finset.sum (Finset.range m) x / m ≥ (Finset.prod (Finset.range m) x) ^ (1 / m) := by
    have step0 : ∀ (x : ℕ → NNReal) (m : ℕ), Finset.sum (Finset.range m) x / m ≥ (Finset.prod (Finset.range m) x) ^ (1 / m) := by
      intro x m
      -- (Removed to prevent data leakage...)
      have h₂ : Finset.sum (Finset.range m) x / m ≥ (Finset.prod (Finset.range m) x) ^ (1 / m) := by
        -- NOTE: The `apply?` line is here
        apply?
      exact h₂
    have step1 : ∀ (x : ℕ → NNReal) (m : ℕ), Finset.sum (Finset.range m) x / m ≥ (Finset.prod (Finset.range m) x) ^ (1 / m) := by
      -- (Removed to prevent data leakage...)

Thomas Zhu (May 02 2025 at 01:25):

@Deming Xu Yes, this is the one we extensively talked about above. The other solutions probably don't have this issue

Jason Rute (May 02 2025 at 01:35):

I’m unclear. Has someone put all the miniF2F solutions through leanchecker? Is this that only one which fails?

Thomas Zhu (May 02 2025 at 01:35):

Someone should definitely do that; I am saying other proofs don't have the apply? issue.

Jason Rute (May 02 2025 at 01:36):

I know. Sorry I started a new topic.

Deming Xu (May 02 2025 at 01:38):

lean_proj % lake env lean ./LeanProj/minif2f-solutions/valid/algebra_amgm_prod1toneq1_sum1tongeqn.lean
warning: Canonical: repository '././.lake/packages/Canonical' has local changes
Try this: refine (NNReal.le_div_iff' ?_).mpr ?_
Try this: refine NNReal.le_of_forall_lt_one_mul_le ?_
found a partial proof, but the corresponding tactic failed:
  refine (NNReal.div_le_div_left ?_ ?_ ?_).mp ?_
Try this: refine NNReal.le_of_forall_pos_le_add ?_

... (about 800 lines)

Try this: refine WithUpperSet.toUpperSet_specializes_toUpperSet.mp ?_
Try this: refine WithUpperSet.toUpperSet_le_iff.mp ?_

(Look at the following line. It mentions the sorry.)

./LeanProj/minif2f-solutions/valid/algebra_amgm_prod1toneq1_sum1tongeqn.lean:9:8: warning: declaration uses 'sorry'
./LeanProj/minif2f-solutions/valid/algebra_amgm_prod1toneq1_sum1tongeqn.lean:10:3: warning: unused variable `h₀`
note: this linter can be disabled with `set_option linter.unusedVariables false`
./LeanProj/minif2f-solutions/valid/algebra_amgm_prod1toneq1_sum1tongeqn.lean:49:8: warning: this tactic is never executed
note: this linter can be disabled with `set_option linter.unreachableTactic false`
./LeanProj/minif2f-solutions/valid/algebra_amgm_prod1toneq1_sum1tongeqn.lean:50:8: warning: this tactic is never executed
note: this linter can be disabled with `set_option linter.unreachableTactic false`
./LeanProj/minif2f-solutions/valid/algebra_amgm_prod1toneq1_sum1tongeqn.lean:70:10: warning: this tactic is never executed
note: this linter can be disabled with `set_option linter.unreachableTactic false`
./LeanProj/minif2f-solutions/valid/algebra_amgm_prod1toneq1_sum1tongeqn.lean:71:10: warning: this tactic is never executed
note: this linter can be disabled with `set_option linter.unreachableTactic false`
./LeanProj/minif2f-solutions/valid/algebra_amgm_prod1toneq1_sum1tongeqn.lean:72:10: warning: this tactic is never executed
note: this linter can be disabled with `set_option linter.unreachableTactic false`
./LeanProj/minif2f-solutions/valid/algebra_amgm_prod1toneq1_sum1tongeqn.lean:89:8: warning: this tactic is never executed
note: this linter can be disabled with `set_option linter.unreachableTactic false`
./LeanProj/minif2f-solutions/valid/algebra_amgm_prod1toneq1_sum1tongeqn.lean:90:8: warning: this tactic is never executed
note: this linter can be disabled with `set_option linter.unreachableTactic false`
./LeanProj/minif2f-solutions/valid/algebra_amgm_prod1toneq1_sum1tongeqn.lean:49:8: warning: 'norm_num at h₁ h₂ h₃ h₄ h₅' tactic does nothing
note: this linter can be disabled with `set_option linter.unusedTactic false`
./LeanProj/minif2f-solutions/valid/algebra_amgm_prod1toneq1_sum1tongeqn.lean:50:8: warning: 'linarith' tactic does nothing
note: this linter can be disabled with `set_option linter.unusedTactic false`
./LeanProj/minif2f-solutions/valid/algebra_amgm_prod1toneq1_sum1tongeqn.lean:70:10: warning: 'simp_all [h₀, geometric_mean_eq_one]' tactic does nothing
note: this linter can be disabled with `set_option linter.unusedTactic false`
./LeanProj/minif2f-solutions/valid/algebra_amgm_prod1toneq1_sum1tongeqn.lean:71:10: warning: 'norm_num' tactic does nothing
note: this linter can be disabled with `set_option linter.unusedTactic false`
./LeanProj/minif2f-solutions/valid/algebra_amgm_prod1toneq1_sum1tongeqn.lean:72:10: warning: 'linarith' tactic does nothing
note: this linter can be disabled with `set_option linter.unusedTactic false`
./LeanProj/minif2f-solutions/valid/algebra_amgm_prod1toneq1_sum1tongeqn.lean:89:8: warning: 'norm_num at h₁ h₂ h₃ h₄ h₅ h₆ h₇ h₈ h₉ h₁₀ h₁₁ h₁₂' tactic does nothing
note: this linter can be disabled with `set_option linter.unusedTactic false`
./LeanProj/minif2f-solutions/valid/algebra_amgm_prod1toneq1_sum1tongeqn.lean:90:8: warning: 'nlinarith' tactic does nothing
note: this linter can be disabled with `set_option linter.unusedTactic false`

Thomas Zhu (May 02 2025 at 01:40):

Jason Rute said:

I know. Sorry I started a new topic.

No, it's definitely something that should be done in light of this error. Thanks for pointing out!

Deming Xu (May 02 2025 at 01:46):

I ran lake env lean xxx.lean.
The warning: declaration uses 'sorry' is in the output

Jason Rute (May 02 2025 at 01:50):

@Deming Xu I meant run all of the proofs in Lean (using whatever lean version they used), and moreover use leanchecker and check for axioms.

Thomas Zhu (May 02 2025 at 01:53):

@Deming Xu I notice the "Canonical" warning in your output; this probably means you are not using their Lean version of v4.9.0? And also you should use lean4checker instead of manually watching the warnings of lake env lean.

Jason Rute (May 02 2025 at 01:57):

https://github.com/leanprover/lean4checker

Deming Xu (May 02 2025 at 02:04):

I will try to run the check later. Thanks for your tips!

Eric Wieser (May 02 2025 at 02:05):

Jason Rute said:

and moreover use leanchecker and check for axioms.

I created #lean4 > Checking axioms with leanchecker @ 💬 regarding this

GasStationManager (May 02 2025 at 02:32):

I'll mention https://github.com/GasStationManager/SafeVerify
which was mostly based on lean4checker, but additionally: throws error if axioms other than the standard ones are used.

Thomas Zhu (May 02 2025 at 02:33):

@GasStationManager can you put the message to the separate thread Eric posted? Thanks!

Last updated: May 02 2025 at 03:31 UTC