Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Can this GPT 5.1 be simplified?


Jonathan Crall (Nov 29 2025 at 20:35):

I've been learning lean on the side, and I've finally been able to get a working proof for a small result that I found: https://arxiv.org/abs/2305.00594

On one hand, I'm amazed that GPT 5.1 was finally able to get the formalization of the proof, I've been struggling with it. Previous GPT models 4o and 5 were not able to get there. I think this is a neat example of something simple, but somewhat non-trivial being proven using AI.

Simply stated, given:

/-- Precision / PPV = TP / (TP + FP) -/
def PPV (TP FP : ) :  :=
  TP / (TP + FP)

/-- Recall / TPR = TP / (TP + FN) -/
def TPR (TP FN : ) :  :=
  TP / (TP + FN)

/-- Fowlkes–Mallows index: geometric mean of precision and recall. -/
def FM (TP FP FN : ) :  :=
  Real.sqrt (PPV TP FP * TPR TP FN)

/-- Matthews Correlation Coefficient (MCC). We treat all entries as reals. -/
def MCC (TP TN FP FN : ) :  :=
  let num := TP * TN - FP * FN
  let den := Real.sqrt ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))
  num / den

I have the proof for the statement:

/--
  Main target theorem:

    lim_{TN→∞} MCC(TP,TN,FP,FN) = FM(TP,FP,FN)

  under non-degeneracy and `0 ≤ TP`.

  The definition of the main theroem (given in full later) is written as:
-/

theorem tendsto_MCC_atTop_eq_FM
    {TP FP FN : }
    (hTPFPpos : 0 < TP + FP)
    (hTPFNpos : 0 < TP + FN)
    (hTP_nonneg : 0  TP)
    (hFP_nonneg : 0  FP)
    (hFN_nonneg : 0  FN) :
    Tendsto (fun TN :  => MCC TP TN FP FN) atTop (𝓝 (FM TP FP FN))

The working proof is: https://github.com/Erotemic/paper-g1-and-mcc/blob/main/attempt6/Attemp6/Basic.lean

But the proof is 654 lines, which is a lot longer than I thought it would need to be. On one hand: great, that explains why I was having such a hard time getting this, but on the other hand, I suspect that given that GPT struggled to get it right, it probably isn't going about it optimally.

My questions are:

  1. How typical is a length like this for a statement of this complexity?
  2. Can anyone else do better?

Kevin Buzzard (Nov 29 2025 at 21:02):

It's best to ask questions as a #mwe so here is one:

import Mathlib.Tactic

noncomputable section

/-- Precision / PPV = TP / (TP + FP) -/
def PPV (TP FP : ) :  :=
  TP / (TP + FP)

/-- Recall / TPR = TP / (TP + FN) -/
def TPR (TP FN : ) :  :=
  TP / (TP + FN)

/-- Fowlkes–Mallows index: geometric mean of precision and recall. -/
def FM (TP FP FN : ) :  :=
  Real.sqrt (PPV TP FP * TPR TP FN)

/-- Matthews Correlation Coefficient (MCC). We treat all entries as reals. -/
def MCC (TP TN FP FN : ) :  :=
  let num := TP * TN - FP * FN
  let den := Real.sqrt ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))
  num / den

end

open Filter Topology

/--
Main target theorem:

    lim_{TN→∞} MCC(TP,TN,FP,FN) = FM(TP,FP,FN)

  under non-degeneracy and `0 ≤ TP`.

  The definition of the main theroem (given in full later) is written as:
-/
theorem tendsto_MCC_atTop_eq_FM
    {TP FP FN : }
    (hTPFPpos : 0 < TP + FP)
    (hTPFNpos : 0 < TP + FN)
    (hTP_nonneg : 0  TP)
    (hFP_nonneg : 0  FP)
    (hFN_nonneg : 0  FN) :
    Tendsto (fun TN :  => MCC TP TN FP FN) atTop (𝓝 (FM TP FP FN)) := by
  sorry

Kevin Buzzard (Nov 29 2025 at 21:08):

Mathematically I think the question is to prove that the limit as TNT_N tends to ++\infty of TPTNFPFN(TP+FP)(TP+FN)(TN+FP)(TN+FN)\frac{T_PT_N-F_PF_N}{\sqrt{(T_P+F_P)(T_P+F_N)(T_N+F_P)(T_N+F_N)}} is TPTP+FP×TPTP+FN.\sqrt{\frac{T_P}{T_P+F_P}\times\frac{T_P}{T_P+F_N}}.

Kevin Buzzard (Nov 29 2025 at 21:09):

I would estimate that, as is currently par for the course with AI (edit: by which I mean "generic LLMs such as ChatGPT"; I should not have said "AI" here, as is clear from the below post), your vibe-coded proof is at least a factor of 10 larger than a human expert proof.

Lawrence Wu (llllvvuu) (Nov 30 2025 at 00:45):

This seems like the kind of thing that there should be a tactic like asymp_simp that would just close it. But in the meantime, here's Aristotle's 18-line proof which was returned from our CLI in 6 minutes:

import Mathlib.Tactic


noncomputable section

/-- Precision / PPV = TP / (TP + FP) -/
def PPV (TP FP : ) :  :=
  TP / (TP + FP)

/-- Recall / TPR = TP / (TP + FN) -/
def TPR (TP FN : ) :  :=
  TP / (TP + FN)

/-- Fowlkes–Mallows index: geometric mean of precision and recall. -/
def FM (TP FP FN : ) :  :=
  Real.sqrt (PPV TP FP * TPR TP FN)

/-- Matthews Correlation Coefficient (MCC). We treat all entries as reals. -/
def MCC (TP TN FP FN : ) :  :=
  let num := TP * TN - FP * FN
  let den := Real.sqrt ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))
  num / den

end

open Filter Topology

/--
Main target theorem:

    lim_{TN→∞} MCC(TP,TN,FP,FN) = FM(TP,FP,FN)

  under non-degeneracy and `0 ≤ TP`.

  The definition of the main theroem (given in full later) is written as:
-/
theorem tendsto_MCC_atTop_eq_FM
    {TP FP FN : }
    (hTPFPpos : 0 < TP + FP)
    (hTPFNpos : 0 < TP + FN)
    (hTP_nonneg : 0  TP)
    (hFP_nonneg : 0  FP)
    (hFN_nonneg : 0  FN) :
    Tendsto (fun TN :  => MCC TP TN FP FN) atTop (𝓝 (FM TP FP FN)) := by
  -- Simplify the expression for the MCC as TN approaches infinity.
  have h_simplify : Filter.Tendsto (fun TN :  => (TP * TN - FP * FN) / Real.sqrt ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))) Filter.atTop (𝓝 (TP / Real.sqrt ((TP + FP) * (TP + FN)))) := by
    -- We divide numerator and denominator by $TN$:
    suffices h_suff'' : Filter.Tendsto (fun TN :  => (TP - FP * FN / TN) / Real.sqrt ((TP + FP) * (TP + FN) * (1 + FP / TN) * (1 + FN / TN))) Filter.atTop (𝓝 (TP / Real.sqrt ((TP + FP) * (TP + FN)))) by
      -- By simplifying, we can see that the two expressions are equivalent for large TN.
      have h_eq :  TN : , TN > 0  (TP * TN - FP * FN) / Real.sqrt ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)) = (TP - FP * FN / TN) / Real.sqrt ((TP + FP) * (TP + FN) * (1 + FP / TN) * (1 + FN / TN)) := by
        intro TN hTNpos
        field_simp [hTNpos];
        rw [ Real.sqrt_div ( by positivity ), Real.sqrt_sq hTNpos.le ] ; ring;
        -- By simplifying, we can see that the two expressions are equal.
        field_simp [hTNpos]
        ring;
      exact h_suff''.congr' ( by filter_upwards [ Filter.eventually_gt_atTop 0 ] with TN hTN using h_eq TN hTN  rfl );
    exact le_trans ( Filter.Tendsto.div ( tendsto_const_nhds.sub ( tendsto_const_nhds.div_atTop Filter.tendsto_id ) ) ( Filter.Tendsto.sqrt <| Filter.Tendsto.mul ( Filter.Tendsto.mul ( tendsto_const_nhds.mul <| tendsto_const_nhds ) <| tendsto_const_nhds.add <| tendsto_const_nhds.div_atTop Filter.tendsto_id ) <| tendsto_const_nhds.add <| tendsto_const_nhds.div_atTop Filter.tendsto_id ) <| ne_of_gt <| Real.sqrt_pos.mpr <| by positivity ) <| by norm_num;
  convert h_simplify using 1;
  unfold FM PPV TPR; ring;
  field_simp;
  rw [ Real.sqrt_div ( by positivity ), Real.sqrt_sq ( by positivity ) ] ; ring

Once cleaned up / simplified by a human it should be even shorter.

TongKe Xue (Nov 30 2025 at 20:55):

  1. This proof is longer than the Aristotle proof, but shorter than the ChatGPT proof; but I think the proof "follows the paper" fairly accurately.
  2. This problem turned out to be more "little pebbles" than I ever imagined. (dealing with division in Reals, sqrt, filter definition of limits).
  3. I have a short video walk through at https://www.youtube.com/watch?v=SdZyIvnACnE
  4. I was running into MathLib issues on lean4.26.0-rc2; so I had to downgrade everything to lean4.25.2. No what I misconfigured to cause the pain.
    Basic.lean

  5. Great learning experience for reading up on parts of MathLib. This one problem hits: limit / Tendsto, Sqrt, GroupWithZero, Div

The core "pebbles" in formalizing this were:

  1. The human proof informally uses "a * T / T = a". We can't do this i nLean4 because LHS is not defined when "T = 0", so we have to carry around "T \ne 0" proofs everywhere. How do we get "T \ne 0"? We exploit the neighborhood definition, to say we only consider the "TN >= 1" which implies "TN > 0" which implies "TN \ne 0".
  2. Lean4 is very strict on sqrt. So anything involving sqrt involves 0 <= x proofs. Luckily, this is generally dispatched via "by positivity"
  3. (This could just be my fault for misreading things). I don't know what to make of Tendsto. On one hand, it's pretty intuitive; on the other hand, many of the lemmas for Tendsto seems to be spread out across multiple files, so it was not clear "what can we use" at first glance.
  4. (not a fault of Lean4): LLMs were hit & miss. Sometimes they were brilliant and taught me new tactics. Othertimes, they gave me lemma's that did not exist. The problem with the latter is I didn't know when (1) the LLM was hallucinating and (2) when it was a misconfiguration on my end. Especially fun when it gave me stuff from Nat.*when all my variables were \Real.

Besides that, everything else was fairly straight forward.

TongKe Xue (Nov 30 2025 at 21:00):

Lawrence Wu (llllvvuu) said:

This seems like the kind of thing that there should be a tactic like asymp_simp that would just close it.

Do you have a "test set" in mind for this?

  1. After manually writing the proof, I think having an asymp_simp would be very useful.
  2. I'm also interested in learning how to write Lean4 tactics
  3. Do you have a road map / test dataset for this ?
  4. Intuitively, I feel having a "bigger bank" of lemmas for manipulating sqrt / frac / trig / ... under limit/Tendsto would go a long way to simplifying alot of these proofs, and also let the tactic "operate at a higher level". [In particular, a more automated system for tracking >= 0, neq 0, ... ]

Last updated: Dec 20 2025 at 21:32 UTC