Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Can this GPT 5.1 be simplified?

Jonathan Crall (Nov 29 2025 at 20:35):

I've been learning lean on the side, and I've finally been able to get a working proof for a small result that I found: https://arxiv.org/abs/2305.00594

On one hand, I'm amazed that GPT 5.1 was finally able to get the formalization of the proof, I've been struggling with it. Previous GPT models 4o and 5 were not able to get there. I think this is a neat example of something simple, but somewhat non-trivial being proven using AI.

Simply stated, given:

/-- Precision / PPV = TP / (TP + FP) -/
def PPV (TP FP : ℝ) : ℝ :=
  TP / (TP + FP)

/-- Recall / TPR = TP / (TP + FN) -/
def TPR (TP FN : ℝ) : ℝ :=
  TP / (TP + FN)

/-- Fowlkes–Mallows index: geometric mean of precision and recall. -/
def FM (TP FP FN : ℝ) : ℝ :=
  Real.sqrt (PPV TP FP * TPR TP FN)

/-- Matthews Correlation Coefficient (MCC). We treat all entries as reals. -/
def MCC (TP TN FP FN : ℝ) : ℝ :=
  let num := TP * TN - FP * FN
  let den := Real.sqrt ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))
  num / den

I have the proof for the statement:

/--
  Main target theorem:

    lim_{TN→∞} MCC(TP,TN,FP,FN) = FM(TP,FP,FN)

  under non-degeneracy and `0 ≤ TP`.

  The definition of the main theroem (given in full later) is written as:
-/

theorem tendsto_MCC_atTop_eq_FM
    {TP FP FN : ℝ}
    (hTPFPpos : 0 < TP + FP)
    (hTPFNpos : 0 < TP + FN)
    (hTP_nonneg : 0 ≤ TP)
    (hFP_nonneg : 0 ≤ FP)
    (hFN_nonneg : 0 ≤ FN) :
    Tendsto (fun TN : ℝ => MCC TP TN FP FN) atTop (𝓝 (FM TP FP FN))

The working proof is: https://github.com/Erotemic/paper-g1-and-mcc/blob/main/attempt6/Attemp6/Basic.lean

But the proof is 654 lines, which is a lot longer than I thought it would need to be. On one hand: great, that explains why I was having such a hard time getting this, but on the other hand, I suspect that given that GPT struggled to get it right, it probably isn't going about it optimally.

My questions are:

How typical is a length like this for a statement of this complexity?
Can anyone else do better?

Kevin Buzzard (Nov 29 2025 at 21:02):

It's best to ask questions as a #mwe so here is one:

import Mathlib.Tactic

noncomputable section

/-- Precision / PPV = TP / (TP + FP) -/
def PPV (TP FP : ℝ) : ℝ :=
  TP / (TP + FP)

/-- Recall / TPR = TP / (TP + FN) -/
def TPR (TP FN : ℝ) : ℝ :=
  TP / (TP + FN)

/-- Fowlkes–Mallows index: geometric mean of precision and recall. -/
def FM (TP FP FN : ℝ) : ℝ :=
  Real.sqrt (PPV TP FP * TPR TP FN)

/-- Matthews Correlation Coefficient (MCC). We treat all entries as reals. -/
def MCC (TP TN FP FN : ℝ) : ℝ :=
  let num := TP * TN - FP * FN
  let den := Real.sqrt ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))
  num / den

end

open Filter Topology

/--
Main target theorem:

    lim_{TN→∞} MCC(TP,TN,FP,FN) = FM(TP,FP,FN)

  under non-degeneracy and `0 ≤ TP`.

  The definition of the main theroem (given in full later) is written as:
-/
theorem tendsto_MCC_atTop_eq_FM
    {TP FP FN : ℝ}
    (hTPFPpos : 0 < TP + FP)
    (hTPFNpos : 0 < TP + FN)
    (hTP_nonneg : 0 ≤ TP)
    (hFP_nonneg : 0 ≤ FP)
    (hFN_nonneg : 0 ≤ FN) :
    Tendsto (fun TN : ℝ => MCC TP TN FP FN) atTop (𝓝 (FM TP FP FN)) := by
  sorry

Kevin Buzzard (Nov 29 2025 at 21:08):

Mathematically I think the question is to prove that the limit as $T_N$ tends to $+\infty$ of $\frac{T_PT_N-F_PF_N}{\sqrt{(T_P+F_P)(T_P+F_N)(T_N+F_P)(T_N+F_N)}}$ is $\sqrt{\frac{T_P}{T_P+F_P}\times\frac{T_P}{T_P+F_N}}.$

Kevin Buzzard (Nov 29 2025 at 21:09):

I would estimate that, as is currently par for the course with AI (edit: by which I mean "generic LLMs such as ChatGPT"; I should not have said "AI" here, as is clear from the below post), your vibe-coded proof is at least a factor of 10 larger than a human expert proof.

Lawrence Wu (llllvvuu) (Nov 30 2025 at 00:45):

This seems like the kind of thing that there should be a tactic like asymp_simp that would just close it. But in the meantime, here's Aristotle's 18-line proof which was returned from our CLI in 6 minutes:

import Mathlib.Tactic


noncomputable section

/-- Precision / PPV = TP / (TP + FP) -/
def PPV (TP FP : ℝ) : ℝ :=
  TP / (TP + FP)

/-- Recall / TPR = TP / (TP + FN) -/
def TPR (TP FN : ℝ) : ℝ :=
  TP / (TP + FN)

/-- Fowlkes–Mallows index: geometric mean of precision and recall. -/
def FM (TP FP FN : ℝ) : ℝ :=
  Real.sqrt (PPV TP FP * TPR TP FN)

/-- Matthews Correlation Coefficient (MCC). We treat all entries as reals. -/
def MCC (TP TN FP FN : ℝ) : ℝ :=
  let num := TP * TN - FP * FN
  let den := Real.sqrt ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))
  num / den

end

open Filter Topology

/--
Main target theorem:

    lim_{TN→∞} MCC(TP,TN,FP,FN) = FM(TP,FP,FN)

  under non-degeneracy and `0 ≤ TP`.

  The definition of the main theroem (given in full later) is written as:
-/
theorem tendsto_MCC_atTop_eq_FM
    {TP FP FN : ℝ}
    (hTPFPpos : 0 < TP + FP)
    (hTPFNpos : 0 < TP + FN)
    (hTP_nonneg : 0 ≤ TP)
    (hFP_nonneg : 0 ≤ FP)
    (hFN_nonneg : 0 ≤ FN) :
    Tendsto (fun TN : ℝ => MCC TP TN FP FN) atTop (𝓝 (FM TP FP FN)) := by
  -- Simplify the expression for the MCC as TN approaches infinity.
  have h_simplify : Filter.Tendsto (fun TN : ℝ => (TP * TN - FP * FN) / Real.sqrt ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))) Filter.atTop (𝓝 (TP / Real.sqrt ((TP + FP) * (TP + FN)))) := by
    -- We divide numerator and denominator by $TN$:
    suffices h_suff'' : Filter.Tendsto (fun TN : ℝ => (TP - FP * FN / TN) / Real.sqrt ((TP + FP) * (TP + FN) * (1 + FP / TN) * (1 + FN / TN))) Filter.atTop (𝓝 (TP / Real.sqrt ((TP + FP) * (TP + FN)))) by
      -- By simplifying, we can see that the two expressions are equivalent for large TN.
      have h_eq : ∀ TN : ℝ, TN > 0 → (TP * TN - FP * FN) / Real.sqrt ((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN)) = (TP - FP * FN / TN) / Real.sqrt ((TP + FP) * (TP + FN) * (1 + FP / TN) * (1 + FN / TN)) := by
        intro TN hTNpos
        field_simp [hTNpos];
        rw [ Real.sqrt_div ( by positivity ), Real.sqrt_sq hTNpos.le ] ; ring;
        -- By simplifying, we can see that the two expressions are equal.
        field_simp [hTNpos]
        ring;
      exact h_suff''.congr' ( by filter_upwards [ Filter.eventually_gt_atTop 0 ] with TN hTN using h_eq TN hTN ▸ rfl );
    exact le_trans ( Filter.Tendsto.div ( tendsto_const_nhds.sub ( tendsto_const_nhds.div_atTop Filter.tendsto_id ) ) ( Filter.Tendsto.sqrt <| Filter.Tendsto.mul ( Filter.Tendsto.mul ( tendsto_const_nhds.mul <| tendsto_const_nhds ) <| tendsto_const_nhds.add <| tendsto_const_nhds.div_atTop Filter.tendsto_id ) <| tendsto_const_nhds.add <| tendsto_const_nhds.div_atTop Filter.tendsto_id ) <| ne_of_gt <| Real.sqrt_pos.mpr <| by positivity ) <| by norm_num;
  convert h_simplify using 1;
  unfold FM PPV TPR; ring;
  field_simp;
  rw [ Real.sqrt_div ( by positivity ), Real.sqrt_sq ( by positivity ) ] ; ring

Once cleaned up / simplified by a human it should be even shorter.

TongKe Xue (Nov 30 2025 at 20:55):

This proof is longer than the Aristotle proof, but shorter than the ChatGPT proof; but I think the proof "follows the paper" fairly accurately.
This problem turned out to be more "little pebbles" than I ever imagined. (dealing with division in Reals, sqrt, filter definition of limits).
I have a short video walk through at https://www.youtube.com/watch?v=SdZyIvnACnE
I was running into MathLib issues on lean4.26.0-rc2; so I had to downgrade everything to lean4.25.2. No what I misconfigured to cause the pain.
Basic.lean
Great learning experience for reading up on parts of MathLib. This one problem hits: limit / Tendsto, Sqrt, GroupWithZero, Div

The core "pebbles" in formalizing this were:

The human proof informally uses "a * T / T = a". We can't do this i nLean4 because LHS is not defined when "T = 0", so we have to carry around "T \ne 0" proofs everywhere. How do we get "T \ne 0"? We exploit the neighborhood definition, to say we only consider the "TN >= 1" which implies "TN > 0" which implies "TN \ne 0".
Lean4 is very strict on sqrt. So anything involving sqrt involves 0 <= x proofs. Luckily, this is generally dispatched via "by positivity"
(This could just be my fault for misreading things). I don't know what to make of Tendsto. On one hand, it's pretty intuitive; on the other hand, many of the lemmas for Tendsto seems to be spread out across multiple files, so it was not clear "what can we use" at first glance.
(not a fault of Lean4): LLMs were hit & miss. Sometimes they were brilliant and taught me new tactics. Othertimes, they gave me lemma's that did not exist. The problem with the latter is I didn't know when (1) the LLM was hallucinating and (2) when it was a misconfiguration on my end. Especially fun when it gave me stuff from Nat.*when all my variables were \Real.

Besides that, everything else was fairly straight forward.

TongKe Xue (Nov 30 2025 at 21:00):

Lawrence Wu (llllvvuu) said:

This seems like the kind of thing that there should be a tactic like asymp_simp that would just close it.

Do you have a "test set" in mind for this?

After manually writing the proof, I think having an asymp_simp would be very useful.
I'm also interested in learning how to write Lean4 tactics
Do you have a road map / test dataset for this ?
Intuitively, I feel having a "bigger bank" of lemmas for manipulating sqrt / frac / trig / ... under limit/Tendsto would go a long way to simplifying alot of these proofs, and also let the tactic "operate at a higher level". [In particular, a more automated system for tracking >= 0, neq 0, ... ]

Last updated: Feb 28 2026 at 14:05 UTC