Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Gemini 2.5 Pro 06/05

Justin Asher (Jun 05 2025 at 16:16):

The new Gemini model seems to be better at Lean. You can access it in the AI Studio. For reference, the previous 2.5 Pro model seemed to make little progress. I had it write a summary:

Hey everyone, I just finished proving formal_3336 and wanted to share the process. It's a neat little number theory problem that demonstrates some cool mathlib features.

Problem Statement

The theorem asks us to prove that A + M + C = 14, given that A, M, and C are single-digit natural numbers (< 10), and they satisfy this equation:
Nat.ofDigits 10 [0, 1, C, M, A] + Nat.ofDigits 10 [2, 1, C, M, A] = 123422

Since Nat.ofDigits is little-endian, this is equivalent to saying that the sum of the number 10000*A + 1000*M + 100*C + 10 and 10000*A + 1000*M + 100*C + 12 is 123422.

Proof Strategy

My approach was a three-step process:

Algebraic Simplification: First, I expanded the Nat.ofDigits terms into their polynomial forms. The main hypothesis h₄ simplifies very cleanly into a standard base-10 representation: 100 * A + 10 * M + C = 617.
Uniqueness of Digital Representation: With the simplified equation, I could leverage the uniqueness of base-10 representations. The left side, 100 * A + 10 * M + C, is what Nat.ofDigits 10 [C, M, A] represents. The right side, 617, is Nat.ofDigits 10 [7, 1, 6]. Since all the digits involved are < 10, the lists of digits themselves must be equal: [C, M, A] = [7, 1, 6].
Substitution and Final Proof: From the list equality, it directly follows that C = 7, M = 1, and A = 6. Plugging these values into the goal A + M + C = 14 makes it 6 + 1 + 7 = 14, which is trivial.

Lean Implementation

Here's how I implemented this in Lean 4:

For Step 1, I used simp [ofDigits_cons] and ring to expand the polynomials. Then linarith did the heavy lifting on the linear arithmetic.
For Step 2, the key lemma was ofDigits_inj_of_len_eq from mathlib, which I found using LeanExplore (a great tool!). This lemma formalizes the "uniqueness of digits" idea. The side goals for this theorem were proven using the hypotheses h₀, h₁, h₂ and norm_num.
For Step 3, I used the injection tactic on the list equality h_list_eq. This is perfect for deconstructing equalities of inductive types. It gave me h_C_eq : C = 7 and a new equality for the rest of the list. Applying injection again, and then rw [List.singleton_inj] on the final list equality [A] = [6], gave me all the individual values. A final rw with the new equalities solved the goal.

Here's the final proof:

import Mathlib.Data.Nat.Digits
import Mathlib.Tactic.Linarith

open Nat

theorem formal_3336
  (A M C : ℕ)
  (h₀ : A < 10)
  (h₁ : M < 10)
  (h₂ : C < 10)
  (h₃ : 0 < A)
  (h₄ : Nat.ofDigits 10 [0, 1, C, M, A] + Nat.ofDigits 10 [2, 1, C, M, A] = 123422) :
  A + M + C = 14 := by
  -- Step 1: Simplify the equation in h₄ to a standard base-10 representation.
  have h_eq_base : 100 * A + 10 * M + C = 617 := by
    -- Expand the `Nat.ofDigits` terms into their polynomial form.
    have h_eval1 : Nat.ofDigits 10 [0, 1, C, M, A] = 10000 * A + 1000 * M + 100 * C + 10 := by
      simp [ofDigits_cons]; ring
    have h_eval2 : Nat.ofDigits 10 [2, 1, C, M, A] = 10000 * A + 1000 * M + 100 * C + 12 := by
      simp [ofDigits_cons]; ring
    -- Substitute the expanded forms into h₄.
    rw [h_eval1, h_eval2] at h₄
    -- Simplify the resulting linear equation using linarith.
    have h_lineq: 20000 * A + 2000 * M + 200 * C = 123400 := by linarith [h₄]
    -- Factor out 200 from both sides and cancel to solve for the base-10 expression.
    rw [show 20000 * A + 2000 * M + 200 * C = 200 * (100 * A + 10 * M + C) by ring] at h_lineq
    rw [show 123400 = 200 * 617 by norm_num] at h_lineq
    exact Nat.eq_of_mul_eq_mul_left (by norm_num : 0 < 200) h_lineq

  -- Step 2: Use the uniqueness of base-10 representation to find the values of A, M, C.
  have h_ofDigits_eq : Nat.ofDigits 10 [C, M, A] = Nat.ofDigits 10 [7, 1, 6] := by
    -- The expression `100 * A + 10 * M + C` corresponds to the number with
    -- little-endian digits [C, M, A].
    have h_lhs : Nat.ofDigits 10 [C, M, A] = C + 10 * M + 100 * A := by
      simp [ofDigits_cons]; ring
    rw [h_lhs]
    rw [show C + 10 * M + 100 * A = 100 * A + 10 * M + C by ac_rfl]
    rw [h_eq_base]
    -- The RHS is 617, written in terms of ofDigits.
    norm_num [ofDigits_cons]

  have h_list_eq : [C, M, A] = [7, 1, 6] := by
    apply ofDigits_inj_of_len_eq (by norm_num : 1 < 10)
    · rfl -- Prove that lengths are equal.
    · -- Prove that all digits in [C, M, A] are less than 10.
      intro d hd
      rw [List.mem_cons, List.mem_cons, List.mem_singleton] at hd
      rcases hd with rfl | rfl | rfl
      · exact h₂
      · exact h₁
      · exact h₀
    · -- Prove that all digits in [7, 1, 6] are less than 10.
      intro d hd
      rw [List.mem_cons, List.mem_cons, List.mem_singleton] at hd
      rcases hd with rfl | rfl | rfl <;> norm_num
    · -- The equality of the numbers from the previous step.
      exact h_ofDigits_eq

  -- Step 3: From the list equality, deduce the values of A, M, C and prove the goal.
  -- Deconstruct the list equality step-by-step using `injection`.
  injection h_list_eq with h_C_eq h_M_A_eq
  injection h_M_A_eq with h_M_eq h_A_eq

  -- h_A_eq is `[A] = [6]`. `List.singleton_inj` converts this to `A = 6`.
  rw [List.singleton_inj] at h_A_eq

  -- Substitute all found equalities into the goal.
  rw [h_A_eq, h_M_eq, h_C_eq]

Justin Asher (Jun 05 2025 at 16:19):

I had to give it the LSP feedback, but this is still better than what Claude Sonnet 4 was able to do previously when I tested it. (I did not test Opus since I thought the API cost was too high for practical usage.)

Eric Wieser (Jun 05 2025 at 16:28):

FWIW, "formal_3336" looks to be amc12a_2003_p5 from miniF2F

Justin Asher (Jun 05 2025 at 16:29):

Yes, that seems right.

Alex J. Best (Jun 05 2025 at 20:46):

Yeah I guess that comes from the harmonic versions https://github.com/harmonic-ai/datasets/blob/main/minif2f%2Ftest.json#L9

GasStationManager (Jun 06 2025 at 02:10):

I've also had some recent positive experience with Gemini 2.5 Pro,
for a Lean coding task. Details here.

What impressed me the most was that it can be very agentic: it was able to make use of tools (including LeanTool ) to keep improving its solution.

Justin Asher (Jun 06 2025 at 02:29):

This is really neat! Thanks for sharing. My observation was that the new Gemini model seemed to get stuck a lot less often on things like "No goals remaining" errors. It also seemed to think more thoroughly about the Lean code and its implications (perhaps this is because they are working on grounding). Maybe I am wrong, but intuitively this just felt like the case.

Justin Asher (Jun 06 2025 at 02:29):

We are so close to autoformalization...

Kevin Buzzard (Jun 07 2025 at 06:38):

Let me know when the system does an unclaimed task on the FLT dashboard.

Justin Asher (Jun 07 2025 at 13:01):

@Kevin Buzzard Working on it!

Justin Asher (Jun 09 2025 at 00:00):

I managed to get ChatGPT o4-mini to output a proof to the same problem which is much shorter:

theorem formal_3336
  (A M C : ℕ)
  (h₀ : A < 10)
  (h₁ : M < 10)
  (h₂ : C < 10)
  (h₃ : 0 < A)
  (h₄ : Nat.ofDigits 10 [0, 1, C, M, A] + Nat.ofDigits 10 [2, 1, C, M, A] = 123422) :
  A + M + C = 14 := by
  -- expand and simplify the `ofDigits` sum
  simp only [Nat.ofDigits, ofDigits_cons, ofDigits_nil] at h₄
  simp only [add_assoc, add_comm 2, add_mul, mul_add, mul_comm] at h₄
  -- deduce the compact linear equation
  have h₅ : C + 10 * M + 100 * A = 617 := by linarith [h₄]
  -- pinpoint each digit via bounds
  have hA : A = 6 := by linarith [h₅, h₀, h₃]
  have hM : M = 1 := by linarith [h₅, h₁]
  have hC : C = 7 := by linarith [h₅, h₂]
  -- conclude the desired sum
  simp [hA, hM, hC]

This proof was also produced by a completely automated system that I made, meaning no human in the loop. I would imagine the frontier models will rather soon be able to most of the formalization process.

(deleted) (Jul 10 2025 at 11:03):

my skeptical mind says such a leap... how can we extrapolate from this very simple problem to more difficult but still self contained problems... and in fact grind is able to obliterate this problem anyway

Justin Asher (Jul 10 2025 at 14:22):

Eh, things move fast. It’s not that it can solve a simple problem, but rather how it solves that problem that matters.

Andy Jiang (Jul 10 2025 at 17:32):

Would like to see a comprehensive summary of how good the various models are now at producing Lean code, including Deepseek prover v2, kimina prover, grok 4, o3, gemini 2.5 pro. I guess miniF2F is kind of such a benchmark but a more similar to standard usage one would be good (like the sorrybench mentioned elsewhere?) maybe I just don't know where to look

Lenny Taelman (Jul 10 2025 at 17:46):

Andy Jiang said:

Would like to see a comprehensive summary of how good the various models are now at producing Lean code, including Deepseek prover v2, kimina prover, grok 4, o3, gemini 2.5 pro. I guess miniF2F is kind of such a benchmark but a more similar to standard usage one would be good (like the sorrybench mentioned elsewhere?) maybe I just don't know where to look

The SorryDB project aims exactly at doing that, but is work in progress. If you want something available that is closer to measuring performance on "standard usage", I'd recommend checking out miniCTX.

Last updated: Feb 28 2026 at 14:05 UTC