Zulip Chat Archive
Stream: Machine Learning for Theorem Proving
Topic: Gemini 2.5 Pro 06/05
Justin Asher (Jun 05 2025 at 16:16):
The new Gemini model seems to be better at Lean. You can access it in the AI Studio. For reference, the previous 2.5 Pro model seemed to make little progress. I had it write a summary:
Hey everyone, I just finished proving formal_3336 and wanted to share the process. It's a neat little number theory problem that demonstrates some cool mathlib features.
Problem Statement
The theorem asks us to prove that A + M + C = 14, given that A, M, and C are single-digit natural numbers (< 10), and they satisfy this equation:
Nat.ofDigits 10 [0, 1, C, M, A] + Nat.ofDigits 10 [2, 1, C, M, A] = 123422
Since Nat.ofDigits is little-endian, this is equivalent to saying that the sum of the number 10000*A + 1000*M + 100*C + 10 and 10000*A + 1000*M + 100*C + 12 is 123422.
Proof Strategy
My approach was a three-step process:
- Algebraic Simplification: First, I expanded the
Nat.ofDigitsterms into their polynomial forms. The main hypothesish₄simplifies very cleanly into a standard base-10 representation:100 * A + 10 * M + C = 617. - Uniqueness of Digital Representation: With the simplified equation, I could leverage the uniqueness of base-10 representations. The left side,
100 * A + 10 * M + C, is whatNat.ofDigits 10 [C, M, A]represents. The right side,617, isNat.ofDigits 10 [7, 1, 6]. Since all the digits involved are< 10, the lists of digits themselves must be equal:[C, M, A] = [7, 1, 6]. - Substitution and Final Proof: From the list equality, it directly follows that
C = 7,M = 1, andA = 6. Plugging these values into the goalA + M + C = 14makes it6 + 1 + 7 = 14, which is trivial.
Lean Implementation
Here's how I implemented this in Lean 4:
- For Step 1, I used
simp [ofDigits_cons]andringto expand the polynomials. Thenlinarithdid the heavy lifting on the linear arithmetic. - For Step 2, the key lemma was
ofDigits_inj_of_len_eqfrommathlib, which I found using LeanExplore (a great tool!). This lemma formalizes the "uniqueness of digits" idea. The side goals for this theorem were proven using the hypothesesh₀, h₁, h₂andnorm_num. - For Step 3, I used the
injectiontactic on the list equalityh_list_eq. This is perfect for deconstructing equalities of inductive types. It gave meh_C_eq : C = 7and a new equality for the rest of the list. Applyinginjectionagain, and thenrw [List.singleton_inj]on the final list equality[A] = [6], gave me all the individual values. A finalrwwith the new equalities solved the goal.
Here's the final proof:
import Mathlib.Data.Nat.Digits
import Mathlib.Tactic.Linarith
open Nat
theorem formal_3336
(A M C : ℕ)
(h₀ : A < 10)
(h₁ : M < 10)
(h₂ : C < 10)
(h₃ : 0 < A)
(h₄ : Nat.ofDigits 10 [0, 1, C, M, A] + Nat.ofDigits 10 [2, 1, C, M, A] = 123422) :
A + M + C = 14 := by
-- Step 1: Simplify the equation in h₄ to a standard base-10 representation.
have h_eq_base : 100 * A + 10 * M + C = 617 := by
-- Expand the `Nat.ofDigits` terms into their polynomial form.
have h_eval1 : Nat.ofDigits 10 [0, 1, C, M, A] = 10000 * A + 1000 * M + 100 * C + 10 := by
simp [ofDigits_cons]; ring
have h_eval2 : Nat.ofDigits 10 [2, 1, C, M, A] = 10000 * A + 1000 * M + 100 * C + 12 := by
simp [ofDigits_cons]; ring
-- Substitute the expanded forms into h₄.
rw [h_eval1, h_eval2] at h₄
-- Simplify the resulting linear equation using linarith.
have h_lineq: 20000 * A + 2000 * M + 200 * C = 123400 := by linarith [h₄]
-- Factor out 200 from both sides and cancel to solve for the base-10 expression.
rw [show 20000 * A + 2000 * M + 200 * C = 200 * (100 * A + 10 * M + C) by ring] at h_lineq
rw [show 123400 = 200 * 617 by norm_num] at h_lineq
exact Nat.eq_of_mul_eq_mul_left (by norm_num : 0 < 200) h_lineq
-- Step 2: Use the uniqueness of base-10 representation to find the values of A, M, C.
have h_ofDigits_eq : Nat.ofDigits 10 [C, M, A] = Nat.ofDigits 10 [7, 1, 6] := by
-- The expression `100 * A + 10 * M + C` corresponds to the number with
-- little-endian digits [C, M, A].
have h_lhs : Nat.ofDigits 10 [C, M, A] = C + 10 * M + 100 * A := by
simp [ofDigits_cons]; ring
rw [h_lhs]
rw [show C + 10 * M + 100 * A = 100 * A + 10 * M + C by ac_rfl]
rw [h_eq_base]
-- The RHS is 617, written in terms of ofDigits.
norm_num [ofDigits_cons]
have h_list_eq : [C, M, A] = [7, 1, 6] := by
apply ofDigits_inj_of_len_eq (by norm_num : 1 < 10)
· rfl -- Prove that lengths are equal.
· -- Prove that all digits in [C, M, A] are less than 10.
intro d hd
rw [List.mem_cons, List.mem_cons, List.mem_singleton] at hd
rcases hd with rfl | rfl | rfl
· exact h₂
· exact h₁
· exact h₀
· -- Prove that all digits in [7, 1, 6] are less than 10.
intro d hd
rw [List.mem_cons, List.mem_cons, List.mem_singleton] at hd
rcases hd with rfl | rfl | rfl <;> norm_num
· -- The equality of the numbers from the previous step.
exact h_ofDigits_eq
-- Step 3: From the list equality, deduce the values of A, M, C and prove the goal.
-- Deconstruct the list equality step-by-step using `injection`.
injection h_list_eq with h_C_eq h_M_A_eq
injection h_M_A_eq with h_M_eq h_A_eq
-- h_A_eq is `[A] = [6]`. `List.singleton_inj` converts this to `A = 6`.
rw [List.singleton_inj] at h_A_eq
-- Substitute all found equalities into the goal.
rw [h_A_eq, h_M_eq, h_C_eq]
Justin Asher (Jun 05 2025 at 16:19):
I had to give it the LSP feedback, but this is still better than what Claude Sonnet 4 was able to do previously when I tested it. (I did not test Opus since I thought the API cost was too high for practical usage.)
Eric Wieser (Jun 05 2025 at 16:28):
FWIW, "formal_3336" looks to be amc12a_2003_p5 from miniF2F
Justin Asher (Jun 05 2025 at 16:29):
Yes, that seems right.
Alex J. Best (Jun 05 2025 at 20:46):
Yeah I guess that comes from the harmonic versions https://github.com/harmonic-ai/datasets/blob/main/minif2f%2Ftest.json#L9
GasStationManager (Jun 06 2025 at 02:10):
I've also had some recent positive experience with Gemini 2.5 Pro,
for a Lean coding task. Details here.
What impressed me the most was that it can be very agentic: it was able to make use of tools (including LeanTool ) to keep improving its solution.
Justin Asher (Jun 06 2025 at 02:29):
This is really neat! Thanks for sharing. My observation was that the new Gemini model seemed to get stuck a lot less often on things like "No goals remaining" errors. It also seemed to think more thoroughly about the Lean code and its implications (perhaps this is because they are working on grounding). Maybe I am wrong, but intuitively this just felt like the case.
Justin Asher (Jun 06 2025 at 02:29):
We are so close to autoformalization...
Kevin Buzzard (Jun 07 2025 at 06:38):
Let me know when the system does an unclaimed task on the FLT dashboard.
Justin Asher (Jun 07 2025 at 13:01):
@Kevin Buzzard Working on it!
Justin Asher (Jun 09 2025 at 00:00):
I managed to get ChatGPT o4-mini to output a proof to the same problem which is much shorter:
theorem formal_3336
(A M C : ℕ)
(h₀ : A < 10)
(h₁ : M < 10)
(h₂ : C < 10)
(h₃ : 0 < A)
(h₄ : Nat.ofDigits 10 [0, 1, C, M, A] + Nat.ofDigits 10 [2, 1, C, M, A] = 123422) :
A + M + C = 14 := by
-- expand and simplify the `ofDigits` sum
simp only [Nat.ofDigits, ofDigits_cons, ofDigits_nil] at h₄
simp only [add_assoc, add_comm 2, add_mul, mul_add, mul_comm] at h₄
-- deduce the compact linear equation
have h₅ : C + 10 * M + 100 * A = 617 := by linarith [h₄]
-- pinpoint each digit via bounds
have hA : A = 6 := by linarith [h₅, h₀, h₃]
have hM : M = 1 := by linarith [h₅, h₁]
have hC : C = 7 := by linarith [h₅, h₂]
-- conclude the desired sum
simp [hA, hM, hC]
This proof was also produced by a completely automated system that I made, meaning no human in the loop. I would imagine the frontier models will rather soon be able to most of the formalization process.
(deleted) (Jul 10 2025 at 11:03):
my skeptical mind says such a leap... how can we extrapolate from this very simple problem to more difficult but still self contained problems... and in fact grind is able to obliterate this problem anyway
Justin Asher (Jul 10 2025 at 14:22):
Eh, things move fast. It’s not that it can solve a simple problem, but rather how it solves that problem that matters.
Andy Jiang (Jul 10 2025 at 17:32):
Would like to see a comprehensive summary of how good the various models are now at producing Lean code, including Deepseek prover v2, kimina prover, grok 4, o3, gemini 2.5 pro. I guess miniF2F is kind of such a benchmark but a more similar to standard usage one would be good (like the sorrybench mentioned elsewhere?) maybe I just don't know where to look
Lenny Taelman (Jul 10 2025 at 17:46):
Andy Jiang said:
Would like to see a comprehensive summary of how good the various models are now at producing Lean code, including Deepseek prover v2, kimina prover, grok 4, o3, gemini 2.5 pro. I guess miniF2F is kind of such a benchmark but a more similar to standard usage one would be good (like the sorrybench mentioned elsewhere?) maybe I just don't know where to look
The SorryDB project aims exactly at doing that, but is work in progress. If you want something available that is closer to measuring performance on "standard usage", I'd recommend checking out miniCTX.
Last updated: Dec 20 2025 at 21:32 UTC