Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: FrontierMath: Formalising Benchmark Problems

Pietro Monticone (Dec 02 2024 at 23:34):

Hi everyone,

Lorenzo Luccioli and I recently had the opportunity to discuss with some of the authors of the FrontierMath paper the possibility of formalising the benchmark problems they developed.

The authors expressed potential interest in collaborating with Lean formalisers to formalise the released sample problems, and more generally, problems of the sort that are represented in the FrontierMath benchmark. They have asked us to gauge the level of interest within the community.

If you are interested in contributing to this project, please reach out to Elliot Glazer (Project Lead) at elliot@epochai.org.

Jason Rute (Dec 03 2024 at 12:28):

I think there are pros and cons to formalizing the problems. First, let me make sure something is clear. Having the problems in informal form doesn't prevent an AI agent from using Lean/Coq/Isabelle to solve them. Moreover, since the problems have a clear numerical answer (number, matrix, list, etc), they are easy to grade automatically, which natural language proofs are not. Indeed there is already an existing proof-of-concept for solving natural language math questions with formal systems in the middle, namely the Don't Trust; Verify (DST) paper. Also, while AlphaProof did manually translate the IMO problems into Lean, that manual translation wasn't very essential to the essence of their work. For training, they did have an auto-formalizer which translated the problems into Lean code. I could easily see the next version of AlphaProof doing this translation automatically.

Jason Rute (Dec 03 2024 at 12:28):

Now we may not yet have auto-formalizers that can reliably formalize such questions, and Lean's formal library may be missing important areas of math to even formalize a FrontierMath question, much less solve it, but that is part of the difficulty and charm of FrontierMath. FrontierMath is really about how can we attack a math problem relying on the existing math literature (or lack thereof), not just on the much smaller amount of literature formalized in Lean. If we want to make formal math relevant, we really need to up our game and figure out how to automatically formalize large chunks of mathematics.

Jason Rute (Dec 03 2024 at 12:29):

Having said that, here are the pros to formalizing these problems in Lean:

At least translating the 10 open problems to Lean will provide good training data, and test how easy it is to translate and solve these problems in Lean with the current state of Mathlib.
If one does translate the test problems too, then one could try powerful tools like DeepSeek-Prover, AlphaProof, etc. out of the box.
Having an automatically checkable proof is better than having an AI which happens to get the right answer. What if the AI just developed good mathematical intuition and guessed the number correctly (or narrowed it down to two options and guessed)? Or what if the AI could just write a program to exhaustively check the numbers up to a certain value (maybe using some mathematical insights to make the program trackable)? (Note, I'm not sure if either of these are clear issues since they both are important mathematical skills, but having a proof is still better.)

Jason Rute (Dec 03 2024 at 12:29):

Cons against having this translated into Lean:

It excludes lots of existing AIs: AI for informal math, AI for Coq/Isabelle, and even AI for other versions of Lean.
Lean and Mathlib change a lot! If you fix your translation at Lean 4.14.0, what happens when Lean makes a breaking change in Lean 4.17, or Mathlib refactors its entire algebraic library? Do you keep your translations up-to-date? Will there be an organization overseeing this?
It's been shown many times (in MiniF2F, PutnamBench, ProofNet, etc) that these formal translations can be wrong quite often, even with extensive proof reading. One misses cases or makes an off-by-one error, or thinks something means something different from what it actually means in Mathlib. One would really have to have a way to ensure the correctness of the translation to prevent formalizing the wrong problem.
Many problems have multiple valid and correct translations. Some translations may be easier to prove formally. Some may be more faithful to the problem. Others may provide insight, especially if one takes the AlphaProof approach of considering alternate or simpler versions of the problem. (Of course, the AI agent doesn't have to be married to that particular Lean translation in the course of solving the problem. It could solve it with a different translation and then show its translation is equivalent to yours, or to transfer its proof to your formulation.)
If one translated the hidden problems to Lean, either the statement only or the whole proof (to better ensure the correctness of the translation), then that would open up some issues with the hidden nature of the problems:
- More people would see the hidden problems, maybe people who later decide to work on AI for Math.
- Would any of the translation work end up being upstreamed to Mathlib? Say one needed to have properties of n-dimensional Gaussian distributions to state a result or to prove it, after encoding a lot of Gaussian distribution facts just to get this problem stated correctly, would one have a strong desire to add these facts to Mathlib? Would such upstreaming skew the results slightly? (What if it was a more obscure area of math like Schnorr randomness or metastable convergence bounds?)

Jason Rute (Dec 03 2024 at 13:38):

As I think about this, I guess all my objections are only for the case where one is trying to make this benchmark a Lean benchmark (or formalizing the hidden problems). If one is just doing formalization work to get training data and get a feel for solving these types of problems in Lean, then I have no objections and I think that is great!

Notification Bot (Dec 04 2024 at 13:15):

14 messages were moved from this topic to #Machine Learning for Theorem Proving > Is Autoformalization needed? by Jason Rute.

Bolton Bailey (Dec 07 2024 at 04:35):

I heard about this benchmark at a small conference yesterday. Here is my best shot at formalizing sample problems 1,3,4, and 5. The concept of "irreducible components" in problem 2 feels like the sort of thing that would be in Mathlib somewhere but I'm not sure where.

If the goal is to formalize the solutions as well, that seems like it would take some work: There are some 2020s papers referenced and my sense is that to formalize that kind of thing, you need at least a few formalizers familiar with the material.

import Mathlib

open BigOperators

noncomputable def orderOfZMod (p : ℕ) (a : ℕ) (ha : Nat.Coprime a p) : ℕ :=
  orderOf (ZMod.unitOfCoprime (n := p) a ha : Units (ZMod p))

noncomputable def ord (p : ℕ) (x : ℕ) (a : ℕ) :=
  if ha : Nat.Coprime a p then
    (∏ q ∈ (Finset.Icc 0 x).filter Nat.Prime, q ^ (padicValNat q (orderOfZMod p a ha)))
    *
    (
      ∏' q : ℕ, if q.Prime ∧ q > x then q ^ (padicValNat q (p - 1)) else 1
    )
  else 0

theorem frontiermath_sample_1
  (S : ℕ → Set ℕ)
  (hS : ∀ x, S x = {p : ℕ | p.Prime ∧ ord p x 2 > ord p x 3})
  (d : ℕ → ℝ)
  (hd : ∀ x, d x = ((Nat.card (S x)) : ℝ) / Nat.card {p : ℕ | p.Prime ∧ p ≤ x})
  (d_inf : ℝ)
  (hd_inf : Filter.Tendsto d Filter.atTop (nhds d_inf) ) :
  ⌊(10^6 * d_inf)⌋ = 367707 := sorry

theorem frontiermath_sample_3
  (a : ℕ → ℕ)
  (ha : ∀ n, a n = 198130309625 * a (n - 1) + 354973292077 * a (n - 2) - 427761277677 * a (n - 3) + 370639957 * a (n - 4))
  (a_0 : a 0 = 0)
  (a_1 : a 1 = 1)
  (a_2 : a 2 = 2)
  (a_3 : a 3 = 3)
  (p : ℕ)
  (hp : p = sInf {p : ℕ | p ≡ 4 [MOD 7] ∧ p.Prime ∧
      if hp : p.Prime then
        let fact_p : Fact (Nat.Prime p) := by exact ⟨hp⟩;
       ∃ f : ℤ_[p] → ℤ_[p], Continuous f ∧ ∀ (n : ℕ), f (n : ℤ_[p]) = (a n : ℤ_[p])
      else
        False})
  :
  p = 9811 := sorry

theorem frontiermath_sample_4
  (pos_multiples_of_five : Set ℤ)
  (hpos_multiples_of_five : pos_multiples_of_five = {n | n > 0 ∧ n % 5 = 0})
  (S : Set (Fin 4 -> (Units (Matrix (Fin 1000) (Fin 1000) ℂ))))
  (hS : S = {A |
      (∀ i, A i * A i = 1) ∧
      (∀ i j, {3 * (j : ℤ) - i, 3 * i - j} ∩ pos_multiples_of_five = ∅ → A i * A j = A j * A i) ∧
      (∀ i j, {3 * (j : ℤ) - i, 3 * i - j} ∩ pos_multiples_of_five ≠ ∅ → A i * A j * (A i)⁻¹ * (A j)⁻¹ = A j * A i)})
  (G_action : MulAction (Units (Matrix (Fin 1000) (Fin 1000) ℂ)) S)
  (hG_action : ∀ B A, G_action.smul B A = fun i => B * A.val i * B⁻¹)
  (orbits : Set (Set (S)))
  (horbits : orbits = {O | ∃ A, O = @MulAction.orbit _ _ _ G_action A}) :
  Nat.card orbits = 625243878951 := sorry

instance : Fact (Nat.Prime 5) := by exact ⟨Nat.prime_five⟩

theorem frontiermath_sample_5
  (points : Set (Fin 3 -> GaloisField 5 18))
  (hpoints : points = {p | p ≠ 0 ∧ (p 0)^3 * (p 1) + (p 1)^3 * (p 2) + (p 2)^3 * (p 0) = 0 ∧ (p 0 = 0 ∨ p 0 = 1)}) :
  Nat.card points = 3814708984376 := sorry

Will Sawin (Dec 23 2024 at 02:57):

Jason Rute said:

It's been shown many times (in MiniF2F, PutnamBench, ProofNet, etc) that these formal translations can be wrong quite often, even with extensive proof reading. One misses cases or makes an off-by-one error, or thinks something means something different from what it actually means in Mathlib. One would really have to have a way to ensure the correctness of the translation to prevent formalizing the wrong problem.

The fact that the problems have one-number answers is actually convenient for this: Most errors of this form would either produce a problem with a different numerical answer, or multiple correct answers, or no correct answer. Formally solving the problem by proving there is a unique correct answer and finding it would thus be a powerful check against errors formalizing the statement.

Jason Rute (Dec 23 2024 at 04:01):

I’m missing something. Yes, formalizing the solution would help guarantee that the formalization of the problem is correct, but in all those examples I mentioned they didn’t take the time to do that. What about having a one number answer would make it easier to formalize the solution? In many ways I think FrontierMath would be much harder to have formalized solutions.

Qian Hong (Dec 24 2024 at 06:37):

Will Sawin said:

Jason Rute said:

It's been shown many times (in MiniF2F, PutnamBench, ProofNet, etc) that these formal translations can be wrong quite often, even with extensive proof reading. One misses cases or makes an off-by-one error, or thinks something means something different from what it actually means in Mathlib. One would really have to have a way to ensure the correctness of the translation to prevent formalizing the wrong problem.

The fact that the problems have one-number answers is actually convenient for this: Most errors of this form would either produce a problem with a different numerical answer, or multiple correct answers, or no correct answer. Formally solving the problem by proving there is a unique correct answer and finding it would thus be a powerful check against errors formalizing the statement.

Good idea! This approach of (in)validating formal translations could also be applied to programming tasks. Imagine using one agent (let’s call it Agent A) to translate an informal programming task description into a formal specification. Then, a second agent (Agent B) could generate code based on that formal specification. If the generated code passes all the test cases, it indicates that the initial translation from informal to formal was likely successful.

This approach could allow us to cherry-pick successful informal-to-formal translations to generate a synthetic corpus. This corpus could then be used to bootstrap a more robust informal-to-formal translation model, hopefully making it applicable to everyday programming tasks.

GasStationManager (Dec 24 2024 at 15:32):

I had a similar approach in my autoformalization project, except that agent B is not needed. I directly compare test case outputs with the formal specification. You can think of test case outputs as a surrogate agent B.

Joseph Myers (Dec 24 2024 at 18:55):

I suspect it's likely in practice that a complicated programming task has various edge cases that are disproportionately likely to be missing from the tests, and to be missed by autogeneration (or human generation) of a formal specification, and to be missed by autogeneration (or human generation) of code (and quite possibly the informal description doesn't actually mention those cases, so you need to figure out the best handling for the underlying problem being solved): a common cause could result in the edge cases being missed at all those stages. The same issue can of course affect formalizing mathematics: the definition in the literature might not be quite right in some edge case, especially when you're trying to write mathlib by generalizing as you go along to cover cases outside the scope of the original informal work.

Jason Rute (Feb 01 2025 at 04:19):

o3-mini came out today. It gets 9.2% of frontier math correct on the first try using the high compute mode. (I assume given the name o3-mini, it is not as powerful as the full o3, but it is faster and cheaper which explains why it is lower than the numbers from December.) There is also an interesting footnote about it being given access to a “Python tool” (which I assume means Python with sympy):

OpenAI o3-mini with high reasoning performs better than its predecessor [o1] on FrontierMath. On FrontierMath, when prompted to use a Python tool, o3-mini with high reasoning effort solves over 32% of problems on the first attempt, including more than 28% of the challenging (T3) problems. These numbers are provisional, and the chart above shows performance without tools or a calculator.

Adam Kurkiewicz (Feb 01 2025 at 13:34):

Do we trust the numbers in the benchmark? I read today that there was a controversy over lack of transparency over Frontier Math funding:

https://www.lesswrong.com/posts/8ZgLYwBmB3vLavjKE/some-lessons-from-the-openai-frontiermath-debacle

Jason Rute (Feb 01 2025 at 13:44):

@Adam Kurkiewicz I trust the results more or less (both these and the original o3 FrontierMath results), and I think OpenAI's results will be reproduced both with OpenAI models and other newer reasoning models. But yes, some other researchers are more skeptical. I know @Kevin Buzzard and Sam Harris have proposed making their own competing benchmark problems. Time will tell. (This has mostly been discussed in #Machine Learning for Theorem Proving > Open open benchmarks and Kevin's recent blog post along with the follow-up in #general > XenaProject blog posts.)

Last updated: May 02 2025 at 03:31 UTC