Zulip Chat Archive

Stream: general

Topic: How to formalize "compute this expression" exercises?


Terence Tao (Jun 22 2025 at 02:27):

I encountered the following question while trying to formalize various exercises in my real analysis textbook.

It is clear how to formalize an exercise asking to prove that some proposition Statement1 is true:

/-- Exercise 1: Prove `Statement1 : Prop` -/
theorem Exercise1 : Statement1 := by sorry

(or one could use the example or proof_wanted keywords). I also figured out a reasonable way to formalize a "prove or disprove" exercise:

/-- Exercise 2: Prove or disprove `Statement2 : Prop` -/
def Exercise2 : Decidable Statement2 := by
  -- The first line of the construction should be `apply isTrue` or `apply isFalse`.
  sorry

But I am wondering how to formalize a "compute this expression" type exercise. The best I can come up with is

noncomputable def Expression3 :  :=
  sorry -- replace with the expression to be computed

/-- Exercise 3(a): Compute `Expression3`.
    You are not permitted to use the `noncomputable` keyword. -/
def Answer3 :  := sorry

/-- Exercise 3(b): Justify your answer. -/
theorem Exercise3 : Expression3 = Answer3 := by sorry

But this is a bit clunky. Is there a better solution?

Mario Carneiro (Jun 22 2025 at 02:44):

this is actually a hard problem which has been discussed before, I think there is no perfect solution

Alex Kontorovich (Jun 22 2025 at 03:32):

Your phrasing of Exercise3 is indeed one of the "standard" ways to ask to determine an unknown result of a computation. (In fact yours is even better, using noncomputable to avoid tautology!) I don't think it's all that clunky in practice; students seem to pick it up?

Edward van de Meent (Jun 22 2025 at 07:40):

actually, i don't think that the noncomputable bits will always work for this? In particular, this will hardly ever work for any "computation" on Real since a lot of operations on the reals are noncomputable in mathlib

Edward van de Meent (Jun 22 2025 at 07:41):

so you can't require Answer3 to be computable in general

Shreyas Srinivas (Jun 22 2025 at 09:08):

Why not exists Answer3, Expression3 = Answer3?

Kevin Buzzard (Jun 22 2025 at 09:09):

Because you can solve that with use Expression3

Shreyas Srinivas (Jun 22 2025 at 09:47):

You could also do that in Alex’s setup

Jovan Gerbscheid (Jun 22 2025 at 09:59):

It might be worth adding e.g. #eval Answer3 to verify it really does compute. In principle it's possible to make a "computable" function that doesn't terminate.

partial def loop (i : Nat) : Nat := loop i

def ohno (f : Nat  Nat) : Nat := f (loop 0)

example : ohno (fun _ => 0) = 0 := rfl

#eval ohno (fun _ => 0) -- loops forever

Kevin Buzzard (Jun 22 2025 at 10:53):

#eval isn't the answer because a completely reasonable analysis question is "what is the limit of this ratio of trig functions as x tends to zero" and a perfectly reasonable answer is π.\pi.

Jovan Gerbscheid (Jun 22 2025 at 10:58):

Wait, are you saying that π is computable?

Kenny Lau (Jun 22 2025 at 11:04):

@Jovan Gerbscheid pi is certainly computable if you follow the logicians' definition

Kenny Lau (Jun 22 2025 at 11:06):

If you think about what your question actually means, and try to formalise it... What do you mean when you say "compute lim[x->0] sin(x)/x"?

You actually mean, reduce it to an expression in terms of natural numbers, some constants (e, pi), and some standard operations (+-x/, exp, ln), etc.

Kenny Lau (Jun 22 2025 at 11:06):

so... one brute force way might be to define an inductive type RealExpr :melt:

Kenny Lau (Jun 22 2025 at 11:07):

and a "middle way" of this would be to define the type of "computable reals" (I think this has been done before)

Ruben Van de Velde (Jun 22 2025 at 11:44):

I guess the RealExpr is actually the closest to formalizing the difference between what you'd mark as correct / incorrect in a pen-and-paper exercise

Kenny Lau (Jun 22 2025 at 12:17):

import Mathlib

inductive RealExpr : Type where
  | ofRat :   RealExpr
  | pi : RealExpr
  | add : RealExpr  RealExpr  RealExpr
  | mul : RealExpr  RealExpr  RealExpr
  | exp : RealExpr  RealExpr
  | ln : RealExpr  RealExpr  -- try not to put something <= 0 :)

@[simp] noncomputable def RealExpr.eval : RealExpr  
  | ofRat q => q
  | pi => Real.pi
  | add e₁ e₂ => e₁.eval + e₂.eval
  | mul e₁ e₂ => e₁.eval + e₂.eval
  | exp e => Real.exp e.eval
  | ln e => Real.log e.eval

open RealExpr in
def Answer3 : RealExpr :=
  exp pi

-- lim[n→∞] (1+π/n)^n = ?
open Real Filter Asymptotics Topology in
theorem Exercise3 :
    Tendsto (fun n :   (1 + π / n) ^ n) atTop (𝓝 Answer3.eval) := by
  exact tendsto_one_add_div_pow_exp π

Aaron Liu (Jun 22 2025 at 12:32):

It's too convoluted to do division here
a/b=exp(ln(a)+(1)ln(b))a / b = exp(ln(a) + (-1) * ln(b))

Kenny Lau (Jun 22 2025 at 13:04):

I guess we can add div : RealExpr -> RealExpr -> RealExpr

Michael Stoll (Jun 22 2025 at 13:11):

... and sub and pow and zpow and perhaps rpow for rational exponents...

Aaron Liu (Jun 22 2025 at 13:11):

Could you make it extensible?

Michael Stoll (Jun 22 2025 at 13:11):

by the students? :smiley:

Aaron Liu (Jun 22 2025 at 13:12):

Maybe :shrug:

Adam Topaz (Jun 22 2025 at 14:13):

Aaron Liu said:

Could you make it extensible?

I think the real answer is to automatically generate RealExpr with some metaprogramming for a given exercise. The exercise would specify: "compute foo using +, -, /, π\pi, etc.", then some meta code would generate a corresponding Exercise3.RealExpr and Exercise3.RealExpr.eval and Exercise3 : Prop.

Kenny Lau (Jun 22 2025 at 14:14):

I don't know meta stuff, we can approximate it quite well using notation and instances

Adam Topaz (Jun 22 2025 at 14:16):

I'm envisioning having some generic mechanism to allow for both "compute bar using +,-,0,1" and "compute foo using π\pi, +,*" without requiring the teacher to manually define types for each such exercise.

Terence Tao (Jun 22 2025 at 14:29):

That would be nice, but Kenny's proposal already looks like it meets my core requirements, which is to create an assessment that can be graded automatically by computer with essentially 100% accuracy, and without any evident "exploits" that either an unscrupulous student or a reinforcement learning-driven AI could take advantage of. For instance, one could have

import Mathlib

/!-
Instructions:

You may only modify the text inside the sorry statements.

When modifying each sorry, you may not leave the scope
or namespace of that sorry.
-/

inductive RealExpr : Type where
  | ofRat :   RealExpr
  | pi : RealExpr
  | add : RealExpr  RealExpr  RealExpr
  | mul : RealExpr  RealExpr  RealExpr
  | exp : RealExpr  RealExpr
  | ln : RealExpr  RealExpr

open Classical in
@[simp] noncomputable def RealExpr.eval : RealExpr  
  | ofRat q => q
  | pi => Real.pi
  | add e₁ e₂ => e₁.eval + e₂.eval
  | mul e₁ e₂ => e₁.eval + e₂.eval
  | exp e => Real.exp e.eval
  | ln e => if e.eval > 0 then Real.log e.eval else 0 -- explicitly document junk logarithm behavior

open RealExpr in
/-- Exercise 3(a) (1 point):  Evaluate `lim[n→∞] (1+π/n)^n`. -/
def Answer3 : RealExpr :=
  sorry

-- Double-check that `Answer3` is indeed computable
#eval Answer3

open Real Filter Asymptotics Topology in
/-- Exercise 3(b) (9 points): Justify your answer. -/
theorem Exercise3 :
  Tendsto (fun n :   (1 + π / n) ^ n) atTop (𝓝 Answer3.eval) := by
  sorry

-- Ensure that only approved axioms are used to establish this exercise
#print axioms Exercise3

An exploit-seeking student or AI could still grab 1 point out of 10 by putting in a junk answer for Answer3, but would there be any other possible exploits for such a framework?

Jason Rute (Jun 22 2025 at 15:59):

While it may not be a perfect solution, there are a few standards already. Some AI benchmarks have problems of this kind including PutnamBench ("no answer" version which no one has attempted), CombiBench ("without solution" version), and some of the variants in Beyond Theorem Proving. There are two challenges here.

  1. Checking the formal correctness of the answer and proof
  2. Checking that the answer is reasonably precise and not just a restatement of the question.

Requirement (2) is really hard. Some of the above papers have attempted solutions. None, as far as I can tell, are perfect. The above approach of @Kenny Lau is pretty good too for situations where it applies.

As for (1), it is also not quite as trivial as you might think. It depends on how trusting you are. An unscrupulous student (or AI) could do very subtle things to trick Lean. For example, it could modify your code to make the problem easier. Some AIs, for example, have been known to comment out problems they can't solve. :smiling_devil: Also, there are many known ways to trick Lean into accepting a bad proof. Using native_decide, set_option debug.skipKernelTC true, local instances, redefining keywords, or exploiting front-end Lean bugs (which cause a theorem to skip being added to the environment) are known exploits print_axioms can find some of these. lean4checker can find others. The best tool I've seen is SafeVerify by @GasStationManager. It uses the metaprogramming tools from print_axioms to check axioms and lean4checker to check the proof, but also checks that the term of the reference theorem statement (which you wrote) is equivalent to the term of the filled-in theorem (which the student wrote). (There is also @Rob Lewis's lean-autograder, which is similar.) SafeVerify has some support for fill-in-the-answer problems.

Chase Norman (Jun 22 2025 at 16:01):

If recursors were tracked by the axiom system, it would be easy.

Jason Rute (Jun 22 2025 at 16:02):

@Chase Norman What would be easy?

Chase Norman (Jun 22 2025 at 16:03):

It would be easy to determine whether an expression is a "literal" answer to a question vs. copying the computation or writing some other equivalent computation

Chase Norman (Jun 22 2025 at 16:03):

One way to formalize what we mean by "compute this expression" is to eliminate recursors.

Jason Rute (Jun 22 2025 at 16:15):

@Chase Norman You still need answer types that have a normal form, right? I think that is much of the challenge. Take Real for example, there is no legitimate Real answer which doesn't have recursors, right? In the case of Real, maybe you can use Kenny's trick to get a good set of constructors for common real numbers. Another example, common in competition problems, is to ask for "the set of integers n such that P(n)". This isn't a computable object either, and {nZP(n)}\{n \in \mathbb{Z} \mid P(n)\} is certainly not a valid answer. I think it would be hard to make a restricted language like in Kenny's trick, that is not giving too much of a hint for such problems.

Chase Norman (Jun 22 2025 at 16:20):

Right now Kenny's trick suffers from the problem that someone could just use a recursor to construct elements of the type. My suggestion was designed to patch that. But if you define it right, I don't see why the issue exists with Real intrinsically. If I define a Real number 0 as a Cauchy sequence that returns 0 on every input, I don't see where the recursor is coming in.

Aaron Liu (Jun 22 2025 at 16:21):

You can define 0 just fine, but good luck defining any irrational without recursors

Chase Norman (Jun 22 2025 at 16:21):

Sets of integers are just functions ℤ → Prop. If your function uses recursion, it's not a true normal-form answer to the question, under this idea.

Chase Norman (Jun 22 2025 at 16:24):

Aaron Liu said:

You can define 0 just fine, but good luck defining any irrational without recursors

This is true, but my solution to this is giving Reals computational structure, so that there is a clear equivalent for recursors on Reals. But I don't expect mathlib to do this.

Aaron Liu (Jun 22 2025 at 16:24):

Chase Norman said:

Sets of integers are just functions ℤ → Prop. If your function uses recursion, it's not a true normal-form answer to the question, under this idea.

I think "every positive odd prime integer" is a reasonable normal form. How do you expect to express this without any recursors?

Chase Norman (Jun 22 2025 at 16:25):

If primality doesn't involve any computation, then axiomatize it.

Jason Rute (Jun 22 2025 at 16:26):

@Chase Norman Evaluating the answer using #eval seems like a fine solution to reducing the answer to normal form. I think you are saying that if the problem is add all the numbers between 1 and 100000 that the student may just write the sum expression and Lean would happily compute it, while you would hope the student doesn't do it this way. But once the student can use Lean, they could also just evaluate it themselves separately in Lean anyway. :smile:

Chase Norman (Jun 22 2025 at 16:27):

Avoiding the use of calculators from students is beyond the scope of this question, I think.

Jason Rute (Jun 22 2025 at 16:28):

Exactly, so if they have access to calculators, just check their answer with #eval as above. It doesn't have to be in normal form. It just has to be specific enough to compute.

Kenny Lau (Jun 22 2025 at 16:28):

Chase Norman said:

Kenny's trick suffers from the problem that someone could just use a recursor to construct elements of the type

I'm a bit confused, how do you recurse on Real?

Chase Norman (Jun 22 2025 at 16:28):

You can recurse on Nat and use it to build a complex RealExpr

Chase Norman (Jun 22 2025 at 16:29):

If you don't track recursors, then they can introduce computation at any point.

Kenny Lau (Jun 22 2025 at 16:29):

but how will you use that to solve my problem?

Kenny Lau (Jun 22 2025 at 16:30):

my problem's answer is e^pi

Kenny Lau (Jun 22 2025 at 16:30):

and my question isn't even stated as an expression

Jason Rute (Jun 22 2025 at 16:31):

Chase Norman said:

If you don't track recursors, then they can introduce computation at any point.

I would also like to see a concrete example of how this is an issue that #eval doesn't already solve.

Chase Norman (Jun 22 2025 at 16:31):

I'm not exactly sure how Tendsto is defined, but I'd try to reimplement Tendsto @Kenny Lau

Kenny Lau (Jun 22 2025 at 16:32):

I'm not sure how that is possible

Kenny Lau (Jun 22 2025 at 16:32):

i mean, suppose i use the epsilon-delta definition (in this case delta is N instead)

Igor Khavkine (Jun 22 2025 at 16:38):

I'm sorry if this remark is not so specific, but the original question reminded me of discussions of what constitutes a "closed form number/expression". Timothy Chow wrote a nice essay on this some years ago, which was published in The American Mathematical Monthly:

What is a closed-form number? (arXiv 1998, AMM 1999, web)

Perhaps eventually formalizing Chow's idea would give a nice target type for the answer to a "compute this" exercise.

Kenny Lau (Jun 22 2025 at 16:42):

I'm not sure how that is different from what I suggested

Chase Norman (Jun 22 2025 at 16:45):

If Mathlib can construct irrational Reals as a sequence of rationals using recursors, I don't see why we wouldn't be able to create an irrational RealExpr as a sequence of ofRat using recursors.

Chase Norman (Jun 22 2025 at 16:45):

But I must admit that I don't have a concrete construction

Chase Norman (Jun 22 2025 at 16:47):

I guess RealExpr creates some kind of data bottleneck. You can't put a lambda inside a RealExpr

Kenny Lau (Jun 22 2025 at 16:49):

what do you actually mean by recursors? I'm very confused

Kenny Lau (Jun 22 2025 at 16:50):

I suspect you're using it to mean two different things when you said "construct irrational Reals as a sequence of rationals using recursors" vs. when you said "recurse on Nat"

Jason Rute (Jun 22 2025 at 16:50):

Similar to Kenny's and Timothy Chow's proposals, there are people in computer science who have created term languages for a significant portion of the commonly used real numbers, with the goal of exact algebraic computation. I think you need Schanuel's conjecture to show that equality (of the reals represented by the terms) is computable, but for our setting, that isn't an issue since Lean will check that the term is correct without having to check it is equal to the ground truth. (I'm struggling to find a reference however.)

Kenny Lau (Jun 22 2025 at 16:52):

Jason Rute said:

equality

We don't need computable equality here because of how I defined eval recursively, and the goal is not for by decide to work.

Kenny Lau (Jun 22 2025 at 16:54):

Even for Nat you still can't cheat the system if the question is phrased implicitly (i.e. "find n such that ..." rather than "compute this expression")

Jason Rute (Jun 22 2025 at 16:56):

Kenny Lau said:

Jason Rute said:

equality

We don't need computable equality here

That is what I said. :smile:

Kenny Lau (Jun 22 2025 at 16:57):

ah, you followed it with "I'm struggling to find a reference" so I thought you were hesitating

Jason Rute (Jun 22 2025 at 17:01):

I'm struggling to find a reference to such a specific, exact real arithmetic proposal intended for actual practice.

Aaron Liu (Jun 22 2025 at 17:01):

Chase Norman said:

If you don't track recursors, then they can introduce computation at any point.

You can still do quite a lot without recursors

Characterizing Collatz Without Recursors

Kenny Lau (Jun 22 2025 at 17:04):

import Mathlib

-- this is not a good answer checker
inductive NatExpr : Type where
  | zero : NatExpr
  | succ : NatExpr  NatExpr

@[simp] def NatExpr.eval : NatExpr  
  | zero => 0
  | succ n => n.eval + 1

open NatExpr in
def answer1 : NatExpr :=
  -- fill in this sorry
  Nat.recOn (2 + 2) zero (fun _  succ)

#eval answer1

theorem theorem1 : 2 + 2 = answer1.eval := by
  -- fill in this sorry
  rfl

-- but you still can't hack this one:
open NatExpr in
def answer2 : NatExpr :=
  -- fill in this sorry
  zero.succ.succ.succ.succ

#eval answer2

/-- Find a natural number n such that every natural number is the sum of n squares. -/
theorem theorem2 :  x : ,  f : Fin answer2.eval  ,  i, (f i) ^ 2 = x := by
  -- fill in this sorry
  intro x
  rcases x.sum_four_squares with a, b, c, d, h
  refine ⟨![a, b, c, d], ?_⟩
  simp [Fin.sum_univ_succ, answer2,  h, add_assoc]

Kenny Lau (Jun 22 2025 at 17:05):

^ for what I mean by you still can't hack it if the question is phrased implicitly

Robin Arnez (Jun 22 2025 at 17:05):

Maybe a more syntactic approach could work?

import Mathlib

set_option checkBinderAnnotations false in
class inductive SimpleRealExpression :   Prop where
  | ofNat (x : ) : SimpleRealExpression ofNat(x)
  | pi : SimpleRealExpression Real.pi
  | add [SimpleRealExpression x] [SimpleRealExpression y] : SimpleRealExpression (x + y)
  | sub [SimpleRealExpression x] [SimpleRealExpression y] : SimpleRealExpression (x - y)
  | mul [SimpleRealExpression x] [SimpleRealExpression y] : SimpleRealExpression (x * y)
  | div [SimpleRealExpression x] [SimpleRealExpression y] : SimpleRealExpression (x / y)
  | neg [SimpleRealExpression x] : SimpleRealExpression (-x)
  | inv [SimpleRealExpression x] : SimpleRealExpression (x⁻¹)
  | exp [SimpleRealExpression x] : SimpleRealExpression (Real.exp x)
  | log [SimpleRealExpression x] : SimpleRealExpression (Real.log x)

open SimpleRealExpression in
attribute [instance] ofNat pi add sub mul div neg inv exp log

noncomputable def question :  := sorry
noncomputable def answer :  := 1 + Real.pi / 2

theorem question_eq_answer : question = answer := by
  sorry

instance : SimpleRealExpression answer := by
  unfold answer
  infer_instance

Kenny Lau (Jun 22 2025 at 17:06):

that's a nice approach too!

Kenny Lau (Jun 22 2025 at 17:09):

@Aaron Liu every one of those "induction" is a recursor

Aaron Liu (Jun 22 2025 at 17:10):

Kenny Lau said:

Aaron Liu every one of those "induction" is a recursor

That's part of the proof of equivalence. There aren't any recursors in my claimed solution.

Eric Wieser (Jun 22 2025 at 17:11):

If the answer to a question is 0, should 1 - 1 + 1 - 1 + 1 - 1 + 1 - 1 + sin 37 - sin 37 be considered correct?

Jason Rute (Jun 22 2025 at 17:16):

@Eric Wieser My thought is if they have access to Lean, and it computes with #eval (which this one doesn't), then it is fine, since they could have just plugged it in to #eval themselves. But maybe this is a way of sneaking in native_decide without actually relying on it in the proof. :slight_smile: Would #reduce be better?

Eric Wieser (Jun 22 2025 at 17:18):

I guess my remark is off topic, and more relevant to questions of the form "simplify this expression"

Eric Wieser (Jun 22 2025 at 17:21):

But for instance, if the question is "compute the last digit of bignum!", expanding the factorial notation into a very large product and writing ... % 10 would surely not count

Terence Tao (Jun 22 2025 at 17:23):

One could require that any #eval of the answer terminate within N seconds for some small N

Aaron Liu (Jun 22 2025 at 17:24):

This unfortunately excludes noncomputable real numbers

Terence Tao (Jun 22 2025 at 17:24):

#eval of the RealExpr or SimpleRealExpression instance, then

Eric Wieser (Jun 22 2025 at 17:29):

SimpleRealExpression is a prop above, so evaluation is always a no-op

Eric Wieser (Jun 22 2025 at 17:30):

I guess in that case you could mean typeclass inference

Kenny Lau (Jun 22 2025 at 17:30):

well, the infer_instance step above should terminate

Eric Wieser (Jun 22 2025 at 17:30):

I think there's actually a depth limit by default

Eric Wieser (Jun 22 2025 at 17:31):

I would guess it already rejects 1 + 1 + ... for an answer that should be 100

Kenny Lau (Jun 22 2025 at 17:33):

for a nice example, consider 3^3^3^3^3^3%10 which I haven't found a simp lemma for

Kenny Lau (Jun 22 2025 at 17:34):

it seems like we don't have the simp lemma Coprime a n -> a^b%n = a^(b%Phi(n))%n

Eric Wieser (Jun 22 2025 at 17:34):

I think for this specific question, the resolution is that % (or floor) is not always a legal operator in the answer, depending on the question

Kenny Lau (Jun 22 2025 at 17:35):

I think for this specific question, doesn't Lean actually have a checker to check if a given expression is a natural number literal?

Bolton Bailey (Jun 22 2025 at 17:36):

If we are going to look at the length of time to evaluate the answer, why not, as an alternative, look at the length of the code needed to express the answer? We could allow the "determine" answer to take any form, but only give credit when the length in bytes of the lean expression we replace the sorry with is no longer than, say, 2x the same length in the reference solution. This would also let us avoid having to define a new inductive type.

Kenny Lau (Jun 22 2025 at 17:37):

well for this question in particular you could even restrict the answer to have one byte!

Alex Kontorovich (Jun 23 2025 at 01:16):

May I ask a stupid question? Why do we care if students can cheat the answer? They can already cheat in a million different ways (especially with LLMs). (They could also cheat for the last ~20 years on their calc homework with WolframAlpha. And for the last ~60 years on their times tables with a calculator...) I plan on telling mine: if you're not interested in learning a skill, please go ahead and cheat; you'll learn nothing, and fail the (in class, timed, handwritten on paper) final exam. I also plan on having a TA grade the homework sets (in particular because I won't insist on students using Lean; I'll be using it, but if they don't find it easier to work with Lean, they're welcome to hand in their homeworks the old fashioned way).

Terence Tao (Jun 23 2025 at 01:23):

Personally I'm interested more in designing tests that AI cannot cheat on, in order to get useful benchmarking and hillclimbing, but since assessing human math homework is such a familiar framework, I stated things largely in that language.

Justin Asher (Jun 23 2025 at 02:06):

I think it's more up to whomever is letting the AI solve problems to ensure it cannot cheat on them.

Kevin Buzzard (Jun 23 2025 at 07:10):

DeepMind claimed that their IMO solver filled in solutions automatically and then proved they were correct, but there was no independent judge of this. The IMO 2025 people seem to have shown no interest in making a framework for testing formal solutions. So I might be in the position next month of having to decide whether the IMO grand challenge has been achieved and the only data I'll have is blog posts from tech companies claiming that their system magically came up with the solutions to the "what is this number" questions.

Shreyas Srinivas (Jun 23 2025 at 11:25):

Kevin Buzzard said:

DeepMind claimed that their IMO solver filled in solutions automatically and then proved they were correct, but there was no independent judge of this. The IMO 2025 people seem to have shown no interest in making a framework for testing formal solutions. So I might be in the position next month of having to decide whether the IMO grand challenge has been achieved and the only data I'll have is blog posts from tech companies claiming that their system magically came up with the solutions to the "what is this number" questions.

Sadly that's the state of AI research across many domains today. All the "awesome" results are not published entirely openly. Much like VLSI papers.

Justin Asher (Jun 23 2025 at 13:26):

Shreyas Srinivas said:

Sadly that's the state of AI research across many domains today. All the "awesome" results are not published entirely openly. Much like VLSI papers.

The fun part is if they do publish their solutions, then it is typically pretty easy to infer how they arrived at them just based off of how they are written. I can understand why labs would want to hide them. Maybe we all need to be a bit more generous.

Alex Kontorovich (Jun 23 2025 at 15:13):

Kevin Buzzard said:

DeepMind claimed that their IMO solver filled in solutions automatically and then proved they were correct, but there was no independent judge of this. The IMO 2025 people seem to have shown no interest in making a framework for testing formal solutions. So I might be in the position next month of having to decide whether the IMO grand challenge has been achieved and the only data I'll have is blog posts from tech companies claiming that their system magically came up with the solutions to the "what is this number" questions.

I thought they said that they told Gemini (or something like it) to "suggest" 100 possible answers, and then went to work proving/disproving the resulting claims. I think in practice they managed to disprove ~98 possibilities quickly, and then had to wait a while longer for the last few to be proved...

Kevin Buzzard (Jun 23 2025 at 15:21):

If you believe in blog posts then yes this is something like what happened

Eric Wieser (Jun 23 2025 at 15:22):

Thomas also gave some public talks which I believe align with that commentary

Kevin Buzzard (Jun 23 2025 at 15:33):

I'm not doubting that this happened, but as a mathematician I feel uncomfortable that there's no proof that this happened. And who knows what will happen at 2025 IMO.

Kevin Buzzard (Jun 23 2025 at 15:33):

(deleted)


Last updated: Dec 20 2025 at 21:32 UTC