Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Opus 4.6 is great at formal proofs (Rocq/Lean4)


Tristan Stérin (Feb 17 2026 at 13:21):

I conducted a few experiments with @Yannick Forster and we were quite impressed by Opus 4.6 skills in Rocq and Lean4. I wrote a blog post with the details of all these experiments: https://tristan.st/blog/opus_4_6_formal_proofs

TLDR

Opus 4.6, with no human input beyond the prompt, successfully proved:

  • 258 of the 260 Rocq lemmas of the BB(4) proof, 70% of the generated proofs were shorter than the original, see proof-by-proof comparison here
  • Yannick and @Théo WinterhalterPRFA course 18-page Fall 2025 assignment
  • The fact that intuitionistic arithmetic has a computable realizability interpretation
  • The non-termination of a Fractran program in Lean4

Notification Bot (Feb 17 2026 at 14:01):

This topic was moved here from #general > Opus 4.6 is great at formal proofs (Rocq/Lean4) by Matthew Ballard.

Matthew Ballard (Feb 17 2026 at 14:02):

@Tristan Stérin I've moved your post to a place where it should get more attention

Tristan Stérin (Feb 17 2026 at 14:02):

Thank you, sorry for the poor initial placement!

Matthew Ballard (Feb 17 2026 at 14:02):

No worries!

Matthew Ballard (Feb 17 2026 at 14:03):

What did your CLAUDE.md look like? What other markdown files/skills/tools did you use?

Tristan Stérin (Feb 17 2026 at 14:09):

No CLAUDE.md no markdown files, no skills, no MCP. The compilation environments were pre-set on the VM (coqc, lake) and usable by command line.

If anything, in the case of the BB(4) proof, I had to restrict its set of tools to prevent it from cheating by looking at the online published solution, using .claude/settings.json:

{
  "permissions": {
    "defaultMode": "bypassPermissions",
    "deny": [
      "WebSearch",
      "WebFetch",
      "Bash(curl *)",
      "Bash(wget *)"
    ]
  }
}

Matthew Ballard (Feb 17 2026 at 14:09):

Just go get it, huh

Matthew Ballard (Feb 17 2026 at 14:10):

What is the token usage?

Snir Broshi (Feb 17 2026 at 14:13):

Tristan Stérin said:

I had to restrict its set of tools to prevent it from cheating by looking at the online published solution

But isn't it already trained on the solution, since it came out later?

Tristan Stérin (Feb 17 2026 at 14:18):

Matthew Ballard said:

What is the token usage?

Great question, I should have kept track of token usage... I asked Claude to analyse the BB(4) conversation transcript to find token usage, it said:

Token Usage Summary (949 API requests)

  ┌───────────────────────────┬─────────────┬───────────┐
           Category             Tokens     Est. Cost 
  ├───────────────────────────┼─────────────┼───────────┤
   Input tokens (non-cached)  4,033        $0.06     
  ├───────────────────────────┼─────────────┼───────────┤
   Cache write tokens         2,588,529    $48.53    
  ├───────────────────────────┼─────────────┼───────────┤
   Cache read tokens          101,660,584  $152.49   
  ├───────────────────────────┼─────────────┼───────────┤
   Output tokens              438,382      $32.88    
  ├───────────────────────────┼─────────────┼───────────┤
   Total                      ~104.7M      ~$234     
  └───────────────────────────┴─────────────┴───────────┘

  Key takeaways:
  - 97% of input tokens were cache reads (101.7M), which is typical for long agentic sessions where the
  growing conversation context is repeatedly sent
  - Output was relatively modest at 438K tokens (~462 tokens/request average)
  - Cache write breakdown: 839K for 5-min ephemeral + 1.75M for 1-hour ephemeral
  - Average cost per request: ~$0.25
  - Total estimated cost: ~$234 at standard Opus pricing

I don't know if it's correct, though; 104M just for BB(4) feels high, but I'm not well calibrated.

What I know is that running all these experiments (and a bit more) made me use 83% of my weekly allowance on a Max plan, and that this week I maxed out my session limit several times.

Snir Broshi said:

Tristan Stérin said:

I had to restrict its set of tools to prevent it from cheating by looking at the online published solution

But isn't it already trained on the solution, since it came out later?

Yes, most likely! I address this point in the blog post: (a) as a minimal effort I obfuscated and shuffled the proof before feeding to Claude (b) only 4 out of 258 proofs are the same between the original and AI-generated and they are inconsequential 1-2 liner pieces of boilerplate and (c) the AI-generated proofs were 20% shorter than the original in average. The point here is that in early iterations of this project, Opus 4.6 went on the internet when it got stuck (for the two lemmas it could not prove).

We believe that all the other experiments are not in Opus 4.6's training set.

Snir Broshi (Feb 17 2026 at 14:39):

Tristan Stérin said:

We believe that all the other experiments are not in Opus 4.6's training set.

Why? Are you assuming that Anthropic didn't scrape GitHub or didn't retrain Claude on it since last year?
Also, I'm assuming you told it (or at least it could infer) that the problem was BB(4) and so it could potentially recall the BB(4) proofs it was trained on.
I'm not sure it's likely, but it is possible since it's in the training set.

Tristan Stérin (Feb 17 2026 at 14:44):

Snir Broshi said:

Tristan Stérin said:

We believe that all the other experiments are not in Opus 4.6's training set.

Why? Are you assuming that Anthropic didn't scrape GitHub or didn't retrain Claude on it since last year?
Also, I'm assuming you told it (or at least it could infer) that the problem was BB(4) and so it could potentially recall the BB(4) proofs it was trained on.
I'm not sure it's likely, but it is possible since it's in the training set.

This is also addressed in the blog post, the reason is because solutions to the other problems have never been put on the internet.

I did not tell Claude the problem was BB(4), but of course it inferred it. I invite you to look at the proof comparison report to see that almost all proofs are shorter and, all but 4, are different. And again, I acknowledge the proof is in its training set but it still brought value by proposing short proofs in average.

Franz Huschenbeth (Feb 17 2026 at 16:40):

Interesting Article and result, especially the timing. I have yesterday published a small Post on my Blog, where I contemplated the idea of a Busy Beaver Intelligence Scale, which measures intelligence of AI, by testing which Busy Beaver Value it can prove.

In this vein, Claude Opus 4.6 currently seems to sit just shy of BB(4)BB(4), if we account for the prepared Lemmas, then maybe a bit further back, but one can also see these prepared lemmas (or the agent teams) as noise, they do improve the performance but don't fundamentally change the general position of Opus 4.6 on the scale.

Tristan Stérin (Feb 17 2026 at 16:55):

Franz Huschenbeth said:

Interesting Article and result, especially the timing. I have yesterday published a small Post on my Blog, where I contemplated the idea of a Busy Beaver Intelligence Scale, which measures intelligence of AI, by testing which Busy Beaver Value it can prove.

In this vein, Claude Opus 4.6 currently seems to sit just shy of BB(4)BB(4), if we account for the prepared Lemmas, then maybe a bit further back, but one can also see these prepared lemmas (or the agent teams) as noise, they do improve the performance but don't fundamentally change the general position of Opus 4.6 on the scale.

I love the idea!! We had a similar one in the discussion section of the BB(5) paper:

More generally, given the sheer amount of problems, with broad variety of difficulty (from very easy to near-impossible), that the Busy Beaver game can generate, the ability to prove known Busy Beaver values or to make progress on unknown ones would be an ambitious benchmark for machine intelligence

Franz Huschenbeth (Feb 17 2026 at 18:32):

Ah nice, did not know that, but the idea is not too far fetched for anyone who knows the Busy Beaver Function and is interested in Benchmarking :D

Franz Huschenbeth (Feb 17 2026 at 18:49):

This makes me think. Has there been investigations into the Distribution of Halting Proofs Difficulty for a given n? Can this potentially be measured with Kolmogorov Complexity? I am not aware of the Busy Beaver Literature, so if there is a Paper that already investigated that, I would love to see it.

The distribution might be something akin to 1/x1/x, but I would be interested to see it graphed for low Busy Beaver Values. And what the best function fit might be.

Its kinda interesting because along the lines of the Curry-Howard Isomorphism we can encode any mathematical problem as a Turing Machine, but ofc most of our problems are scattered across different $n$, and we only study them because there are "interesting" and "canonical" to us, in the sense that they are meaningful problems related to our physical world or have prestige associated to them.

If one can (heuristically) state the distribution in terms of size $n$, then we could also potentially have heuristics for stating the difficulty of open math problems. Such as Goldbachs 27-state TM is (probabilistically) atleast as hard as the hardest problem from $BB(7)$ or so. And by measuring the AI Progress along the Busy Beaver Scale, we can roughly estimate, when it will be in a range to potentially solve the problem. This would be "very" approximate in every sense of the word, but I kinda like the possibility (or maybe it is fundamentally impossible I dont know).

Kevin Buzzard (Feb 17 2026 at 20:55):

Perhaps this thread is actually a proof of "opus 4.6 is great at formal proofs in the topic of theoretical computer science where it might have seen the answers already". But that's not supposed to be a negative comment -- I think this is a really interesting benchmark and the fact that the proofs were shorter than the human-written proofs (something we're definitely not seeing right now in the domain of pure mathematics) is especially intriguing. I think it would be very interesting to do more tests of this nature across different domains because the current title of this thread looks to me like an interesting hypothesis and it's definitely worth testing further.

Snir Broshi (Feb 17 2026 at 21:13):

Kevin Buzzard said:

the fact that the proofs were shorter than the human-written proofs (something we're definitely not seeing right now in the domain of pure mathematics) is especially intriguing.

I don't know how true this is, if you look at the comparison report most shorter proofs appear to be "just put semicolons and delete the newlines". A simple script can do that, it doesn't make proofs more readable, possible the opposite.

image.png

I sampled 5 proofs and they were all like this, there might be some good simplifications in some of them.

Franz Huschenbeth (Feb 17 2026 at 23:35):

LOC is a vague measure for proof verbosity anyway. The main problem with AI Proofs to my understanding is verbosity and human readability.

Snir Broshi (Feb 17 2026 at 23:37):

I suggest amount of tactic calls as a better approximation

Snir Broshi (Feb 17 2026 at 23:38):

Plus maximum amount of nesting

Joseph Myers (Feb 18 2026 at 00:35):

The ability to prove the value of BB(n)BB(n) doesn't seem like a good scale for AI abilities, because it's too coarse-grained, and just about everything related to busy beavers grows faster than any computable function. It seems quite likely to me that BB(7)BB(7) or BB(8)BB(8) is not just independent of Lean but independent of any reasonable consistent formal system with a description small enough to fit within the physical universe (and if not independent, the size of any proof of which machine achieves BB(7)BB(7) might be too large to fit in the universe). When people extrapolate AI abilities they're normally extrapolating an exponential function, not one that grows faster than any computable function.

It's possible that AI ability to resolve some of the currently unresolved 6-state machines is a meaningful scale, since they likely have a wide range of difficulties.

Franz Huschenbeth (Feb 18 2026 at 01:25):

While I do agree that it is not exactly fine-grained when observed as just the value BB(n)BB(n) to the next BB(n+1)BB(n+1), one can simply look at the number of holdouts to get a more fine grained view.

Do you have any arguments for BB(7)BB(7) or BB(8)BB(8) being independent? I would be very impressed to see a Turing Machine, which can enumerate all proofs in Lean with just 7 or 8 states.

That proof length for machines of a given size n grows uncomputable is actually a big hindrance. Thanks for that insight. It means one can still take Machines of size $n$ or $n+1$ as inspiration, but their proofs don't give insights into an AI's Abilities because we can't pinpoint their "difficulty".

Joseph Myers (Feb 18 2026 at 02:38):

I wouldn't expect it to be independent for any nice comprehensible reason (or even any reason that can be stated within the size of the universe) such as enumerating all proofs in Lean. If it enumerates proofs at all as a reason for independence (which I doubt), they're probably proofs in some extremely weird system (that happens to imply consistency of Lean).

Cf Can’t Decide? Undecide!: "I confess delight in each of these examples—it’s absolutely mesmerizing to see them in action. But not only do they leave one unsatisfied, with little means of penetrating just why these particular examples—but not others!—behave as they do, but also we can prove that we can never expect to understand why these, particularly, are just so. That is, though there is some “reason” they are as they are, in the form of an astoundingly long proof, there cannot be any “good reason” or understandable short proof.".

That doesn't specifically justify 7 or 8 as the threshold, but that seems a reasonable guess based on BB(6)BB(6) being well beyond what present-day mathematics is capable of (requiring at least a major breakthrough on Collatz-type problems) and the gap in difficulty between BB(n)BB(n) and B(n+1)B(n+1) growing faster than any computable function of nn.

Joseph Myers (Feb 18 2026 at 02:44):

So even if you're an AI ultra-optimist in terms of expecting superhuman AI soon that can solve all Millennium Prize Problems with ease, and an AI ultra-pessimist who expects that AI then to proceed to convert all available physical resources into more computational power, expanding outwards from Earth at the speed of light in order to acquire more resources to continue its work on busy beaver numbers, you could still reasonably expect that BB(7)BB(7) or BB(8)BB(8) is beyond the physical limits of what such an AI could determine before the heat death of the universe and possibly beyond what can be proved in any reasonable consistent logical system that can be expressed within the universe.

Tristan Stérin (Feb 18 2026 at 07:22):

Kevin Buzzard said:

Perhaps this thread is actually a proof of "opus 4.6 is great at formal proofs in the topic of theoretical computer science where it might have seen the answers already". But that's not supposed to be a negative comment -- I think this is a really interesting benchmark and the fact that the proofs were shorter than the human-written proofs (something we're definitely not seeing right now in the domain of pure mathematics) is especially intriguing. I think it would be very interesting to do more tests of this nature across different domains because the current title of this thread looks to me like an interesting hypothesis and it's definitely worth testing further.

Thank you! I will concede that my title is a bit clickbaity/non-scientific (although I meant it genuinely because I did not expect any of the experiments to work and was positively shocked) and that all experiments performed were more CS than maths. However, just to clarify, we have very strong confidence that, excluding BB(4), for all the other experiments (PRFA, realizability, Fractran) the solutions were not on the internet.

More, in the case of realizability, the authors of the problem had renounced to prove a particularly tedious lemma of the proof, which was solved by Opus. And, in the case of Fractran (Lean4), which is the closest to mathematics in this "benchmark", no human analysis had been performed on the problem and it came up with its own intermediary lemmas.

Hence I stand by the conclusion of my post, which is my main point:

Apart from the Lean4 Fractran experiment, all these proofs have in common that Opus 4.6 was presented with an architecture already in place and “just had” to fill in the proofs (numerous and often tedious). This suggests that it is now possible to complete a formal proof project by spending the human research effort on the proof architecture and leveraging a model like Opus 4.6 to fill in the details of the proofs. The Lean4 Fractran experiment shows that Opus 4.6 can, in some contexts, come up with an original argument, proof architecture, and formalisation on its own.

Finally, the capacity of this model/setup to run on its own for tens of hours and still make progress towards the end (instead of entering a "doom loop") was impressive to me.

Tristan Stérin (Feb 18 2026 at 07:28):

Snir Broshi said:

I suggest amount of tactic calls as a better approximation

On the rocq Zulip, it was suggested to use byte count or gzip size. But yes, I agree that LOC is a vague metric.

The proof comparison I sampled in the blog post is more interesting (I selected that one by sorting the table of the comparison report to show the proofs that were reduced the more, and chose a case that was not too long, to fit in the blog post), but yes, it would require a thorough analysis to measure exactly how many proofs have a meaningfully different proof (taking into consideration that this notion could have several definitions).

I personally don't expect many of the AI-generated BB(4) proofs to be really meaningfully different, but instead, most of them to be the same argument but "golfed better" (e.g. using more compact tactics or leveraging the std library).

Timothy Chow (Feb 18 2026 at 14:51):

I agree with Joseph Myers that busy beavers are not a particularly promising "benchmark" (if you want to call it that). More promising might be Bogdan Grechuk's approach to Diophantine equations. There's a pleasant mix of levels of difficulty, ranging from easy exercises to simple application of known techniques (some of which may be advanced techniques, such as reciprocity laws or elliptic curves) to ingenious new ad hoc arguments to potentially undecidable problems.

Tristan Stérin (Feb 18 2026 at 15:35):

Timothy Chow said:

I agree with Joseph Myers that busy beavers are not a particularly promising "benchmark" (if you want to call it that). More promising might be Bogdan Grechuk's approach to Diophantine equations. There's a pleasant mix of levels of difficulty, ranging from easy exercises to simple application of known techniques (some of which may be advanced techniques, such as reciprocity laws or elliptic curves) to ingenious new ad hoc arguments to potentially undecidable problems.

I am confused about this argument, the same holds for BB. Inside one of these huge n-state Turing machine spaces, you have all kinds of problems' difficulty; e.g. we currently have 18,195,192 holdouts for BB(7) in which you will find every category from trivial to interesting to impossible.

Franz Huschenbeth (Feb 18 2026 at 15:44):

Finding open problems is probably not the hard part, but assigning a rough estimate of difficulty to it. BB(n) < BB(n+1), in rough measures of difficulty, but the problem is that as indicated by Joseph Meyers, this breaks down, because we cant prove all machines (because of proof length problems) in a given n-state Turing Machine Space. And within a given n-state Turing Machine Space, we cant assign difficulty meaningful.

Diphantine Equations also generates lots of hard problems (because of the Solution to Hilberts 10th), but (similarly) you cant assign a meaningful measure of difficulty, atleast I dont know a immediate way.

Bolton Bailey (Feb 18 2026 at 18:03):

If we are looking for classes of undecidable problems I have listed some others here under "undecidability tarpits". I think at some point this approach starts to experience a meta-problem along the lines that whatever problem class and complexity measure you decide on, solving the hardest instances might come down to a small set of techniques that happen to line up with the problem class, but aren't more broadly applicable.

I like ideas for generating benchmarks along the lines of proving or disproving automatic generalizations of existing library lemmas, because these seem likeliest to result in something useful to the community at large.

Timothy Chow (Feb 19 2026 at 04:14):

Tristan Stérin said:

I am confused about this argument, the same holds for BB. Inside one of these huge n-state Turing machine spaces, you have all kinds of problems' difficulty; e.g. we currently have 18,195,192 holdouts for BB(7) in which you will find every category from trivial to interesting to impossible.

One difference is that, from what I have seen, solving a BB problem rarely if ever involves invoking advanced results from well-developed fields of mathematics such as geometry, analysis, algebra, etc. The other difference, somewhat related, is that there tends to be a steep slope between the tractable and the intractable. If there no elementary argument then the problem is typically impossible.

However, maybe I'm just underinformed. Are there examples where the solution of a BB problem involved invoking some advanced machinery from a "classical" area of mathematics?

Tristan Stérin (Feb 19 2026 at 07:39):

Timothy Chow said:

Tristan Stérin said:

I am confused about this argument, the same holds for BB. Inside one of these huge n-state Turing machine spaces, you have all kinds of problems' difficulty; e.g. we currently have 18,195,192 holdouts for BB(7) in which you will find every category from trivial to interesting to impossible.

One difference is that, from what I have seen, solving a BB problem rarely if ever involves invoking advanced results from well-developed fields of mathematics such as geometry, analysis, algebra, etc. The other difference, somewhat related, is that there tends to be a steep slope between the tractable and the intractable. If there no elementary argument then the problem is typically impossible.

However, maybe I'm just underinformed. Are there examples where the solution of a BB problem involved invoking some advanced machinery from a "classical" area of mathematics?

\Pi_1 statements and halting problems of Turing machines are the same thing. That means that any sentence of the form "For all x, phi(x)" with phi(x) verifiable in finite time, is also the halting problem of an n-state Turing machine for some n, and hence you need to solve it to solve BB(n). Such statements include for instance Goldbach conjecture or Riemann Hypothesis.

Hence, you have infinitely many examples of BB problems that require some advanced machinery from a "classical" area of mathematics to be solved.

One example (unsolved) close to my :heart: : https://www.erdosproblems.com/406 and its corresponding TM here.

Timothy Chow (Feb 19 2026 at 12:30):

No, I don't agree. What we're talking about here is not an arbitrary Turing machine; nor are we talking about an arbitrary Diophantine equation. We're talking about the simplest Turing machines and the simplest Diophantine equations according to some well-defined metric of simplicity. The Riemann hypothesis is not showing up in the BB(7) dataset. By the same token, Jones's universal Diophantine equation, which has degree 5605^{60}, is not showing up in Bogdan Grechuk's list.

My point is that well-developed tools in number theory such as reciprocity laws and elliptic curve theory are already showing up in "simple" Diophantine equations in Grechuk's sense. I don't think the same is true with BB.

Tristan Stérin (Feb 19 2026 at 12:47):

Timothy Chow said:

No, I don't agree. What we're talking about here is not an arbitrary Turing machine; nor are we talking about an arbitrary Diophantine equation. We're talking about the simplest Turing machines and the simplest Diophantine equations according to some well-defined metric of simplicity. The Riemann hypothesis is not showing up in the BB(7) dataset. By the same token, Jones's universal Diophantine equation, which has degree 5605^{60}, is not showing up in Bogdan Grechuk's list.

My point is that well-developed tools in number theory such as reciprocity laws and elliptic curve theory are already showing up in "simple" Diophantine equations in Grechuk's sense. I don't think the same is true with BB.

(a) Erdős problem 406 is showing up in BB(15)
(b) With 18,195,192 BB(7) holdouts, we have no idea of what is showing up there
(c) Hard Collatz-like problems, previously studied in the literature, show up in BB(6)

Timothy Chow (Feb 19 2026 at 15:07):

Okay, that reinforces my belief that well-developed theories in classical mathematics are not showing up in BB. Mathematics is famously "not ready" for hard Collatz-type problems; there is little to no relevant theory that can help us. I just took a look at Erdős problem 406 and again it seems to be elementary. The existence of problems where we have no idea what is going on means that we can't prove that the Riemann hypothesis isn't lurking inside, but it's not evidence that the Riemann hypothesis is there.

Timothy Chow (Feb 19 2026 at 15:09):

This is not to say that BB problems aren't interesting. I'm just clarifying what makes it a qualitatively different benchmark. Solving Diophantine equations would test the ability to master a lot of number theory and see how to apply it.

Timothy Chow (Feb 19 2026 at 15:13):

By the way, I myself have cited the appearance of Collatz-type problems in BB as a counterargument to people who try to argue that Collatz is an "artificial" problem and not a "natural" problem. Computation is arguably comparably fundamental to number theory. So I do think it is mathematically interesting that Collatz-type problems show up in BB. But that doesn't automatically make Collatz-type problems a good benchmark.

Tristan Stérin (Feb 19 2026 at 15:25):

Timothy Chow said:

Okay, that reinforces my belief that well-developed theories in classical mathematics are not showing up in BB. Mathematics is famously "not ready" for hard Collatz-type problems; there is little to no relevant theory that can help us. I just took a look at Erdős problem 406 and again it seems to be elementary. The existence of problems where we have no idea what is going on means that we can't prove that the Riemann hypothesis isn't lurking inside, but it's not evidence that the Riemann hypothesis is there.

To that, I'll say again that with 18,195,192 holdouts in BB(7) we're far from knowing what flavor of problems lurk in. Also, BB is defined for all models of computations and there's no reasons for e.g. the problem space of BB_Fractran to look anyhting like the problem space of BB_TM.

Tristan Stérin (Feb 20 2026 at 18:04):

With this same setup we achieved 81.1% of solved problems (198/244) on the test set of miniF2F-rocq in about 6h of active time (25h total but stalled a lot of the time because of waiting for session limits to reset).

Tristan Stérin (Feb 20 2026 at 21:36):

With the same setup, Opus 4.6 formalised in about 1h30 an early Collatz paper, [Bohm & Sontacchi 1978], find the formalisation attached: BohmSontacchi1978_lean.zip

One of the main result of the paper is that an element $x$ of a Collatz cycle of size $n$ is such that $|x| < 3^n$, see Prop6.lean.

/-- PROPOSITION 6 (main statement): If x is a fixed point of u^n
    (i.e., isCycleOfU n x), then |x| < 3^n.

    The proof relies on Proposition 4 which gives the integer formula
    2^n * u^n(x) = 3^m * x + listNumSum v, and the bound listNumSum v < 3^n. -/
theorem prop6_cycle_bound (n : ) (x : ) (hcycle : isCycleOfU n x) :
    |x| < (3 : ) ^ n := by
  obtain hn, hcycle_eq := hcycle
  -- Let v be the list of odd steps
  set v := oddSteps' x n with hv_def
  have hv_sorted : v.Pairwise (· < ·) := oddSteps'_strictMono x n
  have hv_bound :  i, i  v  i < n := oddSteps'_bound x n
  have hv_compat :  i, i < n  (i  v  collatzU_iterate i x % 2 = 1) :=
    oddSteps'_compat x n
  -- Apply the integer formula
  have hformula := collatzU_iterate_int_formula x n v hv_sorted hv_bound hv_compat
  -- hformula : 2^n * collatzU_iterate n x = 3^v.length * x + listNumSum v
  -- Since collatzU_iterate n x = x:
  rw [hcycle_eq] at hformula
  -- hformula : 2^n * x = 3^v.length * x + listNumSum v
  -- So x * (2^n - 3^v.length) = listNumSum v
  have hkey : x * ((2 : ) ^ n - 3 ^ v.length) = listNumSum v := by linarith
  -- Case split on whether v is empty
  by_cases hv_empty : v = []
  · -- v = []: listNumSum [] = 0, so x * (2^n - 1) = 0
    have hlen : v.length = 0 := by rw [hv_empty]; simp
    rw [hlen] at hkey
    simp only [pow_zero] at hkey
    rw [hv_empty, listNumSum] at hkey
    -- hkey : x * (2^n - 1) = 0
    -- 2^n - 1 ≠ 0 since n > 0
    have h2n_gt1 : (2 : ) ^ n > 1 := by
      have : (2 : ) ^ n  2 ^ 1 := by
        exact_mod_cast Nat.pow_le_pow_right (by omega) hn
      linarith [show (2 : ) ^ (1 : ) = 2 from by norm_num]
    have hx_zero : x = 0 := by
      rcases mul_eq_zero.mp hkey with h | h
      · exact h
      · omega
    rw [hx_zero, abs_zero]
    positivity
  · -- v ≠ []: use the bounds
    have hS_nonneg : 0  listNumSum v := listNumSum_nonneg v
    have hS_lt : listNumSum v < 3 ^ n := listNumSum_lt_pow3 v n hv_empty hv_sorted hv_bound
    -- 2^n ≠ 3^v.length
    have hne : (2 : ) ^ n  (3 : ) ^ v.length := pow2_ne_pow3 hn
    -- So |2^n - 3^v.length| ≥ 1
    have hdiff_ne : (2 : ) ^ n - 3 ^ v.length  0 := sub_ne_zero.mpr hne
    have hdiff_abs_ge1 : 1  |((2 : ) ^ n - 3 ^ v.length)| := by
      exact Int.one_le_abs (sub_ne_zero.mpr hne)
    -- |x| * |2^n - 3^m| = |listNumSum v| = listNumSum v (since nonneg)
    have habs_eq : |x| * |((2 : ) ^ n - 3 ^ v.length)| = listNumSum v := by
      rw [ abs_mul, hkey, abs_of_nonneg hS_nonneg]
    -- |x| ≤ listNumSum v (since |2^n - 3^m| ≥ 1)
    have habs_le : |x|  listNumSum v := by
      calc |x| = |x| * 1 := by ring
        _  |x| * |((2 : ) ^ n - 3 ^ v.length)| := by
          apply mul_le_mul_of_nonneg_left hdiff_abs_ge1 (abs_nonneg x)
        _ = listNumSum v := habs_eq
    linarith

Ralf Stephan (Feb 21 2026 at 07:57):

What I like about Opus proofs is the extensive readable comments in the code.

Tristan Stérin (Feb 22 2026 at 15:10):

Tristan Stérin said:

With this same setup we achieved 81.1% of solved problems (198/244) on the test set of miniF2F-rocq in about 6h of active time (25h total but stalled a lot of the time because of waiting for session limits to reset).

After a more careful analysis than mine by Guillaume Baudart, it turns out the setup achieved 81.1% and not 85.2% at miniF2F-rocq.


Last updated: Feb 28 2026 at 14:05 UTC