Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Illusion of Thinking

Julian Berman (Jun 10 2025 at 16:41):

The Illusion of Thinking made its way around the internet a few days ago. It doesn't reference formal proofs at all from what I saw, but I'm curious to hear if anyone in this venue has either general thoughts on the paper or else specific thoughts or data points on how the results relate to formalization and Lean. I guess even more specifically I'm curious about thoughts about the paper's introductory remarks on why they chose synthetic puzzles:

Rather than standard benchmarks (e.g., math problems), we adopt controllable puzzle environments that let us vary complexity systematically—by adjusting puzzle elements while preserving the core logic—and inspect both solutions and internal reasoning (Fig. 1, top). These puzzles: (1) offer fine-grained control over complexity; (2) avoid contamination common in established benchmarks; (3) require only the explicitly provided rules, emphasizing algorithmic reasoning; and (4) support rigorous, simulator-based evaluation, enabling precise solution checks and detailed failure analyses.

(Contamination aside as I've seen in a few of the benchmark threads here that this is still a fun problem to avoid), Is there some general way given some formalization problem to vary it in complexity systematically in an interesting enough way?

And secondly more broadly how believable is the "reliability falls off a cliff at a sufficient level of complexity" conclusion to be true for formalization? And how likely is fixing one to help other domains?

Justin Asher (Jun 10 2025 at 17:06):

I don't know if we can extrapolate anything consequential at this point about formalization. Back in December, everyone was talking about how we are running out of data, so the models will not keep getting smarter…at least this was said in some naive sense (a lot of people I knew outside of AI thought this). Obviously, that turned out to not be true. By the time we try to systematically study varying complexity for formalization, the models may be smart enough that our study is obsolete and formalization may be solved.

I think the hardest part of formalization is that you are having your models deal with a vast amount of Lean code that they may or may not have seen before, unless you are continuously training your models on your autoformalized data. This is also an easily solvable problem (search, RAG, etc.).

Justin Asher (Jun 10 2025 at 17:10):

My guess is we can autoformalize all written mathematics by the end of the year.

Tyler Josephson ⚛️ (Jun 10 2025 at 21:45):

I like the paper. Here’s an earlier one showing unreliable reasoning traces in puzzle environments: https://arxiv.org/abs/2505.13775

There’s an important way this is not directly applicable to questions of autoformalization of proofs to Lean, which doesn’t usually involve tasking LLMs to ”compute” their way step by step through proofs. If the LLM generates Lean code (or even Python code, or calculator input, FWIW), it’s not fundamentally limited by its ability to do the computation reliably, when it outsources part of the action to an external tool. LLMs can’t faithfully multiply really long numbers using chain of thought reasoning, but that’s a different question about their ability to prepare calculator input and extract the answer.

Lawrence Wu (llllvvuu) (Jun 11 2025 at 02:22):

Different proofs feel like (at least for a human) they require different amounts of “precomputation” vs being able to use Lean as a thinking tool. This I suspect to be one of the reasons (besides de Bruijn factor) that combinatorics is hard for AI.

Chase Norman (Jun 11 2025 at 03:33):

Justin Asher said:

My guess is we can autoformalize all written mathematics by the end of the year.

Are you including proofs in this, or just statements? Because this would imply that any formalization projects with over a year remaining (including Kevin's Fermat's Last Theorem project) are misguided. As a significantly easier challenge, I'd like to see a model that can convert my unsafe Rust code into using lifetimes. Jules failed miserably.

Srivatsa Srinivas (Jun 12 2025 at 16:24):

@Chase Norman I'd like to see a model sum a list of 100 pairs of 100 digit numbers accurately

Justin Asher (Jun 12 2025 at 16:44):

Chase Norman said:

Are you including proofs in this, or just statements?

Seems like both are possible: https://github.com/morph-labs/lean-abc-true-almost-always

Chase Norman (Jun 12 2025 at 19:33):

@Srivatsa Srinivas I get what you're saying, but it's not like mathematicians are known for their accuracy in arithmetic. We might have to call this one a draw. The challenge I pose is not merely a test, but is actually something I would use the technology for, if it existed.

Niels Voss (Jun 12 2025 at 19:59):

I think it is likely that we will see LLMs gradually increase in usage among general contributors to mathlib a long time before we achieve general autoformalization. It would seem pretty strange to me if in the span of less than a year LLMs went from from being used only occasionally by mathlib contributors to all of the sudden being able to formalize everything.

Jason Rute (Jun 12 2025 at 20:18):

I think this tweet summarizes the two sides nicely: https://x.com/boazbaraktcs/status/1878083403318231537?s=46

Anh Nguyễn (Jun 13 2025 at 00:25):

About the question asked in the beginning of the discussion, there is a paper that criticize the arguments mentioned in The illusion of thinking paper

Last updated: Feb 28 2026 at 14:05 UTC