Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: DeepMind's Aletheia Erdos problem

Justin Asher (Jan 31 2026 at 18:59):

It looks like Google Deepmind has a new agent, Aletheia, which has proven an outstanding Erdős problem! Here is the preprint:

https://arxiv.org/abs/2601.21442

The result was then formalized into Lean by @Kevin Barreto, which you can see in the relevant thread:

https://www.erdosproblems.com/forum/thread/1051

They also claim to have solved many other open results and beyond:

https://x.com/lmthang/status/2017317250055999926

Ralf Stephan (Feb 03 2026 at 12:01):

Here the second preprint and X post:

https://arxiv.org/abs/2601.22401
https://x.com/lmthang/status/2018355663404187830

Bryan Wang (Feb 03 2026 at 12:20):

The part most relevant for Lean:

Large Language Models can easily generate candidate solutions, but the number of experts who can judge the correctness of a solution is relatively small, and even for experts, substantial time is required to carry out such evaluations. In particular, it would have been infeasible for our team to evaluate model outputs on all of the problems marked ‘Open’.
[...]
An alternative approach to the evaluation problem is via formal verification, such as through the Lean language. This has also led to a handful of success cases, but has limitations. First, because only a tiny proportion of the math research literature is formalized in Lean, this significantly restricts the model’s toolkit for solving problems. Second, many problem statements in the database are open-ended or susceptible to misinterpretation. An expert is still required to interpret the argument in natural language to determine if a formally verified proof addresses the intended mathematical meaning (see Appendix A for an interesting case study on this issue).

Joseph Myers (Feb 03 2026 at 22:33):

That text sounds like the target audience is people not particularly familiar with formalization or the open-ended nature of many research problems. I think it's fairly clear to people familiar with formalization that AIs formalizing their own work can help distinguish the 137 Fundamentally Flawed solutions from the 63 Technically Correct, not so much with deciding whether what the solutions had proved was interesting. (And the better AIs are at finding their own mistakes, the more potential there is for them to try the problems again autonomously after doing so.)

Justin Asher (Feb 04 2026 at 04:19):

I am surprised that their Erdös problem pipeline includes human grading. I guess the necessary mathematics has not yet been formalized in Lean for these solutions to all be autoformalized? This further seems indicative that the AI is not quite good enough to self-grade, and hence self-improve, yet; albeit, I would guess that we were getting close to this point.

I am not sure why the AI struggles to see the mistakes in its own proofs unless each part is put under a microscope. Early last year I was running experiments where I would have the AI check each sentence in its proof, and then the overall proof, for correctness, which was pretty successful. Perhaps it is the nature of the attention mechanism and a limited output size that is prohibitive here. I would imagine agentic systems will naturally start to fill in these gaps.

I am also curious whether there has been any progress made by Google DeepMind on their formal conjectures repository, as it seems like the natural testing ground for AI.

I do think that this paradigm of attacking individual problems is flawed, as it appears like there is not much an attempt of building up proper theory as a mathematician might. Strong technical ability is needed, but beyond that it is industrialization. The research that I have seen in this area over the past few months, where open problems have claimed to be solved by AI, have often focused on things that do not require significant long-term planning or construction.

Yan Yablonovskiy 🇺🇦 (Feb 04 2026 at 04:59):

Justin Asher said:

I am also curious whether there has been any progress made by Google DeepMind on their formal conjectures repository, as it seems like the natural testing ground for AI.

Ironically it has a lot to teach about the interactions of Lean and AI, but so far it’s not how you’d expect.

In particular it has been a sort of barometer of how easily someone can use commercially available AI blindly to formalise the conjectures in Lean, a lot of one time PRs from people just using AI blindly.

Over time it has gotten a lot better at it , and people there sometimes just take what the knowledgeable reviewer says and feeds it to the AI as a middleman successfully.

Ralf Stephan (Feb 05 2026 at 11:37):

Justin Asher said:

I am surprised that their Erdös problem pipeline includes human grading.

As can be seen from misformalizations in the #<Formal conjectures> project one problem is that (from a formal perspective) Erdös was sloppy with his statements. He had hidden assumptions that made it all consistent and valuable but, purely from his writing, it is impossible in many cases to get unambigous statements.

Ralf Stephan (Feb 05 2026 at 11:41):

So there is not only AI slop but also Erdös slop, and we have to deal with both.

Kevin Barreto (Feb 05 2026 at 11:54):

Justin Asher said:

I am surprised that their Erdös problem pipeline includes human grading. I guess the necessary mathematics has not yet been formalized in Lean for these solutions to all be autoformalized?

I made a strong push for wanting to do autoformalisation, but unfortunately was shot down on this due to the effort needed to verify statements were formalised correctly.

Justin Asher (Feb 12 2026 at 13:25):

Here is the relevant blog article:

https://deepmind.google/blog/accelerating-mathematical-and-scientific-discovery-with-gemini-deep-think/

Franz Huschenbeth (Feb 12 2026 at 17:02):

I mean they do know that if AI capabilities continue on this trajectory, that formalization is the only way, one can realistically have a handle on verifying all of these results.

Its probably more a publicity stunt, as currently the natural language math abilities are still ahead of the formalized abilities, meaning if you want to get headlines by saying you solved some research problems, you have to do it the old school way.

I dont think it really matters, in the sense that it shows that AI can do math and that it gets better at that. I think we all know that already, time and time again, we have been surprised.

One can argue that doing math research is fundamentally harder / different, that theory building requires other skills and vision. And while that is certainly true, it just implies that it might take longer getting to that level, it does not provide any evidence that AI will be fundamentally unable to do math on such level. In the end its just compute + a engineering problem and there are enough smart people working on it that we can be sure, that we will get there. Its just the question of how fast.

Last updated: Feb 28 2026 at 14:05 UTC