Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: the-story-of-erdos-problem-1026


Deleted User 968128 (Dec 10 2025 at 07:10):

Very enjoyable blog post by Terry - https://terrytao.wordpress.com/2025/12/08/the-story-of-erdos-problem-126/ must read. Also, https://cloud.google.com/blog/products/ai-machine-learning/alphaevolve-on-google-cloud has been released into private preview. Soon!

I actually meant this post to be an appendage to #Machine Learning for Theorem Proving > Erdos 124 AI Solution Unfortunately, Zulip isn't very supportive of edits. Strangely, I was on that thread when I posted this. Odd UI

Deleted User 968128 (Dec 10 2025 at 08:19):

I left a comment on Terry's blog, though not sure it got in, so I will post it here.

Terry, this was a very inspiring read and the acceleration of community on these problems - community enabled and empowered by AI.

It was particularly fascinating to see how the Aristotle lean proof triggered the cascade of ideas to follow, despite not being not seen as particularly novel. But I think that is the entire point and promise of what could be built around this.

Critically - I think there is room for a lot of AI based enhancements to TFBloom's website, integrating all external artifacts, such as the git, RAG, and all other relevant community and paper based resources.

OpenAI has group chat https://openai.com/index/group-chats-in-chatgpt/ but I think this calls for something more boutique and custom built. An agentic flow that can properly cite, track, synthesize, and realtime integrate contributions from a diverse set of mathematicians.

It will not just be looking to help spark ideas and solve problems, but aggressively determining possibilities for re-use in related questions. It could perhaps help prioritize mathlib coverage.

Most importantly in this agent flow will be the citing capability, providing strong motivations and incentives for people to get credibly involved.

The architecture of the flow should be built in a way that domain experts can enhance and customize its thinking. The idea is to take care of all the glue-code problems so experts don't need to worry about that.

I really do hope you will think about reaching out to the labs to get them to help invest in something like this. Perhaps with the guarantee that they can use it as free training material they might be interested providing credits, though I'd like to think the halo effect should be value enough.

At this point, there are one of two outcomes I think we can almost guarantee will happen. Either the above will eventually get built, or ASI in math will occur making it redundant an unrequired.

Thomas Bloom (Dec 10 2025 at 09:49):

Hi Tim - I don't want to add any "AI based enhancements" to the site (at least not for the foreseeable future). There is clearly a lot that can be, and has been done, with such tools, but I'd like to keep the site relatively simple as a "purely human benchmark".

For example, I am not using AI in any way to collect or transcribe the problems, and certainly not to generate 'new' Erdos-style problems.

In these days when so much of online content is generated or altered with AI somehow, I'd like to do what I can to keep one small corner human. (I suspect this is also beneficial for the AIs scraping the site as well, lest they start training on their own output too much.)

Many users are using AI tools in their own preferred way to help them generate solutions, which is great, but I'd like to leave this up the individuals and let the site be a clean place that the problems and any progress can be recorded.

Of course there are a couple of spin-off projects that use the site (e.g. the Formalised Conjectures github), and one can imagine someone else setting up their own external site that uses AI in the way you describe. But that someone won't be me.

Deleted User 968128 (Dec 10 2025 at 11:32):

I'm not thrilled with the idea of ASI making people largely redundant, but if it happens I will try to adapt.

Hopefully things will plateau and AI will stop at a place where it can just surface and connect knowledge and ideas that people generate, revealing them to each other without ego and rapidly accelerating the process of discovery.

There is an intriguing opportunity in that. Greater global collaboration, where big problems and puzzles are solved by many people working very efficiently together in open science, getting the credit they deserve for their small part in pushing the rock up the hill.

Terry illuminates it well - "One striking feature of this story for me is how important it was to have a diverse set of people, literature, and tools to attack this problem....It was only through the combined efforts of all the contributors and their tools that all these key inputs were able to be assembled within 48 hours. It seems plausible that a more traditional effort involving just one or two mathematicians and simpler programming and literature search tools may eventually have been able to put all these pieces together, but I believe this process would have taken much longer (on the order of weeks or even months)."

It's something you learn when developing ML models: blending diverse approaches is very compelling. It's fairly standard practice to look for a small, poorly performing, but strange model which finds answers where the others cannot.

Deleted User 968128 (Dec 11 2025 at 12:42):

https://arxiv.org/pdf/2506.19923 "PROVER AGENT: AN AGENT-BASED FRAMEWORK FOR FORMAL MATHEMATICAL PROOFS" might be one way to do this ..

Prover Agent coordinates an informal reasoning LLM, a formal prover model, and feedback from Lean while also generating auxiliary lemmas. These auxiliary lemmas are not limited to subgoals in the formal proof but can also include special cases or potentially useful facts derived from the assumptions, which help in discovering a viable proof strategy.
..
Instead, by generating auxiliary lemmas, the agent can gradually construct an effective proof strategy in a bottom-up manner, even when the full structure is not initially apparent

These auxiliary lemmas could be stored for RAG / re-use potential as well. I think for AI to start solving in a more interesting way it needs to be more exploration and less exploitation. This would be more expensive, but if re-use could be a byproduct might be worth it (some massaging may be required, but seems doable).

What's a bit tragic with these closed source solutions coming out is that they might all be re-inventing a lot of auxiliary lemmas that could be pooled . Reminds me a bit of the dueling italians and the cubic equation..

Deleted User 968128 (Dec 11 2025 at 13:03):

https://deepmind.google/blog/alphageometry-an-olympiad-level-ai-system-for-geometry/ does something similar, but at a larger scale and I think less curated....

these folk seem kinda skeptical though: https://arxiv.org/abs/2410.20274 https://arxiv.org/pdf/2504.03048 https://arxiv.org/pdf/2507.22069 Makes for very interesting reading...

my suspicion is they are probably right, but it's a combination of poor build for re-use and poor retrieval. I don't see a lot of research being done into former, though the latter seems better covered.

Oliver Nash (Dec 11 2025 at 16:09):

@Tim Shephard Please try harder to ensure the content you post is relevant to Lean.

There is plenty of space in the web for open-ended discussion of applications of AI to mathematics. This is good (I read many of these myself) but it is also important that there are corners of the web where we try to be more focussed; in our case the focus is Lean and Mathlib.

Deleted User 968128 (Dec 11 2025 at 18:09):

I mean, the papers are all about lean? Did you read them?

Junyan Xu (Dec 11 2025 at 18:19):

This is the single time Lean is mentioned across the four links you shared, as far I can see:
image.png
Why are you wasting our time?

Deleted User 968128 (Dec 11 2025 at 18:23):

Prover Agent is based on lean. Library learning is a very important concept that the entire lean movement is working on. If re-use is failing in autoformalization, it seems pretty relevant.

Deleted User 968128 (Dec 11 2025 at 18:30):

It's interesting, one of the papers seem to dismiss the retrieval results because it was failing to find lemmas with the same names in the re-use library. I guess they didn't think it counts if the LLM was using 90% of the ideas and not performing exact string matches.


Last updated: Dec 20 2025 at 21:32 UTC