Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Current state of AI for Mathematics and what could come n...


Anh Nguyễn (May 29 2025 at 05:15):

Hi, I came to Lean’s Zulipchat since the beginning of March and quickly became very interested in the potential of Lean to help with LLM’s rigorousness and potentially helping with mathematical research. I have asked questions here but they are mostly vague and low-level (sorry for that). In this question I will try my best to ask in a detailed, straightforward manner. Thank you for your great patient with new members.
My brief understanding about AI for mathematics is there are now two main kinds of approaches: informal (Large language model reasons and answers mathematical questions) and formal reasoning (LLM or whatever write Lean code and we can be sure about the answer). Apart from it, there still be an another approach as AI be used as a tool for mathematicians like in AlphaEvolve, some examples of applying neural network in knot theory, graph theory,… But that approach is not quite my interest and I think Lean community is not related to that approach.
About formal reasoning, we already have models that are very good at solving IMO. So, let’s say IMO is now no longer a challenge. There are other benchmarks for formal reasoning such as PutnamBench, … in the same category of competition problems. Putnam is more challenging for AI models but we have seen great progresses.
My questions are

  1. My questions about formal reasoning is: “How formal reasoning models could scale to harder problems? What is exactly the main challenge here (about both technicalities (computational resources, …) and lack of training data (we do not have much Lean code as let’s say Python, maybe LLM write Lean code as fluently as they write Python is now not really possible and therefore formal reasoning system’s capability is limited by that)?
  2. Informal reasoning (I think evaluation of those models’ capabilities in mathematics is much more controversial and vague) is generally can achieve better results in mathematics. But there are many many different opinion about their capability (guessing answers, not providing good and true reasoning to back up their answers,…
  3. What needed to be done in near future (months) and in broader timeframe (2-3 years). Do we have plan for it yet? And what can we hope in that timeframe? My naive sense is that when we have much Lean code, we could make a LLM to automatically write Lean code at a high quality. And then the problem of automated theorem proving is “solve”. Is that really be the case and what else make it very hard?

Anh Nguyễn (May 29 2025 at 05:16):

If necessary, could you tell me more about the technicalities of it (I want to know more about informations about technicalities that made automatic math reasoning system so hard)

Anh Nguyễn (May 29 2025 at 05:16):

Thank you so much

Anh Nguyễn (May 29 2025 at 06:16):

Hi, I came to Lean’s Zulipchat since the beginning of March and quickly became very interested in the potential of Lean to help with LLM’s rigorousness and potentially helping with mathematical research. I have asked questions here but they are mostly vague and low-level (sorry for that). In this question I will try my best to ask in a detailed, straightforward manner. Thank you for your great patient with new members.
My brief understanding about AI for mathematics is there are now two main kinds of approaches: informal (Large language model reasons and answers mathematical questions) and formal reasoning (LLM or whatever write Lean code and we can be sure about the answer). Apart from it, there still be an another approach as AI be used as a tool for mathematicians like in AlphaEvolve, some examples of applying neural network in knot theory, graph theory,… But that approach is not quite my interest and I think Lean community is not related to that approach.
About formal reasoning, we already have models that are very good at solving IMO. So, let’s say IMO is now no longer a challenge. There are other benchmarks for formal reasoning such as PutnamBench, … in the same category of competition problems. Putnam is more challenging for AI models but we have seen great progresses.

  1. My questions about formal reasoning is: “How formal reasoning models could scale to harder problems? What is the main challenge here (about both technicalities (computational resources, …) and lack of training data (we do not have much Lean code as let’s say Python, maybe LLM write Lean code as fluently as they write Python is now not really possible and therefore formal reasoning system’s capability is limited by that)?
  2. Informal reasoning (I think evaluation of those models’ capabilities in mathematics is much more controversial and vague) is generally can achieve better results in mathematics. But there are many many different opinion about their capability (guessing answers, not providing good and true reasoning to back up their answers,…
  3. What needed to be done in near future (months) and in broader timeframe (2-3 years). Do we have plan for it yet? And what can we hope in that timeframe? My naive sense is that when we have much Lean code, we could make a LLM to automatically write Lean code at a high quality. And then the problem of automated theorem proving is “solve”. Is that really be the case and what else make it very hard?

Anh Nguyễn (May 29 2025 at 06:17):

I have asked this question in a different chat, but maybe it is reasonable to ask here

Scott Carnahan (May 29 2025 at 06:27):

When you say “we”, who specifically do you have in mind?

Anh Nguyễn (May 29 2025 at 06:35):

By “we”, I mean the community

Anh Nguyễn (May 29 2025 at 06:35):

Lean community

Kevin Buzzard (May 29 2025 at 06:41):

My naive model is that building an automatic math reasoning system is hard because (a) maths is hard and (b) systems are poor at rigorous reasoning. My belief in (a) is based on decades of experience but it might be wrong, I might be underestimating AI.

But (b) is certainly currently true right now. There's a dichotomy between informal and formal models. Current informal models can be good at "solving high-level problems" but these problems are always of the form "give a short answer which can easily be graded by a computer" (e.g. "how many complicated math things have this complicated property, the answer is less than 10^10") and it seems to me that it's extremely difficult to write such a problem where getting the right answer is a proof that the model has done rigorous reasoning accurately; it might have just become a good guesser. A question for which the solution is a succinct witness to the proof that you have actually done a lot of accurate mathematical reasoning correctly (including proofs) is a question which is doing a remarkable compression task and these are hard to come by. My belief is that the informal community are slightly confused right now, conflating "our system got the number right for this question which most PhD students couldn't even understand (because they work in a different area)" with "our system is actually doing high-level mathematical reasoning correctly".

As for formal systems, here the level is much lower. The area is still in its infancy, you have to choose a formal system, if you want to do math then probably Lean is the best choice given the canonical math library, but even this library (mathlib) does not even cover most of the definitions in modern research due to its "from the ground up" approach and the fact that we have reached a major bottleneck in getting stuff into the library (amount of material appearing in the library is still growing more than linearly, but amount of PRs backed up and not being merged is also growing more than linearly). A concrete consequence of this is that there are many math questions which an informal model could spout out an answer to but which could not even be formally stated in Lean. But even if the library did contain a lot of modern definitions, the technology is still not there; Kimina's demo https://demo.projectnumina.ai/ is very nice and can reliably translate informal questions at easy school level into Lean and then autonomously find Lean proofs, but its level right now is "sub-IMO"; DeepMind's model AlphaProof can (apparently -- this is closed source with no way to play with it) solve IMO-level problems but we've seen no evidence that it can get to undergraduate mathematics (although one would guess that right now it can do this). As I already said, it's hopeless to think that these systems can be used on most open math problems because they cannot be stated in either Lean or any other system, although this will change; however we are still a year or two at least away from AI systems being able to reliably prove theorem autonomously. I think an easier question is to get a system to translate a human mathematical proof into Lean and even this is hard. I have been publically running my FLT project for several months now and I've not had anyone attempting to use their AI system to solve any of the tasks which I give out on the project dashboard (tasks are all of the form "translate this informal LaTeX proof into Lean"); it's always humans that claim them.

Kevin Buzzard (May 29 2025 at 06:50):

Please don't double-post. I'll move your other question here (so now you'll have asked it twice).

Notification Bot (May 29 2025 at 06:51):

7 messages were moved here from #general > Discussion: Current capabilities, technologies of AI for ... by Kevin Buzzard.

Anh Nguyễn (May 29 2025 at 06:58):

Suppose that at some time in the future we have a nearly complete digital mathematical library (whether through autoformalization (supposed that it is good enough to do it) or through having many mathematicians interested in writing Lean), theoretically, it could automate mathematics the same way AlphaProof crushed the IMO problems
Or something is missing currently to perform mathematics at that level

Kevin Buzzard (May 29 2025 at 07:05):

what is missing is that (a) maths is hard. There are plenty of humans who understand a lot of mathematics but they can't just resolve arbitrary hard problems at the boundary. Why do you think computers will be any different?

I think "we have AI therefore we can solve the Riemann Hypothesis" is just the same as "we have a hammer and so now everything is a nail".

Kevin Buzzard (May 29 2025 at 07:07):

"AlphaProof crushed the IMO problems" -- these are problems which can be solved by a schoolchild, don't forget that. These are a huge distance from unsolved mathematical problems, they are born solved.

David Michael Roberts (May 29 2025 at 07:44):

By 'born solved' I think Kevin means: "they are deliberately written to be solvable using standard techniques that are taught to clever high school kids all over the world, plus a bit of trying some tricks and functionally reverse engineering the clever idea that the problem poser used to obfuscate the solution". Open research questions are very much not like this.

Anh Nguyễn (May 29 2025 at 07:53):

How about building a system that can explore and meaningfully contribute to mathematical research (here I do not prefer to neural networks that have been used a long time ago). Maybe we are now not currently at that stage. By explore, I mean give the model knowledge at the time before Galois theory was developed and ask the model to come up with Galois group and all sorts of things that made to proof possible. (Is that at the level of AGI?) . Maybe at that level it can solve RH

David Michael Roberts (May 29 2025 at 08:00):

Coming up with the Galois group idea is a looooooooong way from solving RH. I mean, people have done much more and more impressive 'refactors' of mathematics than Galois did (Grothendieck, for one, but also other people that I can't think of right now), and we have more tools and more understanding now than in Riemann's time (which was decades after Galois) and we have no idea how to attack RH (modulo Alain Connes, but I don't think experts are very confident in his ideas actually solving it). The computer that only had pre-Galois mathematics would have to replicate nearly two centuries of thinking from thousands on thousands of mathematicians just to get to where we are...

David Michael Roberts (May 29 2025 at 08:19):

For interest: Galois didn't invent the concept of Galois group out of thin air: https://mathshistory.st-andrews.ac.uk/Projects/Brunk/chapter-2/ Were a piece of software able to invent all of Galois' work and Artin's modern reworking of it based only on prior art, then I would be very impressed.

Anh Nguyễn (May 29 2025 at 08:52):

David Michael Roberts said:

For interest: Galois didn't invent the concept of Galois group out of thin air: https://mathshistory.st-andrews.ac.uk/Projects/Brunk/chapter-2/ Were a piece of software able to invent all of Galois' work and Artin's modern reworking of it based only on prior art, then I would be very impressed.

I heard that early days of theorem proving system only capable of proving very simple things like adding two even numbers yield an even number
Those system improve not much, 5 years ago IMO problems are out of reach

Anh Nguyễn (May 29 2025 at 08:54):

(deleted)

Anh Nguyễn (May 29 2025 at 08:57):

Kevin Buzzard said:

what is missing is that (a) maths is hard. There are plenty of humans who understand a lot of mathematics but they can't just resolve arbitrary hard problems at the boundary. Why do you think computers will be any different?

I think "we have AI therefore we can solve the Riemann Hypothesis" is just the same as "we have a hammer and so now everything is a nail".

How about easier open problems, not necessarily RH

Kevin Buzzard (May 29 2025 at 12:50):

Your argument is tantamount to an extrapolation which I think is invalid. AI trains on known ideas so can find proofs based on known ideas. How are you getting from this to unknown ideas?

Kevin Buzzard (May 29 2025 at 12:51):

I agree that it's gone from "can't do things that humans can do and which are explained on the internet" to "can do things which humans can do and which are explained on the internet" but how do you get from there to "can do things which humans can't do and aren't explained on the internet"? These things aren't doing magic, they're just looking through training data.

Damiano Testa (May 29 2025 at 13:10):

Kevin, I agree with you, but I also think that humans don't do magic either!

Alex Meiburg (May 29 2025 at 16:23):

I think that saying they're "just looking through training data" is overly reductive. Machine learning can produce systems that can generalize, in quite a number of different meaning of 'generalize'. (I say "can" because it's been observed in many systems in many domains, but also often that other systems fail to generalize in other, unexpected ways.)

A system that is only trained on e.g. textbook text will probably be a very bad mathematician: it will mimic a textbook and act something like an encyclopedia of math. A system that is trained on the process of mathematical discovery should, I believe, make new mathematical discoveries.

That isn't to say it's easy.

Kevin Buzzard (May 29 2025 at 20:14):

Humans can have decisive new mathematical ideas which have never been written down before though, we have observed this throughout history. In my opinion we're yet to observe machines doing this, but I agree that this is not entirely a well-defined question. The whole conversation is subjective really. I'm explaining my mental model as for why I think it's a ridiculous idea right now that computers can do interesting new mathematics autonomously. Of course things may change in the future but given that we have zero evidence suggesting otherwise right now, I'm currently very happy with my mental model. Others may well have different opinions of course!

Kevin Buzzard (May 29 2025 at 20:16):

I should probably add that part of my motivation for formalising mathematics is a hope that things will change in the future; I'm just reporting on my perception of the present.

Joseph Myers (May 29 2025 at 23:28):

"the process of mathematical discovery" isn't generally available for AIs to train on. By the time something reaches the stage of a public preprint, a lot of the discovery process (ideas that were discarded or substantially modified, etc., while the work was being done) isn't visible there.

Anh Nguyễn (May 29 2025 at 23:42):

I think there’s not much “thinking” data in a paper

Jason Rute (May 30 2025 at 00:37):

It's not clear what level of problems we are talking about AI proving here. Sure, I would be skeptical of the RH, but I would also be skeptical of a given human solving it as well, even Terry Tao. But there is a huge breadth of mathematics out there, even in the category of "unsolved open problems". Computers have been proving theorems since, ... well, the advent of electronic computers. There has to be a niche that neural AI can fill. Maybe it is combing the literature for ideas and possible (counter)-examples. Maybe it is working through a tedious problem from start to finish. Maybe it is coming up with ideas in a slightly non-human way. Maybe it is mimicking the style and tools of other mathematicians. Maybe it is just getting lucky (as many humans do). There are already examples (and rumors of examples) of LLMs being helpful for idea generation in research mathematics. I think we are just getting started, not on solving the Collatz conjecture, but on solving certain theorems that humans can't solve.

Anh Nguyễn (May 30 2025 at 00:38):

Do you think there is any relation between LLM and “mathematicians” we see in nature
Bee does not “know” any mathematics. But they can optimize and find the best honeycomb structure. I think there are two main components here: complexity and evolution. The intelligent-appeared behaviors arises out of a highly complex system. That intelligence does not mean understanding.
LLM maybe the same, I think

Alex Meiburg (May 30 2025 at 00:40):

@Joseph Myers : I agree, and this is a big part of why it's hard! This isn't to say that such data can't eventually be found, created, or something synthetic done that is close enough to the real thing.

The same problem exists with "thinking tokens" in the new generation of reasoning LLMs. The base models are trained on e.g. textbooks, solution guides, etc that present a way to do things, free of errors. There is almost no natural data online of someone starting to answer a question, noticing an error mid-sentence, and change course and try again.

Key word: almost. There are a few examples of this, e.g. from video transcripts of someone talking through a problem live. There are discussion threads where someone corrects their post, in another post later.

And, certainly, some major labs wrote synthetic examples of self-reflection in the face of errors and course correction. (We may never know exactly what methods were used to jump start the RL for thinking, but there are many reasonable guesses that probably all help.)

I think there's a similar picture for "the process of discovery" (be that mathematical, scientific, philosophy, literary criticism...) that looks largely the same, maybe just with lots of numbers about ten times bigger.

Alex Meiburg (May 30 2025 at 00:41):

@Anh Nguyễn : this may be getting too far afield, but it sounds like you're asking about non-neural intelligence. See e.g. https://www.kitp.ucsb.edu/activities/brainless26

David Michael Roberts (May 30 2025 at 00:49):

I watched a talk by the philosopher of mathematics Colin McLarty last night, and he was talking about the state of research literature in the 19th century, specifically in Germany. It was a total mess. Paul Gordan would do a calculation, and Max (and Emmy) Noether would write up the actual proof, and they didn't actually fully grasp what was going on (and the proofs are today totally opaque). Hilbert's first proof of his famous basis theorem was wrong, Gordan 'refereed' it, believed the theorem but pointed out the proof was not up to scratch, and Hilbert got salty and told the journal to publish it anyway, which it did. I can't recall the mathematician, but it was someone famous, who would read a paper of Riemann, believe the theorem, but work out their own proof because they couldn't grasp Riemann's. People in Berlin weren't familiar enough with the methods used by Göttingen mathematicians to understand their proofs. People in Germany didn't read proofs by mathematicians in England. We know the famous papers from the time, today, because they have been digested and understood and form the basis for our own thinking. But the average paper at the time was super flaky.

What is incredible is the advances people made at that time in spite of this super non-rigorous and hodge-podge mathematical culture. Train an LLM the entirely of the pre-1850 mathematics literature and it would likely produce the kind of garbage you would invariably get as of a few years ago. There's an anecdote of the mathematics professor lecturing and saying A, writing B, meaning C and the correct statement being D. People really had to have a lot of insight to push through this stuff.

Justin Asher (May 30 2025 at 00:50):

I think the biggest unsolved problem in mathematics is defining a clear objective function for mathematics. Yes, you can naively define proving open conjectures as the training criterion, but that does not get to the heart of the matter.

What is a problem worth pursuing? And how can we quantify this? Or is a powerful statement unquantifiable?

Once we solve this, then I think an RL type approach can succeed.

My approach is more towards “mining” mathematics, which means trying to prove a lot of small statements which add up to some unexpected big statement, perhaps a bit like Grothendieck here. Proving generalizations, for instance. The reason calculators (and computers in general) work well is because they can do many small things fast. If we can do the same for LLMs, e.g., by allowing them to work in parallel, then this could also succeed. This is what I am focused on.

Anh Nguyễn (May 30 2025 at 01:16):

Justin Asher said:

I think the biggest unsolved problem in mathematics is defining a clear objective function for mathematics. Yes, you can naively define proving open conjectures as the training criterion, but that does not get to the heart of the matter.

What is a problem worth pursuing? And how can we quantify this? Or is a powerful statement unquantifiable?

Once we solve this, then I think an RL type approach can succeed.

My approach is more towards “mining” mathematics, which means trying to prove a lot of small statements which add up to some unexpected big statement, perhaps a bit like Grothendieck here. Proving generalizations, for instance. The reason calculators (and computers in general) work well is because they can do many small things fast. If we can do the same for LLMs, e.g., by allowing them to work in parallel, then this could also succeed. This is what I am focused on.

That’s what I really care about

Anh Nguyễn (May 30 2025 at 04:05):

Maybe we can approximate some thing like autoformalizer

David Michael Roberts (May 30 2025 at 04:30):

There have been a worrying number of papers listed under math.CT on the arXiv recently that are really obviously AI-generated. They are full of banal unjustified generalities. Some of them at least have been removed from that arXiv category into the hold-all math.GM, possibly due to me complaining about them. Category theory is not the same as algebraic number theory or arithmetic geometry or harmonic analysis, I have a hard time, seeing these papers, thinking what the models that produce these would spew out in those fields. It's not even clear how this could turn into formalisable mathematics, let alone getting AI to formalise it. As Kevin says, maybe in the future, but right now the evidence is strongly against getting anything decent at research level.

Anh Nguyễn (May 30 2025 at 09:09):

Has ever before witnessed the high hope for automated mathematics (before LLM)

Anh Nguyễn (May 30 2025 at 09:17):

https://research.google/pubs/a-promising-path-towards-autoformalization-and-general-artificial-intelligence/
Maybe it was the paper that marked the beginning of the hope that we see today

Andy Jiang (May 30 2025 at 15:41):

Alex Meiburg said:

Joseph Myers : I agree, and this is a big part of why it's hard! This isn't to say that such data can't eventually be found, created, or something synthetic done that is close enough to the real thing.

The same problem exists with "thinking tokens" in the new generation of reasoning LLMs. The base models are trained on e.g. textbooks, solution guides, etc that present a way to do things, free of errors. There is almost no natural data online of someone starting to answer a question, noticing an error mid-sentence, and change course and try again.

Key word: almost. There are a few examples of this, e.g. from video transcripts of someone talking through a problem live. There are discussion threads where someone corrects their post, in another post later.

And, certainly, some major labs wrote synthetic examples of self-reflection in the face of errors and course correction. (We may never know exactly what methods were used to jump start the RL for thinking, but there are many reasonable guesses that probably all help.)

I think there's a similar picture for "the process of discovery" (be that mathematical, scientific, philosophy, literary criticism...) that looks largely the same, maybe just with lots of numbers about ten times bigger.

In agreement here: I think actually the lack of training data on human problem solving process is not so important. The recent progress in reasoning models have not come from finding much more human data but from better training paradigms. In that light probably it's safe to assume that, as with all other fields, AI progress in math research will not slow down dramatically just because it is approaching human-expert level. Especially given reinforcement learning is expected to "just work" eventually on formal math.

Andy Jiang (May 30 2025 at 15:52):

My own mental model for expectations of AI capability in math is that a datacenter of H100's is simply much smarter than any human in raw ability, and the gap to human-expert is simply a matter of guidance, training methods, and algorithm design. So I don't expect any remaining large hurdles to be cleared in order to surpass human performance. But it seems like timeline wise there's not a big disagreement, 1-2 years seems reasonable.

Anh Nguyễn (May 31 2025 at 04:56):

Could you tell me some steps we should take to achieve that

Anh Nguyễn (May 31 2025 at 05:17):

And with informal or formal

Martin Dvořák (Jun 02 2025 at 08:20):

Anh Nguyễn said:

About formal reasoning, we already have models that are very good at solving IMO. So, let’s say IMO is now no longer a challenge.

You are far from the truth. We have seen one instance of AI formally solving IMO well and, as far as we can tell from the little info DeepMind released, it was very challenging for their system.

And, what's more important, all AI systems that are currently available for the community are weaker than AlphaProof by several orders of magnitude.

Overall, I think the SOTA in automated theorem proving is much lower than you think it is.

Anh Nguyễn (Jun 02 2025 at 08:52):

What if we train a model for specifically deal with one particular problem (like RH)

Anh Nguyễn (Jun 02 2025 at 08:56):

(deleted)

Shreyas Srinivas (Jun 02 2025 at 09:06):

Since I brought this up on discord, I realise this theoretical result might also interest some of us on this thread (it is a limited and worst case sort of result) : https://arxiv.org/pdf/2402.08164v2

Martin Dvořák (Jun 02 2025 at 10:13):

Anh Nguyễn said:

What if we train a model for specifically deal with one particular problem (like RH)

Well, since nobody knows how to attack RH, nobody knows how to train a specialized AI to solve RH, either.

Shreyas Srinivas (Jun 02 2025 at 11:54):

If you want to get a practical sense of how publicly available systems are, here is an experiment I did. I took this basic lemma which says that a cycle in the sense that mathlib defines it, must be a walk of length at least 3 (i.e. have three edges). I asked github's free copilot to explain the proof to me. The first thing it gave me was a lean 3 proof even though the lean4 proof was right below the theorem. I then asked it to read the proof and explain it line by line. It's explanation was correct, but very superficial. Here is the transcript

Shreyas Srinivas (Jun 02 2025 at 11:56):

I would have hoped it could sound more natural than this. This is basically an explanation I would expect from a student who can read and infer what the lean tactics are doing but hasn't actually studied graph theory (although the summary at the end sort of makes up for it).

Shreyas Srinivas (Jun 02 2025 at 11:58):

Lines like If p is empty, hp' (which is the non-nil property) is contradicted by simp, so this case cannot actually happen is a bit ... inaccurate. simp is a tactic, not a mathematical statement. In mathspeak it would be far more clear to say If p is empty, it would be a contradiction of the non-nil assumption hp'

Jason Rute (Jun 02 2025 at 12:08):

@Shreyas Srinivas I don’t think your transcript shows via that link.

Jason Rute (Jun 02 2025 at 12:14):

Also, I think the exact model you are using matters a lot. I haven’t heard good things about Lean and GPT-4.1.

Shreyas Srinivas (Jun 02 2025 at 12:19):

I got it to extract the conversation into plain HTML which can be opened in a browser. I am not 100% sure that the transcription is exact, but I have attached the HTML file.
lean_cycle_theorem_conversation_Version2.html

Shreyas Srinivas (Jun 02 2025 at 12:20):

I wouldn't say it was bad after I asked it explicitly to explain the proof that followed. But I still got away with the feeling that it was superficially reading off the tactic script.

Anh Nguyễn (Jun 02 2025 at 14:06):

Are all the models that achieve good score in benchmarks fully autonomous

Anh Nguyễn (Jun 02 2025 at 14:06):

Or it need human interventions

Miheer Dewaskar (Jun 08 2025 at 20:18):

Potentially relevant to this conversation: https://www.scientificamerican.com/article/inside-the-secret-meeting-where-mathematicians-struggled-to-outsmart-ai/

Jason Rute (Jun 08 2025 at 20:22):

@Miheer Dewaskar That article is a bit confusing. It makes it sound like the AI solved an open probelm, which I don’t think it did. Here is a more balanced account from someone who was there: https://x.com/zjasper666/status/1931481071952293930?s=46

Miheer Dewaskar (Jun 08 2025 at 22:17):

Jason Rute said:

Miheer Dewaskar That article is a bit confusing. It makes it sound like the AI solved an open probelm, which I don’t think it did. Here is a more balanced account from someone who was there: https://x.com/zjasper666/status/1931481071952293930?s=46

Thank you for the comment: the account you mentioned seems more realistic. There is so much hype surrounding AI. Even if AI does become as good (or better) at proving theorems than humans, there is a question of how we can trust its work. There is a somewhat funny quote related to this in the previous Scientific American article:

While sparring with o4-mini was thrilling, its progress was also alarming. Ono and He express concern that the o4-mini’s results might be trusted too much. “There’s proof by induction, proof by contradiction, and then proof by intimidation,” He says. “If you say something with enough authority, people just get scared. I think o4-mini has mastered proof by intimidation; it says everything with so much confidence.”

Having AI output its work in a formal language like LEAN can obviously guard against this. Indeed, if we have a powerful AGI that can do math, why would we want anything less than formal proofs for its claims?


Last updated: Dec 20 2025 at 21:32 UTC