Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Discussion: AI-written mathematical proofs


Kevin Buzzard (Nov 08 2025 at 01:34):

Interesting last slide in Emily's talk:

Screenshot from 2025-11-08 01-34-20.png

Jason Rute (Nov 08 2025 at 01:52):

Provocative. (Not sure I agree with the solution, and I don’t know if Emily does either, or is just being provocative, but I commend the discussion of the problem.)

Rado Kirov (Nov 08 2025 at 02:05):

does "communicated in natural language text" mean 1) something akin to code comments, proving intuition and motivations, but non-rigorous 2) closer to current standard of rigorous math proofs but not formal?

Kevin Buzzard (Nov 08 2025 at 02:38):

my impression was that it just meant "looks like a math paper".

Timothy Chow (Nov 08 2025 at 04:19):

I have a pet peeve about the term "AI system." What does that mean? For decades, the mathematical literature has been accumulating computer-assisted proofs. Are these no longer to be considered proofs until they are certified by an ITP? Or do they get a pass as long as the computer does something other than generating "text"? What about Doron Zeilberger's trusty co-author Shalosh B. Ekhad, which often generates "mathematical text" to accompany WZ-style proofs? I understand that we want to do something to prevent LLM-generated texts from being uncritically accepted as valid mathematical arguments, but being sloppy about what we mean by an "AI system" is an inauspicious start.

Alex Meiburg (Nov 08 2025 at 04:30):

It specifically says "artificially generated mathematical text" as the domain of where this applies. A system like Vampire or Mathematica can certainly be a proof still under this definition.

Timothy Chow (Nov 08 2025 at 04:44):

As I said, Shalosh B. Ekhad generates mathematical text. And aren't formulas also "text"? Mathematica typically generates human-readable output (which is what I think of as "text"), not binary files.

Timothy Chow (Nov 08 2025 at 04:49):

In any case, I don't think the crucial word is "text". It's whether the content has been verified in a satisfactory manner. If a mathematical argument has been properly vetted then its origin doesn't matter, surely? Focusing on the origin seems to be focusing on the wrong thing.

Timothy Chow (Nov 08 2025 at 05:04):

To put it another way, if Mathematica spits out an integral, then I think we can have a healthy discussion about whether to trust it, but I don't want to frame the discussion as a debate about "is Mathematica AI?" or "is an equation text?"

Terence Tao (Nov 08 2025 at 05:25):

My view is that the traditional human-powered way of generating a mathematical proof does not simply produce the final proof text; the process that the human mathematicians go through to reach that proof, trying different strategies, comparing against prior literature, drawing insights from failed attempts, etc., creates significant additional value. While most of that value would only be of private benefit to the researchers through their increased skill and experience, some of it also flows through into the way these researchers write and present their proof, sometimes in subtle ways such as selecting the order in which different parts of the argument are introduced, changes in notation compared to previous literature, etc..

In contrast, machine-generated proofs (whether coming from LLMs, "good old fashioned AI", CASs, etc.) can produce proof artefacts which may indeed be completely correct and pass all verification tests, yet lack the penumbra of additional insight and understanding that the human-powered process organically generates. (cf. Thurston's well known essay on proof and progress in mathematics, or my own short post on the three stages of mathematics which Emily also quotes from.) My interpretation of Emily's proposal is that it is intended to ensure that contributions to the mathematical literature continue to generate both types of valuable mathematical outputs even when the traditional mechanism for generating the latter output is absent.

Notification Bot (Nov 08 2025 at 10:47):

11 messages were moved here from #general > Lean in the wild by Floris van Doorn.

Floris van Doorn (Nov 08 2025 at 10:47):

This seems like an interesting discussion, and there is a lot to say about it, so I've split it into a separate thread.

Timothy Chow (Nov 10 2025 at 02:22):

Terence Tao said:

In contrast, machine-generated proofs (whether coming from LLMs, "good old fashioned AI", CASs, etc.) can produce proof artefacts which may indeed be completely correct and pass all verification tests, yet lack the penumbra of additional insight and understanding that the human-powered process organically generates. ... My interpretation of Emily's proposal is that it is intended to ensure that contributions to the mathematical literature continue to generate both types of valuable mathematical outputs even when the traditional mechanism for generating the latter output is absent.

I certainly agree that it's important to generate both types of valuable mathematical outputs. But I just listened carefully to Emily's entire talk, and I don't think that's her main point. She seems deeply concerned with "vibe proving" and its potential to generate vast amounts of correct-sounding but incorrect mathematics, overwhelming our normal systems for vetting mathematics. The proposal is an attempt to stem the tide of "AI slop" by insisting that it be formally verified.

My first instinct, as hinted at above, is to worry that a double standard is being applied, with formal verification being demanded only for "AI-generated text" (whatever that means), and not for other mathematics. But thinking about it more, I'm now reconsidering: maybe a double standard isn't such a bad thing after all. The mathematical community (by and large) sees no value in ITPs, but (by and large) also loathes LLM-generated "proofs." Maybe the loathing can be parlayed into support for ITPs, if they perceive ITPs to be an antidote for LLMs.

Yan Yablonovskiy 🇺🇦 (Nov 10 2025 at 03:29):

Timothy Chow said:

My first instinct, as hinted at above, is to worry that a double standard is being applied, with formal verification being demanded only for "AI-generated text" (whatever that means), and not for other mathematics. But thinking about it more, I'm now reconsidering: maybe a double standard isn't such a bad thing after all. The mathematical community (by and large) sees no value in ITPs, but (by and large) also loathes LLM-generated "proofs." Maybe the loathing can be parlayed into support for ITPs, if they perceive ITPs to be an antidote for LLMs.

I personally think it will be a matter of pragmatism, as soon as ITP are verifiably and indispensably used to not only formalise but heavily assist in a proof of a result many are curious about (not trying to discount old fashioned AI, so perhaps specifically a more modern/expressive ITP), interest could surge for practicing mathematicians. Mathematics tends to be heavily divided into subdisciplines though.

And I think that could come from LLMs combined with ITPs.

Joseph Myers (Nov 10 2025 at 10:33):

Terence Tao said:

My view is that the traditional human-powered way of generating a mathematical proof does not simply produce the final proof text; the process that the human mathematicians go through to reach that proof, trying different strategies, comparing against prior literature, drawing insights from failed attempts, etc., creates significant additional value. While most of that value would only be of private benefit to the researchers through their increased skill and experience, some of it also flows through into the way these researchers write and present their proof, sometimes in subtle ways such as selecting the order in which different parts of the argument are introduced, changes in notation compared to previous literature, etc..

This reminds me of Peter Naur's description of programming as being about building a shared mental model of how and why a program is designed, not just about producing program text, which people have referenced lately (example) in discussions of what is appropriate or inappropriate use of AI code generation when concerned with long-term maintainability.

With formalization as a hybrid of mathematics and programming, we can apply such ideas to what would be good or bad uses of autoformalization in library building. When we build up theory in mathlib for some part of mathematics, we're not just building a text, but a common understanding of how and why that part of mathematics is set up that way in mathlib, what the design tradeoffs involved were, why one approach for expressing concepts formally works better than another, how certain kinds of statements and proofs should be expressed formally, what the common patterns are in statements and proofs that may lead to new definitions and lemmas and tactics that don't appear directly in the informal literature. So a key question for large-scale autoformalization would be how to use it without losing those things. (An AI might readily produce 1000 proofs based on awkwardly set up definitions, and indeed clean up those proofs so that each one is individually high-quality Lean, without complaining that there are common things that are inconvenient in many of those proofs that indicate the initial definitions and API would better have been set up differently.)

Alex Kontorovich (Nov 10 2025 at 13:35):

Some further thoughts on this topic appear here, FYI:... https://arxiv.org/pdf/2510.15924

Timothy Chow (Nov 10 2025 at 13:59):

On the topic of "AI slop" overwhelming the system, the CS arXiv recently instituted a policy of rejecting review articles and position papers because it's too easy to create them using LLMs. Unfortunately, ITPs can't really help with this problem.

Eric Vergo (Nov 10 2025 at 20:03):

Terence Tao said:

In contrast, machine-generated proofs (whether coming from LLMs, "good old fashioned AI", CASs, etc.) can produce proof artefacts which may indeed be completely correct and pass all verification tests, yet lack the penumbra of additional insight and understanding that the human-powered process organically generates.

I agree if you restrict yourself to considering only the final output of current systems, but that is not the only artifact generated in some workflows. For instance, LLMs interacting with Lean via the lean-lsp MCP generate reasoning traces, code, error message logs, and other intermediate outputs. This means every proof attempt, strategy change, success, failure, and notation modification can be stored and inspected by both humans and other language models. I see no reason why insights extracted from these artifacts could not be used to influence how the LLM writes the final proof and what information is communicated in other domains.

I would even argue that a mature version of this system (though none exists today) has an advantage in this regard. Humans have imperfect memories and chalkboards get erased. Everything the LLM does in this workflow is text that can be stored in a lossless, immutable, and transparent manner.

Terence Tao (Nov 11 2025 at 06:30):

This matches my recent experience with using AlphaEvolve: the traces of intermediate AE generation attempts, and our tweaking of the prompts based on the success or failure of those attempts, was actually rather valuable and in some ways more interesting than the final results (even when they beat previous literature). We'll try to make some of these traces available online in the near future. More generally, perhaps moving towards a culture of not just disclosing AI use in research, but showing more of the AI process, e.g., sharing the actual prompts and conversation with an LLM rather than just the final output, may ameliorate some of the critiques against current AI usage practices, though this of course is not a panacea for all AI-related issues.

Eric Vergo (Nov 11 2025 at 22:21):

Terence Tao said:

More generally, perhaps moving towards a culture of not just disclosing AI use in research, but showing more of the AI process, e.g., sharing the actual prompts and conversation with an LLM rather than just the final output...

Given the nature of math and formal verification, setting a standard of ‘full transparency’ seems prudent and consistent with other values driving norms in this space. Are there good arguments against full transparency beyond the overhead added when producing proofs?

Terence Tao said:

(this) may ameliorate some of the critiques against current AI usage practices, though this of course is not a panacea for all AI-related issues.

We are in full agreement on this point. I have spent a significant amount of time thinking about this and discussing it with mathematicians over the last few years which is why I decided to jump in. It’s nice to see the conversation happening more broadly. This is one of the many challenges, both technical and not, that I do not see how to fully overcome. Moreover, I am of the opinion that we as a community have far less time to prepare for this than the mathematicians that I have spoken to think we do. The scaling laws coupled with better RL techniques are producing consistent trends that are likely to hold. Based on the fact that much larger data centers are going to be coming online in 2026 it would be foolish to assume things are not going to change, and fast.

As for Emily’s talk/slide, I detect no sensationalism in her communication and believe she is right to be concerned. I fully endorse her proposal.

Screenshot 2025-11-11 at 3.03.35 PM.png

Timothy Chow (Nov 12 2025 at 02:16):

Eric Vergo said:

Are there good arguments against full transparency beyond the overhead added when producing proofs?

What exactly do you mean by "full transparency"? Ignoring computers for a moment, traditional mathematical publications do not record every misstep and blind alley along the path to the destination. It may be that we overdo the process of erasing these missteps, but recording everything doesn't seem to me to be the right answer either; it can clutter the exposition and make things much harder to understand.

Given that almost everyone reading this is a fan of formalized mathematics to some extent, we probably all support "full transparency" of all the formal steps in a proof. But I'm not sure what "full transparency" means beyond that.

As for Emily’s talk/slide, I detect no sensationalism in her communication and believe she is right to be concerned. I fully endorse her proposal.

Does this mean that you think Shalosh B. Ekhad's papers should not be accepted as proofs until they are formalized in an ITP?

Niels Voss (Nov 12 2025 at 09:18):

I use LLMs sort of like a search engine for things like algebraic structures. People don't publish their browsing histories when writing papers, however they do publish the papers they eventually ended up referencing. I think asking mathematicians to publish every conversation they had with an LLM is akin to asking them to publish their whole search history, which is very noisy. I agree we should move "towards a culture of ... showing more of the AI process", but that's different from publishing everything.

In any case, the question of whether mathematics should shift culturally towards more transparency in AI usage is different from the question of whether AI transparency should be enforced on an institutional level, and these two questions might have the same answer or they might have different answers.

Emily Riehl (Nov 12 2025 at 23:13):

This is meant to be provocative but also stake out a preliminary proposal I'm happy to defend. As @Timothy Chow notes this is deliberately a higher standard than for human-generated proofs, for the reason that @Terence Tao describes.

In practice, there is more "evidence" for human generated proofs than appears in the paper. The Thurston essay contains (among other gems) a very nice discussion of all that is lost in fact in the translating a proof from one's head to speech to writing. I think this explains why human-generated proofs are more reliable in practice than is evident from the text: (i) Missing arguments can often be supplied. (ii) Even when the written proof contains a mistake, the theorem is often true.

Emily Riehl (Nov 12 2025 at 23:17):

I'm not entirely sure what this screen shot was from. Perhaps this talk whose slides are here. I gave an extended version of the same talk to a somewhat different audience a few months later, though the slide Kevin screenshot does not appear there. In any case, this was meant as a proposal to defend the mathematical literature against "vibe proving."

Emily Riehl (Nov 12 2025 at 23:18):

I'm very happy to see this discussed, both affirmatively and critically. Please let me know if you have further feedback/comments!

Timothy Chow (Nov 12 2025 at 23:22):

Emily Riehl said:

As Timothy Chow notes this is deliberately a higher standard than for human-generated proofs, for the reason that Terence Tao describes.

I think your proposal is on the right track, and I do think that so-called "AI slop" is a problem that needs to be addressed. My concern is that if we're going to formulate an official policy, one needs to address the fact that there is not a clean separation between "human-generated proof" and "AI-generated proof."

For a concrete example, I was on the Robbins Prize committee when we decided to award the prize to the paper that proved the q-TSPP conjecture. During our deliberations, we worried a bit that the proof relied essentially on a Mathematica notebook, which (at the time anyway) was not something that just anyone could run; you had to pay for a Mathematica license (and even then, you don't get access to the source code). On top of that, when I tried to download and run everything, I found that there was a crucial component that I had to email one of the authors to obtain. Despite our reservations, we decided to award the prize anyway. Under the proposed policy, should we have declared that the "proof" was "not a real proof"? It was neither entirely "human-generated" nor entirely "AI-generated".

Jason Rute (Nov 12 2025 at 23:22):

What about the setting where the human author understands the proof, but the origin is AI? (I think in the past year, this has become more common place.) Not Annals level proofs, but basic lemmas and other things which would go in a paper? And would it matter if the author was senior or junior? I agree there is a large danger of someone believing a mathematical fact because an AI said it, not unlike Wikipedia which also has mistakes. But I think the first solution would be to give a rigorous proof yourself and understand all the details. Lean is nice (and I hope AI can make Lean easier), but it is a high bar for many areas of mathematics to get your proof in Lean much less the entire background needed. And if we don’t let someone publish a proof they found with AI, but which they understand, will it just become folklore (like I’ve heard used to happen with Solviet math conferences).

Terence Tao (Nov 12 2025 at 23:53):

I can envisage hybrid levels of formalization begin to appear in math papers, where one particularly intricate part of the argument is formalized in (say) Lean using some preparatory lemmas as axioms, but these lemmas themselves are justified by more traditional methods, such as appeal to past literature, without having to go through the extremely time-consuming task of formalizing all prerequisite literature as well. We already do this sort of hybrid justification when using computer algebra packages to justify one or two specific lemmas (e.g., numerically evaluating an integral) in a paper, while the majority of the paper remains human-generated and written in informal mathematics.

Emily Riehl (Nov 13 2025 at 14:25):

I don't have much personal experience with proofs relying on some computational software so I'd rather hear more discussion with perspectives like @Timothy Chow's on how to vet them.

Emily Riehl (Nov 13 2025 at 14:28):

I agree with @Terence Tao that a hybrid approach makes sense in a lot of scenarios. @Sophie Morel advocates for something similar here.

Emily Riehl (Nov 13 2025 at 14:37):

Jason Rute said:

What about the setting where the human author understands the proof, but the origin is AI? (I think in the past year, this has become more common place.) Not Annals level proofs, but basic lemmas and other things which would go in a paper? And would it matter if the author was senior or junior? I agree there is a large danger of someone believing a mathematical fact because an AI said it, not unlike Wikipedia which also has mistakes. But I think the first solution would be to give a rigorous proof yourself and understand all the details. Lean is nice (and I hope AI can make Lean easier), but it is a high bar for many areas of mathematics to get your proof in Lean much less the entire background needed. And if we don’t let someone publish a proof they found with AI, but which they understand, will it just become folklore (like I’ve heard used to happen with Solviet math conferences).

This is an interesting question. I feel like an important factor is how easy the argument is in retrospect. And one factor in whether something is "easy" is how long it is: eg is this really just a one paragraph proof of one lemma or is it several pages or a whole section or the whole paper?

I don't mean to be overly pedantic. In cases of an in retrospect simple lemma, you could everyone being happy with the natural language proof. (On the other hand, if it's so simple, why not just formalize it?)

Our conventional language ("we show...") acknowledges that written mathematical communication involves some collaboration between the writers and the readers. In some cases, it feels like the balance if off: the authors haven't explained enough so the readers have to do too much work to understand an argument. As mathematical writing becomes easier to produce I think we should shift a little more of the burden back onto the authors from the readers, not by having longer papers but by having some sort of supplemental formalization of the most delicate arguments, which may then liberate the natural language version to focus on details that are more high level.

Timothy Chow (Nov 13 2025 at 14:59):

On the topic of more conventional computer-assisted proofs, there are several MathOverflow questions that provide some suggestions, such as this one and this one. I don't think it's a completely solved problem, though.

Timothy Chow (Nov 13 2025 at 15:32):

Jason Rute said:

Lean is nice (and I hope AI can make Lean easier), but it is a high bar for many areas of mathematics to get your proof in Lean much less the entire background needed.

Yes, if it was a "multi-day struggle" for Alexeev and Mixon to get a Lean proof of a fact that is so simple that it would not even require proof in a conventional combinatorics paper, we can forget about Lean novices formalizing entire papers, at least in the near future.

And if we don’t let someone publish a proof they found with AI, but which they understand, will it just become folklore (like I’ve heard used to happen with Solviet math conferences).

Yes, enforcing standards that most people find too hard to meet can have the unintended consequence that people will be incentivized to lie about whether they used an LLM for assistance. If admitting that I used an LLM sentences me to months of hard labor fighting with a proof assistant, whereas lying gives me a free pass, we can predict what most people will do.

Jason Rute (Nov 13 2025 at 16:49):

As for lying: I met a math PhD at JMM 2025 who told me he used an LLM to get a key research idea for solving his thesis problem, but didn’t want to admit it, for fear of being looked down upon.

Terence Tao (Nov 13 2025 at 16:56):

A related point: if we don't normalize and encourage responsible disclosure of AI tools in our work, and instead a large fraction of ends up using AI clandestinely (and being occasionally exposed as such, e.g., due to the various "tells" of AI-generated output), it becomes harder to take the moral high ground when denouncing high-volume, low-quality AI-generated mathematics from non-experts.

David Michael Roberts (Nov 14 2025 at 00:07):

Niels Voss said:

they do publish the papers they eventually ended up referencing. I

I'm not sure what you mean. But it put me in mind of this: https://mathoverflow.net/questions/227171/what-is-the-status-of-arthurs-book

David Michael Roberts (Nov 14 2025 at 00:20):

@Terence Tao there's obviously a strong difference between brainstorming with an AI and using it to output text, where only the latter is going to have a tell. The former is closer to something like "The author thanks x, y and z for useful discussions" in the acknowledgements. People do sometimes acknowledge their software tools; in category theory, for instance, citing Paul Taylor's diagram package for tex was a not uncommon sight, and sometimes people point out the various free-as-in-speech software that was used in the course of the work. So there's a middle ground to find as to how one acknowledges use of AI qua tool-for-the-research-process, rather than "The author thanks Open AI's/Google's/Anthropic's chat bot for useful discussions".

Jason Rute (Nov 14 2025 at 00:45):

I think there is going to be a continuum of useful discussions with AI ranging from the equivalent of chatting with a rubber duck (a common trope in CS), to having the AI produce long, complicated, and elegant seemingly novel proofs. In the latter case, what if the human rewrites the proof in his/her own words and thinks they understand it? I think Emily would still be concerned by this. I also doubt there is going to be a clear dividing line between the former innocuous use (silly human thinking AI is doing something) and the later (wow, AI solved a theorem!). I think lots of mathematicians are going to use AI for understanding areas of math they are less familiar with. The AI is going to produce something obvious and trivial to the expert, but new and miracuous to the non-expert (mathematician). It will be difficult to know how much such an encounter needs a greater degree of scrutiny.

Timothy Chow (Nov 14 2025 at 08:36):

David Michael Roberts said:

Niels Voss said:

they do publish the papers they eventually ended up referencing. I

I'm not sure what you mean.

I think all he means is that if I do a literature search and find a Paper X by Author A whose results I use in my paper, then I will explicitly cite Paper X by Author A. But in my paper, I won't explicitly write, "I used Google Scholar with version X of Chrome running on Windows 10 at 3:47 p.m. on January 12, 2025, typing the following search terms ... and noticed that the fifth search result was ..."

Timothy Chow (Nov 14 2025 at 14:19):

Jason Rute said:

It will be difficult to know how much such an encounter needs a greater degree of scrutiny.

Yes, if the history of computers and mathematics is anything to go by, attempts to legislate what is or is not a proof will likely fail. IMO, the most we can realistically hope to do is encourage the mathematical community to subject LLM-generated "proofs" to a high degree of scrutiny, and let the community figure out what that means.

We can, however, strongly encourage (without enforcing) the use of ITPs for this purpose. As I said above, pitching ITPs as a weapon against the evil hordes of AI slop may gain more traction among some segments of the community than some of the usual arguments for ITPs (which often fall on deaf ears). Another amusing possibility is to build a library of striking examples of incorrect but plausible LLM-generated proofs, to serve as cautionary tales.

Eric Vergo (Nov 15 2025 at 07:33):

Timothy Chow said:

Another amusing possibility is to build a library of striking examples of incorrect but plausible LLM-generated proofs, to serve as cautionary tales.

I nominate this for the first entry. It contains a complete, type-checked proof of a statement involving Chebyshev polynomials and roots of unity. This involved writing 3000+ lines of Lean over 10 files. In addition, it contains all the standard things: automatically generated documentation, a blueprint and dependency graph, an associated research paper, additional expository items, all packaged and displayed on a nice landing page. Most importantly, it is a 100%, "Vibe-proven" project from beginning to end per @Emily Riehl 's definition.

It is also completely riddled with errors. But if you squint it looks the part. It may also be trivially easy to prove given what's in Mathlib—I'm genuinely not sure. Regardless, That is not the point of this exercise.

To kick things off, I used GitHub Copilot to write a short Python script containing code capturing the basic idea. Other than that, every line of every file written between that moment and the repo's current state was written with Claude Code. It also executed essentially every GitHub action along with every other software engineering task. All I did was prompt it, with a few exceptions. For instance, I had to manually execute some terminal commands to meet permissions requirements on my machines, but did so at Claude Code's direction. I also did some manual editing to deal with linter warnings and other minor tasks of that nature. I estimate that I had Claude Code actively doing something for ~100 clock hours over the course of the last week. These claims warrant skepticism, which I welcome. To that end, I offer the following:

The repo includes this directory which contains a complete chat history from every Claude Code session used while vibe-proving. As far as I can tell, when paired with the GitHub history, this is the most complete record of the working sessions that can be practically obtained by the user. If I have done things correctly, this should contain every token both sent and received from Claude Code. In addition, I will reach out to Anthropic directly to see if they can independently verify these claims and also if they can provide more complete, easier-to-inspect documents.


What this is:

  • Planting a flag in the ground and saying "This is what models are capable of producing today—obvious slop or not"
  • A study in what it means to do mathematics in a maximally transparent way
  • An opportunity to study and document the myriad of ways in which things can go wrong if we are overly reliant on these tools, even when using Lean
  • An effort to stimulate discussion
  • An object that enables us to point at something concrete and make observations, rather than discuss hypotheticals.
  • A claim that in some limited but nonetheless meaningful ways, "the rubber is meeting the road"

What this is not:

  • A suggestion that what is being shared should be made the standard

This experience has caused me to ask quite a few new questions (new to me at least), but I am more interested in what all of you have to say. I look forward to the comments, questions, and discussion this will hopefully generate.

Chebyshev_N5.mp4

Alex Meiburg (Nov 15 2025 at 18:43):

I'm kind of confused what the lesson is. You say it's "riddled with errors". Is this:

  • A project that AI vibe-coded, that successfully proves an interesting mathematical theorem, but the design and layout is terrible to the point that no one should use this?
  • The Lean code compiles, but the AI-generated proof statement is wrong in a way that makes the whole repository essentially useless as a proof artifact?
  • The whole repo "looks plausible", but none of it actually builds?
    Or something else?

Eric Vergo (Nov 15 2025 at 23:11):

Everything builds, and the Lean compiler says things successfully type check. But I make no claims that it is a valid proof. Here's why:

I am barely a mathematician, and far from competent with Lean. At best, I am a well-intentioned amateur. I cannot stress this enough: This is not false modesty, but recognition that the bar is high and I do not yet meet it. I am not qualified to judge whether this is a legitimate formalized proof. To put a fine point on it, the proof uses objects I have never encountered before, and there are statements and proof techniques I do not understand. There are things going on in Lean whose syntax and purpose I do not understand. Claude Code decided to use all of that and wrote all of the Lean without my understanding what was going on with the specifics.

Despite all of that, I was able to make this with tools available today. I knew that the relationship between roots of unity and Chebyshev polynomials was correct before I started this project (it is a known and established relationship) but I have little to no confidence that what is contained in this repo accurately captures that idea.  I would go as far as to say that I was intentionally as careless as possible while "vibe proving" the repo, with the goal to "get all the checkmarks and make it look good."

All of that being said, the "lesson" is this: 

This is existence proof that, in some very limited sense, one can produce seemingly correct mathematical proofs in Lean via almost exclusively prompting a tool-using LLM. This should inform earlier discussion in this thread about the pace of change and potential volume problems facing the community. For a while I thought that Lean was going to be able to naturally defend itself from AI slop, but I am growing less confident of that over time. Some questions I think we need to ask:

  • If I am able to do this now in 100 hours, how easy will it be for non-experts in a year? Two? Five?
  • Can the Lean community handle 10 submissions like this a day? 100? 1,000?
  • How soon will a language model be able to replace a human like me? Can it do that today? Has anyone tried?
  • What if we simply don't have enough humans to review the rapidly increasing volume of submissions?
  • Should we allow the humans we do have to use AI tools during submission evaluation? How might this go wrong?
  • Are we going to enter an era of mass-produced mathematics? What does that look like? How do we prepare for it?

Terence Tao (Nov 15 2025 at 23:49):

By the way, I have tried to initiate a conversation on these topics over at https://ai-math.zulipchat.com/#narrow/channel/539992-Web-public-channel---AI-Math/topic/Best.20practices.20for.20incorporating.20AI.20etc.2E.20in.20papers/with/547591803 , although that Zulip is considerably lower volume than this one. Still it might be worth having some crossover between the two discussions.

Timothy Chow (Nov 16 2025 at 13:30):

Terence Tao said:

Still it might be worth having some crossover between the two discussions.

I'm not sure I want to sign up for an account over there, but I have a couple of comments. First, thanks for taking an initial crack at a list of best practices. I can imagine that a (suitably expanded and honed) version of this list could amass signatories and be published in something like the Notices of the AMS.

Second, I think that one needs to distinguish between things that people should do, and things that should be explicitly discussed in the paper. Take the example of hallucinated references. Obviously, any reference suggested by an LLM should be verified, to avoid those egg-on-the-face situations where a nonexistent reference is cited. However, I'm not sure it makes sense for every paper to contain an explicit discussion of this point and an affirmation that the authors did the right thing. Quite apart from LLMs, people already do sloppy things like copy a reference from someone else's bibliography without checking that it's correct. This is bad practice, and a list of best practices should condemn it, but I don't think anything is gained from cluttering every paper with an affirmation that the authors didn't do this. On the other hand, there are some things that warrant explicit discussion, such as efforts to perform certain intricate calculations in at least two different ways to make sure the answer is correct.

Third, while some of the "best practices" apply across the board to all kinds of tools, there are some that apply primarily if not exclusively to certain tools. It would be good for the list of best practices to have separate sections addressing different types of tools. For example, for an ITP, getting the formal definitions exactly right, and ensuring that others can exactly replicate the proof verification, are very important, but don't apply (at least, not in exactly the same way) to LLMs.

Timothy Chow (Nov 16 2025 at 13:45):

Another thing to think about is whether one wants to expand the list to include "best practices" in general, not just best practices for computer tools. What about some of the things that Kevin Buzzard sometimes cites (or used to cite?) in his talks, of practices that lead to gaps and errors in the literature?

If we go down the route of publishing a list of recommendations, then we should be mindful that mathematicians hate being told what to do. I'm reminded of the brouhaha that Jaffe and Quinn's article on Theoretical Mathematics generated. Atiyah responded by saying that even though he agreed with much of what they said, "I rebel against their general tone and attitude which appears too authoritarian." In order to minimize backlash, we will need to avoid sounding too authoritarian.

Kevin Buzzard (Nov 16 2025 at 14:16):

It's just the fact that humans do mathematics which leads to gaps and errors in the literature.

Kim Morrison (Nov 18 2025 at 00:26):

I think Eric Vergo's example above is interesting to look at. There's no doubt that the AIs are already capable of creating repositories with correct (perhaps not interesting) mathematics, checked by Lean.

Kim Morrison (Nov 18 2025 at 00:26):

Let me restrict attention for a moment to projects where essentially all the definitions required for stating the final result are in Mathlib (I think this is true for Eric's example). (Implicitly saying that either I don't trust AIs to assist with projects where this isn't true, or at least that I expect a human to PR the definitional parts to Mathlib before I care to even consider the AI generated proof.)

Kim Morrison (Nov 18 2025 at 00:26):

I think we should agree, and provide associated tooling, that such projects "don't count" unless:

  1. They contain a file MainTheorem.lean, which has no imports outside of Mathlib, and contains the main result of the repository as a def StatementOfTheorem : Prop := ....
  2. They contain a file ProofOfMainTheorem.lean containing nothing besides theorem mainTheorem : StatementOfTheorem := ... (and which may import arbitrary files from the project).

Kim Morrison (Nov 18 2025 at 00:26):

In fact, when the module system arrives (hopefully later today!) we can even ask that in ProofOfMainTheorem.lean, the only public imports are from Mathlib (i.e. all the repository imports are only used in the private context, e.g. proofs). This has the consequence that import ProofOfMainTheorem will add precisely two declarations to the public context: the statement and proof. This significantly reduces the complexity for a human reviewer!

Kim Morrison (Nov 18 2025 at 00:28):

What do people think?

I think we could write up a document explaining this criteria, and write a simple tool that verifies that a project fulfills these requirements (and probably runs the various sanity checkers too).

Then whenever someone advertises a new AI generated repository, it might be as simple as @mentioning a bot that runs the tool, and if if fails replies with a link to the document. :-)

Eric Vergo (Nov 18 2025 at 01:03):

Kim Morrison said:

What do people think?

I think we could write up a document explaining this criteria, and write a simple tool that verifies that a project fulfills these requirements (and probably runs the various sanity checkers too).

Then whenever someone advertises a new AI generated repository, it might be as simple as @mentioning a bot that runs the tool, and if if fails replies with a link to the document. :-)

I think this is worth exploring and would be happy to do some testing. What do you think about this? let people run this while vibe proving, possibly even through callable tools like the lean-lsp MCP. This would allow people to ensure their repo is going to pass the checks before submitting.

Johan Commelin (Nov 18 2025 at 06:05):

In that case I think it should be a local tool. So that the bot doesn't get clogged by AIs.

Eric Vergo (Nov 18 2025 at 20:39):

Kim Morrison said:

I think Eric Vergo's example above is interesting to look at. There's no doubt that the AIs are already capable of creating repositories with correct (perhaps not interesting) mathematics, checked by Lean.

For what it's worth, I think the math here is interesting. I say this not to put up a stink, but because I think it's relevant to the discussion we are having: Who gets to decide what it means for some piece of mathematics to be interesting?

Timothy Chow said:

If we go down the route of publishing a list of recommendations, then we should be mindful that mathematicians hate being told what to do. I'm reminded of the brouhaha that Jaffe and Quinn's article on Theoretical Mathematics generated. Atiyah responded by saying that even though he agreed with much of what they said,  "I rebel against their general tone and attitude which appears too authoritarian." In order to minimize backlash, we will need to avoid sounding too authoritarian.

This is front and center for me. I know that I am a bit of an outsider here, so I would like to take a minor detour and tell you a bit about my background. I do this not to draw attention to it, but to share it with the hope that you fully understand where I am coming from, add some credence to my thoughts, and to give you confidence that I mean what I say. I know who all of you are and given the nature of the conversation I think it makes sense that you know who I am.

In August of 2024 I was interviewed on a podcast, where I describe the process of leaving my mechanical engineering career at Apple to pursue mathematics. It is surprising how many things I say during that interview that are relevant to the conversation we are having right now, but the important part is said at the 13:00. (timestamped link at the end of this post, it lasts 5 mins). Additionally, in October of 2023 I wrote “Math is now an Engineering Problem” on my now defunct blog. The critical part of that post is this: 

“Just as Napster was the first warning shot to the music industry that everything was about to change, Sagredo is the first warning shot to mathematicians. 

While an incredible achievement, there are numerous opportunities to make this technology significantly more effective; specifically trained LLMs will do a better job of generating error free code, optimization will make the ‘discussion’ between the two subsystems faster, more compute will lead to speed ups. The list goes on. But that list is supplanted in importance by the following observation: everything on that list is a matter of engineering – not mathematics. And, in just the same way the LLM uses the iterative process to ‘engineer’ a proof, we will be able to apply the iterative process to engineer the system itself. What this means is that the ability of this system to produce proofs is a function of our ability to engineer the system, rather than our ability to understand mathematics – it’s engineering all the way down.”

Is it fair to say that the repo I shared earlier in this thread was made with Sagredo++? I think it is.

The core responsibility I had as an engineer was to take product designs/architectures which had been validated at low volume and get the product design refined to the point where it can be put into customers hands. This involved designing things that could be manufactured, sometimes in the many millions of units, while maintaining high quality standards. If my time as an engineer taught me anything, it’s that more is different. Different in a way that is not “we keep finding new corner cases that break the kernel” or “This process is the latest bottleneck in build time, that’s a first”. I mean different in a way that no one can predict, different in a way that simply cannot happen unless you are doing things at scale and have a large number of things interacting. When I looked at the coupling of large language models and formal verification systems, I saw something that could be automated and scaled, which led to that blog post. Based off of what I have seen between then and now, I fully expect the tools to become increasingly powerful. What happens as a result of that is very hard to predict.

Like a lot of people here, I am drawn to and excited by the exacting and brutal standards that formal verification imposes. What excites me more is the fact that I will get to express my creativity while being grounded by Lean. LLMs will be a part of my workflow moving forward and it’s clear that this is true for others as well. I look forward to developing best practices around this that fully respect the rigor that Lean demands, and to do so in a way that increases accessibility while adhering to open source philosophies. I do not know “how to do this right”, and I don’t think anyone does. But we will figure something out.

https://www.youtube.com/watch?v=-3TZG1NiFKA&t=780s

Timothy Chow (Nov 19 2025 at 13:36):

Eric Vergo said:

Everything builds, and the Lean compiler says things successfully type check. But I make no claims that it is a valid proof. Here's why:

I am barely a mathematician, and far from competent with Lean. At best, I am a well-intentioned amateur.

I'm afraid that I'm still confused about which of Alex Meiburg's two category your example is supposed to fall into. Recall that the options were:

  • A project that AI vibe-coded, that successfully proves an interesting mathematical theorem, but the design and layout is terrible to the point that no one should use this.
  • The Lean code compiles, but the AI-generated proof statement is wrong in a way that makes the whole repository essentially useless as a proof artifact.

"Riddled with errors" sounds like the latter option, but I'm not sure. Maybe the "engineering versus mathematics" cultural difference is causing some communication problems. From a mathematician's point of view, if the Lean code consistently compiles, then there are only two possible "errors": (1) the statement of the theorem is not what it seems to be, or (2) there is a major bug in Lean (or the trusted code base). Are you using "error" in some other sense of the word?

Eric Vergo (Nov 19 2025 at 18:06):

I see the source of the confusion now; I was not clear when sharing the repo. It doesn’t exactly meet any of Alex’s definitions, so I will try to cover everything here:

  • The Lean project builds and there is no reason to assume that there is a kernel error or major bug
  • As for “the statement of the theorem is not what it seems to be”, I don’t think there are mistakes of this nature, but am genuinely unsure if there are. Catching those types of mistakes is hard in my experience, and some of the underlying math is new to me which makes it even more difficult. 
  • The ‘riddled with errors’ comment was directed to the other files in the repo. There are obvious errors and inaccuracies such as broken links, improper citations, etc. The details might be worth studying but in many ways, this repo is not to be trusted.

On reflection, I should have placed more emphasis on the following: I spent ~100 hours vibe proving this, and while that is a non-trivial amount of time this is something that would have taken me many months to produce without LLMs. We should expect that people are going to find that sort of time-saving irresistible. In the same way that journal submission teams are under-water from AI generated slop, we should assume that the team reviewing mathlib PRs will be too. It’s hard to say when this will happen, but it seems like something that we will face in the relatively near future. The tool Kim suggested will help, but there are serious problems caused by volume that this might not be able to counter. 

As for Emily’s proposal I am still in support of it, but there are arguments against it. For instance, this would further increase the number of AI generated mathlib submissions and add additional strain on that team. There are a lot of people who are using LLMs when writing traditional math papers, and we might not be able to handle the influx of submissions caused by the requirement of having things formally verified.

Timothy Chow (Nov 19 2025 at 22:25):

Thanks for the clarification. Let me clarify what I intended by my "amusing possibility." I had in mind LLM-generated proofs written entirely in natural language that sound highly plausible to a professional mathematician, but which are incorrect in the mathematician's sense: i.e., the proof contains huge gaps or even outright false statements. Emily gave an example (maybe not highly plausible, but at least somewhat plausible to a non-expert) early on in her talk.

Your repo also serves as an important cautionary tale, but of a slightly different kind from what I had in mind.

Timothy Chow (Nov 19 2025 at 22:40):

Yet another kind of cautionary example would be of seemingly formally verified results that are wrong in a strong sense. If you search for "maliciously doctored" in Mark Adams's slides on "Flyspecking Flyspeck" then you'll see an example of what I mean. One can also imagining maliciously doctoring GMP or some other infrequently scrutinized section of the trusted code base to create a false proof that builds without errors.

Eric Vergo (Nov 20 2025 at 00:11):

Tremendous. Thank you for both requesting the clarifications and working through my explanations. Shared understanding is important and I did a poor job of making that possible. I’d like to build on that and tie up a few loose threads.

Timothy Chow said:

Eric Vergo said:

Are there good arguments against full transparency beyond the overhead added when producing proofs?

What exactly do you mean by "full transparency"? 

By full transparency I mean two different things. First, and much less of a concern on my end, are end user mathematicians disclosing the use of “AI” tools, along with the logs and techniques used in doing so. Indeed, it is not the norm now that mathematicians publish failed attempts and time spent on wrong ideas. But I would argue that this is, in part, a function of the fact that doing so has been unavailable to mathematicians because of pragmatic constraints. But, the introduction of these tools significantly reduces the amount of work needed to execute on that, so maybe we should consider it. As others in the thread have noted, people may engage in undesired behavior when disclosing use in AI, and this needs to be taken into account. I don’t have solutions for much of, if anything highlighted here but do mirror the concerns that others have shared.

The far greater concern, and the intent of my previous posts, is to push for full transparency from companies who are producing the tools that we are going to use. What I mean by that is this: right now we are relying on the good graces of frontier model makers to have access to things like reasoning traces, a full history of actions taken, outputs, intermediate artifacts, tool calls, logs, etc. There is no guarantee that they will do this in the future, and there are already examples of end users having unexpectedly reduced access to these types of things. I am deliberately not including examples to avoid finger pointing, but a quick search will confirm this. I’m bringing this up to illustrate a point: there are implicit assumptions being made that we will always have the option of inspecting every token involved in producing a proof, and the reality is that the option may not always be there.  I am not making the claim that we should make it the norm that mathematicians share all of their relevant LLM interactions moving forward, but I could imagine a future where it is. Moreover, if the norm settles at a ‘less invasive’ level some may want to still choose to share as much as possible. If that is something we want to enable, we should make that clear now. 

These companies are anticipating that this technology will allow them to capture trillions of dollars in value over the coming years. We should expect them to act in a way that is rational when evaluated against the incentive structures surrounding them. This is not ascribing malice to their actions, simply observing that they are operating in a competitive, high stakes environment. If we want something that is in tension with what they want, we may have to fight hard for it.

Timothy Chow said:

Thanks for the clarification. Let me clarify what I intended by my "amusing possibility." I had in mind LLM-generated proofs written entirely in natural language that sound highly plausible to a professional mathematician, but which are incorrect in the mathematician's sense: i.e., the proof contains huge gaps or even outright false statements. Emily gave an example (maybe not highly plausible, but at least somewhat plausible to a non-expert) early on in her talk.

You are very welcome. I thought Lean was going to be able to easily insulate itself from the type of things Emily shared, but it might not. 

Timothy Chow said:

Your repo also serves as an important cautionary tale, but of a slightly different kind from what I had in mind.

A cautionary tale indeed. Even though they are providing legitimate value, LLM based tools and their outputs need to be treated in an adversarial way. 

Timothy Chow said:

Yet another kind of cautionary example would be of seemingly formally verified results that are wrong in a strong sense. If you search for "maliciously doctored" in Mark Adams's slides on "Flyspecking Flyspeck" then you'll see an example of what I mean. One can also imagining maliciously doctoring GMP or some other infrequently scrutinized section of the trusted code base to create a false proof that builds without errors.

This may not even require the malice of a human pilot, just more reward hacking and someone who is not paying attention or is unaware of what is going on.

Kim Morrison (Nov 20 2025 at 01:29):

Eric Vergo said:

Just as Napster was the first warning shot to the music industry that everything was about to change, Sagredo is the first warning shot to mathematicians.

Thanks for the Sagredo shout-out. :-) I'm glad someone noticed it!

David Michael Roberts (Nov 20 2025 at 05:51):

@Eric Vergo :

These companies are anticipating that this technology will allow them to capture trillions of dollars in value over the coming years. We should expect them to act in a way that is rational when evaluated against the incentive structures surrounding them. This is not ascribing malice to their actions, simply observing that they are operating in a competitive, high stakes environment. If we want something that is in tension with what they want, we may have to fight hard for it.

And this is, I think, one of Michael Harris' main points around the move towards increasing formalisation.

Eric Vergo (Nov 20 2025 at 15:50):

Kim Morrison said:

Eric Vergo said:

Just as Napster was the first warning shot to the music industry that everything was about to change, Sagredo is the first warning shot to mathematicians.

Thanks for the Sagredo shout-out. :-) I'm glad someone noticed it!

I showed it to so many people and no one 'got it'. I couldn't believe what I was seeing, or peoples lack of reaction. I'm sure you are busy with the modules launch (which looks awesome!), but I do think we should develop the tool you mentioned if/when you get some time. I'd be happy to use this repo and others for some stress testing :)

Timothy Chow (Nov 20 2025 at 19:35):

I want to mention two posts from Peter Smith's "Logic Matters" blog (this one and this one) that are potentially relevant to this discussion. They were written shortly after Holmes and Wilshaw announced the formal proof of the consistency of NF. Though the posts (and the accompanying comments/discussion) do not address LLM-generated content, they do address the role of ITPs in validating the correctness of mathematical proofs. In particular, Rowsety Moid expressed concerns about ITPs playing the role of a referee. Can ITPs be trusted, and even if they can, could their use damage mathematics by making mathematics less of a human activity?

As you'll see if you click through and read everything, I disagree with much of what Rowsety Moid says, but if we are going to be pushing ITPs as a virtual referee or guardian, then it's important to understand what kinds of objections and pushback we might encounter.

Anh Nguyễn (Nov 21 2025 at 00:54):

https://cdn.openai.com/pdf/4a25f921-e4e0-479a-9b38-5367b47e8fd0/early-science-acceleration-experiments-with-gpt-5.pdf#page8

Eric Vergo (Nov 22 2025 at 21:14):

David Michael Roberts said:

And this is, I think, one of Michael Harris' main points around the move towards increasing formalisation.

Indeed. The introduction of generative AI makes this point much sharper. I am not familiar with everything he has written but have been following his blog for a while. It's rare I find myself disagreeing with him, but I do expect the technology to get better much faster than he does. “Is it time to sell out” is one of his better pieces IMO and this line is particularly relevant: “But I have to confess that the prospect of foisting the work off on an AI agent, even Tao’s imagined GPT, looks deeply appealing.” I am deeply conflicted about this technology. The appeal is clearly there, but the concerns about what it will do to mathematics (and the world at large) are there too.

Siddhartha Gadgil (Nov 24 2025 at 11:59):

Don't want to cross post so will just link to my post: #mathlib4 > Discussion thread for Aristotle API @ 💬

Notification Bot (Nov 30 2025 at 10:29):

8 messages were moved from this topic to #Machine Learning for Theorem Proving > Erdos 124 AI Solution by Jason Rute.

Johannes Schmitt (Dec 15 2025 at 14:36):

Since it might be relevant for the discussion above: in #Machine Learning for Theorem Proving > AI-discovered proof on psi-integrals (with Lean formalizat.) I describe a recent project, where GPT-5 found a proof of an open problem on descendant invariants on the moduli space of stable curves. The post includes a write-up of this project which features both the AI-generated proof, and an AI-assisted formalization of a significant component of that proof in Lean. All mathematical results have been checked both by myself and several colleagues - I would also be very grateful for any comments on the formalization. Any other feedback (on mathematics, presentation, etc) is also very welcome!


Last updated: Dec 20 2025 at 21:32 UTC