Zulip Chat Archive

Stream: FLT

Topic: accelerating FLT (with tactics or AI)


Kevin Buzzard (May 07 2025 at 11:35):

I am continually asked by people how to accelerate FLT. What I personally have learnt in the first 6 months of the project is that I need to write more LaTeX, state more sorries, and set more tasks, because this does seem to be an efficient way of getting other people involved. I intend to concentrate on this part of the process a lot more in the coming months. My main problem is non-FLT-related distractions (e.g. I gave a lecture in Cambridge and a lecture in Oxford in the last week). But this question isn't about that.

The questions I get asked are of two forms. Firstly people ask "what tactics do you need to accelerate FLT" and secondly people ask "what AI tools could you use to accelerate FLT". I have 0 good answers to either of these questions. The code I've written myself -- I knew how to write it and I just wrote it, and never once did I think "I wish a tool were available to make this process quicker". But perhaps other people have different experiences? Does anyone else reading, who has been actively involved with the project so far, have any suggestions?

Yaël Dillies (May 07 2025 at 11:37):

My experience (not on FLT but Toric, but same same) is that Copilot is reasonably good at autocompletion when doing semi-repetitive tasks (ie tasks that are repetitive, but not repetitive enough that it's worth automating). I wouldn't say it doesn't make me more than 5% faster, though

Yaël Dillies (May 07 2025 at 11:42):

IMO the turning point will be when LLMs can write a full API with less than something like one Lean error per two declarations, and be receptive to feedback akin to what I would write on someone's PR, ie:

  • "Use this existing definition"
  • "Make this argument implicit/explicit"
  • "Generalise from this set of typeclasses to this other set"
  • "Turn this equality/iff around"

Yaël Dillies (May 07 2025 at 11:45):

Something that AI people seem to be very focused on is to have their LLMs turn PDF into code. IMHO this is too ambitious currently, and I wouldn't personally expect an AI to do it given that I wouldn't know how to do this myself one-shot! Instead, it seems to me much more useful to be able to give the AI a piece of trivial mathematics (like defining the categories of algebras, bialgebras, Hopf algebras) and let it write the corresponding API, so that I can focus on the hard bits.

Chris Birkbeck (May 07 2025 at 11:45):

You'd maybe want LLMs to help cleanup Lean code for PRs? Its quite hard to turn text into lean code, but maybe turning "bad" lean code into "better" lean code would help save time. We already have things like Lean IMprover, which do something like this, but I've not really tested them.

Yaël Dillies (May 07 2025 at 11:45):

Make the trivial things trivial, worry about the harder stuff later

Yaël Dillies (May 07 2025 at 11:46):

Chris Birkbeck said:

You'd maybe want LLMs to help cleanup Lean code for PRs?

That would be a huge timesaver for me given how much time I spend reviewing style! If we could have advanced autosuggestions on mathlib, I would have more time to work on the difficult project like Toric and FLT.

Chris Birkbeck (May 07 2025 at 11:51):

Yeah I want to just have a LLM mathlib bot that cleans and PRs stuff for me, which I know is maybe ambitious, but if it could do a first pass, golfing, naming things correctly, fixing typos in doc strings etc that would be a nice start.

Ruben Van de Velde (May 07 2025 at 11:56):

As somewhat more of an "AI" skeptic, I think that "telling people what you want to see proved" is probably going to be more efficient for now, at least :)

Yakov Pechersky (May 07 2025 at 16:55):

On github, Github Copilot does somewhat useful code reviews and is able to highlight unexpected changes. Something like this that is Lean/Mathlib cognizant, like "this should be an implicit variable instead" or "naming is off" or things of that nature, could help in review.

Shreyas Srinivas (May 07 2025 at 18:12):

Given recent events around deepseek, please consider the risk of allowing AI programs to write proofs in your project. Even if you review everything, the maintenance headache will be non trivial

Shreyas Srinivas (May 07 2025 at 18:13):

I also recommend finding a maintainer structure similar to mathlib where you have subproject tags and different people responsible for different tags

Zack Weinberg (May 07 2025 at 18:24):

I am, among other things, a cognitive scientist, and I can say with very high confidence that LLMs are Not What You Want if your goal is correct proofs. They are "just" Markov processes with much larger context than traditional Markov processes. They don't have any internal notion of anything besides "it is probable that this block of text should be continued by adding these words in sequence". This makes them very good at aping the style of human writing and very bad at anything that requires some kind of correctness.

Eric Wieser (May 07 2025 at 18:44):

But writing _lean_ proofs doesn't require any correctness; you can just keep trying until you get one that lean says is "correct"!

Shreyas Srinivas (May 07 2025 at 19:18):

See the deepseek v2 threads

Jason Rute (May 08 2025 at 11:37):

I really hope the takeaway from the DeepSeek bug is NOT that we can't trust AI proofs in Lean. The takeaway should be that we just need a slightly modified notion of correctness. (Namely, passes the kernel, doesn't have any special axioms, and the theorem statement matches a reference theorem decided ahead of time.) If we do this right, and it is not too hard to get it right, then no bugs in the front end or in metaprogramming will get in the way of correctness. We can trust AI proofs as much as human proofs. (As for whether they are useful is another matter of course and that is Kevin's main question right now.)

Alex Kontorovich (May 14 2025 at 21:57):

Chris Birkbeck said:

Yeah I want to just have a LLM mathlib bot that cleans and PRs stuff for me, which I know is maybe ambitious, but if it could do a first pass, golfing, naming things correctly, fixing typos in doc strings etc that would be a nice start.

Have you tried, e.g., Claude on tasks like this? You could feed it the whole file (or even one lemma at a time) and ask if it has suggestions for better lemma names (it's remarkably good at following conventions), writing better docstrings, etc...

Chris Birkbeck (May 20 2025 at 16:25):

So I tried this before with chatgpt and Gemini and found it kept giving me broken Lean code, even when trying to do simple cleanup, but maybe that's too ambitious. But this was some time ago, and also I've not tried Claude, maybe that's better for these things.

Kevin Buzzard (May 20 2025 at 16:48):

I can't tell all these LLMs apart but people who use these things definitely have preferences about which one are or aren't good for Lean

Kim Morrison (May 21 2025 at 02:40):

Claude 3.7 is dramatically better than 3.5. Gemini 2.5 is good (and very good at long informal mathematical reasoning), previous Geminis are bad. As far as I can tell, all the OpenAI models are still not good at Lean.

TongKe Xue (Dec 10 2025 at 08:43):

I know nothing about FLT besides its statement and Andrew Wiles proved it.
I stumbled across this thread while doing research on Lean4 UI and using AI / Crowdsourcing to speed up proofs.

I have some really basic questions:

  1. What is the current bottleneck in formalizing FLT? Is it defining theorem : ... := by sorry, or is it filling in the sorry with proofs ?
  2. If the latter, have "LLM Provers" and/or crowdsourcing been of use ?

From the above threads, it seems, as if 6 months ago, "LLM Provers" have not been very useful; but I'm curious (1) what the current state is, and (2) why its not more useful -- i.e. if LLM Provers can solve IMO/Putnam/some Erdos problems, it seems they might be able to eliminate theorems here & there.

Thanks!

Kevin Buzzard (Dec 10 2025 at 09:14):

There are two different questions: what is the bottleneck in giving an arbitrary proof of FLT, and what is the bottleneck in doing what I am doing. For the first question, I would imagine that it's the same bottleneck as formalizing an arbitrary hard theorem in the literature -- too much human-written stuff, not enough Lean. For what I'm doing, there is no bottleneck, but what I am doing is writing idiomatic code and growing mathlib and developing a proof which is not easily extractable from the human literature and training people and doing all sorts of other things. Solving an Erdos problem which takes 5 minutes to explain to a smart undergraduate is extremely different from solving FLT which takes hundreds of hours to explain to an expert.

TongKe Xue (Dec 10 2025 at 09:34):

I need to clarify a misunderstanding.

I'm not comparing FLT to an IMO/Putnam/Erdos problem -- that would be insulting. :-)

I'm comparing a single lemma of the form theorem : ... := by sorry that (1) is defined and (2) assigned to some human to fill out. My (very limited understanding) is that you + some small core group break FLT down into many small lemmas (like chapter, section, subsection, paragraph) in a book; and these "small lemmas" get solved by individuals. I'm comparing IMO/Putnam/Erdos to these small lemmas, and wondering "hey, can these small lemmas be solved by Aristotle / other LLM provers"

Another misunderstanding.

I think I misunderstood the purpose of the FLT project. I incorrectly thought the sole goal was to formalize FLT. Now, I'm starting to see the greater vision/impact is to use formalizing FLT as a way to (1) train next generation, (2) build up mathlib, (3) etc ...

So I guess in this case 'crowdsourcing / LLM Provers' are less useful, as much of the value is from the 'side-effect' of formalizing FLT.

Kevin Buzzard (Dec 10 2025 at 12:00):

So these small lemmas do not really exist right now, most of the holes are big theorems.

Occasionally I break down a big theorem into some other big theorems, and the glue is some small lemmas. Then the small lemmas are typically filled in by me or other humans, often quite quickly.

Probably an AI could attempt to solve some of these small lemmas. However there are two (unrelated) problems here.

1) The small lemmas are all about technical mathematical objects whose definitions are not in mathlib and whose definitions might only have just been made by me in the FLT repo. So the AI doesn't know anything about these definitions, which might make things harder.

2) If an AI comes along and writes a crappy unidiomatic ten-times-as-many-lines-as-it-should-be proof of a lemma, then in my mind this is a step backwards not forwards, because it will take years to prove FLT and now I have some crappy unidiomatic badly-proved lemma which I have to maintain for a long time. I would rather have a human proof. "Filling in proofs of small lemmas" is not the bottleneck.

Time and time again people come along and say "I am good at AI, let me help you prove FLT" and I always say the same thing to them, and I'll say it to you too: if we want a full proof then we need much more than just the Wiles and Taylor--Wiles paper (which my team are working on, we're doing fine and don't need any AI help right now). Right now we are assuming a whole bunch of material which was in the references for the Wiles and Taylor--Wiles papers and nobody is working on this stuff. Here is a concrete example:

https://github.com/ImperialCollegeLondon/FLT/blob/main/FLT/Assumptions/Mazur.lean

The proof of the axiom in that file is a 100 page paper by Mazur whose reference is in the docstring. In reality it is big problems like that which are the bottleneck. If you can get an AI to read the 100 page paper by Mazur and all the references of that paper and then formalize a proof of the axiom, that would be great. I feel like the answer to your original question "what are the bottlenecks" is really this: the bottlenecks are all the assumptions which we are going to make and which nobody is working on.

In January to March I will be giving a course on FLT and I will be adding many more of these assumptions.

TongKe Xue (Dec 10 2025 at 12:59):

Thank you for the detailed reply. I have learned:

  1. filling in "small lemmas" is not bottleneck -- they often don't exist, are trivial to fill
  2. machine "proofs" can often have negative value, if difficult to read
  3. the real problem is all the stuff assumed in Wiles paper (but not yet formalized)

Time and time again people come along and say "I am good at AI, let me help you prove FLT" and I always say the same thing to them, and I'll say it to you too: if we want a full proof then we need much more than just the Wiles and Taylor--Wiles paper (which my team are working on, we're doing fine and don't need any AI help right now).

I should clarify. This was not my intention at all (but my fault for not being clear when posting here).

I was continuing my train of thought along the lines of #general > LeanGameServer x Undergrad Math ? , and was wondering: "Why is it that autoformalization tools haven't formalized all of the springer undergrad/grad texts -- because the difficulty of a single lemma/theorem there is often <= an IMO/Putnam/Erdos problem." Mentally, I could not resolve these two claims:

  1. LLMProvers are good at solving IMO/Putnam/Erdos problems
  2. LLMProvers have not yet autoformalized all the springer undergrad/grad texts.
    That's what led me here, wondering what the bottlenecks in formalizing FLT are.

What I have realized through this discussion is that formalizing Springer math books is very different from formalizing FLT in that: (1) springer texts are self contained, FLT relies on many external papers; (2) springer texts have nicely spaced lemmas, in FLT these lemmas are still being formulated.

Thank you for patiently sharing your experiences. I learned quite a bit from this. :-)

PS: Still surprised that despite the power of these LLMProvers, we don't have a full auto formalization of all the Springer books.

Thomas Bloom (Dec 10 2025 at 13:52):

LLMProvers are not 'good' at solving Erdős problems (at least, not at the moment). They have solved (with varying degrees of human intervention) a couple of simple questions that Erdős happened to have asked, which therefore qualify as Erdős problems in a broad sense. These questions are on the level of IMO-style problems.

These should be distinguished from the actually important Erdős problems, those that have driven progress in number theory and combinatorics for decades, and which many people have worked on and contributed to. LLMProvers have yet to say anything new about problems of this type.

Kevin Buzzard (Dec 10 2025 at 14:20):

There are two different things here. There's machines doing new things autonomously (where there has been recent incremental progress but still nothing which is going to make researchers start sweating -- yet) and there's machines translating the literature. I am not an expert in either of these things -- I am an algebraic number theorist manually translating interesting mathematics (not all of which is publically available) into Lean and right now I have no particular desire to try and get machines to help because I think that given the current abilities of such systems it is not time-effective for me to go down that path.

I think it's a reasonable question to ask "why aren't machines formalizing all the textbooks" and this is not a question I am qualified to answer (or a question which is on topic for this channel), but certainly it is true that machines are not yet formalizing all the textbooks and it is also true that the machine-written mathematics which I have seen is not of high quality (yet) and so it's not entirely clear to me whether formalizing a textbook is worth the cost (we won't want it in mathlib, it will cost money to maintain it).

My RenPhil grant https://www.renaissancephilanthropy.org/a-dataset-of-modern-formalized-theorem-statements is an attempt to make some progress here, getting humans to formalize definitions so that at least we can state what's going on in the textbooks in an idiomatic way (I would not trust machines to do this yet) so at least if machines do start making progress on the textbooks then hopefully we will end up with idiomatic statements at least, even if the proofs remain horrible and unmaintainable.

Bryan Wang (Dec 11 2025 at 02:21):

"Why is it that autoformalization tools haven't formalized all of the springer undergrad/grad texts -- because the difficulty of a single lemma/theorem there is often <= an IMO/Putnam/Erdos problem."

For me personally I think the biggest bottleneck to formalising anything from Springer graduate texts to Mazur's theorem is getting human experts (by 'expert' I just mean humans who understand the underlying mathematics well enough) to work on the formalisation, with or without AI assistance. Because there is only so much one human can do in a finite amount of time, yet conversely, if there is no human expert in the loop to at least look through the AI output, then I think we are back to the proverbial "tree falling in a forest".

Unfortunately it is true that in formalisation one does have to spend a relatively disproportionate amount of time and space on details which the average human expert in informal maths would consider 'obvious' or 'easy' (noting of course that these are highly subjective words). Therefore to me the ideal use case for AI tools is to help with the formalisation of these 'routine' prerequisite details, which would save even expert formalisers some time and effort.

However because essentially the only consistently reliable reward signal that AI models get is "does the Lean kernel give me a double blue tick", and not "can a human actually read my formalisation and give me a double blue tick", the current state (at least as I'm aware) is that the AI models produce either extremely long, inscrutable, and/or unidiomatic statements and proofs (at least, relative to what a human formaliser would write).

Now imagine you're a human expert in informal maths with no prior experience in Lean, and you see the AI model produce thousands and thousands of lines of inscrutable code, all to prove some sublemma which was 'obvious' to you anyway. Would you want to continue plowing through the entire formalisation? The end product could be hundreds of thousands or even millions of lines of code, which you're not even sure if any other human would ever want to go through, all to formalise some 'standard material' for which there is no doubt about the correctness anyway.

In other words, formalisation of known material, at least to me, is not just about "getting that double blue tick", there is also a notion of style, coherence and understanding involved (which is to me the main point of mathlib and its review process), and this will always be the bottleneck in the near-term, with or without AI assistance.

I liken it to the following: formalising textbook material is not unlike writing a new reference textbook on the topic (the go-to example is Stacks, but there are many other examples even within traditional literature). What's the point of a new textbook if no other student or researcher wants to read, use or even maintain it? It certainly wouldn't sell well, that's for sure.

Anyway, this discussion is perhaps tangential to 'top-down' FLT (but maybe less so for 'bottom-up' FLT), and probably can be moved elsewhere..


Last updated: Dec 20 2025 at 21:32 UTC