Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Is Autoformalization needed?

Joseph Myers (Dec 03 2024 at 14:39):

Jason Rute said:

If we want to make formal math relevant, we really need to up our game and figure out how to automatically formalize large chunks of mathematics.

A reasonable statement of the high-level goal here might be "mathlib contains most of the dependencies for current research and other mathematical activities, across a broad range of areas of mathematics".

We need to figure out how to grow mathlib to 10 or 100 times its present size to get there. Autoformalization may be part of that, but only part. Another part might be attracting the interest of 100 grad students in different areas of mathematics who then formalize the content of all the graduate-level courses they take and contribute it to mathlib. Another part might be getting funding agencies to believe that building libraries of mathematics for computers is a useful thing to support. (And another part would be speeding up Lean and improving our infrastructure so it can handle 100-times-larger mathlib.)

People working on autoformalization can consider "build an autoformalizer that produces substantial amounts of material that's close enough to being mathlib-ready to save humans lots of time in preparing mathlib PRs with such new material" as the basic challenge here. It is, of course, not so readily quantifiable as most benchmarks are, and as much about producing a usable product as about the underlying basic research in AI.

Jason Rute (Dec 03 2024 at 14:42):

Do you think Mathlib is only 10 to 100 times too small to do current research? If so, then scaling by humans may be feasible. It could certainly be feasible to cover standard undergrad/grad course material. But math research is really broad and diverse.

Jason Rute (Dec 03 2024 at 14:49):

Joseph Myers said:

It is, of course, not so readily quantifiable as most benchmarks are, and as much about producing a usable product as about the underlying basic research in AI.

I think in 2025 and 2026 we are going to see a major shift from AI for Math as a pi-in-the-sky research question, to AI for Math as a useable product. (Both in industry and open source.)

Joseph Myers (Dec 03 2024 at 14:52):

At some point we might need to do better at working with separate specialized libraries on top of mathlib (while still keeping the close integration of current versions of everything that we get with the mathlib monorepo, so that people can use the current versions of several such libraries together - the libraries might be considered logically part of mathlib, but not all in the same repository). At 100 times I think it's still plausible to put everything in a monorepo (with better tools to avoid people needing 500GB of .olean files in every project when the project only uses a small proportion of that). We'd still want very good (possibly AI-assisted) refactoring tools to help in e.g. updating a 100-times larger mathlib the next time we change the definition of a group.

When we get to 10 or 100 times the present size, we'll have more information to reassess how large mathlib might end up.

For "formalize the entire extant mathematics literature, including things which little to nothing depends upon" (much more than what might end up in mathlib), I expect autoformalization (with reasonable abilities to correct minor mistakes and gaps as it goes along, while logging more significant issues it couldn't fix for the subsequent attention of anyone interested) is definitely needed.

Jason Rute (Dec 03 2024 at 23:18):

Jason Rute said:

If we want to make formal math relevant, we really need to up our game and figure out how to automatically formalize large chunks of mathematics.

I realize I was sloppy with this statement. I think formal math is already relevant in general and getting more so. I was really talking about formal math being relevant in applying AI to advanced mathematics like found in FrontierMath, which I did not make clear. I think we can’t expect AI to generate a formal proof of an advance mathematical topic without the background math also being formalized. Either that background needs to be formalized on the fly (auto-formalization) or formalized ahead of time (which seems to be a big ask without auto-formalization assisting the effort). To that extent I guess I’m also talking about the effort to formalize most of modern mathematics (which may not be the current goal of Mathlib, but is certainly the dream of the community).

Jason Rute (Dec 03 2024 at 23:22):

But maybe @Joseph Myers’s point is that I’m overestimating the background math needed for modern math (or for FrontierMath).

Kim Morrison (Dec 03 2024 at 23:24):

No, I don't think you are. And even if 10x present size were sufficient, I'm not convinced we're seeing exponential growth in Mathlib at present, and I don't want to wait 50 years. :-) Auto-formalization, whether autonomous or more likely collaborative, is on the critical path.

Joseph Myers (Dec 04 2024 at 00:50):

My point wasn't to express confidence in a particular estimate how much background mathematics is needed in future mathlib, but rather to express lack of confidence in what form the contributors building such a future mathlib might take.

If we can achieve a 100x larger mathlib through exponential growth, it will also have a 100x larger contribution rate. An individual contributor who is 100x as productive as the most productive individual contributors today is surely an AI or very heavily assisted by one. But that's not the only way to get a 100x larger contribution rate. You could have 100x as many individual contributors - formalization is still sufficiently niche at present that if it became a mainstream thing for a large proportion of mathematics students to do, there's probably room for that. Or 10x as many individual contributors, each 10x as productive through AI assistance. Or 25x as many individual contributors, each 2x as productive through AI assistance and 2x as productive through more traditional automation and smarter tactics, for example. (Both AI and more traditional automation also have a lot of scope to help with the PR review process, for example, as well as with initially writing Lean code.)

I don't think the amount of productivity improvement from autoformalization is readily predictable at present. If anything I'd be more confident in "grow the number of human contributors 10x" being on the critical path for integrating 100x the rate of contributions, even if most of the code in those contributions is AI-written and AI is significantly assisting in the review process.

Oliver Nash (Dec 04 2024 at 09:21):

I agree that autoformalisation is obviously important (and I currently enjoy using it). However at least for me (and I suspect in general) the main thing that makes developing Mathlib slow is refactoring the informal literature. Sometimes I can do this on the fly in my head but often I need to sit down with paper and pencil.

This Zulip instance is full of threads of people doing this. E.g., two "live" ones I find interesting are
https://leanprover.zulipchat.com/#narrow/channel/116395-maths/topic/Moreira-2001.20.28generalization.20of.20Sard's.20theorem.29/near/485461646 and https://leanprover.zulipchat.com/#narrow/channel/116395-maths/topic/Refactor.20.60krullTopology.60.3F/near/485552967

Oliver Nash (Dec 04 2024 at 09:22):

Almost always the informal literature is heavily optimised (e.g., level of generality) for the exact direction that the author wants to travel but when writing Mathlib I want to create something that is good no matter which direction we want to travel.

Oliver Nash (Dec 04 2024 at 09:22):

So I think one of the most interesting things to attack in autoformalisation is this "writing Mathlib" job where you refactor as you go.

Oliver Nash (Dec 04 2024 at 09:22):

I want to call this out explicitly because there is an understandable naive viewpoint in which the job seems to be an almost-idea-free "translation" LaTeX -> Lean. This is far from the case. I might also note that I don't think Bourbaki were bottlenecked on typesetting.

Oliver Nash (Dec 04 2024 at 09:23):

Speaking of bottlenecks, I suspect Mathlib's bottleneck is actually review bandwidth (which does have a positive trend, albeit noisy as key people get temporarily busy) so we can hope for superlinear growth.

David Renshaw (Dec 04 2024 at 12:55):

I agree that autoformalisation is obviously important (and I currently enjoy using it)

I'm guessing that @Oliver Nash is referring to proprietary autoformalization tools here, and I wonder when something similarly enjoyable might be available for everyone.

Notification Bot (Dec 04 2024 at 13:15):

14 messages were moved here from #Machine Learning for Theorem Proving > FrontierMath: Formalising Benchmark Problems by Jason Rute.

Jason Rute (Dec 04 2024 at 13:16):

I moved the messages about autoformalization to a new topic.

Jason Rute (Dec 04 2024 at 13:23):

@David Renshaw Both AlphaProof (which I assume is what @Oliver Nash is referring to) and DeepSeek-Prover use auto-formalization to translate problems for training. The latter has been released here: https://huggingface.co/collections/deepseek-ai/deepseek-prover-66beb212ae70890c90f24176 I think the same model is used for both tactic prediction and autoformalization, so one could try to see if it works for autoformalization. Any advice on how to do that, @Huajian Xin?

Jason Rute (Dec 04 2024 at 13:23):

I think the other best option is to use public LLMs, but those often struggle with Lean 3 vs Lean 4 syntax. Or @Siddhartha Gadgil used to maintain an autoformalization tool LeanAide which hooked into Lean for feedback. I don't know what became of that. Again this is part of the bigger issue that most of the research papers are not practical tools which are maintained for real users.

Oliver Nash (Dec 04 2024 at 13:34):

@David Renshaw Jason is right: proprietary. Unfortunately that's all I can say for now.

Siddhartha Gadgil (Dec 04 2024 at 13:39):

Jason Rute said:

I think the other best option is to use public LLMs, but those often struggle with Lean 3 vs Lean 4 syntax. Or Siddhartha Gadgil used to maintain an autoformalization tool LeanAide which hooked into Lean for feedback. I don't know what became of that. Again this is part of the bigger issue that most of the research papers are not practical tools which are maintained for real users.

Thanks for mentioning LeanAide. It is actively maintained and under active development - the team shrank to just me in practice but has expanded again. It still has some way to go to be production ready but is not bad - for instance it correctly translates about half the Putnam Bench problems as of now.

Siddhartha Gadgil (Dec 04 2024 at 13:40):

To me tooling is just as important as performance. Hopefully that aspect will improve too.

Huajian Xin (Dec 04 2024 at 13:56):

Yes. DeepSeek-Prover V1.5 (both SFT and RL versions) can be used to perform autoformalization tasks in the same prompt given in the V1 paper image.png. Few shot examples maybe also useful.

Joseph Myers (Dec 04 2024 at 21:09):

Oliver Nash said:

So I think one of the most interesting things to attack in autoformalisation is this "writing Mathlib" job where you refactor as you go.

This job is more than "just" generalizing for use outside the original context of the sources; it includes writing all the API lemmas that don't appear in informal mathematics because they're too obvious to humans, or because humans think of variant statements as being the same thing. For example:

Preemptively building API for a definition immediately after adding that definition, by adding all the manipulation lemmas and obvious facts you can think of.
Adding API lemmas all over mathlib whenever you discover one is missing (often in a part of mathlib you're using but that isn't the focus of your work). Any significant formalization project is liable to come up with many of these, which are best PRed into mathlib as you go along rather than accumulating a backlog. (I've done more than 30 PRs like this so far from AperiodicMonotilesLean without even starting to PR any of the new definitions and associated API there relating to tilings. Even when formalizing competition problems, three of the five IMO 2024 formalizations I've done so far involved such additions to mathlib proper.) Getting this from autoformalization would need the AI to have a sense of "this step ought to be in mathlib separately, and should be generalized this way", or else a human to look over the AI output pretty closely to find such cases.
Adding multiple versions of a result that seem like the same thing informally but whose formal statements differ in ways that help with applying the result in different contexts. As with small API lemmas, this can be either preemptive (at the time the original lemma is added) or reactive (discovering an existing lemma is inconvenient to apply and abstracting the conversion to a more convenient variant into a new lemma).

Mario Carneiro (Dec 04 2024 at 22:17):

Joseph Myers said:

Preemptively building API for a definition immediately after adding that definition, by adding all the manipulation lemmas and obvious facts you can think of.

This one is at least conceivably within the realm of AI automation, although I don't know if it's been attempted as a benchmark task

Mario Carneiro (Dec 04 2024 at 22:18):

Joseph Myers said:

Adding multiple versions of a result that seem like the same thing informally but whose formal statements differ in ways that help with applying the result in different contexts.

Present AIs seem to be terrible at this, because they are too quick to gloss over the details and treat morally equivalent statements as actually equivalent

Jason Rute (Dec 04 2024 at 22:38):

I think right now autoformalization is best as a cannon firing scattershot. A lot of it won’t hit the target, but if we don’t necessarily use it to build clean libraries, but just libraries of something which can be processed and cleaned later, then it will have (and already has) tremendous value.

Jason Rute (Dec 04 2024 at 22:39):

And I don’t just mean via rejection sampling. We can get so much value from slightly wrong or slightly different translations.

Xiyu Zhai (Dec 07 2024 at 09:49):

Autoformalization is not just an AI problem. It's about system design, version control. All bunch of stuffs. Only LLMs or only machine learning obviously couldn't deliver a satisfactory result.

Last updated: May 02 2025 at 03:31 UTC