Zulip Chat Archive
Stream: Machine Learning for Theorem Proving
Topic: How close are we to semi-automated Mathlib bumping?
Terence Tao (May 13 2025 at 07:12):
As we are discovering in #Equational > Bump to v20.0-rc5 , doing a Mathlib bump from a version that is more than a month out of date is surprisingly painful when the codebase becomes sufficiently large and complex. On the other hand, out of all the challenging autoformalization tasks out there, the task of converting a proof in "Lean + Mathlib version n" to a proof in "Lean + Mathlib version n+h" (for smallish values of h) seems like the lowest hanging fruit. How close are we to being able to develop tooling to accelerate this particular task? It seems that all the existing Lean repositories might already have some suitable training data for this task that could be scraped for this.
Michael Rothgang (May 13 2025 at 07:14):
To begin with, 80% of the work (deprecations and imports) can be done without any AI help, but by a simple script. There exist various versions of scripts for each of these (deprecated lemmas, deprecated modules, fixing missing imports); centralising them into one would be great.
Michael Rothgang (May 13 2025 at 07:16):
(I don't know the answer about the remaining 20%. Certainly, not all of these repeat: if a Lean core change happens, there are probably little past changes which are exactly similar to learn from. That said, these changes should require relatively little manual work.)
Justin Asher (May 13 2025 at 17:33):
@Michael Rothgang If we can integrate those tools, we can then try layering an LLM on top of them which we feed the Lean GitHub diff of Lean vA.B.C -> vX.Y.Z to handle the remaining 20%. Let’s say it goes one error at a time, tries to resolve it, and then submits a pull request. We will probably have to automatically show it previous changes the LLM made related to the current problem it is working on. This all seems doable. I know in certain cases Lean version changes can make updating metaprogramming code difficult (this was an issue LeanDojo ran into), but bumping proofs should be fine. What do you think?
Michael Rothgang (May 13 2025 at 17:36):
That seems like the natural thing to try, for throwing a LLM at it. I'm personally a bit sceptical if LLM can do well for those --- the root causes will often be different, and you won't be trying to diagnose those, right? I might be wrong, of course.
Michael Rothgang (May 13 2025 at 17:37):
I like the long-term vision of lean exe autobump, where one writes automatic migrations for each breaking change in mathlib, and Lean applies that --- ideally, that's fully type-safe and correct. (It means spending effort on these migrations, though.)
Shreyas Srinivas (May 14 2025 at 13:50):
Migrations as a concept have long pre-dated LLMs. For changes of names and default tactic configs, there is usually a simple translation from the old tactic + config to new tactic + config. These should be reasonably simple (although in lean, parsing the end of one application of a tactic accurately can be tricky). For example if the new default behaviour of simp excluded decide, then a migration would replace every existing application of simp with simp +decide. But there are downsides to simple syntactic replacement.
Mario Carneiro (May 15 2025 at 01:24):
Michael Rothgang said:
I like the long-term vision of
lean exe autobump, where one writes automatic migrations for each breaking change in mathlib, and Lean applies that --- ideally, that's fully type-safe and correct. (It means spending effort on these migrations, though.)
See also #mathlib4 > leanup design discussion
Michael Rothgang (May 15 2025 at 06:15):
Thanks for linking - that was what I meant :-)
Oliver Nash (May 15 2025 at 08:19):
IMHO given that we've seen AI agents come up from non-trivial proofs from scratch, it's certain that we could automate the proof repair component of bumping an upstream repository. I believe it just requires somebody with sufficient motivation to build such a tool. I also believe that what is needed is not a "proof of concept" but rather a production-ready tool.
Kevin Buzzard (May 15 2025 at 08:24):
I guess one idea would be that you train on the mathlib bump diff.
Oliver Nash (May 15 2025 at 09:01):
My guess is that there wouldn't be nearly enough such diffs for this to be more useful than just throwing a general purpose Lean proving agent at the problem.
Kim Morrison (May 15 2025 at 11:21):
A few such diffs may be useful in a prompt.
Shreyas Srinivas (May 15 2025 at 11:24):
The problem of proof repair is still a serious research question. I wonder if it will be that simple and produce consistently correct results
Oliver Nash (May 15 2025 at 11:27):
I guess I'm just repeating myself in different words but I'm convinced that it is not a serious research question to build a very useful tool.
Eric Wieser (May 15 2025 at 11:34):
I think it's more of a difficult engineering problem than an interesting research question
Shreyas Srinivas (May 15 2025 at 11:53):
Based on the recent literature, I disagree (about the non-research nature of practical proof repair tools). But the problem in lean upgrades is a much more classic and simple one that can be solved by migration rules applied systematically to syntax trees. Ideally you don't actually want any non local changes to be made in any migration. For example, you don't want an old proof in mathlib replaced by a new one. You merely want pieces of the old proof such as definition names or tactic calls to be altered in a systematic way. In other words, modern AI tools might actively make things more complicated than they need to be, especially when the same changes need to be made across dozens or hundreds of files systematically
Kim Morrison (May 15 2025 at 11:56):
Less talk, more code.
Huajian Xin (May 16 2025 at 10:02):
I think version migration across Mathlib releases definitely seems like one of the more tangible and data-rich proof engineering challenges—especially since breakages are often local but still nontrivial to resolve. In practice, many of these edits involve small but semantically meaningful adjustments: adapting to renamed declarations, modified type signatures, reorganized imports, or even subtle changes in tactic behavior.
To study this kind of task more systematically, we've started working on a benchmark called APE-Bench I, which focuses on instruction-guided, file-level proof edits mined from real Mathlib4 commit history. While it doesn't explicitly target version bumps (yet), many of the tasks in the benchmark are exactly the kinds of localized changes that arise during library migration or maintenance.
The long-term plan is to extend this into project-scale benchmarks (multi-file, version-aware) and eventually support tools that can assist with version migration workflows more directly. Ideally, this would culminate in LLM-based agents capable of proposing and validating edits in-context. We’re also exploring packaging this as a service layer to support interactive or automated maintenance. Paper link here for anyone interested: https://arxiv.org/abs/2405.06308
Terence Tao (May 16 2025 at 14:57):
I'm glad to see some effort put into this. In another thread, @Kevin Buzzard pointed out that once one stops maintaining a project by continually keeping Mathlib up to date, a previous successful Lean formalization project may gradually become increasingly incompatible with the latest version of Mathlib, if the proofs are not upstreamed to Mathlib in a timely fashion. This sort of "Lean proof rot" is not desirable in the long-term, so anything that reduces the friction of bumping would be welcome.
Ralph Furman (May 19 2025 at 03:07):
Do you think for each bump there will always be a moderate-sized set of rules that can be used to fix the code? Assuming there's some basic intelligence in applying the rules (e.g.: in @Shreyas Srinivas's example, if it works without error than keep it as simp but if it fails then try simp +decide).
If so then what level of automation is needed for building the migration rules:
- Built manually and explicitly (could be listed in PRs and then automatically collected together)
- Built interactively (for each error, the user is prompted to suggest a fix, which is then broadly applied)
- Learned to generalize from seeing examples of a few fixes
- Task an LLM or proof search system to fix the errors and generate rules
Michael Rothgang (May 19 2025 at 07:07):
I absolutely think a tool doing "build manually" would work. (Search for leanup, Mario Carneiro sketched a tool like this.) Now, somebody "just" needs to implement that, and all PR authors need to add migrations for their breaking changes/we need a way to detect that these are happening.
Shreyas Srinivas (May 19 2025 at 12:16):
Is there a WIP repo for the leanup project?
Mario Carneiro (May 19 2025 at 12:46):
I think I made the repo but did not put anything in it. Most of the work done on leanup is in the zulip thread
Shreyas Srinivas (May 19 2025 at 12:55):
Is it really a bad idea to manipulate terms at the level of TSyntax with sufficient context?
Shreyas Srinivas (May 19 2025 at 12:55):
I know it is brittle
Shreyas Srinivas (May 19 2025 at 12:56):
But I don't see stronger alternatives
Mario Carneiro (May 19 2025 at 13:01):
it's not really significantly easier to get TSyntax than full elaborator results if you want a 100% correct answer, because the elaborator can change the parser
Shreyas Srinivas (May 19 2025 at 13:03):
On the other end of the spectrum, if I take a fully elaborated proof term Expr and change it, while leaving the surface tactic syntax untouched, I haven't really "migrated" the proof have I?
Shreyas Srinivas (May 19 2025 at 13:03):
at what point of the elaboration should the change be made?
Shreyas Srinivas (May 19 2025 at 13:08):
For reference : https://dl.acm.org/doi/10.1145/3453483.3454033 EDIT: Arxiv : https://arxiv.org/abs/2010.00774
Justin Asher (May 19 2025 at 21:57):
Has anyone considered some new tools being released, like Claude Code, OpenAI's Codex, or Google's Jules, for tasks related to bumping (or ATP in general)? I have not used coding agents yet, and I am curious to hear other peoples' takes. I know that, for instance, Jules runs the code in a VM before submitting a PR, and since I do not yet have access, I am uncertain whether this would work for Lean 4 code (I am presuming not).
Mario Carneiro (May 20 2025 at 12:01):
Shreyas Srinivas said:
On the other end of the spectrum, if I take a fully elaborated proof term
Exprand change it, while leaving the surface tactic syntax untouched, I haven't really "migrated" the proof have I?
No, that was never the suggestion. The idea is to use elaborator information to migrate syntax to syntax and pretty print it, or even better, construct diffs and apply them so that you don't muck up the formatting of the rest of the file while applying migrations
Mario Carneiro (May 20 2025 at 12:02):
This is essentially the functionality of lake exe refactor, except leanup repackages it into a form where the migrations are applied from a database based on where your code is and where it wants to be
agniv (Aug 18 2025 at 18:52):
To begin with, 80% of the work (deprecations and imports) can be done without any AI help, but by a simple script. There exist various versions of scripts for each of these (deprecated lemmas, deprecated modules, fixing missing imports); centralising them into one would be great.
@Michael Rothgang + others, could you link some of these? i haven't been able to easily find stuff online :(
Michael Rothgang (Sep 19 2025 at 07:31):
Sure: firstly, there are
- deprecation warnings on renamed lemmas (which stay around for at least six months, so as long as you bump e.g. at most 3 Lean versions at a time, you'll get to use them)
- deprecated modules for files renamed or moved
- for mere import changes, you currently have to (1) add
import Mathlibat the beginning, then use#min_importsat the end of the file to minimise imports again
These are just about warnings, not yet about auto-fixing them.
Michael Rothgang (Sep 19 2025 at 07:33):
Updating deprecations: most of it can be done with UpdateDeprecations (see #general > update_deprecations and #general > Auto-replace deprecated aliases?, and also #PR reviews > #13483 update_deprecations). That tool is in the prototype stage, but may be helpful already. Building a robust tool to do this would be very welcome.
Michael Rothgang (Sep 19 2025 at 07:33):
(And yes, that would just be engineering work - but very useful one.)
Michael Rothgang (Sep 19 2025 at 07:34):
I don't think there is tooling for auto-replacing deprecated modules. This shouldn't be too hard to add to UpdateDeprecations.
Jason Rute (Sep 19 2025 at 07:40):
Anything that can automate notation changes, say, \sum x in Foo to \sum x \in Foo?
Michael Rothgang (Sep 19 2025 at 07:45):
There definitely was a custom deprecation warning, telling you how to fix the code. I believe that was not automated.
Michael Rothgang (Sep 19 2025 at 07:46):
(Making linters suggest the right replacement is tricky, currently. But we're slowly getting there, IIUC.)
Deleted User 968128 (Sep 23 2025 at 22:52):
Michael Rothgang said:
(And yes, that would just be engineering work - but very useful one.)
For anything repetitive / mechanical, claude code or openai codex should be able to handle it cleanly. Where it might start to break down is proof repair. That said, I've heard good things about gpt-5-pro if you have budget.
Deleted User 968128 (Sep 23 2025 at 23:17):
Justin Asher said:
am uncertain whether this would work for Lean 4 code (I am presuming not).
Afaik, all recent frontier models are all fairly well versed in lean 4 itself. However, they do hallucinate some more obscure std things and some mathlib symbols despite being trained on it, especially when proofs get complex and you start going past transformer sliding windows. For that type of work, you'll want a good mcp installed.
I can't recommend enough using agentic search like o3-search or gpt5-search to find updated resources. I know math can move slowly, but things are starting to accelerate and change quite rapidly.
Last updated: Dec 20 2025 at 21:32 UTC