Zulip Chat Archive

Stream: mathlib4

Topic: RFC: porting strategies


Daniel Selsam (Aug 03 2021 at 16:12):

As many of you know, @Mario Carneiro and I have been working on automated porting tools. The two main tools are:

  • binport: produces lean4 .olean files from lean3 .tlean files
  • synport: produces lean4 .lean files from lean3 .ast.json files, that look great but will not elaborate/type-check in general

We are starting to think about both how best to connect them and how best to structure the overall porting effort. There are many possible strategies, for example:

  1. "top-down"

    • produce .olean files from binport
    • working backwards from the leaves:
      • produce non-working .lean file from synport that imports the previous binported files
      • tweak it manually so it elaborates
      • replace the binport .olean with the new .lean file
  2. "bottom-up"

    • start from the beginning (as in Mathlib4 repo currently)
    • in dependency order:
      • produce non-working .lean file from synport that imports only working .lean files
      • tweak it manually so it elaborates
  3. "parallel"

    • produce .olean files from binport
    • stage 1: in parallel,
      • produce non-working .lean file from synport that imports the previous binported files
      • tweak it manually so it elaborates
    • stage 2: in parallel,
      • replace the binportted .oleans with the working .lean files, and propagate the changes forward as necessary
  4. "hybrid"

    • somehow interleave manual + binport, and introduce new commands that add new rules/alignments to binport

Notes:

  • All strategies have "-sorry" variants, where nontrivial proofs are sorryed out until the final, parallel stage. Microsoft is prepared to run a lot of proof search to fill in these gaps. It might not be all that hard, since we have the original tactic scripts and the binported kernel terms, and we can mine a lot of useful hints from both.

  • There are also opportunities to use binport to help generate better .lean files. For example, binport can be used to delaborate defs/theorem statements to guarantee roundtripping. The new delaborator (https://github.com/leanprover/lean4/pull/575) has a very high roundtrip success rate while staying concise in general. However, this approach is complicated by the fact that the binported terms won't in general be exactly def-eq to the terms desired in the .lean file. Some things will delab in undesirable ways, e.g. definitions compiled by lean3's equation compiler.

  • It is also not clear how to structure the repositories, and this depends on which strategy is adopted. For example, all approaches require synport to depend on binport (for resolving/translating names), automatic proof-patching must depend on Mathlib/Tactic, and the hybrid approach requires Mathlib4 and binport to be mutually dependent.

  • We can gather more empirical information before we make any big decisions. For example, for simple binport-enhanced synport, how many squigglies are there outside of proofs? When using the binported .oleans, how problematic are the "weird" binport relics (e.g. definitions compiled with lean3's equation compiler)?

Yakov Pechersky (Aug 03 2021 at 16:25):

Is it possible to have an Okazaki fragments like approach, for the "parallel" (3) approach? That is, select landmark files as ones that are prioritized for having in .lean format. Those could be tactics, or highly imported files (finset, big_operators). Those are made in .lean, relying on imported .oleans. Then we work backward to port the imported .oleans to .lean.

This is different than just the "bottom-up" approach because once we have a good baseline of tactics/notations, the second-level porting might go smoother. We'd also find corner cases sooner with this approach than a pure "bottom-up" approach.

Patrick Massot (Aug 03 2021 at 16:32):

Daniel, where do you see the non-trivial tactic writing happening in those scenario? I mean ring, linarith, abel, tidy, norm_cast...

Patrick Massot (Aug 03 2021 at 16:33):

The top-down approach looks like it minimizes the risk of divergence. But probably we can't decide on the best strategy without trying a bit of top-down and a bit of bottom-up.

Daniel Selsam (Aug 03 2021 at 16:36):

Patrick Massot said:

Daniel, where do you see the non-trivial tactic writing happening in those scenario? I mean ring, linarith, abel, tidy, norm_cast...

Most of these (or alternatives/variants) will need to be written. When their absence will become a brick wall depends on the porting strategy adopted. For example, we can squeak by without them for a while longer if we adopt a sorry-variant approach and worry about the proofs in a second pass.

Daniel Selsam (Aug 03 2021 at 16:39):

Patrick Massot said:

The top-down approach looks like it minimizes the risk of divergence. But probably we can't decide on the best strategy without trying a bit of top-down and a bit of bottom-up.

Yes, I think "top-down" is the most incremental and the least risky. It also maintains the invariant that there is a working (if not easily refactorable) Mathlib4.

Gabriel Ebner (Aug 03 2021 at 16:39):

We can gather more empirical information before we make any big decisions. For example, for simple binport-enhanced synport, how many squigglies are there outside of proofs?

I think this is an important question. Can you make a repo with the output of the "basic" binport-enhanced synport?

Patrick Massot (Aug 03 2021 at 16:40):

Another big question (but maybe this was already answered and I missed that): what is the current status of the "old structure command debate"? Did we decide anything? What are the porting tools producing?

Daniel Selsam (Aug 03 2021 at 16:42):

Patrick Massot said:

Another big question (but maybe this was already answered and I missed that): what is the current status of the "old structure command debate"? Did we decide anything? What are the porting tools producing?

Leo is investigating whether there is a sub-astronomically-complicated way of supporting efficient structure-nesting in lean4's structure command.

Daniel Selsam (Aug 03 2021 at 16:43):

Patrick Massot said:

What are the porting tools producing?

Binport merely translates the kernel terms, which are whatever was produced by lean3. Synport is (at best) producing some kind of structure syntax that won't elaborate.

Patrick Massot (Aug 03 2021 at 16:44):

We will hit that issue very soon, whatever the strategy.

Gabriel Ebner (Aug 03 2021 at 16:54):

Ultimately all of these strategies have the same flag-day problem I think. We can't change the parts we've already ported or this will mess up the porting tools. No matter whether we do top-down or bottom-up, we'll need to essentially freeze mathlib3 when we seriously start the port. (Or, phrased differently, manually port all PRs made after the flag day.)

Gabriel Ebner (Aug 03 2021 at 16:57):

"top-down"

This requires that synport correctly translates pretty much all of the mathlib stuff (and in exactly the same way that the mathlib4 tactics expect):

  • locales
  • auto-params
  • simps

Replacing the binported olean with the tweaked lean file almost certainly requires fixing up the reverse dependencies. This could very well be a quadratic cost.

Gabriel Ebner (Aug 03 2021 at 17:00):

"bottom-up"

This is somewhat appealing because of the tactics. We could first port the basic stuff, then the tactics, then the rest. synport could then use the new oleans to delaborate the tactics proofs (in the rest part).

That said, I think it is completely infeasible to write "ABI-compatible" Lean 4 files. synport would most likely fail to combine the newly compiled oleans with the remaining binported oleans.

Gabriel Ebner (Aug 03 2021 at 17:10):

where nontrivial proofs are sorryed out until the final, parallel stage.

I think it's important that the proofs are in the ported files, even if they don't compile. Concretely, I'd prefer a sorrying (by simp) over sorry, where sorrying stx elaborates to sorry and ignores stx.

Manually fixing up proofs is much easier if you don't have to hunt down the original version in mathlib and copy it to mathlib4.

Patrick Massot (Aug 03 2021 at 19:59):

What would be even greater would be to have a way to record the Lean 3 tactic state, but I don't see how to do that without having a lot of comments in the auto-ported code.

Wojciech Nawrocki (Aug 03 2021 at 20:07):

If auto-generated Lean 4 source files included references to Lean 3 source locations, and it would be useful, we could probably add a (temporary, hacky) go-to-def that jumps to the right mathlib3 position.

Eric Rodriguez (Aug 03 2021 at 20:38):

the ast jsons that synport makes do store all the position information, so that is probably doable

Eric Rodriguez (Aug 03 2021 at 20:39):

(well, only the starts of things sadly... but that's good enough for this case)

Eric Rodriguez (Aug 03 2021 at 20:39):

I'm quite in favour of sorrying a lot of things and seeing how it goes from there

Mario Carneiro (Aug 03 2021 at 21:03):

Gabriel Ebner said:

"top-down"

This requires that synport correctly translates pretty much all of the mathlib stuff (and in exactly the same way that the mathlib4 tactics expect):

  • locales
  • auto-params
  • simps

We want to support all of this anyway. Tactics are being manually ported so in the same way there will be extensions to support mathlib-specific automation whether it goes via attributes or user commands.

Replacing the binported olean with the tweaked lean file almost certainly requires fixing up the reverse dependencies. This could very well be a quadratic cost.

We're thinking about #align annotations on tweaked mathlib files to keep binport from diverging too badly.

Manually fixing up proofs is much easier if you don't have to hunt down the original version in mathlib and copy it to mathlib4.

Currently synport will print the lean 3 declaration unmodified if it encounters a fatal error during translation of an individual command, for this exact reason. (Well, actually it prints a reconstruction of the lean 3 ast, but it has access to the original span and could go look up the actual text from the original file.)

Mario Carneiro (Aug 03 2021 at 21:03):

Patrick Massot said:

What would be even greater would be to have a way to record the Lean 3 tactic state, but I don't see how to do that without having a lot of comments in the auto-ported code.

It doesn't exist yet in the current version, but this is part of our "stage 2" plan. It will require some additional logging in lean 3, but I don't expect it to be too bad, and then the tactic information can be used to help guide the syntax generation. Just putting the state as comments in the ported code could work but I imagine it would be more harm than good in most cases since it's so noisy, so I would try to avoid it.

Mario Carneiro (Aug 03 2021 at 21:03):

@Wojciech Nawrocki said:

If auto-generated Lean 4 source files included references to Lean 3 source locations, and it would be useful, we could probably add a (temporary, hacky) go-to-def that jumps to the right mathlib3 position.

That would be great, but requires some support from the server (so maybe I should ask you?). How about an attribute like @[link "mathlib/src/foo.lean" 20 1] that can be attached to a definition, such that ctrl-clicking on link takes you to the specified location? Then mathport could generate those on each declaration.

Eric Rodriguez (Aug 03 2021 at 21:07):

Mario Carneiro said:

Patrick Massot said:

What would be even greater would be to have a way to record the Lean 3 tactic state, but I don't see how to do that without having a lot of comments in the auto-ported code.

It doesn't exist yet in the current version, but this is part of our "stage 2" plan. It will require some additional logging in lean 3, but I don't expect it to be too bad, and then the tactic information can be used to help guide the syntax generation. Just putting the state as comments in the ported code could work but I imagine it would be more harm than good in most cases since it's so noisy, so I would try to avoid it.

what's the plan for this? would probably be nice for Alectryon, too

Wojciech Nawrocki (Aug 03 2021 at 21:32):

Mario Carneiro said:

Wojciech Nawrocki said:

If auto-generated Lean 4 source files included references to Lean 3 source locations, and it would be useful, we could probably add a (temporary, hacky) go-to-def that jumps to the right mathlib3 position.

That would be great, but requires some support from the server (so maybe I should ask you?). How about an attribute like @[link "mathlib/src/foo.lean" 20 1] that can be attached to a definition, such that ctrl-clicking on link takes you to the specified location? Then mathport could generate those on each declaration.

Definitely possible. We currently require that declarations we jump to come from a "real" Lean module, but this could be (hopefully temporarily) relaxed to arbitrary files.

Mario Carneiro (Aug 03 2021 at 21:40):

The mechanism here is more like an unstructured hyperlink, like the kind you find in markdown links in comments. I don't see why there should be any connection to "real" lean modules, and I also don't think it needs to be a temporary thing - having generic links is useful for lots of things

Mario Carneiro (Aug 03 2021 at 21:41):

it doesn't have to be triggered by "go to definition" if that helps

Wojciech Nawrocki (Aug 03 2021 at 21:55):

Oh, then in VSCode you can just put the link in a comment (say line number is 10):

-- file:///path/to/file#10

For arbitrary editors I guess we'd need server support.

Mario Carneiro (Aug 03 2021 at 22:03):

one sadness is that the links will probably be machine-specific, so they aren't likely to make it to PR'd mathlib4 files

Floris van Doorn (Aug 03 2021 at 22:08):

Patrick Massot said:

What would be even greater would be to have a way to record the Lean 3 tactic state, but I don't see how to do that without having a lot of comments in the auto-ported code.

An easier solution is just to have 2 editors open: one with a compiled mathlib3 proof (so you can step through the tactic state) and one with a broken mathlib4 proof. I do that regularly for big refactors, too.

Mac (Aug 05 2021 at 06:55):

Mario Carneiro said:

one sadness is that the links will probably be machine-specific, so they aren't likely to make it to PR'd mathlib4 files

could/would making the paths relative help?

Eric Wieser (Aug 05 2021 at 07:16):

One perhaps crazy idea; what if the lean 4 code was inserted as special comments into the lean 3 code, and extracted and tested as part of the CI process. This would ensure it can't diverge from the lean 3 version, and would allow lean3 development to continue in parallel, so long as PR authors also moved around / fixed the corresponding lean 4 code. There'd still be a flag day, but on that day all that happens is the lean3 code and lean 4 comment markers are removed.

Eric Wieser (Aug 05 2021 at 07:25):

That also has the bonus off leaving behind a useful git history for mathlib that makes tracking blames through the lean4 switch not a complete pain

Daniel Selsam (Aug 05 2021 at 14:31):

Gabriel Ebner said:

Ultimately all of these strategies have the same flag-day problem I think. We can't change the parts we've already ported or this will mess up the porting tools. No matter whether we do top-down or bottom-up, we'll need to essentially freeze mathlib3 when we seriously start the port. (Or, phrased differently, manually port all PRs made after the flag day.)

Suppose we do top-down. Once doing new things in lean4 with binported oleans is almost as easy as doing new things in lean3, we could define a border: new files go in Mathlib4, old files are changed in lean3 and only when absolutely necessary. Binport would be re-run when this happens. Every successful top-down lean4 .lean file generated (by synport + automation + delab + humans) would shift the border.

Daniel Selsam (Aug 05 2021 at 14:36):

Gabriel Ebner said:

Replacing the binported olean with the tweaked lean file almost certainly requires fixing up the reverse dependencies. This could very well be a quadratic cost.

I agree that this is a concern. Two thoughts:

  • A lot of the 'not-necessarily-strictly-defeq' changes will probably not require any changes downstream. For example, replacing binported equation-compiler definitions/lemmas with the lean4 ones may be transparent.
  • Perhaps this is a good argument for a sorry-staged approach, so that at least any quadratic changes do not break proofs.

Patrick Massot (Aug 05 2021 at 14:41):

Eric Wieser said:

would allow lean3 development to continue in parallel, so long as PR authors also moved around / fixed the corresponding lean 4 code.

I think that aiming to continue parallel development of mathlib in Lean 3 is a catastrophic idea. The porting problem is already hard enough, and we will need every available quantum of energy to work on porting or cleaning up the auto-port.

Eric Wieser (Aug 05 2021 at 14:54):

I think that aiming to continue parallel development of mathlib in Lean 3 is a catastrophic idea

Isn't the alternative to throw away every open PR that exists on "flag day"? We certainly shouldn't encourage parallel development, but I expect there are or will be users interested in finishing up their old contributions with zero interest in helping with porting. With my suggestion, the lean 4 porters would be able to continue with their porting, and attach via automated tools a @[lean4 /-" lean4 code here "-/] attribute or similar to the definitions from new lean3 PRs whenever they feel like it.

Patrick Massot (Aug 05 2021 at 15:05):

I honestly think we should stop reviewing PRs on flag day.

Daniel Selsam (Aug 05 2021 at 15:07):

Patrick Massot said:

I honestly think we should stop reviewing PRs on flag day.

In the "border" proposal, mathlib3 would only accept PRs that (a) only touched files on the lean3 side of the border and (b) were motivated by the lean4 side, i.e. various kinds of backports.

Patrick Massot (Aug 05 2021 at 15:09):

I agree, but I don't think this is what Eric had in mind.

Patrick Massot (Aug 05 2021 at 15:18):

I have pretty strong feelings about this. Almost every current significant contributer to mathlib started contributing at a time when it was perfectly clear that Lean 4 was coming and that porting mathlib would be a huge task. As far as I understand we still don't have any guarantee that the easiest way won't be to redo everything from scratch (and we won't know for sure before the port is over). Almost everybody knew this (or at least should have understood this) before starting. So I think that having a total freeze time would be much much better than the expectation.

Scott Morrison (Aug 15 2021 at 23:44):

I agree with Patrick here. mathlib3 is a dead-end, and its only long-term value is in helping us build mathlib4. (Well... I guess Lean4 could fail and mathematicians could conceivably keep using mathlib3, but this is a very bad outcome.)

Obviously at this point we're not ready to close mathlib3 to new content PRs, nor do we even have much of an estimate of when that day might usefully come. We would probably want to do a merge sprint around the time of "flag day", where we stop allowing new content PRs, and then triage the existing PRs --- get everything merged that can be in reasonable time, and then close (ie abandon forever) all the remaining WIP, help-wanted, please-adopt etc PRs. Some of that work --- taking a hard look at the full list of PRs and saying to authors "now or never" --- can begin already.


Last updated: Dec 20 2023 at 11:08 UTC