Zulip Chat Archive

Stream: Equational

Topic: Refactoring the Lean file structure

Terence Tao (Sep 27 2024 at 15:10):

We are rapidly approaching the point where the trivial Lean file structure we have right now of putting everything in a single Basic.lean file is not going to be sustainable. So I would like to open a discussion on how best to structure the Lean code in a way that can sustainably handle many submissions (both large and small) from various contributors working on different portions of the graph (as opposed to the situation until yesterday in which the human contributors were mostly focused on a small (~20 vertex) portion of the graph).

Here is what I am thinking as a first pass of a file structure:

A Magma.lean file to store the basic API for Magmas (which currently is extremely rudimentary, but perhaps will become more sophisticated as per @Adam Topaz 's proposed refactoring)
A Equations.lean file, importing Magma.lean, that contains definitions of all the 4694 equations, with optional docstrings.
Various contributed lean files, importing Equations.lean, that contain some set of implications and anti-implications, which could be computer-generated, human-generated, or some mix of both. For instance one could have a project to completely map out some designated portion of the implication graph, in which one imports some large computer-generated lean files to settle all the "easy" implications and then have the human participants fill out the remaining ones. The human proofs thus generated could then serve as templates to create further automated files to locate all other implications that can be resolved the same techique.

The point is that these latter lean files could be worked on almost independently, and we would not need to coordinate via a single "Outstanding tasks" thread (which is rapidly becoming unwieldy) but could instead split into smaller groups.

One technical issue with this structure is that it is possible that the same implication might be proven by two different groups, so that we have the same theorem with the same name proven in two different files. This creates a conflict when importing from both files. Perhaps there is some workaround for this issue using namespaces? I'd be open to suggestions on this.

Joachim Breitner (Sep 27 2024 at 15:24):

That's exactly the structure I am already assuming in https://github.com/teorth/equational_theories/pull/19.
The splitting is also very helpful if some files are slow to process, to process them in parallel.

Refinement suggestions: Put all generated lean files in a subdirectory Generated/

Joachim Breitner (Sep 27 2024 at 15:24):

The duplicate names is only a problem if you try import all files into one. Maybe in the first phase it suffices if the lemma statements follow a certain pattern that can be read using a python script, when producing graphs and other aggregate information?

Floris van Doorn (Sep 27 2024 at 15:39):

I recommend one additional layer: 1 or more files that import all the files with the implications and non-implications. These can

check for duplicate lemmas
automatically generate Hasse-diagrams, todo-list, ...

Joachim Breitner (Sep 27 2024 at 15:41):

I wonder if that should really be a lean file at this point. Processing the auto-generated lean files can be very slow, and you don't really need a lean import here - hence my suggestion to implement that part pragmatically with python. No need to actually check all the proofs if you just want to update the diagrams and todo-lists. Proof checking can be done on CI, and shouldn't slow down anyone else working with the data.

Floris van Doorn (Sep 27 2024 at 15:46):

One option is that this last layer just outputs a list of pairs of proven implications and a list of pairs of disproven implications, and then programs in other languages can use these lists for data analysis. That seems more robust than trying to parse Lean using Python...
We can also have a CI step run Lean with the option set_option debug.byAsSorry true, which will not verify any proofs but still can generate this data.

Joachim Breitner (Sep 27 2024 at 15:53):

If you say so… I expect you don’t gain that much from gathering the information of what is proved where from lean over something simple textual, but will run into otherwise avoidable hurdles. Anyways, as usually, power to those who do actually do it :-)

Joachim Breitner (Sep 27 2024 at 16:03):

How would you envision the all-importing file to collect the proven and dispoven implications?

Checking for the existence of theorems with an expected name for each implication/nonimplication probably won’t scale (because we probably do not want to explicit list all nonimplications) and will cause problems with duplicates.

So maybe go through the environment (or through a list of explicitly tagged theorems) and look at their type, looking for certain shapes? e.g. those of the form

∀ (G: Type*) [Magma G], Equation4513 G → Equation4512 G
∃ (G: Type) (_: Magma G), Equation387 G ∧ ¬ Equation42 G
∃ (G : Type) (_ : Magma G), Facts G [1, 2, …] [4, 5, …]

(using an atttribute is probably better, as that attribute can complain if the theorem statement isn't understood)

We can even introduce amagma_theorem command with a syntax like example so that the poor user doesn’t have to come up with a name for each of these theorems, when the type is enough really :)

Floris van Doorn (Sep 27 2024 at 16:48):

I would imagine putting all results in a namespace, and then analyze the types of all declarations in that namespace.
If we go with the magma_theorem route, we could also let that command put the data in an environment extension.

Joachim Breitner (Sep 27 2024 at 16:50):

Namespace works as well indeed, guess that's a nice and cheap way to simulate an attribute.

David Renshaw (Sep 27 2024 at 17:03):

"everything under a certain namespace that matches a certain shape" makes a lot of sense to me

Edward van de Meent (Sep 27 2024 at 17:11):

i think having it in an environment extension might be useful too though, i think that allows you to find the data in the .olean files, which might be easier to parse?

Edward van de Meent (Sep 27 2024 at 17:12):

i'm not an expert regarding this though, i just went "i read about this a few weeks ago, that might be useful"

Edward van de Meent (Sep 27 2024 at 17:12):

i don't know if it is practical

Terence Tao (Sep 27 2024 at 17:14):

I like the namespace containment proposal, as well as the idea of a "ground truth" lean file that imports all the other files and somehow computes the raw list of implications that can be obtained from the other files, which can then be post-processed by both Lean and non-Lean tools. I also like not having to rely on specific naming conventions for implications, but using Lean's internal processing to figure out what has and has not yet been proved.

In the immediate term, perhaps we can refactor Basic.lean into Magma.lean, Equations.lean, and say Subgraph.lean (this is a proposed name for the subproject to explore a designated small subgraph of the entire dependency graph - other suggestions welcome), and place all the implications in Subgraph.lean in a namespace? Then we can expand Equations.lean to contain all 4694 equations instead of just 20 or so. The incoming large collections of automatically generated implications could then also have a similar structure to Subgraph.lean, and then one would have some other lean file (name TBD, e.g., Implications.lean) to import all of these and process them to create the ground truth of Lean implications.

Nicholas Carlini (Sep 27 2024 at 17:22):

For automatically-generated proofs, I would suggest we also have somewhere organized to store the corresponding code that generated these proofs so that everything can be reproduced (and understood).

Terence Tao (Sep 27 2024 at 17:24):

Nicholas Carlini said:

For automatically-generated proofs, I would suggest we also have somewhere organized to store the corresponding code that generated these proofs so that everything can be reproduced (and understood).

I would also add that every automatically generated proof file should come with a chapter of the blueprint that explains the automatic generation process. This can largely be a LaTeX file with very few metadata links to lean code.

Joachim Breitner (Sep 27 2024 at 17:31):

Nicholas Carlini said:

For automatically-generated proofs, I would suggest we also have somewhere organized to store the corresponding code that generated these proofs so that everything can be reproduced (and understood).

Of course! And - at least if possible - CI should check that they are actually the expected output of these tools.

Nicholas Carlini (Sep 27 2024 at 17:35):

Joachim Breitner said:

Of course! And - at least if possible - CI should check that they are actually the expected output of these tools.

This should be possible in principle for most cases. In my case, I'd have to re-run my massive refutation set because it's randomized and I didn't set an initial seed. (It takes <4 hours to run, so it wouldn't take that long to do.)

Although I can imagine some cases where it might be hard to guarantee "deterministic builds" if there's anything that ever were to involve some larger systems that's not a simple python script.

Terence Tao (Sep 27 2024 at 19:29):

I have gone ahead and refactored Basic.lean into Magma.lean, Equations.lean, and Subgraph.lean, but I have not had time to verify that it actually works (for some reason Lean wanted to recompile all of Mathlib, which I am trying now to talk it out of). Hopefully the commit I just made passes CI, but if not someone may need to fix it. I have kept Basic.lean for now but noted that it is deprecated, and will remove it later.

David Renshaw (Sep 27 2024 at 19:57):

Hopefully the commit I just made passes CI

Nope, but this fixes it: https://github.com/teorth/equational_theories/pull/24

Shreyas Srinivas (Sep 27 2024 at 20:02):

David Renshaw said:

Hopefully the commit I just made passes CI

Nope, but this fixes it: https://github.com/teorth/equational_theories/pull/24

CI running

Shreyas Srinivas (Sep 27 2024 at 20:04):

Do you want me to run it on the other PR as well?

Shreyas Srinivas (Sep 27 2024 at 20:04):

Oh wait it failed

Nicholas Carlini (Sep 27 2024 at 20:35):

Terence Tao said:

I have gone ahead and refactored Basic.lean into Magma.lean, Equations.lean, and Subgraph.lean, but I have not had time to verify that it actually works (for some reason Lean wanted to recompile all of Mathlib, which I am trying now to talk it out of). Hopefully the commit I just made passes CI, but if not someone may need to fix it. I have kept Basic.lean for now but noted that it is deprecated, and will remove it later.

Do you want equations.lean to have all of the equations, and each set of theorems to just bulk load them all from? In my PR I tried splitting out only those equations necessary for each proof because it takes ~20 minutes for lean to prove a few easy theorems when you load all 4k equations, compared to ~30 seconds if you just load the ones you need. But it's ugly to have to re-define each equation in multiple places.

An alternate to consider would be to have a EquationXYZ.lean file per equation so just the ones you need to import for your proofs could be loaded.

Joachim Breitner (Sep 27 2024 at 20:48):

I'd be very surprised that simply loading the file with 4000 defs is causing a slow down. On my PR I have all of them in one file and it's fine.
Just because all equations are in one file doesn't mean you have to process all of them in one downstream file, if that's the bottleneck.

Nicholas Carlini (Sep 27 2024 at 20:50):

I was surprised too. But I have two files that differ only in the that one has all the definitions and the other doesn't, and one takes 30 seconds and the other 20 minutes. Maybe I'm invoking lean wrong and there's a "--go-fast" argument? My first time downloading lean was 6 hours ago...

Joachim Breitner (Sep 27 2024 at 20:52):

Not at the computer right now. If you point to a PR with your work I can have a look (tomorrow).

Vlad Tsyrklevich (Sep 27 2024 at 22:19):

Nicholas Carlini said:

I was surprised too. But I have two files that differ only in the that one has all the definitions and the other doesn't, and one takes 30 seconds and the other 20 minutes. Maybe I'm invoking lean wrong and there's a "--go-fast" argument? My first time downloading lean was 6 hours ago...

I just learned about this, you want to preprocess the defs by putting them in a separate file and using lean -c to generate an .olean file, then include that.

Joachim Breitner (Sep 27 2024 at 22:49):

Are you not using lake?

Nicholas Carlini (Sep 27 2024 at 22:51):

Thanks both. I'm running with time lake env lean -c equational_theories/rewrites2/rewrite_ux_vz_wz_zx.lean. If it's cached, though, then I only ran it once and didn't require the definitions cross-file, I just added them to the top of the file on each timing test. I will use lake and/or -c and it sounds like that should fix things.

Terence Tao (Sep 27 2024 at 23:55):

Hmm, so if we end up putting all 4694 equations in a single Equations.lean file, is there a way to compile the file into an .olean file once and have that available for all contributors to use without having to recompile it? I presume we would not touch the Equations.lean file very often (and if we do, we could make an .olean again).

Without having a common Equations.lean file, it's going to be difficult to integrate together many independent contributed Lean files proving different sets of implications... as suggested above one could instead have 4694 separate Equation files, and create various further Equations files for various specific sets of Equations, including an Equations_all.lean that imports all of them, but this feels clunky to me. What is the best approach here?

Joachim Breitner (Sep 28 2024 at 07:02):

A file with all equations should be fine? I have that on my branch and it works well. The first compilation take a bit (maybe a minute or two), but once it's built I can import and use it quickly without issues. 4000 defs isn't thst large, compared to mathlib which you also import regularly… I don't quite see the issue here (but maybe I'm missing something or there is a big difference in workflow)

Joachim Breitner (Sep 28 2024 at 07:13):

That said, for a nicer onboarding for contributers that open files before building it may make sense to have two files, one with a few equations (maybe those with comments, and those for which we have manual implication proofs) and one that imports that and adds all other equations, typically imported by automated files.

Joachim Breitner (Sep 28 2024 at 08:20):

I’ll just do that now :-)

https://github.com/teorth/equational_theories/pull/48, and merged.

Notification Bot (Sep 28 2024 at 08:44):

2 messages were moved from this topic to #Equational > A lean list of all equations? by Joachim Breitner.

Last updated: May 02 2025 at 03:31 UTC