Zulip Chat Archive

Stream: lean4

Topic: olean dump tool

Mario Carneiro (Jun 21 2023 at 20:00):

I've just created https://github.com/digama0/oleandump, a fancier version of https://github.com/gebner/oleanparser which also uses type information so that it's not just a sea of Obj.ctor nodes and you can actually kind of see what's going on. The output looks something like this:

let xed6e668c{168} : Lean.Import := Lean.Import.mk #[`Init.Prelude, Bool.false]
let xf2bec02{96} : Lean.Import := Lean.Import.mk #[`Init.SizeOf, Bool.false]
let x43828d99{304} : Array.{0} Lean.Import := array 2 #[xed6e668c{168}, xf2bec02{96}]
let x3e10dd5c{216} : Lean.Name := Lean.Name.num #[`_private.Init.Core, 0]
let x4128399e{248} : Lean.Name := Lean.Name.str #[x3e10dd5c{216}, "Quotient"]
let xf8f7d484{320} : Lean.Name := Lean.Name.str #[x4128399e{248}, "rel"]
let xeeaa3d5f{80} : Lean.Name := Lean.Name.str #[x4128399e{248}, "rel_of_eq"]
let x913b1e34{144} : Lean.Name := Lean.Name.num #[`Init.Core._auxLemma, 1]
let x461040dc{32} : Lean.Name := Lean.Name.str #[xf8f7d484{320}, "refl"]
let x86ae8f45{32} : Lean.Name := Lean.Name.str #[xf8f7d484{320}, "proof_1"]
let x5ee03cc8{32} : Lean.Name := Lean.Name.str #[x3e10dd5c{216}, "Thunk"]
let x177fca44{104} : Lean.Name := Lean.Name.str #[x5ee03cc8{32}, "fn"]
let x32363a17{63560} : Array.{0} Lean.Name :=
array 1012 #[`PSum.inr.sizeOf_spec,
   `«term_⊕_»._closed_5._cstage2,
   `«term_≈_»._closed_7._cstage2,
   `«term_!=_»._closed_4._cstage2,
   `Subtype.instInhabitedSubtype._rarg._cstage2,
   `Quotient.liftOn,
   `_aux_Init_Core___unexpand_Sum_1._lambda_1._cstage2,
   `«term_≠_»._closed_1._cstage2,
   `Task.casesOn,
   `decidable_of_decidable_of_eq._rarg._cstage2,
   `Prod.lexLtDec._cstage1,
    ...
]
...
let xefdc7e35{1775856} : Lean.ModuleData :=
Lean.ModuleData.mk #[x43828d99{304}, x32363a17{63560}, x7d25df70{1246112}, xe5438f9e{5448}, x2ef2ecf6{460384}]
xefdc7e35{1775856}

The references like xefdc7e35{1775856} combine a Merkle hash of the contents (not pointer values), which makes it a bit more diff-able, together with the byte size of the object (here 1775856 bytes), which is useful for finding what is taking up all the space in the file. I already managed to use this to find one bug where a lean value was mis-typed.

One of the main goals here was to track down the strangely small olean file savings in #4573, and I think the explanation is simply that the new compiler is quite light on annotations when the file is full of theorems and has nothing worth compiling. For example, Mathlib.Algebra.Group.Basic goes from 1158472 to 1155088 bytes, where the difference comes from two environment extensions:

Lean.Compiler.LCNF.specExtension: 184 bytes -> 56 bytes
Lean.Compiler.LCNF.baseExt: 3424 bytes -> 168 bytes

Only two definitions are marked for compilation in the first place, SubtractionMonoid.toSubNegZeroMonoid and DivisionMonoid.toDivInvOneMonoid, which is why it takes up almost no space to begin with. Here is the top level size breakdown for elements of ModuleData for Mathlib.Algebra.Group.Basic:

imports (declarations of imported modules): 448 bytes
constNames (redundant with constants, just the names): 31808 bytes
constants (the ConstantInfo data, including all the proofs): 997040 bytes
extraConstNames (extra constants not "officially" in the environment, used by the compiler): 40 bytes
entries (environment extensions): 129088 bytes
There are 109 extensions in this file. The largest environment extensions are:
- Lean.declRangeExt (declaration ranges): 61808 bytes
- Lean.Meta.simpExtension (the @[simp] attribute, mainly simp discrimination keys): 18680 bytes
- Mathlib.Prelude.Rename.renameExtension (the #align command): 16464 bytes

This all seems pretty typical for a "small" file, although it still manages to add up to over a megabyte. I'm contemplating a zip/unzip-like tool that will pack oleans losslessly into a lean 3 style serialized format, and unzip them at lake exe cache time rather than at load time. I am hopeful that a sufficiently high performance implementation of the deserializer will be faster than just downloading it over the network and unzipping, unless you have a really fast connection.

Sebastian Ullrich (Jun 21 2023 at 20:25):

This is great, I was dreading having to write this tool myself :big_smile: . How do you account for sharing, is it included in the size only in the first path leading to a shared object?

Sebastian Ullrich (Jun 21 2023 at 20:26):

Would also be interesting to know what percentage of the file is referenced exclusively from proofs

Patrick Massot (Jun 21 2023 at 20:28):

Mario Carneiro said:

I'm contemplating a zip/unzip-like tool that will pack oleans losslessly into a lean 3 style serialized format, and unzip them at lake exe cache time rather than at load time. I am hopeful that a sufficiently high performance implementation of the deserializer will be faster than just downloading it over the network and unzipping, unless you have a really fast connection.

This would be great but wouldn't completely solve the problem for teaching when you don't want to tell students they need 2Gb of storage to work on exercises. For this we would really want a mode where Lean does the unpacking during import, just like in Lean 3.

Mario Carneiro (Jun 21 2023 at 20:41):

Sebastian Ullrich said:

How do you account for sharing, is it included in the size only in the first path leading to a shared object?

Yes. I thought about doing something like counting up the total size of everything in the transitive closure of a given object, but that's a lot of computation and also somewhat difficult to reason about - it's an easier mental model if things just add up.

Mario Carneiro (Jun 21 2023 at 21:14):

Sebastian Ullrich said:

Would also be interesting to know what percentage of the file is referenced exclusively from proofs

For the Mathlib.Algebra.Group.Basic test file, it is 1158472 bytes normally and 422872 bytes without proofs, i.e. it is 63% proof bodies

Mario Carneiro (Jun 21 2023 at 22:07):

I did the same calculation on all of mathlib. Here's the full data dump: proof-size-olean.txt, and here's the beginning of it:

with prfs  w/o proofs prf % file
2638588696 1829564816 .3066 <- total
13744040   12881840   .0627 Mathlib/LinearAlgebra/TensorProduct.olean
13064600   12263384   .0613 Mathlib/RingTheory/TensorProduct.olean
11742536   11426464   .0269 Mathlib/Tactic/Ring/Basic.olean
10997048   9833712    .1057 Mathlib/Analysis/NormedSpace/Multilinear.olean
11203032   9412040    .1598 Mathlib/Analysis/NormedSpace/OperatorNorm.olean
9131928    8843960    .0315 Mathlib/Algebra/Lie/BaseChange.olean
8939736    8604072    .0375 Mathlib/Tactic/NormNum/Basic.olean
8547120    8378648    .0197 Mathlib/CategoryTheory/Limits/Cones.olean
8488096    8100696    .0456 Mathlib/Algebra/Lie/Classical.olean
9000344    7827536    .1303 Mathlib/Algebra/Module/LocalizedModule.olean
8764496    7699544    .1215 Mathlib/Topology/Algebra/Module/Basic.olean
7642368    7639832    .0003 Mathlib/Tactic/Linarith/Preprocessing.olean
7639864    7078056    .0735 Mathlib/CategoryTheory/Limits/Shapes/Pullbacks.olean
6984672    6776568    .0297 Mathlib/RepresentationTheory/Action.olean
8503752    6339272    .2545 Mathlib/GroupTheory/Subgroup/Basic.olean

Mario Carneiro (Jun 21 2023 at 22:09):

Here are the files that benefit the most from dropping proofs:

1992472 56976  .9714 Mathlib/Algebra/GroupPower/Identities.olean
694784  41104  .9408 Mathlib/RingTheory/DiscreteValuationRing/TFAE.olean
1351176 84344  .9375 Mathlib/AlgebraicTopology/DoldKan/Faces.olean
585184  36640  .9373 Mathlib/Data/Complex/ExponentialBounds.olean
586520  40200  .9314 Mathlib/RingTheory/Coprime/Ideal.olean
1203336 94248  .9216 Mathlib/Analysis/Convex/SpecificFunctions/Basic.olean
859464  68928  .9198 Mathlib/Topology/Category/Profinite/CofilteredLimit.olean
1025448 83200  .9188 Mathlib/Analysis/Convex/Slope.olean
1920376 160552 .9163 Mathlib/RingTheory/FinitePresentation.olean
1164112 116920 .8995 Mathlib/NumberTheory/DiophantineApproximation.olean
1216992 125248 .8970 Mathlib/Geometry/Euclidean/Triangle.olean
744712  76896  .8967 Mathlib/AlgebraicTopology/DoldKan/Degeneracies.olean

I would assume these have no or few defs and some big tactic proofs.

Mario Carneiro (Jun 21 2023 at 22:12):

Quite a lot of the category theory files have 0% savings, most likely because this is only looking specifically at dropping the value argument of TheoremVal.mk, not all propositions anywhere in the term, and the category theory files presumably have all defs and no theorems

Scott Morrison (Jun 21 2023 at 22:13):

Yes, proofs tend to be embedded in the fields of def structures in the category theory library.

Scott Morrison (Jun 21 2023 at 22:14):

i.e. is is much more "data bundled with proofs about it" than is typical for mathlib.

Scott Morrison (Jun 21 2023 at 22:14):

~~Or am I misunderstanding and you're dropping proof fields already in these counts?~~

Mario Carneiro (Jun 21 2023 at 22:15):

In lean 3 it would automatically out-line proof parts of definitions, but I guess lean 4 doesn't do that anymore?

Gabriel Ebner (Jun 21 2023 at 23:05):

Lean 4 does that as well. Maybe it's the code generator that blows up the file so much?

Mario Carneiro (Jun 21 2023 at 23:58):

It could also be that they are being marked as defs incorrectly

Mario Carneiro (Jun 21 2023 at 23:59):

indeed, all these proof_n definitions are defs

Floris van Doorn (Jun 22 2023 at 10:17):

Would it be possible that we optionally remove proofs from the olean files? That would significantly reduce the olean file size (especially after we mark proof_n as theorems), and almost everything should still work normally (#print is one of the few exceptions, I think).
If Lean 4 still has the concept of trust level, this should not be accepted at trust level 0, but it seems fine at higher trust levels.
It would also require some tooling additions. I'm imagining a world where lake exe cache get gives you smaller oleans without the proofs, but you can optionally request the larger oleans with the proof.

Scott Morrison (Jun 23 2023 at 00:48):

Mario Carneiro said:

indeed, all these proof_n definitions are defs

This is now fixed in the latest nightly. Bump PRs are at https://github.com/leanprover/std4/pull/163 and #5409.

Scott Morrison (Jun 23 2023 at 01:46):

(The mathlib4 bump looks like it will have to wait for a Lean4 bug fix.)

Mario Carneiro (Jun 23 2023 at 01:48):

~~do you have a MWE?~~ lean4#2282

Mario Carneiro (Jun 23 2023 at 01:48):

simp changed behavior in the last nightly bump, so I expect some minor fixes

Scott Morrison (Jun 23 2023 at 01:53):

This problem causes pretty extensive problems across mathlib4. :-(

Floris van Doorn (Jul 17 2023 at 14:29):

Last week with the help of Mario I looked through 3 files of Mathlib4 and see how big the individual parts of the olean files are. Here is a summary of these olean files. All sizes are given as multiples of 1 kilobyte (to be more precise: 1000 bytes).

Floris van Doorn (Jul 17 2023 at 14:29):

Data.List.Basic.olean (lean file is 4500 lines)

6175k total size (uncompressed)

4985k all declarations
3240k proofs of theorems
238k unnecessary data stored by a bug, fixed in lean4#2321
largest declarations are meaningless because of that.

1092k Environment extensions.
The environment extensions larger than 10k:
441k `Lean.Elab.Structural.eqnInfoExt
129k `Lean.declRangeExt
122k `Lean.Compiler.LCNF.monoExt
119k `Lean.Compiler.LCNF.baseExt
95k `Mathlib.Prelude.Rename.renameExtension (#align statements)
71k `Lean.IR.declMapExt
45k `Lean.Meta.simpExtension

Topology.Algebra.Module.Basic.olean (lean file is 2700 lines)

8823k total size (uncompressed)

7460k all declarations
1085k proofs of theorems
largest declarations:
242k `ContinuousLinearEquiv.continuousSemilinearEquivClass.proof_3
155k `ContinuousLinearMap.addCommGroup._cstage1

1229k Environment extensions.
The environment extensions larger than 10k:
495k `Lean.Compiler.LCNF.baseExt
274k `Lean.IR.declMapExt
109k `Lean.Compiler.LCNF.monoExt
83k `Lean.declRangeExt
79k `Lean.Meta.simpExtension
37k `Mathlib.Prelude.Rename.renameExtension (#align statements)
24k `Lean.IR.UnreachableBranches.functionSummariesExt
21k `Lean.Compiler.specExtension
17k docstrings + module doc
15k `Lean.Meta.instanceExtension
10k `Lean.Compiler.LCNF.specExtension
10k `Simps.structureExt

Geometry.Manifold.ContMDiff.olean (lean file has 2200 lines)

4002k total size (uncompressed)

3839k all declarations
2249k proofs of theorems
largest declarations:
238k `isLocalStructomorphOn_contDiffGroupoid_iff
145k `contMDiffOn_iff
124k `ContMDiffWithinAt.comp
122k `isLocalStructomorphOn_contDiffGroupoid_iff_aux
118k `contDiffWithinAt_localInvariantProp

136k Environment extensions.
The environment extensions larger than 10k:
68k `Lean.declRangeExt
31k `Mathlib.Prelude.Rename.renameExtension (#align statements)
16k docstrings + module doc

Floris van Doorn (Jul 17 2023 at 14:29):

Some notes:

We currently store all proofs of theorems
We currently store runnable versions of all computable definitions, even those inside noncomputable section
We currently store information to be able to later generate equation lemmas for declarations (in extension eqnInfoExt). We currently store this even for theorems, which is useless (this is a bug).

Floris van Doorn (Jul 17 2023 at 14:29):

I think it's safe to say that 2 very big contributors to the olean sizes with only minimal value are:

proofs of theorems
compilation of "mathematical" defs

I think it would be nice if we have the option to get olean files without these two components.

What you would lose, as far as I can tell:

Cannot #print the value on a theorem in an imported file
Cannot run #print axioms on a theorem
Cannot #eval mathematical definitions.

2 is definitely useful sometimes, but I rarely use 1 and 3.

What do I mean by mathematical definition? How about anything in a noncomputable section except things that are whitelisted (e.g. by marking it specifically as computable?). Note that we still want to be able to run some of these definitions, e.g. delaborators could be defined in noncomputable sections (by the notation command).

Kevin Buzzard (Jul 17 2023 at 16:37):

It probably goes without saying but I essentially never do any of 1 to 3 and I bet I'm not alone in the mathematical community with this property.

Alex J. Best (Jul 17 2023 at 18:05):

I definitely use print axioms on theorems I just wrote in my file (that maybe depends on other files I wrote) to check if it is now sorry free.
And I do occasionally use eval.
That's not to say that reduced oleans are a bad idea, just I'm curious what the use case is.
Using reduced oleans would also reduce the utility of some linters or similar tooling I guess.

Floris van Doorn (Jul 17 2023 at 18:11):

The use case is reducing the olean file size by ~2x.

Mario Carneiro (Jul 17 2023 at 18:12):

the print axioms issue specifically can be fixed by precalculating the result of print axioms on everything

Floris van Doorn (Jul 17 2023 at 18:12):

In my ideal world you can have lake exe cache get (which gives you the small oleans) and lake exe cache get --all (or something) which gives you the full oleans (current behavior). Similarly there is a flag for building a Lean file with all/partial info.

James Gallicchio (Jul 17 2023 at 18:18):

do people care about elaborated proofs for anything other than axiom usage..?

Floris van Doorn (Jul 17 2023 at 18:19):

I opened a RFC here: lean4#2328

Kyle Miller (Jul 17 2023 at 18:23):

@James Gallicchio Floris's to_additive attribute operates on elaborated proofs for example (but that's not an olean concern, since the attribute is added in the same module (always?))

Floris van Doorn (Jul 17 2023 at 19:16):

There are a few exceptions in Algebra.Group.Defs and Init.ZeroOne, but other than that they are indeed always added in the same file AFAIK.

Mac Malone (Jul 18 2023 at 01:55):

Another thing to consider: If proofs were removed from some variety of olean, Lake would not rebuild downstream files if the proofs changed (e.g., are golfed), which could be quite nice in many situations.

Mario Carneiro (Jul 18 2023 at 02:13):

unfortunately I doubt that would work because the olean files also include the declarationRanges data, i.e. line/col for every declaration, which is invalidated by most non-whitespace changes

Mario Carneiro (Jul 18 2023 at 02:13):

the ilean too (this even contains the locations of local variables in the code IIRC)

James Gallicchio (Jul 18 2023 at 05:28):

ooc, why is that in the olean?

Mario Carneiro (Jul 18 2023 at 05:28):

so that go-to-def works

James Gallicchio (Jul 18 2023 at 05:28):

ooh!

James Gallicchio (Jul 18 2023 at 05:29):

that explains why out of sync olean files cause the go to def to be all funky :)

Joachim Breitner (Jul 18 2023 at 10:18):

Would it be feasible to put that information into a separate file, only read by interactive tools, but not when just compiling downstream modules? This way, some recompilation could be avoided without losing that feature.

Sebastian Ullrich (Jul 18 2023 at 10:32):

Yes, the .ilean files are already a first step in that direction. This is all part of the module system plan (TBD).

Joachim Breitner (Jul 18 2023 at 12:58):

Exciting! :-)

Joachim Breitner (Jan 08 2025 at 19:38):

Floris van Doorn said:

Last week with the help of Mario I looked through 3 files of Mathlib4 and see how big the individual parts of the olean files are. Here is a summary of these olean files. All sizes are given as multiples of 1 kilobyte (to be more precise: 1000 bytes).

A few of the issues you found back then have likely been fixed. Do you still have the tool/script around that generated these reports? What does it say these days?

Floris van Doorn (Jan 09 2025 at 11:20):

The tool we used was ~~https://github.com/digama0/leangz~~ https://github.com/digama0/oleandump and manually looking at the output (manually investigating nodes that had a large size).
I'd be interested to do this again, but cannot promise that I'll do anything this month.

Last updated: May 02 2025 at 03:31 UTC