Zulip Chat Archive
Stream: lean4
Topic: olean dump tool
Mario Carneiro (Jun 21 2023 at 20:00):
I've just created https://github.com/digama0/oleandump, a fancier version of https://github.com/gebner/oleanparser which also uses type information so that it's not just a sea of Obj.ctor
nodes and you can actually kind of see what's going on. The output looks something like this:
let xed6e668c{168} : Lean.Import := Lean.Import.mk #[`Init.Prelude, Bool.false]
let xf2bec02{96} : Lean.Import := Lean.Import.mk #[`Init.SizeOf, Bool.false]
let x43828d99{304} : Array.{0} Lean.Import := array 2 #[xed6e668c{168}, xf2bec02{96}]
let x3e10dd5c{216} : Lean.Name := Lean.Name.num #[`_private.Init.Core, 0]
let x4128399e{248} : Lean.Name := Lean.Name.str #[x3e10dd5c{216}, "Quotient"]
let xf8f7d484{320} : Lean.Name := Lean.Name.str #[x4128399e{248}, "rel"]
let xeeaa3d5f{80} : Lean.Name := Lean.Name.str #[x4128399e{248}, "rel_of_eq"]
let x913b1e34{144} : Lean.Name := Lean.Name.num #[`Init.Core._auxLemma, 1]
let x461040dc{32} : Lean.Name := Lean.Name.str #[xf8f7d484{320}, "refl"]
let x86ae8f45{32} : Lean.Name := Lean.Name.str #[xf8f7d484{320}, "proof_1"]
let x5ee03cc8{32} : Lean.Name := Lean.Name.str #[x3e10dd5c{216}, "Thunk"]
let x177fca44{104} : Lean.Name := Lean.Name.str #[x5ee03cc8{32}, "fn"]
let x32363a17{63560} : Array.{0} Lean.Name :=
array 1012 #[`PSum.inr.sizeOf_spec,
`«term_⊕_»._closed_5._cstage2,
`«term_≈_»._closed_7._cstage2,
`«term_!=_»._closed_4._cstage2,
`Subtype.instInhabitedSubtype._rarg._cstage2,
`Quotient.liftOn,
`_aux_Init_Core___unexpand_Sum_1._lambda_1._cstage2,
`«term_≠_»._closed_1._cstage2,
`Task.casesOn,
`decidable_of_decidable_of_eq._rarg._cstage2,
`Prod.lexLtDec._cstage1,
...
]
...
let xefdc7e35{1775856} : Lean.ModuleData :=
Lean.ModuleData.mk #[x43828d99{304}, x32363a17{63560}, x7d25df70{1246112}, xe5438f9e{5448}, x2ef2ecf6{460384}]
xefdc7e35{1775856}
The references like xefdc7e35{1775856}
combine a Merkle hash of the contents (not pointer values), which makes it a bit more diff-able, together with the byte size of the object (here 1775856
bytes), which is useful for finding what is taking up all the space in the file. I already managed to use this to find one bug where a lean value was mis-typed.
One of the main goals here was to track down the strangely small olean file savings in #4573, and I think the explanation is simply that the new compiler is quite light on annotations when the file is full of theorems and has nothing worth compiling. For example, Mathlib.Algebra.Group.Basic
goes from 1158472
to 1155088
bytes, where the difference comes from two environment extensions:
Lean.Compiler.LCNF.specExtension: 184 bytes -> 56 bytes
Lean.Compiler.LCNF.baseExt: 3424 bytes -> 168 bytes
Only two definitions are marked for compilation in the first place, SubtractionMonoid.toSubNegZeroMonoid
and DivisionMonoid.toDivInvOneMonoid
, which is why it takes up almost no space to begin with. Here is the top level size breakdown for elements of ModuleData
for Mathlib.Algebra.Group.Basic
:
imports
(declarations of imported modules): 448 bytesconstNames
(redundant withconstants
, just the names): 31808 bytesconstants
(theConstantInfo
data, including all the proofs): 997040 bytesextraConstNames
(extra constants not "officially" in the environment, used by the compiler): 40 bytes-
entries
(environment extensions): 129088 bytes
There are 109 extensions in this file. The largest environment extensions are:Lean.declRangeExt
(declaration ranges): 61808 bytesLean.Meta.simpExtension
(the@[simp]
attribute, mainly simp discrimination keys): 18680 bytesMathlib.Prelude.Rename.renameExtension
(the#align
command): 16464 bytes
This all seems pretty typical for a "small" file, although it still manages to add up to over a megabyte. I'm contemplating a zip/unzip-like tool that will pack oleans losslessly into a lean 3 style serialized format, and unzip them at lake exe cache
time rather than at load time. I am hopeful that a sufficiently high performance implementation of the deserializer will be faster than just downloading it over the network and unzipping, unless you have a really fast connection.
Sebastian Ullrich (Jun 21 2023 at 20:25):
This is great, I was dreading having to write this tool myself :big_smile: . How do you account for sharing, is it included in the size only in the first path leading to a shared object?
Sebastian Ullrich (Jun 21 2023 at 20:26):
Would also be interesting to know what percentage of the file is referenced exclusively from proofs
Patrick Massot (Jun 21 2023 at 20:28):
Mario Carneiro said:
I'm contemplating a zip/unzip-like tool that will pack oleans losslessly into a lean 3 style serialized format, and unzip them at
lake exe cache
time rather than at load time. I am hopeful that a sufficiently high performance implementation of the deserializer will be faster than just downloading it over the network and unzipping, unless you have a really fast connection.
This would be great but wouldn't completely solve the problem for teaching when you don't want to tell students they need 2Gb of storage to work on exercises. For this we would really want a mode where Lean does the unpacking during import, just like in Lean 3.
Mario Carneiro (Jun 21 2023 at 20:41):
Sebastian Ullrich said:
How do you account for sharing, is it included in the size only in the first path leading to a shared object?
Yes. I thought about doing something like counting up the total size of everything in the transitive closure of a given object, but that's a lot of computation and also somewhat difficult to reason about - it's an easier mental model if things just add up.
Mario Carneiro (Jun 21 2023 at 21:14):
Sebastian Ullrich said:
Would also be interesting to know what percentage of the file is referenced exclusively from proofs
For the Mathlib.Algebra.Group.Basic test file, it is 1158472 bytes normally and 422872 bytes without proofs, i.e. it is 63% proof bodies
Mario Carneiro (Jun 21 2023 at 22:07):
I did the same calculation on all of mathlib. Here's the full data dump: proof-size-olean.txt, and here's the beginning of it:
with prfs w/o proofs prf % file
2638588696 1829564816 .3066 <- total
13744040 12881840 .0627 Mathlib/LinearAlgebra/TensorProduct.olean
13064600 12263384 .0613 Mathlib/RingTheory/TensorProduct.olean
11742536 11426464 .0269 Mathlib/Tactic/Ring/Basic.olean
10997048 9833712 .1057 Mathlib/Analysis/NormedSpace/Multilinear.olean
11203032 9412040 .1598 Mathlib/Analysis/NormedSpace/OperatorNorm.olean
9131928 8843960 .0315 Mathlib/Algebra/Lie/BaseChange.olean
8939736 8604072 .0375 Mathlib/Tactic/NormNum/Basic.olean
8547120 8378648 .0197 Mathlib/CategoryTheory/Limits/Cones.olean
8488096 8100696 .0456 Mathlib/Algebra/Lie/Classical.olean
9000344 7827536 .1303 Mathlib/Algebra/Module/LocalizedModule.olean
8764496 7699544 .1215 Mathlib/Topology/Algebra/Module/Basic.olean
7642368 7639832 .0003 Mathlib/Tactic/Linarith/Preprocessing.olean
7639864 7078056 .0735 Mathlib/CategoryTheory/Limits/Shapes/Pullbacks.olean
6984672 6776568 .0297 Mathlib/RepresentationTheory/Action.olean
8503752 6339272 .2545 Mathlib/GroupTheory/Subgroup/Basic.olean
Mario Carneiro (Jun 21 2023 at 22:09):
Here are the files that benefit the most from dropping proofs:
1992472 56976 .9714 Mathlib/Algebra/GroupPower/Identities.olean
694784 41104 .9408 Mathlib/RingTheory/DiscreteValuationRing/TFAE.olean
1351176 84344 .9375 Mathlib/AlgebraicTopology/DoldKan/Faces.olean
585184 36640 .9373 Mathlib/Data/Complex/ExponentialBounds.olean
586520 40200 .9314 Mathlib/RingTheory/Coprime/Ideal.olean
1203336 94248 .9216 Mathlib/Analysis/Convex/SpecificFunctions/Basic.olean
859464 68928 .9198 Mathlib/Topology/Category/Profinite/CofilteredLimit.olean
1025448 83200 .9188 Mathlib/Analysis/Convex/Slope.olean
1920376 160552 .9163 Mathlib/RingTheory/FinitePresentation.olean
1164112 116920 .8995 Mathlib/NumberTheory/DiophantineApproximation.olean
1216992 125248 .8970 Mathlib/Geometry/Euclidean/Triangle.olean
744712 76896 .8967 Mathlib/AlgebraicTopology/DoldKan/Degeneracies.olean
I would assume these have no or few defs and some big tactic proofs.
Mario Carneiro (Jun 21 2023 at 22:12):
Quite a lot of the category theory files have 0% savings, most likely because this is only looking specifically at dropping the value
argument of TheoremVal.mk
, not all propositions anywhere in the term, and the category theory files presumably have all def
s and no theorem
s
Scott Morrison (Jun 21 2023 at 22:13):
Yes, proofs tend to be embedded in the fields of def
structures in the category theory library.
Scott Morrison (Jun 21 2023 at 22:14):
i.e. is is much more "data bundled with proofs about it" than is typical for mathlib.
Scott Morrison (Jun 21 2023 at 22:14):
Or am I misunderstanding and you're dropping proof fields already in these counts?
Mario Carneiro (Jun 21 2023 at 22:15):
In lean 3 it would automatically out-line proof parts of definitions, but I guess lean 4 doesn't do that anymore?
Gabriel Ebner (Jun 21 2023 at 23:05):
Lean 4 does that as well. Maybe it's the code generator that blows up the file so much?
Mario Carneiro (Jun 21 2023 at 23:58):
It could also be that they are being marked as def
s incorrectly
Mario Carneiro (Jun 21 2023 at 23:59):
indeed, all these proof_n
definitions are def
s
Floris van Doorn (Jun 22 2023 at 10:17):
Would it be possible that we optionally remove proofs from the olean files? That would significantly reduce the olean file size (especially after we mark proof_n
as theorems), and almost everything should still work normally (#print
is one of the few exceptions, I think).
If Lean 4 still has the concept of trust level, this should not be accepted at trust level 0, but it seems fine at higher trust levels.
It would also require some tooling additions. I'm imagining a world where lake exe cache get
gives you smaller oleans without the proofs, but you can optionally request the larger oleans with the proof.
Scott Morrison (Jun 23 2023 at 00:48):
Mario Carneiro said:
indeed, all these
proof_n
definitions aredef
s
This is now fixed in the latest nightly. Bump PRs are at https://github.com/leanprover/std4/pull/163 and #5409.
Scott Morrison (Jun 23 2023 at 01:46):
(The mathlib4 bump looks like it will have to wait for a Lean4 bug fix.)
Mario Carneiro (Jun 23 2023 at 01:48):
do you have a MWE? lean4#2282
Mario Carneiro (Jun 23 2023 at 01:48):
simp changed behavior in the last nightly bump, so I expect some minor fixes
Scott Morrison (Jun 23 2023 at 01:53):
This problem causes pretty extensive problems across mathlib4. :-(
Floris van Doorn (Jul 17 2023 at 14:29):
Last week with the help of Mario I looked through 3 files of Mathlib4 and see how big the individual parts of the olean files are. Here is a summary of these olean files. All sizes are given as multiples of 1 kilobyte (to be more precise: 1000 bytes).
Floris van Doorn (Jul 17 2023 at 14:29):
Data.List.Basic.olean
(lean file is 4500 lines)
6175k total size (uncompressed)
4985k all declarations
3240k proofs of theorems
238k unnecessary data stored by a bug, fixed in lean4#2321
largest declarations are meaningless because of that.
1092k Environment extensions.
The environment extensions larger than 10k:
441k `Lean.Elab.Structural.eqnInfoExt
129k `Lean.declRangeExt
122k `Lean.Compiler.LCNF.monoExt
119k `Lean.Compiler.LCNF.baseExt
95k `Mathlib.Prelude.Rename.renameExtension (#align statements)
71k `Lean.IR.declMapExt
45k `Lean.Meta.simpExtension
Topology.Algebra.Module.Basic.olean
(lean file is 2700 lines)
8823k total size (uncompressed)
7460k all declarations
1085k proofs of theorems
largest declarations:
242k `ContinuousLinearEquiv.continuousSemilinearEquivClass.proof_3
155k `ContinuousLinearMap.addCommGroup._cstage1
1229k Environment extensions.
The environment extensions larger than 10k:
495k `Lean.Compiler.LCNF.baseExt
274k `Lean.IR.declMapExt
109k `Lean.Compiler.LCNF.monoExt
83k `Lean.declRangeExt
79k `Lean.Meta.simpExtension
37k `Mathlib.Prelude.Rename.renameExtension (#align statements)
24k `Lean.IR.UnreachableBranches.functionSummariesExt
21k `Lean.Compiler.specExtension
17k docstrings + module doc
15k `Lean.Meta.instanceExtension
10k `Lean.Compiler.LCNF.specExtension
10k `Simps.structureExt
Geometry.Manifold.ContMDiff.olean
(lean file has 2200 lines)
4002k total size (uncompressed)
3839k all declarations
2249k proofs of theorems
largest declarations:
238k `isLocalStructomorphOn_contDiffGroupoid_iff
145k `contMDiffOn_iff
124k `ContMDiffWithinAt.comp
122k `isLocalStructomorphOn_contDiffGroupoid_iff_aux
118k `contDiffWithinAt_localInvariantProp
136k Environment extensions.
The environment extensions larger than 10k:
68k `Lean.declRangeExt
31k `Mathlib.Prelude.Rename.renameExtension (#align statements)
16k docstrings + module doc
Floris van Doorn (Jul 17 2023 at 14:29):
Some notes:
- We currently store all proofs of theorems
- We currently store runnable versions of all computable definitions, even those inside
noncomputable section
- We currently store information to be able to later generate equation lemmas for declarations (in extension
eqnInfoExt
). We currently store this even for theorems, which is useless (this is a bug).
Floris van Doorn (Jul 17 2023 at 14:29):
I think it's safe to say that 2 very big contributors to the olean sizes with only minimal value are:
- proofs of theorems
- compilation of "mathematical" defs
I think it would be nice if we have the option to get olean files without these two components.
What you would lose, as far as I can tell:
- Cannot
#print
the value on a theorem in an imported file - Cannot run
#print axioms
on a theorem - Cannot
#eval
mathematical definitions.
2 is definitely useful sometimes, but I rarely use 1 and 3.
What do I mean by mathematical definition? How about anything in a noncomputable section
except things that are whitelisted (e.g. by marking it specifically as computable
?). Note that we still want to be able to run some of these definitions, e.g. delaborators could be defined in noncomputable sections (by the notation
command).
Kevin Buzzard (Jul 17 2023 at 16:37):
It probably goes without saying but I essentially never do any of 1 to 3 and I bet I'm not alone in the mathematical community with this property.
Alex J. Best (Jul 17 2023 at 18:05):
I definitely use print axioms on theorems I just wrote in my file (that maybe depends on other files I wrote) to check if it is now sorry free.
And I do occasionally use eval.
That's not to say that reduced oleans are a bad idea, just I'm curious what the use case is.
Using reduced oleans would also reduce the utility of some linters or similar tooling I guess.
Floris van Doorn (Jul 17 2023 at 18:11):
The use case is reducing the olean file size by ~2x.
Mario Carneiro (Jul 17 2023 at 18:12):
the print axioms issue specifically can be fixed by precalculating the result of print axioms on everything
Floris van Doorn (Jul 17 2023 at 18:12):
In my ideal world you can have lake exe cache get
(which gives you the small oleans) and lake exe cache get --all
(or something) which gives you the full oleans (current behavior). Similarly there is a flag for building a Lean file with all/partial info.
James Gallicchio (Jul 17 2023 at 18:18):
do people care about elaborated proofs for anything other than axiom usage..?
Floris van Doorn (Jul 17 2023 at 18:19):
I opened a RFC here: lean4#2328
Kyle Miller (Jul 17 2023 at 18:23):
@James Gallicchio Floris's to_additive
attribute operates on elaborated proofs for example (but that's not an olean concern, since the attribute is added in the same module (always?))
Floris van Doorn (Jul 17 2023 at 19:16):
There are a few exceptions in Algebra.Group.Defs
and Init.ZeroOne
, but other than that they are indeed always added in the same file AFAIK.
Mac Malone (Jul 18 2023 at 01:55):
Another thing to consider: If proofs were removed from some variety of olean, Lake would not rebuild downstream files if the proofs changed (e.g., are golfed), which could be quite nice in many situations.
Mario Carneiro (Jul 18 2023 at 02:13):
unfortunately I doubt that would work because the olean files also include the declarationRanges
data, i.e. line/col for every declaration, which is invalidated by most non-whitespace changes
Mario Carneiro (Jul 18 2023 at 02:13):
the ilean too (this even contains the locations of local variables in the code IIRC)
James Gallicchio (Jul 18 2023 at 05:28):
ooc, why is that in the olean?
Mario Carneiro (Jul 18 2023 at 05:28):
so that go-to-def works
James Gallicchio (Jul 18 2023 at 05:28):
ooh!
James Gallicchio (Jul 18 2023 at 05:29):
that explains why out of sync olean files cause the go to def to be all funky :)
Joachim Breitner (Jul 18 2023 at 10:18):
Would it be feasible to put that information into a separate file, only read by interactive tools, but not when just compiling downstream modules? This way, some recompilation could be avoided without losing that feature.
Sebastian Ullrich (Jul 18 2023 at 10:32):
Yes, the .ilean files are already a first step in that direction. This is all part of the module system plan (TBD).
Joachim Breitner (Jul 18 2023 at 12:58):
Exciting! :-)
Last updated: Dec 20 2023 at 11:08 UTC