Zulip Chat Archive

Stream: lean4

Topic: quadratic growth of olean cache from environment extensions?

Scott Morrison (Mar 15 2023 at 22:59):

Do we expect to see quadratic growth of the olean cache (in terms of LOC), coming from PersistentEnvironmentExtension?

It seems if each file is adding something, and the extension saves its data at each olean, we expect that olean data to be linear in the number of imports, and hence the total size to be quadratic.

If that is the case, how much of a worry is it for existing extensions? I made a few quick graphs, and there's certainly no obvious sign of a quadratic term dominating!

If I try stashing the library_search data (i.e. a key-value pair for every declaration) in a persistent environment extension, could it become a problem then? By the end of mathlib, every olean file would essentially contain a list of every imported declaration. Opinions welcome on:

This is obviously a bad idea.
Err, try it and see.
No worries, mate.

My back of the envelope calculation says it's a bad idea: mathlib3 has 150000 declarations, so with say 100 bytes in the cache for each declaration, that would add in the worst case 15mb to an olean file. A typical file that doesn't import so much would optimistically average say 1mb, but over 3000 files that adds 3gb to the olean cache, which is untenable.

Jannis Limperg (Mar 16 2023 at 00:14):

For precisely this reason, the builtin extensions only export entries which were added in the current module. This way each module's olean contains only module-local data. You then have to merge the data during import, but that seems better than the alternative you describe.

Scott Morrison (Mar 16 2023 at 01:53):

But isn't that just hiding the problem in a quadratic growth in compile time? In each file, we need to do linearly much work to import the extension.

Scott Morrison (Mar 16 2023 at 01:55):

I guess implicit in what you've said is that when a PersistentEnvironmentExtension is handed an Array (Array α) of data to reassemble, that outermost array is indexed by all transitive imports of the current file. I had been assuming it was just the immediate imports.

Scott Morrison (Mar 16 2023 at 01:59):

It seems for a really "heavy" environment extension like library_search, neither option is really good enough.

An alternative would be to:

during CI, have a file with import Mathlib, and then assembles the full library_search DiscrTree, and writes it to disk
put this somewhere persistent (git or azure, I guess)
a mechanism like lake exe cache get to obtain it?
low down in mathlib, we import this DiscrTree, and then all calls to library_search can be fast. (in fact, the DiscrTree would be returned the Names of theorems that hadn't even been imported yet, but these would fail fast)

James Gallicchio (Mar 16 2023 at 02:44):

The alternative would also mean it’s only useful for finding mathlib lemmas.

Scott Morrison (Mar 16 2023 at 03:20):

I mean, we could keep the existing mechanisms available for other projects. It's not like anything comparable in size to mathlib is on the horizon.

Sebastian Ullrich (Mar 16 2023 at 08:28):

Scott Morrison said:

In each file, we need to do linearly much work to import the extension.

Some extensions do not actually have to import all data -- if the .olean structure is already sufficient for e.g. finding specific entries via binary search, the OS should in theory only have to load those O(log n) accessed pages into memory. See MapDeclarationExtension for the generic representative example for this. But I guess that doesn't apply to assembling the DiscrTree.

Sebastian Ullrich (Mar 16 2023 at 08:30):

Similarly to the declaration .ilean data, a global solution sounds interesting because it would allow you to find theorems you have not imported yet; on the flip side, you need to be careful not to construct self-referential proofs, such as solving a theorem with itself.

Last updated: May 02 2025 at 03:31 UTC