Zulip Chat Archive

Stream: lean4

Topic: exporting and importing a `DiscrTree`


Scott Morrison (Apr 04 2023 at 00:57):

I'd like to experiment with doing some on-disk caching of the library_search cache. If I just want to write out a DiscrTree Name s to disk, where do I start?

Scott Morrison (Apr 04 2023 at 00:58):

I don't want to use environment extensions here, because the intention is to have a single cache for everything in mathlib. This will be a largish file; if we put it in environment extensions it would essentially be stored separately in every olean.

Mario Carneiro (Apr 04 2023 at 01:07):

assuming you want it to be memory mapped like oleans are, I don't think it is possible unless you make it a fake module so that you can get lean_save_module_data to do the heavy lifting

Mario Carneiro (Apr 04 2023 at 01:08):

It would be great if there was a lean API for the object_compactor

Gabriel Ebner (Apr 04 2023 at 01:13):

For testing purposes you can just unsafeCast to ModuleData and use readModuleData and saveModuleData.

Mario Carneiro (Apr 04 2023 at 01:13):

yeah I was just thinking about making an API which did that

Mario Carneiro (Apr 04 2023 at 01:28):

easy peasy:

import Std.Util.TermUnsafe

open Lean System

def pickle {α : Type} (path : FilePath) (x : α) (key : Name := by exact decl_name%) : IO Unit :=
  saveModuleData path key (unsafe unsafeCast x)

unsafe def unpickle (α : Type) (path : FilePath) : IO (α × CompactedRegion) := do
  let (x, region)  readModuleData path
  pure (unsafeCast x, region)

unsafe def withUnpickle [Monad m] [MonadLiftT IO m] {α β : Type}
    (path : FilePath) (f : α  m β) : m β := do
  let (x, region)  unpickle α path
  let r  f x
  region.free
  pure r

#eval do
  pickle "test.mylean" (3, 14)
  withUnpickle "test.mylean" fun x : Nat × Nat =>
    println! x -- (3, 14)

Gabriel Ebner (Apr 04 2023 at 01:29):

.anonymous is a bad choice in pickle. That will disable mmaping if you have more than one pickled object.

Mario Carneiro (Apr 04 2023 at 01:30):

hm, yeah that is awkward. The user would need to provide a random seed

Gabriel Ebner (Apr 04 2023 at 01:31):

Or just a name, like `librarySearch.cache .

Mario Carneiro (Apr 04 2023 at 01:31):

heh, it could be decl_name%

Gabriel Ebner (Apr 04 2023 at 01:34):

I think in many cases you want to create more than pickle file per function. Like if you want to have one file per Lean module.

Mario Carneiro (Apr 04 2023 at 01:35):

well it's just the default, the user can override it to mix in their own key

Mario Carneiro (Apr 04 2023 at 01:36):

I don't think the other inputs make a good cache key

Scott Morrison (Apr 04 2023 at 02:23):

Wow, okay, this is pretty good. Building the entire DiscrTree for currently mathlib4 takes about 47 seconds. Pickling it takes about 15, unpickling it is essential instant.

Gabriel Ebner (Apr 04 2023 at 02:24):

Just how big is it??

Scott Morrison (Apr 04 2023 at 02:24):

~150mb uncompressed, 20mb compressed

Gabriel Ebner (Apr 04 2023 at 02:25):

It takes 15 seconds to write a 150M olean file? That's much slower than I'd expected.

Scott Morrison (Apr 04 2023 at 02:25):

I'm tempted to try to distribute it alongside the olean cache in some way, so users of library search aren't constructing it themselves.

Scott Morrison (Apr 04 2023 at 02:26):

I think it would require extra features in cache, we can't just sneak this into the existing setup...?

Gabriel Ebner (Apr 04 2023 at 02:26):

It would fit in better if there was one cache per module.

Scott Morrison (Apr 04 2023 at 02:27):

Even then, I'm worried that we'd lose all the advantage: we either have to merge all those DiscrTrees, or search for matches in all of them.

Scott Morrison (Apr 04 2023 at 02:27):

Either way it's probably slow again.

Gabriel Ebner (Apr 04 2023 at 02:27):

With one cache per mathlib, it's a bit of a new thing. What do we do if the build fails? How is the cache updated when I locally change something?

Scott Morrison (Apr 04 2023 at 02:29):

My inclination would be to keep it simple: don't produce anything if the build fails, and the cache isn't updated when you change something! If the cache file is absent, library_search will just follow the current behaviour.

Scott Morrison (Apr 04 2023 at 02:29):

And we'll have a script users can run locally to rebuild it if desired.

Scott Morrison (Apr 04 2023 at 02:30):

I feel like mostly library_search is not being used for finding lemmas in the files you are authoring today. :-)

Gabriel Ebner (Apr 04 2023 at 02:30):

I guess you'd always have a cache file lying around. But when you're switching to a branch where CI fails, you'd still be searching in the old one.

Gabriel Ebner (Apr 04 2023 at 02:31):

I feel like mostly library_search is not being used for finding lemmas in the files you are authoring today. :-)

That's a valid point. Maybe we could also turn this into a feature and have library_search recommend imports for lemmas you don't have yet.

Scott Morrison (Apr 04 2023 at 02:31):

The cache only refers to Names. So obviously if things get renamed/added/etc, if it not going to find those lemmas. But everything is still behind name resolution.

Scott Morrison (Apr 04 2023 at 02:32):

So a partially dirty cache is still completely usable.

Scott Morrison (Apr 04 2023 at 02:33):

For lemmas in files that aren't imported, obviously we can't apply them. So suggesting things from files that haven't been imported yet could only be on the rather weak basis of the DiscrTree returning them.

Gabriel Ebner (Apr 04 2023 at 02:33):

And we'll have a script users can run locally to rebuild it if desired.

I don't think you'd get a chance to use it often since it would only work once everything builds. When you no longer need library_search.

Scott Morrison (Apr 04 2023 at 02:34):

Sure, but if you are off on some weird branch with no CI, at least you can build a cache before you start work.

Scott Morrison (Apr 04 2023 at 02:35):

Perhaps for a first implementation I could just try hacking cache to look for this one extra file, and store it under some hash that mixes in the hash for Mathlib.lean.

Mario Carneiro (Apr 04 2023 at 04:16):

Gabriel Ebner said:

It takes 15 seconds to write a 150M olean file? That's much slower than I'd expected.

It's not just writing the file, it is also compacting the data structure

Mario Carneiro (Apr 04 2023 at 04:16):

I would not be surprised if that is comparable to the time it takes to construct in the first place

Sebastian Ullrich (Apr 04 2023 at 08:08):

The benchmarks say ~100s for serialising mathlib4's 1.2GB of .olean files, so that ratio sounds roughly correct


Last updated: Dec 20 2023 at 11:08 UTC