Zulip Chat Archive

Stream: mathlib4

Topic: lean-cache: lake exe cache rewrite


Mario Carneiro (Jul 20 2023 at 05:44):

I have an initial version of the lake exe cache rewrite available now at https://github.com/digama0/lean-cache . Some comparative statistics:

$ rm -rf build/lib/Mathlib* build/ir/Mathlib*
$ time lake exe cache get
No files to download
Decompressing 3589 file(s)
unpacked in 5646 ms
________________________________________________________
Executed in   10.20 secs    fish           external
   usr time   30.16 secs    3.72 millis   30.15 secs
   sys time    8.30 secs    0.00 millis    8.30 secs
$ rm -rf build/lib/Mathlib* build/ir/Mathlib*
$ time lake_ext
________________________________________________________
Executed in    1.02 secs      fish           external
   usr time  929.58 millis    0.46 millis  929.12 millis
   sys time   93.50 millis    4.11 millis   89.39 millis
$ time lean-cache get
Nothing to download
Decompressing 3369 file(s)
unpacked in 3128 ms
________________________________________________________
Executed in    4.16 secs    fish           external
   usr time   20.92 secs    0.70 millis   20.92 secs
   sys time    6.00 secs    3.63 millis    6.00 secs

Mario Carneiro (Jul 20 2023 at 05:50):

The main differences are:

  • It's written in rust, not lean
  • It doesn't run inside a lake environment, which means it needs data from lake regarding the workspace configuration. This cuts about a second off the processing time, but if the configuration changes then you have to run lake first to update (and unfortunately this is often necessary when switching branches or otherwise doing things that might necessitate using lake-cache in the first place). Support for the necessary export still needs to be upstreamed but for now there is a separate application lake_ext that processes the lakefile into a JSON format and puts it in the build directory.
  • It doesn't support all the subcommands of lake exe cache, are people actually using these?

Mac Malone (Jul 20 2023 at 05:54):

@Mario Carneiro I'm curious: What is the primary cause of the 2x time save?

Mario Carneiro (Jul 20 2023 at 05:55):

It's hard to tell. Running anything through lake seems to involve some time just to do the workspace setup, and other activities are also slower than seems appropriate

Mario Carneiro (Jul 20 2023 at 05:57):

If you stick an exit 0 at the beginning of cache, lake exe cache get drops to 1.1 sec

Yury G. Kudryashov (Jul 20 2023 at 06:13):

Why did it unpack 200 less files?

Sebastian Ullrich (Jul 20 2023 at 07:22):

Yikes, that sure is a bit of duplication with Lake code. @Mario Carneiro What directions do you see for getting Cache and Lake closer together? The simplest and loosest coupling I can see would be specifying an executable in the lakefile that Lake calls just before trying to build a module, but that would probably destroy download parallelism. The next point on the Pareto curve I see would be for Lake to first collect all this data and call the executable with it only once.

Mario Carneiro (Jul 20 2023 at 07:22):

@Yury G. Kudryashov It appears that the 220 missing files are from Std, ProofWidgets etc, since the command I used above only cleared build/lib/Mathlib and not lake-packages/Std/build/lib as well. Another difference:

  • It calculates lake hashes for files, and skips downloading and unpacking for files that are already up to date

Mario Carneiro (Jul 20 2023 at 07:23):

I would really like lake to export some of this as a library but @Mac didn't seem interested in that

Scott Morrison (Jul 20 2023 at 07:25):

I think it would be great to be able to access hashes of files for various REPL / Infotree-extraction things we want to do down the track.

Mac Malone (Jul 20 2023 at 07:26):

My approach would be to try and integrate the per-file downloading into Lake as pre-build step, but that might not get the performance @Mario Carneiro desires. However, it could be potentially somewhat be offset by the ability to do this interactively on VS Code file loads.

Sebastian Ullrich (Jul 20 2023 at 07:27):

Note that having Lake call Cache as part of the build would have two immediate upsides: 1) it's impossible to forget to call Cache 2) we only download the files we actually need. Of course a downside is that we query the cache more often, the overhead of which would have to be measured.

Mario Carneiro (Jul 20 2023 at 07:27):

yeah, a major part of wanting to bring the cost of cache down was so that it could be used as a first step in other tools that just want things to be up to date

Mario Carneiro (Jul 20 2023 at 07:28):

like lake build

Mac Malone (Jul 20 2023 at 07:28):

I should note that I have mentioned integrating cache pre-build is already possible with extraDepTargets, but no one has taken me up on that approach.

Mario Carneiro (Jul 20 2023 at 07:28):

I'm not saying no to that, but we need an extra dep to target first

Mac Malone (Jul 20 2023 at 07:29):

@Mario Carneiro huh?

Sebastian Ullrich (Jul 20 2023 at 07:29):

@Mac This would not provide upside 2), would it?

Mario Carneiro (Jul 20 2023 at 07:29):

@Mac if we want to run lean-cache as an extraDepTarget it has to exist first

Mac Malone (Jul 20 2023 at 07:30):

@Sebastian Ullrich Not through Lake (but cache would still work as it does). To do that in Lake, that would require another planned changed: custom module build facets.

Mario Carneiro (Jul 20 2023 at 07:30):

so far I haven't yet worked out how to integrate this into user workflows

Mac Malone (Jul 20 2023 at 07:31):

@Mario Carneiro the existing cache could also be integrated in the short term.

Mario Carneiro (Jul 20 2023 at 07:31):

the current implementation is mainly targeted at being a drop in replacement for lake exe cache get

Mac Malone (Jul 20 2023 at 07:31):

i.e., just have a target that does the equivalent of lake exe cache get and use it in extraDepTargets

Mario Carneiro (Jul 20 2023 at 07:32):

is that an exe or a script?

Mac Malone (Jul 20 2023 at 07:32):

extraDepTargets are targets -- i.e. arbitrary build code (specfically, IndexBuildM (BuildJob a)).

Sebastian Ullrich (Jul 20 2023 at 07:33):

Mac said:

To do that in Lake, that would require another planned changed: custom module build facets.

I don't see the relation yet. Caching is something that enhances the existing facets, it should not require changing the build invocation

Mario Carneiro (Jul 20 2023 at 07:34):

yeah, adding it as a depTarget would mess up the hashes

Mac Malone (Jul 20 2023 at 07:34):

@Sebastian Ullrich It depends on whether the cache is integrated as part of the default build facet or as add-on. In the former case, the custom facet is unnecessary. In the later, the custom is just the default + cache add-on.

Mac Malone (Jul 20 2023 at 07:34):

Mario Carneiro said:

yeah, adding it as a depTarget would mess up the hashes

You can just have the cache target produce a nilHash no matter what (regardless of whether the cache is used).

Sebastian Ullrich (Jul 20 2023 at 07:37):

Any kind of automatic caching definitely needs an opt-out like --offline. To me that suggests it should at least be a built-in notion of Lake even if the actual cache retrieval is done by an external process.

Mac Malone (Jul 20 2023 at 07:38):

@Sebastian Ullrich Lake already supports custom configuration options through -K, so I do not think that would be a limiting factor.

Mario Carneiro (Jul 20 2023 at 07:39):

configuration options also mess up the cache

Mac Malone (Jul 20 2023 at 07:39):

However, I do think integration should be the eventual goal. Especially, when we have establish an official package repository for cloud builds.

Mac Malone (Jul 20 2023 at 07:39):

Mario Carneiro said:

configuration options also mess up the cache

They do not have to?

Mario Carneiro (Jul 20 2023 at 07:39):

I don't think we need to establish an official package repository, just a caching protocol

Mario Carneiro (Jul 20 2023 at 07:40):

e.g. in mathlib's lakefile

cache_server "https://lakecache.blob.core.windows.net/mathlib4"

should be enough

Mac Malone (Jul 20 2023 at 07:42):

Sure, but I do think a package repository is an important end goal for a langauge.

Mario Carneiro (Jul 20 2023 at 07:42):

maybe, but I don't think we have any current takers for that infra

Sebastian Ullrich (Jul 20 2023 at 07:42):

To be honest I'm not sure how we got on this custom facet track when my initial assertion already was that retrieving the cache for each file individually is probably too slow

Mario Carneiro (Jul 20 2023 at 07:43):

That's another part of the caching protocol that could be improved

Mario Carneiro (Jul 20 2023 at 07:43):

specifically, it would be nice to send the list of files to download to the server and have it send all of them

Mac Malone (Jul 20 2023 at 07:44):

@Sebastian Ullrich Oh, sorry, I just kind of assuming individual file caching, since that is what cache is currently doing (afaik).

Mario Carneiro (Jul 20 2023 at 07:44):

unfortunately that's not naturally part of static file serving

Mario Carneiro (Jul 20 2023 at 07:46):

currently there are two kinds of files stored on the server:

CACHE_URL/f/1234.ltar: individual file hash (not tied to any particular commit but hashing all the deps)
CACHE_URL/c/1234: list of files contained in one commit

Mario Carneiro (Jul 20 2023 at 07:46):

the latter kind of file is unused AFAIK

Sebastian Ullrich (Jul 20 2023 at 07:47):

I don't remember if we benchmarked this, but just doing the retrieval from a single process like we're already doing has the tremendous benefits of HTTP multiplexing over long-lived connections. That's what I meant by calling the cache executable only once.

Mario Carneiro (Jul 20 2023 at 07:47):

There is some code to download directory listings from CACHE_URL/f, but it maxes out at 5000 files so it's not very useful for determining in advance whether something is on the server

Mario Carneiro (Jul 20 2023 at 07:48):

I have not yet implemented the version which links to curl's API, this will help with overlapping unpacking and downloading

Mario Carneiro (Jul 20 2023 at 07:48):

the current implementation is just calling curl --parallel like lake exe cache does

Sebastian Ullrich (Jul 20 2023 at 07:56):

Minor downer in this respect: the direct blob storage does not actually support HTTP/2.0 :sweat_smile: . In summary: we need more benchmarks!

Mario Carneiro (Jul 20 2023 at 07:57):

do you mean the storage supported by azure?

Sebastian Ullrich (Jul 20 2023 at 07:57):

Yes https://learn.microsoft.com/en-us/rest/api/storageservices/http-version-support

Arthur Paulino (Jul 20 2023 at 09:52):

Mario Carneiro said:

currently there are two kinds of files stored on the server:

CACHE_URL/f/1234.ltar: individual file hash (not tied to any particular commit but hashing all the deps)
CACHE_URL/c/1234: list of files contained in one commit

@Gabriel Ebner asked for the second type of stored files for garbage collection. We'd like to keep the files that are bound to some commit on master, only. Everything else would be deleted after some time window

Mario Carneiro (Jul 28 2023 at 01:26):

I didn't write a readme so here's how you use it:

  • To compile:
    • cd lake_ext; lake build for the lean part
    • cargo build --release for the rust part
  • To run, from the target project directory, run:
    • /lake_ext/build/bin/lake_ext to generate some metadata (only needs to be rerun if the project configuration changes)
    • /target/release/lean-cache get

Scott Morrison (Jul 28 2023 at 01:26):

Sorry, @Mario Carneiro, could I have a hint about running this. I did a cargo install --path ., after which

% time cache-rs
...
cache-rs  0.00s user 0.00s system 61% cpu 0.004 total

looks very healthy, but then

% time cache-rs get
thread 'main' panicked at 'workspace-manifest hash does not match, run 'lake-ext'', src/main.rs:532:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
cache-rs get  0.00s user 0.00s system 63% cpu 0.004 total

Scott Morrison (Jul 28 2023 at 01:29):

First step in compilation instructions need to be cd lake-ext; lake update; lake build (hyphen instead of underscore, add lake update).

Scott Morrison (Jul 28 2023 at 01:31):

In the later steps, should be:

/lake-ext/build/bin/lake_ext

and

/target/release/cache-rs get

Mario Carneiro (Jul 28 2023 at 01:32):

just pushed some name changes so that it's /lake-ext/build/bin/lake-ext and /target/release/lean-cache get respectively

Scott Morrison (Jul 28 2023 at 01:32):

But with those changes, it runs and is instant.

Mario Carneiro (Jul 28 2023 at 01:33):

how instant?

Scott Morrison (Jul 28 2023 at 01:33):

% time ../lean-cache/target/release/cache-rs get
Nothing to do
../lean-cache/target/release/cache-rs get  0.18s user 0.31s system 94% cpu 0.519 total

Mario Carneiro (Jul 28 2023 at 01:34):

ah okay, yes Nothing to do is pretty fast, hopefully that part can be upstreamed to lake so that lake can also do it in less than 1s

Mario Carneiro (Jul 28 2023 at 01:34):

the main job there is the olean hash computation

Mac Malone (Jul 28 2023 at 23:56):

Scott Morrison said:

First step in compilation instructions need to be cd lake-ext; lake update; lake build (hyphen instead of underscore, add lake update).

this can be shorten to lake -d lake-ext build -U

Mario Carneiro (Jul 29 2023 at 00:07):

@Mac has a change to the trace computation landed? I'm having issues with lake exe cache get (both the old one and new one) in mathport, the unpacked files have incorrect traces and are rebuilt by lake build. For example, Std.Data.Option.Init.Lemmas unpacked from the mathlib cache for commit https://github.com/leanprover-community/mathlib4/commit/6220e1d0219e245327c603be771b67aea0fdeb4d has a .trace file containing 17557947031058207531, and I reliably get this number whether unpacking via lake exe cache get, but lake build wants the trace file to read 11933920017627886034 and rebuilds the file to make it so

Mario Carneiro (Jul 29 2023 at 00:08):

It's not clear to me how this would happen since lake exe cache get isn't even doing its own trace computation, it uses the lake bundled with lean to build the files

Mac Malone (Jul 29 2023 at 00:11):

@Mario Carneiro I haven't changed Lake in almost a month; however, the Lake in Lean was bumped to the latest prerelease when it was merged into the Lean repo a week ago. If the behavior is less than a week new, then nothing has changed that I know of. Otherwise, it is possible I accidently broke something as Lake does not have rich trace tests, but no intentional changes come to mind.

Mario Carneiro (Jul 29 2023 at 00:12):

hm, I just tried the same test in mathlib itself, and both lake build and lake exe cache get agree that lake-packages/std/build/lib/Std/Data/Option/Init/Lemmas.trace should contain 17557947031058207531

Mario Carneiro (Jul 29 2023 at 00:13):

so maybe lake's trace depends on an aspect of the parent project?

Mac Malone (Jul 29 2023 at 00:14):

Possibly

Mario Carneiro (Jul 29 2023 at 00:14):

aha, mathlib bumped its lean-toolchain, mathport was on 07-15 instead of 07-24

Mario Carneiro (Jul 29 2023 at 00:17):

although that's a bit sad, lake used to be more forgiving about slight mismatches in lean version as long as everything builds but if it's in the hash then everything has to exactly match

Mac Malone (Jul 29 2023 at 00:43):

Mario Carneiro said:

although that's a bit sad, lake used to be more forgiving about slight mismatches in lean version as long as everything builds but if it's in the hash then everything has to exactly match

Lake can build across different lean-toolchain (i.e., a root package and dependency can have different toolchains and Lake will use the root's without complaint). However, builds themselves are meant to be as locked and reproducible as possible, so any change in the significant configuration (like the Lean version) causes a rebuild.

Mac Malone (Jul 29 2023 at 00:46):

In fact, when Lake was not properly doing this, that was filed as a bug (e.g., !lake#26, !lake#62).

Mario Carneiro (Jul 29 2023 at 00:56):

This basically means that build caching is impossible when you have diamond issues in your dependencies

Mario Carneiro (Jul 29 2023 at 00:57):

that is, any lean version bump is "semver breaking" and you have to make sure all deps update in lock-step

Mario Carneiro (Jul 29 2023 at 00:59):

which is a major challenge if you have a rarely updated dep of a frequently updated package, like proofwidgets -> mathlib

Mac Malone (Jul 30 2023 at 19:04):

Mario Carneiro said:

This basically means that build caching is impossible when you have diamond issues in your dependencies

Yes, hence @Gabriel Ebner 's first comment in the original cache RFC for Lake:

The opportunity to reuse caches from std4 is extremely limited, since the Lean toolchain versions need to match up exactly.

Mac Malone (Jul 30 2023 at 19:05):

The only reason it works for mathlib is because deps are downstream packages mirror mathlib's toolchain. Outside mathlib, cache generally will miss.

Mac Malone (Jul 30 2023 at 19:07):

This is why I have often noted that cache does not generalize well outside mathlib.

Mac Malone (Jul 30 2023 at 19:10):

It is also important to note that this is not just a Lake limitation, Lean can segfault easily if it loads oleans from a different toolchain version.

Mac Malone (Jul 30 2023 at 19:13):

To make caching generally feasible, Lean needs to have infrequent versioned milestone releases that project can reasonably be regularly updated to and their builds cached. Or, we need a spacious centralized build repository were multiple machines can automatically upload builds for every nightly / toolchain version.

Mario Carneiro (Aug 28 2023 at 00:04):

I'd like to make some more progress on this. lean-cache is now working with a built-in curl (so no more curl --parallel issues) and the nothing-to-download case is about 120ms, which is acceptably fast to use as a pre-build step for other tools.

The next step is distribution, and this is more challenging than for leantar because the program isn't running through lake, it's a standalone executable. (We can of course set up lake exe cache to be a wrapper which downloads and runs lean-cache, possibly even doing version management like it is doing for leantar, but this would bring back a lot of the performance issues. I would prefer users run lean-cache directly.) It would be great if elan could manage versions of lean-cache as well and putting it in the path, but I don't know if that is possible without bundling it into lean4 repo. @Sebastian Ullrich thoughts?

Sebastian Ullrich (Aug 28 2023 at 12:43):

If it becomes a pre-build step, we've already paid the Lake startup cost, no?

Mario Carneiro (Aug 28 2023 at 12:46):

yes, although the lake startup cost is not well distributed, any time you call lake you have to pay it again, and it's difficult for lean-cache to benefit from that time spent

Mario Carneiro (Aug 28 2023 at 12:48):

lean-cache still needs lake sometimes, when there is a change to the lakefile or build configuration, and currently this is being handled by a second executable which links to lake and just writes the workspace configuration to disk so that lean-cache can make use of it

Mario Carneiro (Aug 28 2023 at 12:48):

it would be great if lake could subsume that tool

Mario Carneiro (Aug 28 2023 at 12:49):

otherwise I will have to decide whether to try to merge lean-cache and lake-ext by linking lean+lake into lean-cache, or else ship two binaries and have double the distribution fun

Sebastian Ullrich (Aug 28 2023 at 12:50):

/cc @Mac Malone

Mario Carneiro (Aug 28 2023 at 12:52):

I'm also considering writing a lake-lite utility, which parses lakefiles directly (without lean) and recognizes some common patterns. If it can be made to cover a significant fraction of lakefiles in the wild, then I think it could cut processing time by 10x, because most of the time there is spent in lean initialization, building simp sets and other nonsense

Mario Carneiro (Aug 28 2023 at 12:52):

and we can always fall back on full lake if the lakefile is complex

Joachim Breitner (Aug 28 2023 at 17:06):

I wonder how likely it is that the lean ecosystem will go the same way as Haskell’s – starting with a build description (Setup.hs) in our new language (because it’s a great language, so let’s use it), to later notice that non-declarative build descriptions stand in the way of many features, so a (non-executable) description for the 90% case emerges, until in a few years hardly anybody remembers that Setup.hs, but it’s complexity keeps haunting the ecosystem :-)

Julian Berman (Aug 28 2023 at 20:11):

That's the Python story as well, though we remember setup.py because of how badly it tortured us.

Scott Morrison (Aug 29 2023 at 01:40):

(The baroque atrocities of .sbt files you would encounter in every Scala project was part of what drove me away from Scala, years ago. The ASCII art in build files didn't help either.)

Sky Wilshaw (Aug 30 2023 at 12:32):

I feel that Rust's build system is pretty well-designed: the Cargo.toml file is a declarative description of how to compile a given crate, and you can add a build.rs file (which can even have its own dependencies) to run arbitrary code at build time, e.g. compiling and linking C libraries.


Last updated: Dec 20 2023 at 11:08 UTC