Zulip Chat Archive

Stream: lean4

Topic: Lake cache get and put speeds

Simon Sorg (Jan 02 2026 at 13:00):

Currently, lake cache put and lake cache get iterate over all artifacts and send curl requests one by one to the s3 bucket. While this works, even assuming upload speeds are unlimited, it would take latency * number of files seconds to upload. This means that, for instance, uploading recent mathlib to a lake cache with a bucket in the US from an EU country takes around 17 minutes just for pure latency (using 130ms as latency, a number I took off the internet, not meant to be exact). In practice, I have lake cache put taking 90 minutes when putting a recent mathlib commit even with high upload speeds due to authentication + signing time. The same latency calculation is true for lake cache get, but empirically lake cache get is a lot closer to 17 min as there is no authentication overhead.

As repos grow, users of cache put / cache get spend an increasingly long time waiting for cache results.

To mitigate this, one could expose a --num-threads option on lake cache put and get, and internally sending num-threads curl requests / HTTP get requests in parallel.
Alternatively, one could rely on LEAN_NUM_THREADS as a way to bottleneck not sending too many requests in parallel.

I assume the major beneficiaries of this feature will be users with their own lake cache, which is admittedly not too frequently used yet, but becomes increasingly relevant for large formalization projects, which we are seeing more and more of.
Also, the currently very slow speed is for sure a reason to be hesitant to use lake cache get and put in the first place, so I hope this could help with adoption.

On the downside, this would make the Caching uploading logic a bit more complex, as one would have to adjust uploadArtifacts and downloadArtifacts to have logic to submit exactly num-threads tasks at a time. In case LEAN_NUM_THREADS is used to limit concurrency instead, the map would only change slightly to use pure tasks and wait for all of them to finish.

Is this still a feature worth adding?

Snir Broshi (Jan 02 2026 at 13:26):

#mathlib4 > lake exe cache get very slow

Snir Broshi (Jan 02 2026 at 13:28):

FWIW I disagree with the calculation latency * number of files, as a file ≠ a single packet, but agree with the conclusion of threads

Snir Broshi (Jan 02 2026 at 13:29):

Are you sure this isn't multithreaded already?

Simon Sorg (Jan 02 2026 at 13:30):

(Oh yeah totally, sorry, it was only a very poor lower bound)

mathlib4 > lake exe cache get very slow

This is extremely cool, but it's done on Mathlib's caching, not Lake's. I believe there's also value in doing this for lake.

Simon Sorg (Jan 02 2026 at 13:32):

Are you sure this isn't multithreaded already?

https://github.com/leanprover/lean4/blob/4eb5b5776da2c386e0808e8844abb2b6eec0a8e3/src/lake/Lake/Config/Cache.lean#L441C12-L441C27
This fn takes the full list from the mappings, and is then mapping uploading over them, but there is no task spawning inside the map fn anywhere, so I believe this is single-threading, though note I am not too familiar with threading in Lean (yet).

Snir Broshi (Jan 02 2026 at 13:38):

Simon Sorg said:

(Oh yeah totally, sorry, it was only a very poor lower bound)

mathlib4 > lake exe cache get very slow

This is extremely cool, but it's done on Mathlib's caching, not Lake's. I believe there's also value in doing this for lake.

Are you sure? Isn't lake exe cache a wrapper around lake cache? They're currently trying to optimize leangz, so as long as Lake's cache also uses it, it applies there too

Simon Sorg (Jan 02 2026 at 13:49):

From what I can tell on lake exe cache get it has its own curl command that is parallelised here: https://github.com/leanprover-community/mathlib4/blob/master/Cache/Requests.lean#L379

Ruben Van de Velde (Jan 02 2026 at 14:07):

Yeah, lake exe cache predates lake cache

Mac Malone (Jan 03 2026 at 06:09):

Simon Sorg said:

Also, the currently very slow speed is for sure a reason to be hesitant to use lake cache get and put in the first place, so I hope this could help with adoption.

Honestly, I am little suprised people adopting lake cache already! I expected people to start exploring it after it saw some use in the Lean core or Mathlib (and there was more information on how to use it). You are correct, lake cache get / lake cache put are currently prohibitively slow for general use. This is something I am planning to fix in the coming quarter.

Last updated: Feb 28 2026 at 14:05 UTC