Zulip Chat Archive

Stream: lean4

Topic: leangz: olean file compressor

Mario Carneiro (Jun 23 2023 at 13:28):

As advertised earlier, I've implemented https://github.com/digama0/leangz , a tool to compress and decompress olean files into a new file format .lgz. Because it is performance-sensitive, I wrote it in Rust instead of Lean, with the idea being that it would replace the current tgz call in lake exe cache.

$ cargo build --release
$ time target/release/leangz ../lean4/build/release/stage1/lib/lean/Init/Core.olean
________________________________________________________
Executed in   88.34 millis    fish           external
   usr time   84.27 millis    1.58 millis   82.69 millis
   sys time    4.34 millis    0.40 millis    3.94 millis
$ time target/release/leangz -x ../lean4/build/release/stage1/lib/lean/Init/Core.olean.lgz
________________________________________________________
Executed in   25.72 millis    fish           external
   usr time   18.91 millis    1.15 millis   17.75 millis
   sys time    7.10 millis    0.00 millis    7.10 millis
$ target/release/leangz ../lean4/build/release/stage1/lib/lean/Init/Core.olean.lgz.olean
$ gzip -k ../lean4/build/release/stage1/lib/lean/Init/Core.olean
$ ls -al ../lean4/build/release/stage1/lib/lean/Init/Core.olean*
ls -al ../lean4/build/release/stage1/lib/lean/Init/Core.olean*
-rw-rw-r-- 1 mario mario 1772440 Jun 22 21:53 ../lean4/build/release/stage1/lib/lean/Init/Core.olean
-rw-rw-r-- 1 mario mario  554831 Jun 22 21:53 ../lean4/build/release/stage1/lib/lean/Init/Core.olean.gz
-rw-rw-r-- 1 mario mario  361867 Jun 23 09:09 ../lean4/build/release/stage1/lib/lean/Init/Core.olean.lgz
-rw-rw-r-- 1 mario mario 1772440 Jun 23 09:09 ../lean4/build/release/stage1/lib/lean/Init/Core.olean.lgz.olean
-rw-rw-r-- 1 mario mario  361867 Jun 23 09:12 ../lean4/build/release/stage1/lib/lean/Init/Core.olean.lgz.olean.lgz
$ diff ../lean4/build/release/stage1/lib/lean/Init/Core.olean.lgz ../lean4/build/release/stage1/lib/lean/Init/Core.olean.lgz.olean.lgz
$ diff ../lean4/build/release/stage1/lib/lean/Init/Core.olean ../lean4/build/release/stage1/lib/lean/Init/Core.olean.lgz.olean
Binary files ../lean4/build/release/stage1/lib/lean/Init/Core.olean and ../lean4/build/release/stage1/lib/lean/Init/Core.olean.lgz.olean differ

As you can see, the olean.lgz file is 20% the size of the original olean, and 34% smaller than the gzipped olean file. (Note that lgz includes a gzip step internally; the lgz file without compression is 757152 bytes, or 42% of the original olean.)

One thing that has not yet been solved is whether it is possible to get bit-for-bit identical oleans back. As you can see, the olean -> lgz -> olean process does not yield exactly the same result, although lgz -> olean -> lgz does. I believe this is because the olean compactor in lean core uses a different traversal strategy which yields a different ordering for the nodes and hence different pointer values. This also causes a problem for using it in lake exe cache, since either lake needs to run an extra olean -> lgz -> olean roundtrip before calculating the trace, or we need to calculate new traces, or else lake will refuse to use the unpacked olean files. And even if it does use them because we overrode the trace, any locally compiled oleans will not match that trace and cause a bunch of recompiles.

Sebastian Ullrich (Jun 23 2023 at 13:49):

Why do you have a custom traversal order, can't you traverse the objects in the order as in the file, which already is topologically sorted? And have you thought about compressing n-ary applications yet?

Sebastian Ullrich (Jun 23 2023 at 13:52):

If we have the full library ecosystem of Rust available, we can also move on to zstd from gzip

Mario Carneiro (Jun 23 2023 at 14:33):

The original traversal order is lost because lgz stores the nodes in a different way which avoids allocating names for nodes which are used only once. It should still be possible to recover the olean traversal order though since decompression is much like the original compaction stage. It is a bit annoying that the order isn't what I am generating now though, a very canonical postorder DFS

Mario Carneiro (Jun 23 2023 at 14:40):

Sebastian Ullrich said:

And have you thought about compressing n-ary applications yet?

So far it is only doing the type oblivious thing, traversing the object graph, which means it can't do anything expr-specific. I would like to give it type info at least enough to identify exprs, because you can do a lot better with types: you don't need to say how many arguments each constructor has, and most importantly you can remove the hash field, which is aggressively bad for compression.

Sebastian Ullrich (Jun 23 2023 at 15:26):

From my previous numbers, objects with constructor tag 5, which are presumably almost entirely applications, are really all that matters. So if adding type info is nontrivial, you could simply check for any such objects whether they have an app shape and whether the redundant data checks out and in that case discard the data and pack together the spine

Mario Carneiro (Jun 24 2023 at 07:21):

The olean reproducibility issue is now fixed (although I cheated a bit by changing lean instead of leangz: lean4#2286)

Yury G. Kudryashov (Jun 24 2023 at 12:32):

Did you compare gz to, e.g., xz for the generic compression step?

Yury G. Kudryashov (Jun 24 2023 at 12:34):

BTW, it seems that th elgz extension is taken on Windows by MS Windows Application Log.

Mario Carneiro (Jun 24 2023 at 12:46):

The flate2 library I am using has three available compression formats (DEFLATE, gzip and zlib), but a basic test didn't show up any major differences in compression quality. I have not tried a large scale test though. There is also a quality knob (fast <-> best) which can be tweaked depending on how much we think it is worth squeezing extra bytes out

Yury G. Kudryashov (Jun 24 2023 at 17:55):

What if you run an external xz to compare? I remember that moving from tar.gz to tar.xz saved us some space (hence, download time) with Lean 3 cache.

Alex J. Best (Jun 24 2023 at 18:02):

I also wonder if zlib with a preconfigured dictionary would do a good job, after all we are serving individual files now right, so a dictionary might be more relevant than before

Mario Carneiro (Jun 25 2023 at 12:02):

Woohoo, I implemented @Sebastian Ullrich 's idea to heuristically identify Name / Level / Expr constructors in the olean files, and the resulting file is 2.7x smaller than the old implementation and 13.3x smaller than the original .olean file (or 4.1x smaller than the olean.gz files we are using in lake exe cache). At this point I think it is competitive with the old lean 3 .xz files.

1772440 Init/Core.olean
 554831 Init/Core.olean.gz
 361866 Init/Core.olean.lgz
 133153 Init/Core.olean.lgz2
1772440 Init/Core.olean.lgz2.olean
1772440 Init/Core.olean.lgz.olean

The reason I didn't use the type-aware version, at least for now, is because types are not always available, for example in dynamically typed environment extensions, and it is good to be able to compress exprs that might appear in those terms as well.

Sebastian Ullrich (Jun 25 2023 at 12:04):

That's fantastic!

Mario Carneiro (Jun 25 2023 at 15:41):

After a bit more tweaking, I managed to get the compression ratio up to 14.8x on Lean. On Mathlib the compression ratio is even higher, a 16.9x reduction from 2.72GB to 160MB. Does anyone know how that compares to mathlib3?

Mario Carneiro (Jun 25 2023 at 15:52):

The choice of compression algorithm doesn't seem to make too big of a difference, compared to the wins I was getting with just adjusting the encoding. These are the numbers for compressing Lean with various algorithms and quality settings: (I've been using gz 7 for most of the numbers in this thread.)

deflate 1: 599922104 / 50503772 = 11.878758
zlib    1: 599922104 / 50508068 = 11.877748
gz      1: 599922104 / 50516660 = 11.875728
deflate 7: 599922104 / 40508075 = 14.809939
zlib    7: 599922104 / 40512371 = 14.808368
gz      7: 599922104 / 40520963 = 14.805228
deflate 9: 599922104 / 40460111 = 14.827495
zlib    9: 599922104 / 40464407 = 14.825921
gz      9: 599922104 / 40472999 = 14.822774

Mario Carneiro (Jun 25 2023 at 15:57):

Unfortunately, Patrick is right that this doesn't really solve many of the problems associated to lean's large disk usage: nightlies will still be big (although they can be shipped to you smaller), so you still have to wipe your elan cache regularly, and copies of mathlib in your local projects will still be big. The only thing that is directly impacted is that the .mathlib cache will be smaller, and also the cache download times will be smaller.

Mario Carneiro (Jun 25 2023 at 16:00):

To go the last mile would need lean itself to support reading these zipped files in memory, which unfortunately kills a lot of the advantage of mmap instant load times. Maybe that's a tradeoff users are willing to make, although I'm not sure how to expose that as an option. Maybe it would be lake's responsibility?

Mario Carneiro (Jun 25 2023 at 16:08):

Mario Carneiro said:

After a bit more tweaking, I managed to get the compression ratio up to 14.8x on Lean. On Mathlib the compression ratio is even higher, a 16.9x reduction from 2.72GB to 160MB. Does anyone know how that compares to mathlib3?

I just checked, and recent mathlib3 tar.xz files are 92MB, so I guess there is still some ways to go.

Henrik Böving (Jun 25 2023 at 17:08):

This isnt quite a fair comparision due to the extra size to the original oleans that the new code generator adds though, if you want an actual comparision to mathlib3 you should probably add the option that disables it to the lakefile and compile with that

Mario Carneiro (Jun 25 2023 at 17:12):

the code generator only adds about 5% to the oleans though

Mario Carneiro (Jun 25 2023 at 17:12):

at least in mathlib

Mario Carneiro (Jun 25 2023 at 17:14):

I think that most of the rest of the gains are to be found deeper in lean though, I expect we are really just producing larger or more complex expressions for certain things

Gabriel Ebner (Jun 25 2023 at 17:35):

The choice of compression algorithm doesn't seem to make too big of a difference, compared to the wins I was getting with just adjusting the encoding. These are the numbers for compressing Lean with various algorithms and quality settings:

I hope you realize that deflate, zlib, and gz are essentially the same algorithm, right? You should definitely try zstd instead.

Mario Carneiro (Jun 25 2023 at 17:39):

I have very little knowledge about how those compression algorithms relate, they just happened to be the ones provided by the library I'm using (which makes sense, if they are all similar)

Mario Carneiro (Jun 25 2023 at 18:06):

Okay, zstd does improve the result:

zstd 3:  599922104 / 41122345 = 14.588713
zstd 15: 599922104 / 38674486 = 15.512090
zstd 18: 599922104 / 35631733 = 16.836737
zstd 21: 599922104 / 35565313 = 16.868180

Henrik Böving (Jun 25 2023 at 18:11):

Unclear whether I am supposed to react with :upwards_trend: or :downwards_trend: to this...

Henrik Böving (Jun 25 2023 at 18:12):

What does the magic number after zstd mean, is that the parameter for how hard it is supposed to compress?

Mario Carneiro (Jun 25 2023 at 18:12):

yes, it apparently ranges from 1-21 with a default of 3

Mario Carneiro (Jun 25 2023 at 18:12):

unlike the others that go from 1-9

Mario Carneiro (Jun 25 2023 at 18:13):

I'm not sure what we are using on mathlib CI but it's probably cranked to max

Gabriel Ebner (Jun 25 2023 at 19:39):

We don't use zstd on mathlib ci, only on Lean 4 CI. There we use 19 from what I can tell.

Patrick Massot (Jun 25 2023 at 19:48):

Mario, thanks a lot for working on this! For teaching I would clearly prefer slow loading time with small olean files than fast loading with large oleans.

Patrick Massot (Jun 25 2023 at 19:49):

The growing elan cache isn't an issue for teaching since we usually don't want students to update Lean or mathlib during a course.

Scott Morrison (Jun 26 2023 at 00:47):

Regarding the compression trade-off, the ideal situation for me would be tightly compressed oleans over the network, but then storing oleans uncompressed in the cache for fast switching between branches. (I'm sure I decompress the same oleans multiple times every day.)

Scott Morrison (Jun 26 2023 at 00:47):

Perhaps there could even be two layers of the cache if we implement the auto-cleaning feature discussed earlier. E.g. I'd happily keep X gigabytes of uncompressed oleans, plus a further Y gigabyes of compressed oleans, etc.

Mario Carneiro (Jun 26 2023 at 01:18):

Scott Morrison said:

Regarding the compression trade-off, the ideal situation for me would be tightly compressed oleans over the network, but then storing oleans uncompressed in the cache for fast switching between branches. (I'm sure I decompress the same oleans multiple times every day.)

I'll get you some numbers soon, but I'm hoping it won't be necessary to do this. Decompression is quite fast, you should not notice it compared to the current tar performance. (It might even be faster, we'll see.)

Gabriel Ebner (Jun 26 2023 at 01:47):

I'm not sure if this can completely replace the tar stuff. We also cache ileans and c files.

Mario Carneiro (Jun 26 2023 at 02:01):

Hm, maybe I should take the compression out of the lgz format and instead use tar.xz on uncompressed lgz + ilean + c? There could still be performance gains by just doing that all in-process instead of shelling out to thousands of commands as is currently done

Yury G. Kudryashov (Jun 26 2023 at 02:27):

Are we shelling out or exec extra processes? I mean, do we run bin/sh (or an equivalent) in addition to /usr/bin/tar?

Mario Carneiro (Jun 26 2023 at 02:27):

pretty sure it's just exec

Mario Carneiro (Jun 26 2023 at 02:32):

One thing that I would really like to add to that process though is to check the trace and not unpack the file if it's already up to date. Right now that is the bottleneck when you call lake exe cache and there is nothing to do

Moritz Firsching (Jun 26 2023 at 06:02):

perhaps it would be worth trying brotli instead of zstd as well?!

Mario Carneiro (Jun 26 2023 at 06:06):

isn't brotli for text?

Mario Carneiro (Jun 26 2023 at 06:07):

I was thinking about that, dictionary approaches may well help, but probably we want a custom dictionary, not sure if any of these compression algs support that

Mario Carneiro (Jun 26 2023 at 07:39):

I have performance numbers for unpacking mathlib now:

baseline: 12.6 s compress,  1.13 s decompress, 2749684104 / 372063364 = 7.39x
gz 7:     34.1 s compress, 13.69 s decompress, 2749684104 / 162315766 = 16.94x
zstd 19: 216.5 s compress, 11.16 s decompress, 2749684104 / 143815597 = 19.12x

Here zstd 19 means that we used zstd compression, and baseline is for the case of no off the shelf compression, just the manually written stuff. Compare this with the approximately 15 seconds of tar time spent in lake exe cache when there is nothing to do. But note also that this is single-threaded performance, it can be divided by 8 or how ever many cores you have in practice. That brings the baseline performance to around 140ms, which I think is small enough to be usable by lean itself during startup, assuming we can resolve the linking issues with bringing in this library (or rewrite the decompressor in C++...). @Sebastian Ullrich do you have any thoughts on how one might use leangz in the lean4 repo? I think it would be best to keep the codec as a separate project, and statically link it with lean so that it can unpack directly into the mmap regions on startup.

Sebastian Ullrich (Jun 26 2023 at 08:37):

I don't think uncompressing at Lean run time is a good approach, the memory overhead of parallel Lean processes would be prohibitive unless we implemented explicit interprocess sharing (which we get for free when the source is uncompressed files)

Mario Carneiro (Jun 26 2023 at 09:26):

I didn't suggest parallel lean processes, rather lean would link to leangz which may use threads to load files in parallel

Mario Carneiro (Jun 26 2023 at 09:28):

actually even that isn't really necessary, it could be just a plain single-threaded function on the leangz side and lean does the multithreading using tasks

Sebastian Ullrich (Jun 26 2023 at 09:29):

No, what I mean is that you don't want to keep a separate, uncompressed copy of mathlib in memory for each open Lean file

Mario Carneiro (Jun 26 2023 at 09:29):

oh I see, that sharing is not being managed by lean

Mario Carneiro (Jun 26 2023 at 09:30):

you could do it by creating a temp file and mmaping that

Mario Carneiro (Jun 26 2023 at 12:04):

zstd 19:        199.77 s compress, 10.66 s decompress, 2749684104 / 143833925 = 19.11x
zstd 19 + dict: 100.64 s compress, 10.22 s decompress, 2749684104 / 139903272 = 19.65x

I managed to figure out how to use the dictionary creation feature of zstd, and it improves things on all axes: significantly faster compress time, better compression, and slightly faster decompress time

Mario Carneiro (Jun 26 2023 at 12:07):

the only thing I can recognize in the dictionary are the names of environment extensions like protectedExt, customEliminatorExt etc. Makes sense that these would show up in all the olean files

Sebastian Ullrich (Jun 26 2023 at 12:50):

Mario Carneiro said:

you could do it by creating a temp file and mmaping that

Then who will remove these files? If we need to come up with any kind of cache eviction policies, these should be in cache, not core.

Sebastian Ullrich (Jun 26 2023 at 12:56):

How about this: cache keeps a global directory of unpacked .olean files bounded by some sensible-sounding size like say 10GB and hardlinks these files into project directories as desired. When the size limit is reached, we start deleting files with link count 1. Which might even work on Windows.

Mario Carneiro (Jun 26 2023 at 13:11):

I think it would be fine even if lean itself never cleans up those files and something else is doing garbage collection

Mario Carneiro (Jun 26 2023 at 13:11):

the main issue right now is that you can't really remove oleans without causing massive rebuilds

Mario Carneiro (Jun 26 2023 at 13:12):

if they are just build outputs that can be easily regenerated on demand and cleaned up as required the equation changes a bit

Mario Carneiro (Jun 26 2023 at 13:14):

Sebastian Ullrich said:

How about this: cache keeps a global directory of unpacked .olean files bounded by some sensible-sounding size like say 10GB and hardlinks these files into project directories as desired. When the size limit is reached, we start deleting files with link count 1. Which might even work on Windows.

How would this work if there are multiple oleans with the same path (e.g. different versions of the same file on different branches of mathlib)?

Mario Carneiro (Jun 26 2023 at 13:16):

But I take your point that if the temp file generation has to happen anyway, it doesn't have to be lean doing the unpacking, and probably lake is a better place to live

Mario Carneiro (Jun 26 2023 at 13:18):

If lake was driving the process then you could pretty well ensure that these temp files really are suitably temporary

Moritz Firsching (Jun 26 2023 at 13:46):

Mario Carneiro said:

I was thinking about that, dictionary approaches may well help, but probably we want a custom dictionary, not sure if any of these compression algs support that

I though brotli supports custom dictionary: https://github.com/google/brotli/blob/master/research/dictionary_generator.cc

Sebastian Ullrich (Jun 26 2023 at 13:52):

Mario Carneiro said:

Sebastian Ullrich said:

How about this: cache keeps a global directory of unpacked .olean files bounded by some sensible-sounding size like say 10GB and hardlinks these files into project directories as desired. When the size limit is reached, we start deleting files with link count 1. Which might even work on Windows.

How would this work if there are multiple oleans with the same path (e.g. different versions of the same file on different branches of mathlib)?

Same way as with the current tar cache, use the hash as a name. The renaming can be done as part of the hardlinking.

Mario Carneiro (Jun 26 2023 at 15:09):

Okay, it's just about ready for use in lake exe cache now. I added a leantar utility that has a similar interface as what cache is expecting: a command to pack olean + ilean + ... into a .ltar file, compressing stuff along the way; and a single command that takes a list of .ltar files and unpacks everything which is not up to date. Here are the test results for an end to end test (local only, no network stuff)

129.3 s compress
3196 files, 167.852 MB
90ms to do nothing (everything up to date)
3.260 s to unpack everything

Compared to 15.969 s for running lake exe cache, it is a 4.9x improvement when actually unpacking, and a 177x improvement when there is nothing to do.

Mario Carneiro (Jul 04 2023 at 09:32):

The leantar-enabled cache is now up for review: #5710. The https://github.com/digama0/leangz repo is now cutting releases for all arches supported by lean4, and lake exe cache will automatically download leantar and use it, same as it currently does for curl.

Mario Carneiro (Jul 04 2023 at 09:34):

If this looks good we should probably transfer leangz over to leanprover-community repo

Sebastian Ullrich (Jul 04 2023 at 10:23):

This somewhat begs the question whether it wouldn't be easier to write all of Cache in Rust, in which case we can use libcurl and link it into a single binary. Not sure if any plans of integrating it into Lake are affected by that, the Rust code could still export a sensible FFI

Mario Carneiro (Jul 04 2023 at 10:24):

yeah, I have been seriously considering that. We might want to think about it in conjunction with the rewrite for moving it to lake

Mario Carneiro (Jul 04 2023 at 10:25):

This initial release is mainly focused on preserving the existing interface, modulo the differing file extensions and such

Sebastian Ullrich (Jul 04 2023 at 10:26):

Rust would also help with doing "advanced" FS operations like the symlinking approach mentioned above

Scott Morrison (Jul 04 2023 at 10:27):

I just got, after a partial build, lake exe cache pack then lake exe cache get:

Decompressing 401 file(s)
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: UnexpectedEof, message: "failed to fill whole buffer" }', src/tar.rs:87:36
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
uncaught exception: leantar failed with error code 101

Mario Carneiro (Jul 04 2023 at 10:29):

that particular error message indicates that it tried to open an .ltar file which is 0 bytes

Scott Morrison (Jul 04 2023 at 10:29):

I notice that packing the cache is substantially slower than it was previously. This isn't a huge problem, as mostly this is work done by CI, but perhaps worth flagging. Is there some trivial parallelism available? I notice it is doing this single threaded.

Mario Carneiro (Jul 04 2023 at 10:29):

yes, zstd is slower to pack

Mario Carneiro (Jul 04 2023 at 10:30):

at least on the relatively high compression setting in use

Mario Carneiro (Jul 04 2023 at 10:31):

When packing, the intention was for lean to handle the multithreading and launch many leantar processes, one per file

Mario Carneiro (Jul 04 2023 at 10:32):

(I actually haven't tested pack at all, I wasn't really sure whether it was possible to test without sending stuff to azure)

Mario Carneiro (Jul 04 2023 at 10:33):

oh I see, the pack is indeed serialized

Scott Morrison (Jul 04 2023 at 10:34):

This is probably worth fixing. A normal build takes ~30s to pack and upload, and your build is still going at 10 minutes.

Mario Carneiro (Jul 04 2023 at 10:41):

I did, it's 1.7 min parallel 12 threads

Scott Morrison (Jul 04 2023 at 10:41):

It finished uploading. I tried lake exe cache get, and it downloads, but then gives me 3500 copies of:

Scott Morrison (Jul 04 2023 at 10:41):

stack backtrace:
   0:        thread '0x<unnamed>10e09ad36' panicked at ' - called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }__mh_execute_header',
src/tar.rs: 103 : 541

Mario Carneiro (Jul 04 2023 at 10:42):

oh that looks a bit funny, maybe the streams are interleaving?

Mario Carneiro (Jul 04 2023 at 10:45):

oh I see, all the threads are panicking in parallel with a not found error

Sebastian Ullrich (Jul 04 2023 at 10:56):

It's a mass panic?

Mario Carneiro (Jul 04 2023 at 11:10):

What is the expected behavior when one of the cache files is empty, i.e. 0 bytes

Mario Carneiro (Jul 04 2023 at 11:12):

Currently this just causes the extractor to report an error but continue and unpack other things (but still report an error for the overall execution), similar to the behavior you get with a bunch of tar calls

Mario Carneiro (Jul 04 2023 at 11:13):

but lake exe cache doesn't really introspect on such files, it just assumes they are fine and does not try to redownload them or clean them up or anything

Sebastian Ullrich (Jul 04 2023 at 11:21):

Where are these empty files coming from?

Mario Carneiro (Jul 04 2023 at 11:41):

I believe it happens when a writing process is terminated or curl dies

Mario Carneiro (Jul 04 2023 at 11:42):

or possibly if curl loses connection in the middle

Arthur Paulino (Jul 04 2023 at 12:41):

Instead of rewriting Cache in Rust, I would consider improving Lean 4 so it matches Rust's power more closely (where possible), even if it's a longer term goal.

Mathlib is almost fully ported, meaning that Lean 4 is probably almost as good of a theorem prover as it needs to be (modulo optimizations for compilation and UX improvements, which can go on forever).

Of course, I am out of context at this point so I don't know the long term strategy. But my impression is that once the community considers the port fully finished, a lot of burden will be released and there would be plenty of room to think big in different ways

Patrick Massot (Jul 04 2023 at 12:56):

I understand your point but beware that mathematicians won't feel qualified to write that kind of code, so the end of the port won't free that much energy in this specific direction.

Arthur Paulino (Jul 04 2023 at 12:59):

Yeah I mean in the sense that core Lean 4 devs are focusing on providing support for the port. Once the port is finished, mathematicians will come back to expanding mathlib and the core devs would be able to worry less about mathlib

Arthur Paulino (Jul 04 2023 at 13:01):

But again, I'm out of the loop. I'm sure the core devs have a good plan. I'm speaking as someone (a programmer) who would be happy to see Lean 4 reach Rust in some of the aspects mentioned above.

Sebastian Ullrich (Jul 04 2023 at 13:15):

Cache is so small that I don't think it would even be contradictory to have the short-term goal of rewriting it in Rust and the longer-term goal of rewriting it in Lean

Sebastian Ullrich (Jul 04 2023 at 13:20):

Having said that, I have been thinking about what the smallest possible change to core would need to consist of to unblock interfacing with external libraries such as libcurl

Mario Carneiro (Jul 04 2023 at 17:22):

The PR now seems to be working: the last build completed successfully, and running lake exe cache get and lake exe cache get! in various combinations seems to unpack correctly. Unfortunately since mathlib is still on leanprover/lean4:nightly-2023-06-20 it doesn't include the upstream changes to the olean file format needed to keep parity between leangz and lean, which means that lean will still recompile mathlib after unpacking the cache. This PR will need to depend on the lean4 bump PR #5409.

Mario Carneiro (Jul 04 2023 at 17:39):

It's working well enough to take some timing measurements. A lean exe cache get when there is nothing to do takes 4.6 seconds, of which 60ms are being spent in the decompressor (this part was previously ~15 seconds) and the other 4.6 seconds are spent computing hashes and starting up lean.

To me this is just not acceptable performance, and to your point @Arthur Paulino while I understand why we should make every attempt to make lean as good a language as it can be, I am a staunch believer in using the best tool for the job and currently lean simply isn't delivering. Maybe it will in the future and we can rewrite it in lean, but this is just wasting time and energy on users' machines in the meantime. TBH I knew that this would come up when I first decided to write the decompressor in something that isn't lean, people have this sense that we should be doing everything in lean because it's a programming language, but it's not the best programming language for very predictable reasons (it's really hard to compete with the big leagues), and making the lean experience better by not using lean sounds like a good tradeoff to me.

Mauricio Collares (Jul 04 2023 at 17:52):

Slightly off-topic: Since the cache file name is changing anyway, can it be changed to use a fixed number of hex digits? Base 10 for hashes just looks weird

Mario Carneiro (Jul 04 2023 at 17:57):

those are separate issues

Mario Carneiro (Jul 04 2023 at 17:57):

base 10 for hashes is determined by lake, cache can't do anything about it

Mario Carneiro (Jul 04 2023 at 17:58):

oh nvm the file names are also base 10

Mario Carneiro (Jul 04 2023 at 18:05):

I don't think there is even a function to get hex digits of fixed length

Mac Malone (Jul 04 2023 at 19:18):

Sebastian Ullrich said:

Having said that, I have been thinking about what the smallest possible change to core would need to consist of to unblock interfacing with external libraries such as libcurl

While certainly this is a longer term goal, eventually interfacing with libcurl does seem like it could provide many benefits for a number of projects that wish to interact with web APIs.

Mac Malone (Jul 04 2023 at 19:21):

Mario Carneiro said:

base 10 for hashes is determined by lake, cache can't do anything about it

And Lake's reason for using base 10 is just because that is what Lean's toString/toNat? does. Also, at least in Lake, there isn't a strong motivation to do anything else, because there is not likely a significant performance difference between different bases.

Mario Carneiro (Jul 04 2023 at 19:26):

No, but I agree it looks a little peculiar. Incidentally, leangz does have to read trace files so it also has to implement those semantics

Mario Carneiro (Jul 04 2023 at 19:27):

in any case, I updated cache to write the numbers in fixed width hex

Gabriel Ebner (Jul 04 2023 at 20:39):

Mario Carneiro said:

base 10 for hashes is determined by lake, cache can't do anything about it

The cache filenames are purely our own making, they have zero relation to Lake and are not computed using Lake code either. We could switch to base32 or whatever, but aesthetics aside I don't see any reason to switch.

Kevin Buzzard (Jul 04 2023 at 20:43):

If people are unhappy with base 10 then I could recommend base 37. It's 0-9, then a-z, then ᾰ. I think it's ideally suited to Lean.

Mario Carneiro (Jul 04 2023 at 20:50):

@Gabriel Ebner I wrote a function to do fixed width hex in cache, which really should be in Lean. Sadly we don't have fancy format strings in lean so hex formatting is painful right now and the path of least resistance uses decimal

Last updated: May 02 2025 at 03:31 UTC