Zulip Chat Archive

Stream: mathlib4

Topic: Download oleans that are bitwise identical?

Joachim Breitner (Nov 23 2023 at 22:26):

Consider this:

~/build/lean/mathlib4 $ lake exec cache get
Attempting to download 3938 file(s)
Downloaded: 3938 file(s) [attempted 3938/3938 = 100%] (100% success)
Decompressing 3938 file(s)
unpacked in 7072 ms
~/build/lean/mathlib4 $ md5sum .lake/build/**/*.olean > /tmp/md5sums-before
~/build/lean/mathlib4 $ git checkout origin/master
Vorherige Position von HEAD war e9bcfc6e58 style(Algebra/Module/Injective): fix weird bracket position (#8589)
HEAD ist jetzt bei 84c26b211a feat(Calculus/ContDiff): add `ContDiffAt.contDiffOn` (#8600)
~/build/lean/mathlib4 $ git log -n 2
commit 84c26b211afbd8ae411c6f7cb9458f5a6e44ca0a (HEAD, origin/staging, origin/master, origin/HEAD)
Author: Yury G. Kudryashov <urkud@urkud.name>
Date:   Thu Nov 23 19:13:50 2023 +0000

    feat(Calculus/ContDiff): add `ContDiffAt.contDiffOn` (#8600)

commit e9bcfc6e583d0784ece32475ca6f450126ef286f
Author: Jujian Zhang <jujian.zhang1998@outlook.com>
Date:   Thu Nov 23 14:01:58 2023 +0000

    style(Algebra/Module/Injective): fix weird bracket position (#8589)

    The bracket position seems off
~/build/lean/mathlib4 $ lake exec cache get
Attempting to download 162 file(s)
Downloaded: 162 file(s) [attempted 162/162 = 100%] (100% success)
Decompressing 3938 file(s)
unpacked in 1985 ms
~/build/lean/mathlib4 $ md5sum .lake/build/**/*.olean > /tmp/md5sums-after
~/build/lean/mathlib4 $ diff -u /tmp/md5sums-before /tmp/md5sums-after |diffstat
 md5sums-after |   22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

I had a fresh cache at master^, then updated to mathlib master, where one definition was added to Calculus.ContDiff.

then cache get downloaded 162 files, but only 22 are actually different afterwards.

This makes me wonder:

Does cache get really have to download the 140 .oleans that are bitwise identical? Wouldn’t it save a s-load of storage in the cloud, download bandwidth and time, and local storage if .oleans were downloaded content-addressed, and cache get would first get a mapping from trace(?) to olean content hash, and only download if it doesn't have it already?
Why did these files changes:

Mathlib/Analysis/Calculus/BumpFunction/FiniteDimension.olean
Mathlib/Analysis/Calculus/ContDiff/Basic.olean
Mathlib/Analysis/Calculus/ContDiff/Bounds.olean
Mathlib/Analysis/Calculus/ContDiff/FiniteDimension.olean
Mathlib/Analysis/Calculus/FDeriv/Analytic.olean
Mathlib/Analysis/Calculus/IteratedDeriv.olean
Mathlib/Analysis/Calculus/Series.olean
Mathlib/Analysis/Distribution/SchwartzSpace.olean
Mathlib/Analysis/InnerProductSpace/Calculus.olean
Mathlib/Analysis/SpecialFunctions/SmoothTransition.olean

I would assume these oleans still have the exact same semantic content. Maybe some global counter shows up somewhere that could be avoided or normalized before serialization? (But even if they are not always the same when they could, it still seems that there is something to be saved.)

Mauricio Collares (Nov 24 2023 at 07:17):

Does Lake have an option to compute all traces without building anything? If so, I agree content-addressing would be a good improvement. As I understand it, cache currently hashes the inputs (Nix-style) instead of being content-addressed.

Joachim Breitner (Nov 24 2023 at 07:22):

I think it's mostly orthogonal. You'd still query the cache as now, but instead of querying for the olean directly, you ask it for the olean hash (small response), so the trace related part is mostly the same. Then a new second step checks if that hash is already present locally and only download the real file if it doesn't exist already.

Every problem can be solved by an extra layer of indirection, except the problem of too many indirections.

Joachim Breitner (Nov 24 2023 at 09:03):

For my second question, it’s (at least) hygienic names. Using oleandump to compare the changed Mathlib/Analysis/SpecialFunctions/SmoothTransition.olean

 let x927492f2{176} : Lean.Expr := Lean.Expr.app #[xd8a3bb26{144}, x23b53e6f{392}]
 let xaf761576{208} : Lean.Expr := Lean.Expr.app #[x927492f2{176}, xc62553d9{5768}]
 let x6a66db6b{24} : Lean.Expr := Lean.Expr.sort #[x870a583e{24}]
-let x1c66e28e{368} : Lean.Name := Lean.Name.num #[`inst._@.Mathlib.Analysis.Calculus.ContDiff.Defs._hyg, 26695]
+let x75a663ef{368} : Lean.Name := Lean.Name.num #[`inst._@.Mathlib.Analysis.Calculus.ContDiff.Defs._hyg, 26824]
 let x91159395{32} : Lean.Expr := Lean.Expr.const #[`NontriviallyNormedField, x75ba0d01{24}]
 let xf266cb60{64} : Lean.Expr := Lean.Expr.app #[x91159395{32}, xf87d595c{24}]
-let xe6bbd2a8{32} : Lean.Name := Lean.Name.num #[`inst._@.Mathlib.Analysis.Calculus.ContDiff.Defs._hyg, 26699]
+let xacaad2af{32} : Lean.Name := Lean.Name.num #[`inst._@.Mathlib.Analysis.Calculus.ContDiff.Defs._hyg, 26828]
 let x7914a10d{32} : Lean.Expr := Lean.Expr.const #[`NormedAddCommGroup, x75ba0d01{24}]
 let xa540fd54{64} : Lean.Expr := Lean.Expr.app #[x7914a10d{32}, xf87d595c{24}]

I notice that some hygienic names in Mathlib.Analysis.Calculus.ContDiff.Defs have changed. Not unexpected, a bit unfortunate, probably not easy to work around. Although if cache would deduplicate oleans, the motivation to solve this differently would increase.

Mario Carneiro (Nov 24 2023 at 17:32):

Joachim Breitner said:

Does cache get really have to download the 140 .oleans that are bitwise identical? Wouldn’t it save a s-load of storage in the cloud, download bandwidth and time, and local storage if .oleans were downloaded content-addressed, and cache get would first get a mapping from trace(?) to olean content hash, and only download if it doesn't have it already?

They already are doing exactly what you suggest. So this indicates rather that something in the hash changed, not necessarily the file itself but one of its dependents. We can't calculate the content hash for files with missing dependencies because we don't have the olean file yet, but we could in principle download only the things directly needed, then rerun the hash computation and see if that knocks out some additional requests, rather than computing the whole to-download list upfront and then sending it all to curl, at the cost of some lost parallelism.

Mario Carneiro (Nov 24 2023 at 17:33):

We can't look things up by the hash of the olean file because we don't have the olean file until we have requested it, this would be a circular dependency. So we hash the lean file instead, and we have to take all dependencies into account here. So it is always possible that we look up the olean after a change to the lean files which could potentially invalidate the olean, but it turns out not to and we end up downloading an olean bitwise identical to the one we have already. There is no way to predict in advance that the server will send the same file until we actually get it

Joachim Breitner (Nov 24 2023 at 17:41):

Sorry, I wasn't clear. All of that appens as now. But where currently you download a Olean from the server, you download the hash of the Olean (which, of course the server needs to provide). Then you check of you have it locally already, and only if not, download the actual Olean.

Mario Carneiro (Nov 24 2023 at 17:41):

We could download the olean hash first if we happen to have an olean lying around and suspect it is already up to date, but this is again a data dependency which reduces parallelism, and asking for an 8 byte hash file from the server, then comparing the received olean hash with the one we have and only if it differs requesting the (5-100 KB) .ltar file would be a significant overhead.

Mario Carneiro (Nov 24 2023 at 17:44):

forming a new server request for a file is about half of the cost, the other half is sending data, so I would expect this strategy not to be worth it unless the oleans are very large. We could try to be clever about this since we know how big the file we expect is, but it would need some testing to find the right cutoff point. I would expect just doing this for every file to be a net negative

Mario Carneiro (Nov 24 2023 at 17:48):

We are also constrained by the fact that the cache server is a plain file server with no "smarts", and it is unclear whether reservoir would have the same restriction. For a plain file server it would necessitate doubling the number of files we store, and the .ltar files we store are already "uncomfortably small and numerous" so this may cause issues on the server too

Mario Carneiro (Nov 24 2023 at 17:48):

Basically, .ltar files are too small to be worth it

Joachim Breitner (Nov 24 2023 at 18:09):

Hmm, I see. Thanks for entertaining the idea!

Ruben Van de Velde (Nov 24 2023 at 18:34):

Another idea could be that the client sends the hash of the file it has locally, and if the server sees that hash matches the file it would send, it would send a "no change" response rather than the entire olean. Not sure if that's possible or worthwhile given the current server architecture, but it seems it could reduce the amount of data sent over the network without increasing round trips

Arthur Paulino (Nov 24 2023 at 18:41):

Note that "the server" is just Azure's cloud API. We push and request stuff without any specific logic on the other side besides what Azure offers

Joachim Breitner (Nov 24 2023 at 18:44):

Sounds like HTTP's ETag could be useful there. In that variant you need to make an educated guess about which local file you want to compare with, rather than getting a hash and then looking in the local store. But it does save a roundtrip indeed.

Eric Wieser (Nov 24 2023 at 19:12):

Mauricio Collares said:

Does Lake have an option to compute all traces without building anything?

This would also be useful for doing a nix build, but then opening the folder in vscode and having lake know that everything is already built

Last updated: May 02 2025 at 03:31 UTC