Zulip Chat Archive
Stream: general
Topic: Potential data loss from `lake clean` with 4.2.0-rc2/3
Sebastian Ullrich (Oct 21 2023 at 20:20):
All, please see the updated release notes of 4.2.0-rc4 if you have previously used rc2 or rc3, or go back to a project/branch still using them. For more information also see the thread #lean4 > lake build'ing mathlib HEAD.
Kalle Kytölä (Oct 22 2023 at 20:39):
I think I hit this now while already on 4.2.0-rc4. I (really believe I) had deleted the lakefile.olean
as instructed before lake update
ing to 4.2.0-rc4. [UPDATE: Confirmed on command history, but see below for why this was not a solution.]
I had some issues in VS Code suddenly --- the same config worked for some time before. I then ran lake clean
, and it erased the entire project directory.
Eric Wieser (Oct 22 2023 at 20:41):
I think you maybe have to delete it after running lake update
Eric Wieser (Oct 22 2023 at 20:41):
Otherwise lake update
will recreate it (pre-update)
Kalle Kytölä (Oct 22 2023 at 20:41):
Oh, I had misunderstood the instructions...
Eric Wieser (Oct 22 2023 at 20:43):
I think that's on the instructions, not on you:
Fixes a bug that could lead to data loss from lake clean when switching between this version and previous stable versions. Users upgrading a project from 4.2.0-rc2 or 4.2.0-rc3 should manually remove their lakefile.olean beforehand.
Eric Wieser (Oct 22 2023 at 20:44):
The obvious reading is that "beforehand" modifies "upgrading a project", but I think it actually is intended to modify "lake clean
"
Kalle Kytölä (Oct 22 2023 at 20:44):
I just checked the same and was about to request a clarification there.
(That said, I don't blame those instructions, because I just sloppily followed the problem on Zulip instead of carefully reading the instructions.)
Kalle Kytölä (Oct 22 2023 at 20:46):
I should have everything (except from a very small amount of recent work) on an external repo, so I should be fine.
But this can be really bad for Lean users. There is absolutely nothing left of the directory.
Julian Berman (Oct 22 2023 at 20:58):
Not that I assume you saw it, and Sebastian I'm sure will update the official instructions, but I changed the TLDR at the top of https://leanprover.zulipchat.com/#narrow/stream/270676-lean4/topic/lake.20build'ing.20mathlib.20HEAD/near/397850852 to hopefully make that even clearer there.
Sebastian Ullrich (Oct 22 2023 at 21:14):
I'm confused, everyone is talking about lake update
upgrading Lean but it doesn't change lean-toolchain
, you still have to do that yourself. Whether you remove lakefile.olean
right before or after touching that file doesn't really matter. I've clarified the instructions nevertheless.
Yaël Dillies (Oct 22 2023 at 21:20):
Now it's my turn to be confused. I'm pretty sure lake update
updated lean-toolchain
last time I used it (= this morning).
Kevin Buzzard (Oct 22 2023 at 21:21):
I hope you can also see that having a confusing system for upgrading leads to confusion.
(I should probably add that knowing this community well I fully expect these issues to be ironed out in time!)
Sebastian Ullrich (Oct 22 2023 at 21:22):
Then I may be mistaken, did I miss some feature automating this?
Yaël Dillies (Oct 22 2023 at 21:28):
People have been talking about it, so I wouldn't be surprised. Nothing else comes to mind though.
Julian Berman (Oct 22 2023 at 21:31):
I honestly cannot remember what it does, but it seems not to here from just testing. The language I edited in myself was probably imprecise, will edit to match yours.
Scott Morrison (Oct 22 2023 at 21:42):
No, I'm pretty sure Yaël must be mistaken here --- I don't know of anything tooling that modifies a lean-toolchain
file (except for the CI automation that manages the nightly-testing
and lean-pr-testing-NNNN
branches).
Yaël Dillies (Oct 22 2023 at 21:43):
Yeah sorry I can't check now
Sebastian Ullrich (Oct 23 2023 at 06:55):
(we want lake update
to appropriately update Lean of course and we will get there, I was just checking whether my instructions for the current Lean were correct!)
Utensil Song (Oct 23 2023 at 08:52):
Maybe add something like: Back up your whole Lean project elsewhere before running any lake
command, particularly lake clean
.
Shreyas Srinivas (Oct 23 2023 at 11:14):
does this lake clean
bug also end up deleting the git history and the .vscode folder?
Shreyas Srinivas (Oct 23 2023 at 11:16):
Also, what toolchain should I update to if I haven't been bitten by this bug yet. Currently I am using 4.2 rc1 and 4.1.
Mario Carneiro (Oct 23 2023 at 11:17):
4.2.0-rc4 is the fixed version, if you just skip over 4.2.0-rc{2,3} you will not have any problems
Eric Wieser (Oct 23 2023 at 11:17):
Yes, it deletes the entire working directory, which includes .git
and .vscode
; the bug is that what lake things is the build directory is actually the source directory
Mario Carneiro (Oct 23 2023 at 11:19):
alternatively you can just not use lake clean
, I don't think I ever use this command and do rm -rf build
instead
Shreyas Srinivas (Oct 23 2023 at 11:21):
Is there/can there be an option for lake update
that only updates to stable releases like 4.1 or 4.2 instead of intermediate rc
releases?
Eric Wieser (Oct 23 2023 at 11:30):
As already discussed upthread, lake update
never updates the release at all
Eric Wieser (Oct 23 2023 at 11:30):
So wish somewhat granted, lake update
will never move you to a rc
release
Jireh Loreaux (Oct 23 2023 at 12:35):
Mario, the stupid part about that is that lake clean
should be the safer operation, because if I mistype with rm -rf
there's no telling what I might delete.
Mario Carneiro (Oct 23 2023 at 16:37):
no, my point is exactly that lake clean
is another way to spell rm -rf <something not immediately visible>
and this has consequences
Jireh Loreaux (Oct 23 2023 at 18:42):
I'm sorry, that sucks. If the directory to rm
were just stored in some json file that was tracked in the repo this would never happen. And if lake clean
meant rm -rf <directory read from json file>
, then there would be no problem and it would be safer than rm -rf
manually. The only reason this issue came up is because lake
was / is relying on a cache file lakefile.olean
(for important performance reasons I expect) which is not stored in the git repo. So in my mind, this again boils down to the complexity of lake
.
In years of using make
files for my LaTeX projects, I have exactly 0 times deleted something unwanted by running my make clean
operation, but I have definitely rm -rf
'ed a number of things on accident (in fact, as a safety precaution, I generally type asdf rm -rf stuff
and only after checking stuff
is correct do I delete asdf
).
For newbie / math users, it's even worse because they may not even understand rm -rf
, or maybe (in the future) they will only interact with lake clean
by using the VS Code extension; and the potential for deleting the user's entire repo should be virtually non-existent in that case (requiring lake clean
to never do anything bad, hence safer than rm -rf stuff
).
Julian Berman (Oct 23 2023 at 18:49):
I'll add that my personal intention is likely to use git clean -xi
going forward, which both supports a --dry-run / -n
option as well as -i
for interactively deleting only after showing me what I'm deleting. FWIW, rm
also has such a flag, it's rm -i
or rm -I
to only ask for confirmation if it's going to remove a lot of stuff
Julian Berman (Oct 23 2023 at 18:51):
I hesitate to say anything because of how little I know, and the fact that it seems this issue itself isn't fully "clear and out of the water yet", but to me the really worrying thing is how this seems to have happened -- namely that .olean files seem to contain serialized arbitrary objects, and that when deserializing them, there's some sort of positional layout applied. That's very worrying to me naively, and reminiscent of the situation in Python where -- I claim good Python programmers know "never use Pickle", which has some of these issues. I could be way off here on the "root" issue, it's just understanding I got from reading the other thread, but to me that would be what I would be wary of. ISTM Lake was just the first unlucky thing to hit that. Again though, just raw thoughts from seeing the other Zulip thread.
Julian Berman (Oct 23 2023 at 18:53):
It would certainly seem to be "safer" to have a "downward facing FilePath object" -- i.e. one where it cannot traverse further upward than its root path (i.e. the kind of path that in other places mean you cannot do Path("/foo/bar").child(someUntrustedString).remove()
and be vulnerable to path traversal attacks) -- but so long as it's possible to have the kind of deserialization problem that I gathered (hopefully correctly), it'd still be worrying -- because you could get some other random object in the spot you expected, making anything fair game again.
Shreyas Srinivas (Oct 23 2023 at 19:04):
Mario Carneiro said:
no, my point is exactly that
lake clean
is another way to spellrm -rf <something not immediately visible>
and this has consequences
User interfaces on top of existing tools provide safety for non-expert and non-malicious users by hiding complexity and footguns. By that standard you are suggesting, how does one trust rm
itself rather than writing a program with all the syscalls it uses?
Mario Carneiro (Oct 23 2023 at 20:22):
The reason I trust rm
and make clean
more than lake clean
at a technical and social level is because they are old and well tested pieces of software used by millions of people. I have never even heard a report of these tools deleting unwanted data except by user error (i.e. calling rm
with the wrong argument or in the wrong directory), let alone seen it first hand. I'm sure this used to happen more, but that was 20+ years ago. lake
is a young tool under active development, built on a young language under active development. Building stable software takes time and effort, and that doesn't come easily. In fact, if I was to write a program with direct syscalls I would still trust it less than rm
because I would have just created another young and buggy program.
Mario Carneiro (Oct 23 2023 at 20:28):
Obviously, if lake clean
has no bugs and does exactly what I am hoping it does it is superior to rm -rf build
. But the complexity of lake ties directly to both lowering the probability that it has no bugs, as well as the probability that it does what I want it to. rm -rf build
has a very simple to understand spec, at least for someone like me who knows how CLIs work. I totally agree that for people who don't know their way around a terminal it is better to use lake clean
via vscode commands (assuming it doesn't have critical bugs in it).
Jireh Loreaux (Oct 23 2023 at 20:30):
This I totally understand. I guess from your previous message I had interpreted "you should not eventually trust lake clean
more than your fat fingers." The fact that it has bugs early on is unfortunate, but understandable.
Damiano Testa (Oct 23 2023 at 20:32):
... and more than a little unlucky, since the bug could have been something less dramatic than wiping out the current mathlib directory!
Mario Carneiro (Oct 23 2023 at 20:33):
I think we have had less dramatic ABI breakage issues many times already
Mario Carneiro (Oct 23 2023 at 20:34):
oleans make for a bad interchange format on many levels, but the siren song of mmap
is hard to compete with
Mario Carneiro (Oct 23 2023 at 20:36):
Let's take a file from disk and - without even parsing it - map it directly into memory and start dereferencing pointers in it. What could possibly go wrong?
Eric Wieser (Oct 23 2023 at 20:38):
Presumably this doesn't manifest anywhere else because we always read the source file and check it is in sync with the olean before loading the olean?
Eric Wieser (Oct 23 2023 at 20:38):
But lake skipped that step for performance reasons, and we're now paying the price?
Shreyas Srinivas (Oct 23 2023 at 20:38):
Eric Wieser said:
Presumably this doesn't manifest anywhere else because we always read the source file and check it is in sync with the olean before loading the olean?
How does that work? We compile each lean file every time?
Mario Carneiro (Oct 23 2023 at 20:39):
Lake uses a hash of the source file and the toolchain and lots of other stuff to decide whether to rebuild the olean
Mario Carneiro (Oct 23 2023 at 20:40):
Unfortunately it doesn't do this for lakefile.olean, it only uses modification times
Jireh Loreaux (Oct 23 2023 at 20:41):
Is this just for performance so it doesn't have to incur startup costs?
Mario Carneiro (Oct 23 2023 at 20:41):
yes, that's the main reason, but I think the lack of hashing is just because there isn't an obvious place to put it
Mario Carneiro (Oct 23 2023 at 20:42):
oleans should probably (definitely) store the toolchain name or an ABI version number or something
Shreyas Srinivas (Oct 23 2023 at 20:42):
This hash needs to be computed once for one lake file.olean when it is asked to run a command that probably checks many oleans? Why is it costly?
Mario Carneiro (Oct 23 2023 at 20:43):
right now they are a non-self-describing binary format which means you need to have some external knowledge to even know how to parse them
Mario Carneiro (Oct 23 2023 at 20:44):
the hash is not costly, it was just not done
Mario Carneiro (Oct 23 2023 at 20:44):
the hash of all the lean files in the project is somewhat costly but that's not what we're talking about
Mario Carneiro (Oct 23 2023 at 20:45):
There is no question that hashing the lakefiles and/or adding the toolchain version is an important thing to do and I know Mac is planning on working on it soon because we don't want to enshrine ABI stability of lake
Mario Carneiro (Oct 23 2023 at 20:46):
rc4 was just an emergency measure to prevent data loss while we fix the underlying issue
Mario Carneiro (Oct 23 2023 at 20:47):
Shreyas Srinivas said:
This hash needs to be computed once for one lake file.olean when it is asked to run a command that probably checks many oleans? Why is it costly?
The thing that is costly which motivated the lakefile.olean in the first place is elaboration of the lakefiles
Mario Carneiro (Oct 23 2023 at 20:48):
It's less costly than it used to be, but still a decent fraction of lake's overall runtime
Shreyas Srinivas (Oct 23 2023 at 20:49):
Is this hash stored in the olean when it is created from the respective lean file and cross checked by recomputing the hash on the lean file afterwards?
Shreyas Srinivas (Oct 23 2023 at 20:51):
Maybe I will understand this better if I read how this hash is built, stored, and checked. Where should I look in the source/docs?
Mario Carneiro (Oct 23 2023 at 20:57):
If you look in the build
directory you will notice there are .trace
files everywhere
Mario Carneiro (Oct 23 2023 at 20:59):
these are recursive hashes of all the inputs to each build output, and when you run lake build
it re-hashes everything, checks if the result matches the stored .trace
file and only if it agrees does it trust the corresponding build output file (.olean
, .c
, .o
etc)
Mario Carneiro (Oct 23 2023 at 21:00):
lakefile.olean
is special because it is done before lake
proper is set up (the workspace hasn't been built yet), so it has to go through a different code path and that one doesn't have the same hashing infrastructure
Shreyas Srinivas (Oct 23 2023 at 21:28):
Okay, now the pieces are falling in place in my head. Thanks :)
Shreyas Srinivas (Oct 23 2023 at 21:37):
Basically, before this bootstrapping of Lake is complete, the only way to be sure that lakefile.lean matches its olean is compiling it, which is expensive. So Lake has been faithfully reading lakefile.olean, which is essential to said bootstrapping.
Shreyas Srinivas (Oct 23 2023 at 21:41):
I am guessing a two stage boostrapping process is one possible solution?
Mario Carneiro (Oct 23 2023 at 21:47):
please don't propose that, I just mentioned that complexity is a major issue and that will not help
Shreyas Srinivas (Oct 23 2023 at 21:52):
Not at all. Not proposing anything. At this point, I am just fascinated by how the system design process is hashed out (pun intended) in practice. Fwiw, I suspect Lake is going to get more complex when dependency management with version constraints enters the picture.
Mac Malone (Oct 23 2023 at 23:54):
Julian Berman said:
I'll add that my personal intention is likely to use
git clean -xi
going forward, which both supports a--dry-run / -n
option as well as-i
for interactively deleting only after showing me what I'm deleting. FWIW,rm
also has such a flag, it'srm -i
orrm -I
to only ask for confirmation if it's going to remove a lot of stuff
Both of these sound like nice features to add to Lake! :grinning_face_with_smiling_eyes: Would you be willing to make an issue for them? One design question I can think of is whether Lake should traverse the directories it intends to delete and list all files or just list the top-level directory if it intends to delete the whole thing.
Julian Berman (Oct 24 2023 at 00:34):
One design question I can think of is whether Lake should traverse the directories it intends to delete and list all files or just list the top-level directory if it intends to delete the whole thing.
This is indeed a good question, considering git clean -xd
will not, but rm -i
will :) -- I think either is probably reasonable behavior personally. But yes sure, done as lean4#2740 and lean4#2741!
Last updated: Dec 20 2023 at 11:08 UTC