Zulip Chat Archive

Stream: general

Topic: Potential data loss from `lake clean` with 4.2.0-rc2/3


Sebastian Ullrich (Oct 21 2023 at 20:20):

All, please see the updated release notes of 4.2.0-rc4 if you have previously used rc2 or rc3, or go back to a project/branch still using them. For more information also see the thread #lean4 > lake build'ing mathlib HEAD.

Kalle Kytölä (Oct 22 2023 at 20:39):

I think I hit this now while already on 4.2.0-rc4. I (really believe I) had deleted the lakefile.olean as instructed before lake updateing to 4.2.0-rc4. [UPDATE: Confirmed on command history, but see below for why this was not a solution.]

I had some issues in VS Code suddenly --- the same config worked for some time before. I then ran lake clean, and it erased the entire project directory.

Eric Wieser (Oct 22 2023 at 20:41):

I think you maybe have to delete it after running lake update

Eric Wieser (Oct 22 2023 at 20:41):

Otherwise lake update will recreate it (pre-update)

Kalle Kytölä (Oct 22 2023 at 20:41):

Oh, I had misunderstood the instructions...

Eric Wieser (Oct 22 2023 at 20:43):

I think that's on the instructions, not on you:

Fixes a bug that could lead to data loss from lake clean when switching between this version and previous stable versions. Users upgrading a project from 4.2.0-rc2 or 4.2.0-rc3 should manually remove their lakefile.olean beforehand.

Eric Wieser (Oct 22 2023 at 20:44):

The obvious reading is that "beforehand" modifies "upgrading a project", but I think it actually is intended to modify "lake clean"

Kalle Kytölä (Oct 22 2023 at 20:44):

I just checked the same and was about to request a clarification there.

(That said, I don't blame those instructions, because I just sloppily followed the problem on Zulip instead of carefully reading the instructions.)

Kalle Kytölä (Oct 22 2023 at 20:46):

I should have everything (except from a very small amount of recent work) on an external repo, so I should be fine.

But this can be really bad for Lean users. There is absolutely nothing left of the directory.

Julian Berman (Oct 22 2023 at 20:58):

Not that I assume you saw it, and Sebastian I'm sure will update the official instructions, but I changed the TLDR at the top of https://leanprover.zulipchat.com/#narrow/stream/270676-lean4/topic/lake.20build'ing.20mathlib.20HEAD/near/397850852 to hopefully make that even clearer there.

Sebastian Ullrich (Oct 22 2023 at 21:14):

I'm confused, everyone is talking about lake update upgrading Lean but it doesn't change lean-toolchain, you still have to do that yourself. Whether you remove lakefile.olean right before or after touching that file doesn't really matter. I've clarified the instructions nevertheless.

Yaël Dillies (Oct 22 2023 at 21:20):

Now it's my turn to be confused. I'm pretty sure lake update updated lean-toolchain last time I used it (= this morning).

Kevin Buzzard (Oct 22 2023 at 21:21):

I hope you can also see that having a confusing system for upgrading leads to confusion.

(I should probably add that knowing this community well I fully expect these issues to be ironed out in time!)

Sebastian Ullrich (Oct 22 2023 at 21:22):

Then I may be mistaken, did I miss some feature automating this?

Yaël Dillies (Oct 22 2023 at 21:28):

People have been talking about it, so I wouldn't be surprised. Nothing else comes to mind though.

Julian Berman (Oct 22 2023 at 21:31):

I honestly cannot remember what it does, but it seems not to here from just testing. The language I edited in myself was probably imprecise, will edit to match yours.

Scott Morrison (Oct 22 2023 at 21:42):

No, I'm pretty sure Yaël must be mistaken here --- I don't know of anything tooling that modifies a lean-toolchain file (except for the CI automation that manages the nightly-testing and lean-pr-testing-NNNN branches).

Yaël Dillies (Oct 22 2023 at 21:43):

Yeah sorry I can't check now

Sebastian Ullrich (Oct 23 2023 at 06:55):

(we want lake update to appropriately update Lean of course and we will get there, I was just checking whether my instructions for the current Lean were correct!)

Utensil Song (Oct 23 2023 at 08:52):

Maybe add something like: Back up your whole Lean project elsewhere before running any lake command, particularly lake clean.

Shreyas Srinivas (Oct 23 2023 at 11:14):

does this lake clean bug also end up deleting the git history and the .vscode folder?

Shreyas Srinivas (Oct 23 2023 at 11:16):

Also, what toolchain should I update to if I haven't been bitten by this bug yet. Currently I am using 4.2 rc1 and 4.1.

Mario Carneiro (Oct 23 2023 at 11:17):

4.2.0-rc4 is the fixed version, if you just skip over 4.2.0-rc{2,3} you will not have any problems

Eric Wieser (Oct 23 2023 at 11:17):

Yes, it deletes the entire working directory, which includes .git and .vscode; the bug is that what lake things is the build directory is actually the source directory

Mario Carneiro (Oct 23 2023 at 11:19):

alternatively you can just not use lake clean, I don't think I ever use this command and do rm -rf build instead

Shreyas Srinivas (Oct 23 2023 at 11:21):

Is there/can there be an option for lake update that only updates to stable releases like 4.1 or 4.2 instead of intermediate rc releases?

Eric Wieser (Oct 23 2023 at 11:30):

As already discussed upthread, lake update never updates the release at all

Eric Wieser (Oct 23 2023 at 11:30):

So wish somewhat granted, lake update will never move you to a rc release

Jireh Loreaux (Oct 23 2023 at 12:35):

Mario, the stupid part about that is that lake clean should be the safer operation, because if I mistype with rm -rf there's no telling what I might delete.

Mario Carneiro (Oct 23 2023 at 16:37):

no, my point is exactly that lake clean is another way to spell rm -rf <something not immediately visible> and this has consequences

Jireh Loreaux (Oct 23 2023 at 18:42):

I'm sorry, that sucks. If the directory to rm were just stored in some json file that was tracked in the repo this would never happen. And if lake clean meant rm -rf <directory read from json file>, then there would be no problem and it would be safer than rm -rf manually. The only reason this issue came up is because lake was / is relying on a cache file lakefile.olean(for important performance reasons I expect) which is not stored in the git repo. So in my mind, this again boils down to the complexity of lake.

In years of using make files for my LaTeX projects, I have exactly 0 times deleted something unwanted by running my make clean operation, but I have definitely rm -rf'ed a number of things on accident (in fact, as a safety precaution, I generally type asdf rm -rf stuff and only after checking stuff is correct do I delete asdf).

For newbie / math users, it's even worse because they may not even understand rm -rf, or maybe (in the future) they will only interact with lake clean by using the VS Code extension; and the potential for deleting the user's entire repo should be virtually non-existent in that case (requiring lake clean to never do anything bad, hence safer than rm -rf stuff).

Julian Berman (Oct 23 2023 at 18:49):

I'll add that my personal intention is likely to use git clean -xi going forward, which both supports a --dry-run / -n option as well as -i for interactively deleting only after showing me what I'm deleting. FWIW, rm also has such a flag, it's rm -i or rm -I to only ask for confirmation if it's going to remove a lot of stuff

Julian Berman (Oct 23 2023 at 18:51):

I hesitate to say anything because of how little I know, and the fact that it seems this issue itself isn't fully "clear and out of the water yet", but to me the really worrying thing is how this seems to have happened -- namely that .olean files seem to contain serialized arbitrary objects, and that when deserializing them, there's some sort of positional layout applied. That's very worrying to me naively, and reminiscent of the situation in Python where -- I claim good Python programmers know "never use Pickle", which has some of these issues. I could be way off here on the "root" issue, it's just understanding I got from reading the other thread, but to me that would be what I would be wary of. ISTM Lake was just the first unlucky thing to hit that. Again though, just raw thoughts from seeing the other Zulip thread.

Julian Berman (Oct 23 2023 at 18:53):

It would certainly seem to be "safer" to have a "downward facing FilePath object" -- i.e. one where it cannot traverse further upward than its root path (i.e. the kind of path that in other places mean you cannot do Path("/foo/bar").child(someUntrustedString).remove() and be vulnerable to path traversal attacks) -- but so long as it's possible to have the kind of deserialization problem that I gathered (hopefully correctly), it'd still be worrying -- because you could get some other random object in the spot you expected, making anything fair game again.

Shreyas Srinivas (Oct 23 2023 at 19:04):

Mario Carneiro said:

no, my point is exactly that lake clean is another way to spell rm -rf <something not immediately visible> and this has consequences

User interfaces on top of existing tools provide safety for non-expert and non-malicious users by hiding complexity and footguns. By that standard you are suggesting, how does one trust rm itself rather than writing a program with all the syscalls it uses?

Mario Carneiro (Oct 23 2023 at 20:22):

The reason I trust rm and make clean more than lake clean at a technical and social level is because they are old and well tested pieces of software used by millions of people. I have never even heard a report of these tools deleting unwanted data except by user error (i.e. calling rm with the wrong argument or in the wrong directory), let alone seen it first hand. I'm sure this used to happen more, but that was 20+ years ago. lake is a young tool under active development, built on a young language under active development. Building stable software takes time and effort, and that doesn't come easily. In fact, if I was to write a program with direct syscalls I would still trust it less than rm because I would have just created another young and buggy program.

Mario Carneiro (Oct 23 2023 at 20:28):

Obviously, if lake clean has no bugs and does exactly what I am hoping it does it is superior to rm -rf build. But the complexity of lake ties directly to both lowering the probability that it has no bugs, as well as the probability that it does what I want it to. rm -rf build has a very simple to understand spec, at least for someone like me who knows how CLIs work. I totally agree that for people who don't know their way around a terminal it is better to use lake clean via vscode commands (assuming it doesn't have critical bugs in it).

Jireh Loreaux (Oct 23 2023 at 20:30):

This I totally understand. I guess from your previous message I had interpreted "you should not eventually trust lake clean more than your fat fingers." The fact that it has bugs early on is unfortunate, but understandable.

Damiano Testa (Oct 23 2023 at 20:32):

... and more than a little unlucky, since the bug could have been something less dramatic than wiping out the current mathlib directory!

Mario Carneiro (Oct 23 2023 at 20:33):

I think we have had less dramatic ABI breakage issues many times already

Mario Carneiro (Oct 23 2023 at 20:34):

oleans make for a bad interchange format on many levels, but the siren song of mmap is hard to compete with

Mario Carneiro (Oct 23 2023 at 20:36):

Let's take a file from disk and - without even parsing it - map it directly into memory and start dereferencing pointers in it. What could possibly go wrong?

Eric Wieser (Oct 23 2023 at 20:38):

Presumably this doesn't manifest anywhere else because we always read the source file and check it is in sync with the olean before loading the olean?

Eric Wieser (Oct 23 2023 at 20:38):

But lake skipped that step for performance reasons, and we're now paying the price?

Shreyas Srinivas (Oct 23 2023 at 20:38):

Eric Wieser said:

Presumably this doesn't manifest anywhere else because we always read the source file and check it is in sync with the olean before loading the olean?

How does that work? We compile each lean file every time?

Mario Carneiro (Oct 23 2023 at 20:39):

Lake uses a hash of the source file and the toolchain and lots of other stuff to decide whether to rebuild the olean

Mario Carneiro (Oct 23 2023 at 20:40):

Unfortunately it doesn't do this for lakefile.olean, it only uses modification times

Jireh Loreaux (Oct 23 2023 at 20:41):

Is this just for performance so it doesn't have to incur startup costs?

Mario Carneiro (Oct 23 2023 at 20:41):

yes, that's the main reason, but I think the lack of hashing is just because there isn't an obvious place to put it

Mario Carneiro (Oct 23 2023 at 20:42):

oleans should probably (definitely) store the toolchain name or an ABI version number or something

Shreyas Srinivas (Oct 23 2023 at 20:42):

This hash needs to be computed once for one lake file.olean when it is asked to run a command that probably checks many oleans? Why is it costly?

Mario Carneiro (Oct 23 2023 at 20:43):

right now they are a non-self-describing binary format which means you need to have some external knowledge to even know how to parse them

Mario Carneiro (Oct 23 2023 at 20:44):

the hash is not costly, it was just not done

Mario Carneiro (Oct 23 2023 at 20:44):

the hash of all the lean files in the project is somewhat costly but that's not what we're talking about

Mario Carneiro (Oct 23 2023 at 20:45):

There is no question that hashing the lakefiles and/or adding the toolchain version is an important thing to do and I know Mac is planning on working on it soon because we don't want to enshrine ABI stability of lake

Mario Carneiro (Oct 23 2023 at 20:46):

rc4 was just an emergency measure to prevent data loss while we fix the underlying issue

Mario Carneiro (Oct 23 2023 at 20:47):

Shreyas Srinivas said:

This hash needs to be computed once for one lake file.olean when it is asked to run a command that probably checks many oleans? Why is it costly?

The thing that is costly which motivated the lakefile.olean in the first place is elaboration of the lakefiles

Mario Carneiro (Oct 23 2023 at 20:48):

It's less costly than it used to be, but still a decent fraction of lake's overall runtime

Shreyas Srinivas (Oct 23 2023 at 20:49):

Is this hash stored in the olean when it is created from the respective lean file and cross checked by recomputing the hash on the lean file afterwards?

Shreyas Srinivas (Oct 23 2023 at 20:51):

Maybe I will understand this better if I read how this hash is built, stored, and checked. Where should I look in the source/docs?

Mario Carneiro (Oct 23 2023 at 20:57):

If you look in the build directory you will notice there are .trace files everywhere

Mario Carneiro (Oct 23 2023 at 20:59):

these are recursive hashes of all the inputs to each build output, and when you run lake build it re-hashes everything, checks if the result matches the stored .trace file and only if it agrees does it trust the corresponding build output file (.olean, .c, .o etc)

Mario Carneiro (Oct 23 2023 at 21:00):

lakefile.olean is special because it is done before lake proper is set up (the workspace hasn't been built yet), so it has to go through a different code path and that one doesn't have the same hashing infrastructure

Shreyas Srinivas (Oct 23 2023 at 21:28):

Okay, now the pieces are falling in place in my head. Thanks :)

Shreyas Srinivas (Oct 23 2023 at 21:37):

Basically, before this bootstrapping of Lake is complete, the only way to be sure that lakefile.lean matches its olean is compiling it, which is expensive. So Lake has been faithfully reading lakefile.olean, which is essential to said bootstrapping.

Shreyas Srinivas (Oct 23 2023 at 21:41):

I am guessing a two stage boostrapping process is one possible solution?

Mario Carneiro (Oct 23 2023 at 21:47):

please don't propose that, I just mentioned that complexity is a major issue and that will not help

Shreyas Srinivas (Oct 23 2023 at 21:52):

Not at all. Not proposing anything. At this point, I am just fascinated by how the system design process is hashed out (pun intended) in practice. Fwiw, I suspect Lake is going to get more complex when dependency management with version constraints enters the picture.

Mac Malone (Oct 23 2023 at 23:54):

Julian Berman said:

I'll add that my personal intention is likely to use git clean -xi going forward, which both supports a --dry-run / -n option as well as -i for interactively deleting only after showing me what I'm deleting. FWIW, rm also has such a flag, it's rm -i or rm -I to only ask for confirmation if it's going to remove a lot of stuff

Both of these sound like nice features to add to Lake! :grinning_face_with_smiling_eyes: Would you be willing to make an issue for them? One design question I can think of is whether Lake should traverse the directories it intends to delete and list all files or just list the top-level directory if it intends to delete the whole thing.

Julian Berman (Oct 24 2023 at 00:34):

One design question I can think of is whether Lake should traverse the directories it intends to delete and list all files or just list the top-level directory if it intends to delete the whole thing.

This is indeed a good question, considering git clean -xd will not, but rm -i will :) -- I think either is probably reasonable behavior personally. But yes sure, done as lean4#2740 and lean4#2741!


Last updated: Dec 20 2023 at 11:08 UTC