Zulip Chat Archive

Stream: mathlib4

Topic: ltar error aften lean#4402


Kim Morrison (Jun 13 2024 at 04:24):

@Mario Carneiro, @Mac Malone, calling your attention to the build failure at https://github.com/leanprover-community/mathlib4/actions/runs/9493155801/job/26161439009, which is my attempt to test Mathlib on lean#4402.

Unfortunately, the .trace file has changed format, and ltar no longer accepts it.

Kim Morrison (Jun 13 2024 at 04:25):

Can I leave it to the two of you to sort this out? Unfortunately there is some time pressure, as we would like to have lean#4402 in v4.9.0-rc2, which I am otherwise ready to release.

If initial triage suggests this will not be easy to fix, or there isn't availability to fix it soon, that would be helpful to hear, so I can proceed with v4.9.0-rc2 without including lean#4402.

Kim Morrison (Jun 13 2024 at 04:26):

(Also just pinging @Sebastian Ullrich as they have moving pieces in the rc2 release!)

Mario Carneiro (Jun 13 2024 at 04:26):

I'll try to get to this today or tomorrow

Kim Morrison (Jun 13 2024 at 04:27):

Thanks.

Mario Carneiro (Jun 13 2024 at 04:27):

it will be helpful in the meanwhile if @Mac Malone summarizes the content of the change

Kim Morrison (Jun 13 2024 at 04:29):

Sorry about springing this on short notice: unfortunately Mathlib CI disables the "check cache" step on lean-pr-testing-NNNN branches, so I didn't notice this until a manual check at the last moment. I've modified cache already so it doesn't expect a .log.json file (which are no longer used).

Mac Malone (Jun 13 2024 at 04:32):

Mario Carneiro said:

it will be helpful in the meanwhile if Mac Malone summarizes the content of the change

.trace and .log.json have been merged, so now .trace is a JSON file with depHash and log fields.

Also, module .hash files are now reliably regenerated on rebuild (and on --rehash) addressing lean4#2751. I don't think that shoud require any cache changes, though.

Mario Carneiro (Jun 13 2024 at 12:06):

I get the following error when trying to run builds including lean#4402:

$ lake +lean4 build
 [113/172] Building Batteries.Data.Char
trace: ././.lake/build/lib/Batteries/Data/Char.trace: invalid trace file: Lake.BuildMetadata.depHash: String expected
trace: .> LEAN_PATH=././.lake/build/lib LD_LIBRARY_PATH=././.lake/build/lib /home/mario/Documents/lean/lean4/build/release/stage1/bin/lean -Dlinter.missingDocs=true ././././Batteries/Data/Char.lean -R ./././. -o ././.lake/build/lib/Batteries/Data/Char.olean -i ././.lake/build/lib/Batteries/Data/Char.ilean -c ././.lake/build/ir/Batteries/Data/Char.c --json
error: ././././Batteries/Data/Char.lean:26:58: unknown constant 'Char.utf8Size_le_four'
error: Lean exited with code 1
 [165/172] Running Batteries.Data.HashMap (+ 2 more)

The error itself is about a missing constant, this is just usual migration stuff. The point I want to highlight is the first line, which is a failure of lake to parse its own hash files

Mario Carneiro (Jun 13 2024 at 12:08):

The resulting trace files are also problematic from a caching perspective:

$ cat .lake/build/lib/Batteries/CodeAction.trace
{"log":
 [{"message":
   ".> LEAN_PATH=././.lake/build/lib LD_LIBRARY_PATH=././.lake/build/lib /home/mario/Documents/lean/lean4/build/release/stage1/bin/lean -Dlinter.missingDocs=true ././././Batteries/CodeAction.lean -R ./././. -o ././.lake/build/lib/Batteries/CodeAction.olean -i ././.lake/build/lib/Batteries/CodeAction.ilean -c ././.lake/build/ir/Batteries/CodeAction.c --json",
   "level": "trace"}],
 "depHash": "17092602937924886406"}

I would really like it if a clean build produced no log messages, especially not one containing /home/mario/Documents/

Mario Carneiro (Jun 13 2024 at 12:09):

@Mac Malone

Mac Malone (Jun 13 2024 at 15:01):

Mario Carneiro said:

I would really like it if a clean build produced no log messages, especially not one containing /home/mario/Documents/

The contents of the log have not changed with this PR. The old log.json also had this (and it was previously cached). Recall that the point of the cached log is precisely to retain the log of a clean build (so that it can potentially fail if a a future build sues a more restrictive log level (e.g., --wfail or --iofail).

Mac Malone (Jun 13 2024 at 15:02):

However, I do agree that, for caching, you probably do not want to include the log.

Mac Malone (Jun 13 2024 at 15:27):

Mario Carneiro said:

The point I want to highlight is the first line, which is a failure of lake to parse its own hash files

What was the contents of Batteries/Data/Char.trace? If it was the old numeric trace, that would be why.

Mario Carneiro (Jun 13 2024 at 18:46):

Mac Malone said:

However, I do agree that, for caching, you probably do not want to include the log.

Actually my plan is not to completely remove the log, but rather to remove only the verbose messages. The mathlib cache actually does include failing files, e.g. when someone pushes a file with errors, and it is still useful to get replayed error messages after downloading the run results. But it is surprising that these messages are logged even when verbosity level is set low. I spoke with @Rob Lewis about this, and we agreed that the most sensible behavior would be to only store the messages that were actually printed at the selected verbosity level, and possibly also the verbosity level setting itself so that you know not to replay if the verbosity level is changed. Logging all messages seems like a dangerous thing, since people could unknowingly be printing quite a lot at higher verbosity levels, not realizing that these are effectively always logged "just in case" they are replayed at higher verbosity later.

Mario Carneiro (Jun 13 2024 at 19:20):

@Kim Morrison leantar v0.1.12 is pushed and compiled releases are ready. It is fully backward compatible as usual, so all you need to do is update LEANTARVERSION in Cache.IO on nightly-testing, and it can be PR'd to master in advance of lean4#4402 landing.

Kim Morrison (Jun 13 2024 at 21:33):

#13808

Mario Carneiro (Jun 13 2024 at 21:34):

actually maybe don't auto merge, make sure the cache works first

Kim Morrison (Jun 13 2024 at 21:34):

There's a cache check step in CI: it clears the oleans, and checks that lake exe cache get; lake build --no-build works.

Kim Morrison (Jun 13 2024 at 21:34):

But yes, I'll also manually verify!

Kim Morrison (Jun 13 2024 at 21:35):

(Unfortunately we disable this step on lean-pr-testing-NNNN branches, which is why we didn't catch this problem in the first place!)

Mario Carneiro (Jun 13 2024 at 21:35):

it also needs testing on the 4402 branch itself

Mario Carneiro (Jun 13 2024 at 21:36):

I did some local testing that ltar files are modified in the expected way with both versions of lean, but I did not test on mathlib

Kim Morrison (Jun 13 2024 at 21:37):

Ah, good point. I'll cherry-pick now.

Mario Carneiro (Jun 14 2024 at 22:52):

Has anyone else been getting errors like:

$ lake exe cache get
...
/home/mario/.cache/mathlib/815b0d854b744bbf.ltar: bad .trace file
/home/mario/.cache/mathlib/13cb55e27cd34313.ltar: bad .trace file
uncaught exception: leantar failed with error code 1

since rc2? I'm worried that I broke something in leantar 0.1.12, but I'm still isolating the problem.

Mario Carneiro (Jun 14 2024 at 23:10):

Aha, I think I know the source of the issue: this is on a mathlib branch prior to #13808, meaning it uses leantar 0.1.11 which does not understand the new ltar or trace file format. There is no problem with the ltar file, it is using the old format, but the .trace file inside the .lake directory is using the new json format, which neither old leantar nor old lake can interpret. To summarize:

  • If you have old .trace files after bumping lean to rc2 or switching to an rc2 branch (backward compatibility), leantar will work and lake will fail
  • If you have new (rc2) .trace files after downgrading to rc1 or switching to an rc1 branch (forward compatibility), both leantar and lake will fail

The solution in both cases is to rm -rf .lake when switching between versions.

Mario Carneiro (Jun 14 2024 at 23:13):

I'm going to build in some forward compatibility for the next leantar version, so this doesn't happen in the future

Mario Carneiro (Jun 15 2024 at 01:09):

#13843

Mac Malone (Jun 17 2024 at 13:55):

@Mario Carneiro Is this compatibility a common problem? My working assumption was that Lake traces were only ever used by the same Lean/Lake version as otherwise the module needs rebuilding anyway (as the Lean version changed). Or are you saying the Lake errors rather than simply rebuilding the module?

Mario Carneiro (Jun 17 2024 at 20:54):

Lake seems to fail. Upgrade scenario:

$ rm -rf .lake
$ git co 14f25859  # this is 4.8.0-rc1
$ lake build Batteries.CodeAction.Misc
$ git co 47e4cc5c  # this is 4.9.0-rc2
$ lake build Batteries.CodeAction.Misc
✖ [4/4] Building Batteries.CodeAction.Misc
trace: ././.lake/build/lib/Batteries/CodeAction/Misc.trace: invalid trace file: Lake.BuildMetadata.depHash: String expected
trace: .> LEAN_PATH=././.lake/build/lib LD_LIBRARY_PATH=././.lake/build/lib /home/mario/.elan/toolchains/leanprover--lean4---v4.9.0-rc2/bin/lean -Dlinter.missingDocs=true ././././Batteries/CodeAction/Misc.lean -R ./././. -o ././.lake/build/lib/Batteries/CodeAction/Misc.olean -i ././.lake/build/lib/Batteries/CodeAction/Misc.ilean -c ././.lake/build/ir/Batteries/CodeAction/Misc.c --json
error: ././././Batteries/CodeAction/Misc.lean:6:0: failed to read file '././.lake/build/lib/Batteries/Lean/Position.olean', invalid header
error: ././././Batteries/CodeAction/Misc.lean:21:5: unknown namespace 'Lean'
error: ././././Batteries/CodeAction/Misc.lean:27:24: expected token
...

Mario Carneiro (Jun 17 2024 at 20:56):

Downgrade scenario fails in the same way:

$ rm -rf .lake
$ git co 47e4cc5c  # this is 4.9.0-rc2
$ lake build Batteries.CodeAction.Misc
$ git co 14f25859  # this is 4.8.0-rc1
$ lake build Batteries.CodeAction.Misc
[4/4] Building Batteries.CodeAction.Misc
trace: .> LEAN_PATH=././.lake/build/lib LD_LIBRARY_PATH=././.lake/build/lib /home/mario/.elan/toolchains/leanprover--lean4---v4.8.0-rc1/bin/lean -Dlinter.missingDocs=true ././././Batteries/CodeAction/Misc.lean -R ./././. -o ././.lake/build/lib/Batteries/CodeAction/Misc.olean -i ././.lake/build/lib/Batteries/CodeAction/Misc.ilean -c ././.lake/build/ir/Batteries/CodeAction/Misc.c --json
error: ././././Batteries/CodeAction/Misc.lean:6:0: failed to read file '././.lake/build/lib/Batteries/Lean/Position.olean', invalid header
error: ././././Batteries/CodeAction/Misc.lean:21:5-21:9: unknown namespace 'Lean'
error: ././././Batteries/CodeAction/Misc.lean:27:24: expected token
...

Mario Carneiro (Jun 17 2024 at 21:00):

This issue seems likely to hit everyone whenever bumping a project across the 4.8.0 -> 4.9.0 boundary or jumping between branches on opposite sides of that boundary. The fix is simple but not particularly well suggested by the error message (unless you happen to be aware that nuking the lake folder is a panacea for lake state issues)

Mario Carneiro (Jun 17 2024 at 21:01):

@Mac Malone

Mac Malone (Jun 17 2024 at 21:08):

Mario Carneiro The problem there appears to be that Lake is assuming the olean is up-to-date because the trace file is missing/invalid and (presumably) the modification time of the olean is greater than that of its inputs. However, since this is crossing a toolchain boundary, Lean gets mad.

Mac Malone (Jun 17 2024 at 21:11):

(The same thing would also happen if the trace file was removed.)

Mac Malone (Jun 17 2024 at 21:15):

Part of this is a question of what Lake should do without a trace file. Currently, it falls back to modification time. With that in mind, though, I probably should have made the tracing backwards-compatible.

Mario Carneiro (Jun 17 2024 at 21:17):

I think the damage is already done. The best we can do at this point is publicize the fix for people running into this issue this month. Best to focus on making sure that it doesn't happen in the future, and probably switching to json gives you the needed forward compatibility already.

Mario Carneiro (Jun 17 2024 at 21:18):

Actually that's not quite true, there is still time to slip in a backward compatibility fix before 4.9.0 stable goes out. That won't fix the pre-existing forward compatibility issue in 4.8.0, but downgrades are relatively less common anyway.

Mario Carneiro (Jun 17 2024 at 21:20):

the simplest backward compatibility fix here is not even to read the old format and just treat a "malformed .trace file" as out of date

Yaël Dillies (Jun 17 2024 at 21:45):

Mario Carneiro said:

[...] unless you happen to be aware that nuking the lake folder is a panacea for lake state issues

When all you have is a nuke, everything looks like a state issue, they say?

Yaël Dillies (Jun 17 2024 at 21:46):

I think it would help a lot to encourage everyone to update their PRs to v4.9.0, and to take that opportunity to publicise the fix

Mac Malone (Jun 17 2024 at 22:34):

Mario Carneiro said:

the simplest backward compatibility fix here is not even to read the old format and just treat a "malformed .trace file" as out of date

To clairfy, are you suggesting I should treat missing and invalid traces differently?

Mario Carneiro (Jun 17 2024 at 22:35):

I guess if the trace is missing then it's also out of date?

Mario Carneiro (Jun 17 2024 at 22:36):

This is what I implemented in leantar in any case, there is no date matching

Mac Malone (Jun 17 2024 at 22:37):

The date matching is valuable when other systems may also provide the file (For example, this is what prevents Lake from rebuilding core if Make has already built it.)

Mario Carneiro (Jun 17 2024 at 22:38):

that sounds like a specialized option, possibly core-only

Mac Malone (Jun 17 2024 at 22:38):

(Theoretically this would also have worked with cache were it not to also cache traces.)

Mario Carneiro (Jun 17 2024 at 22:38):

I think it is reasonable that alternative build systems (like cache to some extent) should also be responsible for producing valid .trace files

Mario Carneiro (Jun 17 2024 at 22:39):

a project that works with build systems not satisfying this invariant should be signaled in some way

Mario Carneiro (Jun 17 2024 at 22:42):

Given only the .olean files / everything except .trace files, I don't see any obvious correct method for validating that these files are not out of date and rebuilding the ones that are

Eric Wieser (Jun 17 2024 at 23:46):

Mario Carneiro said:

I think it is reasonable that alternative build systems (like cache to some extent) should also be responsible for producing valid .trace files

Is there a speciifcation for .trace files? I don't see how any alternative build system (eg nix or bazel) can reasonably assemble them today, short of invoking lake and hoping for it not to build any targets but the specified one

Mario Carneiro (Jun 17 2024 at 23:47):

the structure is specified, but not the hash itself inside depHash (although I did write an external implementation of the hash algorithm for lean-cache once)

Mario Carneiro (Jun 17 2024 at 23:48):

But lake could offer a mode where it only computes hashes and assumes everything else is up to date

Eric Wieser (Jun 18 2024 at 00:01):

Mario Carneiro said:

although I did write an external implementation of the hash algorithm for lean-cache once

Presumably the format has changed since then; if the hash isn't specified, I assume it has also changed?

Utensil Song (Oct 16 2024 at 12:17):

Mario Carneiro said:

This issue seems likely to hit everyone whenever bumping a project across the 4.8.0 -> 4.9.0 boundary or jumping between branches on opposite sides of that boundary.

The solution in both cases is to rm -rf .lake when switching between versions.

I hit this today, crossing the boundary from leanprover/lean4:v4.8.0-rc1 to leanprover/lean4:v4.11.0. Removing the cache didn't work (because the error indicates only about ~/.cache), and removing .lake works like a charm. Thanks for noticing and posting this.

Another thing I have observed is that after rm -rf .lake and re-executing lake exec cache get, there is a long silence from lake (more than 15 seconds, enough for me to think it hung, and halt lake). It turns out that lake is just busy cloning packages from GitHub. Maybe in this case there could be some sign from lake that it's already working?

Kevin Buzzard (Oct 16 2024 at 12:40):

I have also seen uncomfortably long pauses from lake which I have been unable to understand. More output from lake would be wonderful!

Markus Himmel (Oct 16 2024 at 12:42):

We have lean4#5615 to track this issue, feel free to give a thumbs up there.


Last updated: May 02 2025 at 03:31 UTC