Zulip Chat Archive

Stream: general

Topic: Lean server crashes


Sebastien Gouezel (Oct 16 2023 at 13:55):

I have updated a branch I am working on to latest master, to take advantage of the speedup. Since then, I have been experiencing many crashes, like

The Lean Server has stopped processing this file:

Server process for file:///c%3A/Users/Sebastien/Desktop/mathlib4/Mathlib/MeasureTheory/Measure/Haar/Unique.lean crashed, likely due to a stack overflow or a bug.

Is it specific to my (windows) setup, or have other people been experiencing the same lately?

Sebastien Gouezel (Oct 16 2023 at 13:57):

Output looks like:

[Error - 3:56:12 PM] Request textDocument/codeAction failed.
  Message: Server process for file:///c%3A/Users/Sebastien/Desktop/mathlib4/Mathlib/MeasureTheory/Measure/Haar/Unique.lean crashed, likely due to a stack overflow or a bug.
  Code: -32902
[Error - 3:56:12 PM] Request textDocument/semanticTokens/full failed.
  Message: Server process for file:///c%3A/Users/Sebastien/Desktop/mathlib4/Mathlib/MeasureTheory/Measure/Haar/Unique.lean crashed, likely due to a stack overflow or a bug.
  Code: -32902

Scott Morrison (Oct 16 2023 at 13:58):

Oh dear! Leo has also had the same problem. :-(

Sebastien Gouezel (Oct 16 2023 at 13:59):

(It crashes every two lines I am typing, so makes it pretty unusable)

Scott Morrison (Oct 16 2023 at 14:06):

I'm out of time here, but if someone is experiencing this knows how to compile Lean and use elan override, I suggest:

  • locally revert lean4#2648
  • rebuild lean
  • have Mathlib use it via elan override (should compile fine)
  • test!

If that solves it we will revert it globally asap.

Sebastien Gouezel (Oct 16 2023 at 14:09):

(A message I am getting all the time is libc++abi: terminating due to uncaught exception of type lean::interrupted. I don't know if this hints in some direction).

Scott Morrison (Oct 16 2023 at 14:09):

I'm 75% confident lean4#2648 will be the source of the problem.

Sebastien Gouezel (Oct 16 2023 at 14:11):

Indeed, it looks like the interrupted exception I am getting has been introduced there.

Johan Commelin (Oct 16 2023 at 14:22):

This sounds like the kind of thing that can't be caught in CI. Does that mean we need some amount of "testing in production"?

Shreyas Srinivas (Oct 16 2023 at 14:22):

Beta release?

Johan Commelin (Oct 16 2023 at 14:23):

Sure, but you need users of the beta release, otherwise it doesn't have value

Shreyas Srinivas (Oct 16 2023 at 14:26):

When I saw rc releases or "release candidates" my first instinct was that they are in fact beta releases of some sort.

Utensil Song (Oct 16 2023 at 14:35):

I assume rc is what Mathlib master will use it for a while, so would projects that depend on Mathlib's master HEAD and have to keep a consistent toolchain. But isn't that most projects? Maybe only teaching projects that deliberately fall behind a version or two would survive. EDIT: oh, only after lake update

Matthew Ballard (Oct 16 2023 at 14:35):

Johan Commelin said:

This sounds like the kind of thing that can't be caught in CI. Does that mean we need some amount of "testing in production"?

What would this look like? I guess nightly testing is to make sure things build and no one is taking for a hard core test drive

Utensil Song (Oct 16 2023 at 14:54):

After update, I crashed it too (on Mac M1). With no Mathlib deps. But it doesn't keep crashing, only intermittent.

Scott Morrison (Oct 16 2023 at 15:10):

I am not going to be able to look at this for >7 hours. If someone could revert #7703 in the meantime I can sort out the details tomorrow.

Matthew Ballard (Oct 16 2023 at 15:55):

#7710

Scott Morrison (Oct 16 2023 at 23:01):

Unfortunately I just haven't been able to reproduce this problem on my computer, so I'm not entirely sure how to proceed from here.

Scott Morrison (Oct 16 2023 at 23:02):

If there is anyone who is able to reproduce, I would really really appreciate if they could help out by following my instructions above to locally revert lean4#2648 and report whether this solves the problem.

Scott Morrison (Oct 16 2023 at 23:03):

Without someone who can reproduce and who can try this, my fallback solution is going to be to revert lean4#2648 on master, cut a v4.2.0-rc3 release, and then just see what happens. :-( But this is a lot of work with uncertain outcome.

Scott Morrison (Oct 16 2023 at 23:04):

It seems the problem occurs on at least Windows and Apple Silicon (but not on mine??).

Matthew Ballard (Oct 16 2023 at 23:04):

Do we have a set context where this happens repeatedly?

Scott Morrison (Oct 16 2023 at 23:09):

Johan Commelin said:

This sounds like the kind of thing that can't be caught in CI. Does that mean we need some amount of "testing in production"?

Yes, the basic problem here is that lean4#2648, which makes changes not testable by CI, should not have been included in a release candidate so quickly after reaching master. If I'd waited longer on it, we hopefully would have seen this either from users on Lean master (i.e. not at Mathlib), or me working on nightly-testing at Mathlib.

This is what happens when there is demand for additional release candidates to get new features. :-)

Scott Morrison (Oct 16 2023 at 23:12):

If there's anyone who was experiencing these crashes available now, I am very happy to screenshare with you to walk you through the step of testing whether it was lean4#2648.

Scott Morrison (Oct 16 2023 at 23:20):

Sebastien Gouezel said:

I have updated a branch I am working on to latest master, to take advantage of the speedup. Since then, I have been experiencing many crashes, like

The Lean Server has stopped processing this file:

Server process for file:///c%3A/Users/Sebastien/Desktop/mathlib4/Mathlib/MeasureTheory/Measure/Haar/Unique.lean crashed, likely due to a stack overflow or a bug.

Is it specific to my (windows) setup, or have other people been experiencing the same lately?

Is this branch public? I am still failing to reproduce.

Scott Morrison (Oct 16 2023 at 23:23):

Found it, SG_haar_unique.

Scott Morrison (Oct 16 2023 at 23:33):

... but still can't reproduce.

Patrick Massot (Oct 16 2023 at 23:49):

Scott Morrison said:

This is what happens when there is demand for additional release candidates to get new features. :-)

My only excuse is that all this came from too much enthusiasm about the great work coming from the FRO. I also ate my own dog food and gave a talk today including live coding with the new rc2 (without hitting any bug).

Scott Morrison (Oct 16 2023 at 23:53):

Lucky. :-)

Scott Morrison (Oct 16 2023 at 23:54):

Sorry, didn't mean to blame you on this! It should definitely have occurred to me that this was an untestable change being rolled into a release candidate soon after hitting master... I'll learn, hopefully.

Utensil Song (Oct 17 2023 at 00:16):

It seems to more likely to happen when I just open a long file, typing in the middle of it, typing triggers slow orange bar then my typing requests code completion. But when it happens, I've tried reproduce it this way in the same spot after many file restart/server restart but it just doesn't happen again. I've tried manually create a monkey test of typing sequence then play undo and redo at different pace after file restart, see orange bar come and go, copilot auto completion and LSP auto completion mixed together, it just doesn't crash. Now I feel that it's quite resilient.

Scott Morrison (Oct 17 2023 at 01:33):

Thanks @Utensil Song. I'm working on making an v4.2.0-rc3 without this PR now.

Sebastien Gouezel (Oct 17 2023 at 05:28):

For me, it happens more often when I'm working close to the end of a longish file. Apart from that, it is pretty nondeterministic.

Scott Morrison (Oct 17 2023 at 05:40):

I've now seen it a handful of times, but have never managed two in a row.

master is now on v4.2.0-rc3, so if you merge master to your branches hopefully everything is good.

Scott Morrison (Oct 17 2023 at 05:40):

(And if it is not, and you're definitely on v4.2.0-rc3, please let me know ASAP)

Sebastien Gouezel (Oct 17 2023 at 05:42):

I've just merged master. We'll see how things go.

Sebastien Gouezel (Oct 17 2023 at 05:42):

And thanks!

Mauricio Collares (Oct 17 2023 at 11:24):

Not sure if this is related, but although I couldn't get the server to crash with Emacs and v4.2.0-rc2, on rc2 almost any editing done on a pending ("orange bar") region causes completions to stop working (using the company completion framework). Take this with a big grain of salt, but it "feels like" some of the edits are being discarded and the LSP server view of the file gets out of sync. v4.2.0-rc3 is fine.

Mauricio Collares (Oct 17 2023 at 12:46):

Ok, got a worker stack trace: https://gist.github.com/collares/ae3a8d5b5f7c6976b8d46b5bfb6e945b. This is on a debug build, commit d15a0a4acb988954137eb902f0991b72d63e8d67. It seems like the kernel calls check_interrupted() directly in for_each_fn::apply.

Sebastian Ullrich (Oct 17 2023 at 13:00):

Thank you!

Yaël Dillies (Oct 17 2023 at 18:27):

I've noticed the same thing as Mauricio. It feels like the LSP server is seeing a different file from the editor.


Last updated: Dec 20 2023 at 11:08 UTC