Zulip Chat Archive

Stream: lean4

Topic: hang/segfault on import


Mac (Aug 08 2021 at 12:37):

I have discovered a Heisenbug with nightly-2021-08-05 (i.e., the zero-cost mmap import update). While working on Papyrus, I have discovered that my unit.lean hangs/segfaults on its imports on versions nightly-2021-08-05 and beyond (and does not on versions before that). Unfortunately, I suspect this issue is not easily reproducible or convertible to a #mwe. In particular, simply copying/renaming the repo to a different directory can cause the bug to not occur. Thus, I suspect the hang is caused by the particular place the imports get randomly placed in memory.

Mac (Aug 08 2021 at 12:47):

The bug also causes the oleans to be undeletable until I force terminate all active lean processes.

Sebastian Ullrich (Aug 08 2021 at 12:48):

Do you have a stacktrace?

Sebastian Ullrich (Aug 08 2021 at 12:49):

The addresses depend only on the module names, so I'm not sure how moving the directory could affect the issue. Also it sounds suspicious that only this single file that isn't part of the build fails, right?

Sebastian Ullrich (Aug 08 2021 at 12:49):

Mac said:

The bug also causes the oleans to be undeletable until I force terminate all active lean processes.

That's just how memory-mapped files work on Windows.

Mac (Aug 08 2021 at 12:51):

Sebastian Ullrich said:

The addresses depend only on the module names, so I'm not sure how moving the directory could affect the issue.

I may be wrong about the directory moving thing. The first time I thought I tried it and it did crash. But I may have been using the wrong version as when I tried to copy and rebuild again it still crashed in a different directory.

Mac (Aug 08 2021 at 12:58):

Sebastian Ullrich said:

Also it sounds suspicious that only this single file that isn't part of the build fails, right?

Not really, as it is the import of Papyrus itself that seems to be failing.

Sebastian Ullrich (Aug 08 2021 at 13:01):

But that's a pretty boring import. It sounds unlikely that an issue would manifest there but not with any of the nested imports, but who knows.

Mac (Aug 08 2021 at 13:05):

Sebastian Ullrich said:

Mac said:

The bug also causes the oleans to be undeletable until I force terminate all active lean processes.

That's just how memory-mapped files work on Windows.

I suspected as much, my current problem is that the locks seems to remain even when there are no more lean processes left (or any other process that seems connected). My current solution is having to logout/login.

Mac (Aug 08 2021 at 13:06):

Scratch that -- they were just hiding under a hanging VS Code process.

Mac (Aug 08 2021 at 13:14):

Sebastian Ullrich said:

But that's a pretty boring import. It sounds unlikely that an issue would manifest there but not with any of the nested imports, but who knows.

My suspicion as to the source of the problem could be incorrect, but it does only break starting on the nightly-2021-08-05 build so I would assume it is an mmap import problem.

Mac (Aug 08 2021 at 13:30):

I figured out what the problem was! I had not properly cleaned the build output from the previous version of Lean. I think the new Lean was somehow breaking on the old build output. Namely, it was not rebuilding TestLib and it was segfaulting because of that import (not Papyrus).

Sebastian Ullrich (Aug 08 2021 at 13:31):

Ok ok, we should really bring back CHECK_OLEAN_VERSION...

Mac (Aug 08 2021 at 13:33):

@Sebastian Ullrich I wonder if a similar build cache error could be the cause of #555's segfault.

Sebastian Ullrich (Aug 08 2021 at 13:33):

We don't cache .oleans

Sebastian Ullrich (Aug 08 2021 at 13:34):

Well, the Nix CI does, but that one's clever enough not to confuse versions

Mac (Aug 08 2021 at 13:36):

Ah. Then what exactly is it caching on Windows?

Sebastian Ullrich (Aug 08 2021 at 13:37):

Only the C/C++ compilation via ccache. At least so far we've never had a problem with ccache soundness.

Mac (Aug 08 2021 at 13:40):

Hmm, but in #555 you are now compiling and linking very differently. Would it be worth it to try clearing the ccache for it and rebuilding jus to see if that is somehow the problem?

Mac (Aug 08 2021 at 13:40):

Because it really feels like that segfault is coming from some quirk of the CI environment that user machines don't share. That cache could certainly be it (even if it may be unlikely that is the case).

Sebastian Ullrich (Aug 08 2021 at 13:43):

Maybe! You're welcome to comment out the cache action and e.g. push the branch to your own master :)

Sebastian Ullrich (Aug 08 2021 at 13:44):

Ah, I guess you might not have Actions configured. I can do it as well.

Sebastian Ullrich (Aug 08 2021 at 14:47):

Nope, same result


Last updated: Dec 20 2023 at 11:08 UTC