Zulip Chat Archive

Stream: lean4

Topic: allocation crash?


Floris van Doorn (Feb 06 2023 at 17:43):

I'm getting a weird crash in a mathlib file, that is very susceptible to any changes in the file: the error disappears or re-appears semi-randomly when removing or changing unrelated definitions or certain doc comments in the file. This means that it is hard to minimize, but the crash does consistently happen when running the same file repeatedly (both in VSCode and from the command line). I managed to semi-minimize it to the code here:
https://gist.github.com/fpvandoorn/9d21b9530e22bb49ca4adc23484f8402
It depends on mathlib, but gives an error on commit 4d0921d2 of mathlib4 (i.e. current master).

On several versions of the crash I ran gdb together with the help of @Kyle Miller and looked at the stack trace. Each time, the error is Program received signal SIGSEGV, Segmentation fault. and the top line of the backtrace is

#0  0x00007ffff7ebd7c2 in lean_alloc_small () from /home/vandoorn/.elan/toolchains/leanprover--lean4---nightly-2023-02-03/bin/../lib/lean/libleanshared.so

The rest of the lines of the backtrace vary wildly depending on small changes in the file.

Floris van Doorn (Feb 06 2023 at 17:49):

Remark: the file only executes lines 217-237, and that definition doesn't refer to any previous definitions in the same file (only some structures).

Kyle Miller (Feb 06 2023 at 17:50):

It's hard to know if the error is due to lean_alloc_small or if it's from multiple interacting errors, but two places that could cause a segmentation fault are this line in case g_heap is null or this line in case the page for the given object size for some reason hasn't been allocated yet.

One thing I noticed is that the small deallocation functions check for g_heap being null, but it's possible to avoid this check if you go through lean_alloc_small_object

Kevin Buzzard (Feb 06 2023 at 18:21):

(FWIW I can reproduce, so well done for minimising!)

Kyle Miller (Feb 07 2023 at 20:23):

Floris and I got as far as finding that this code causes interpreter::~interpreter() to do reference count decrementing for dead objects.

My best guess at the moment is that this line of interpreter::load() might be the issue. It is trusting that t is the actual type of the returned value -- if that's wrong, then if the result is an object it won't increment the reference count.

Kyle Miller (Feb 07 2023 at 20:30):

The way we found this was by adding lean_assert((uintptr_t)get_next_obj(r) % 8 == 0); right before p->m_header.m_free_list = get_next_obj(r); in lean_alloc_small, which revealed an object r in a free list that had a bad next pointer (an in particular, one whose pointer was one less than what it was supposed to be). Then we added a watch in gdb for when that next pointer got decremented to that value, and that happened to be right in interpreter::~interpreter().

lean_asserts in context

Gabriel Ebner (Feb 07 2023 at 20:34):

Kyle Miller said:

My best guess at the moment is that this line of interpreter::load() might be the issue. It is trusting that t is the actual type of the returned value -- if that's wrong, then if the result is an object it won't increment the reference count.

That should be easy to check. Can you add an assert there?

Kyle Miller (Feb 07 2023 at 20:35):

gdb stack trace part one

Kyle Miller (Feb 07 2023 at 20:36):

gdb stack trace part two

(this one has interpreter::~interpreter)

Kyle Miller (Feb 07 2023 at 20:36):

@Gabriel Ebner I'm not sure what the assert should be exactly since value is a union, but there's something I could try here to at least test the theory.

Gabriel Ebner (Feb 07 2023 at 20:40):

Ah, I see what you mean. I got a bit confused by the name type_is_scalar and thought it referred to scalars stored using pointer tagging.

Sebastian Ullrich (Feb 07 2023 at 20:52):

I don't have much time for Lean right now, but it would not be the first bug resulting from an interpreter/compiler ABI mismatch

Sebastian Ullrich (Feb 07 2023 at 20:54):

rr (reverse debugger) on a debug+sanitized build without custom allocator is my favored setup for debugging runtime bugs btw

Kyle Miller (Feb 07 2023 at 20:55):

So far it looks like my theory is wrong -- the interpreter::load() function increments the reference count just fine, so there must be something somewhere else double decrementing.

Kyle Miller (Feb 07 2023 at 21:30):

If anyone wants to pick this up, here's how to get to something that seems close to the problem.

First, using a breakpoint or an assert like the following, the lean file ends up failing the following assert:

static inline void lean_dec_ref(lean_object * o) {
     if (LEAN_LIKELY(o->m_rc > 1)) {
+        assert(o->m_rc < 100000);
         o->m_rc--;

This gives you the address of an object o that is being double freed. On my system it is reliably at 0x5555563db268 (which I'll use below).

I then set a breakpoint for src/library/compiler/ir_interpreter.cpp:832 conditioned on r.m_obj == 0x5555563db268 and ran Lean until I reached the breakpoint. Then I added a watchpoint for *0x5555563db268 conditioned on *(uint32_t*)0x5555563db268 == 1 and continued execution. This leads you into the middle of the deallocation routines. From within lean_del_core, you can see that there is a 4-argument constructor that is being deallocated, and the first two elements are 0x5555563db268.

So somehow this object appears twice without the reference count being incremented properly, leading to it being marked as free even though it's still in m_constant_cache.

Kyle Miller (Feb 07 2023 at 21:48):

Here is a little bit further. You can get the name of the m_constant_cache object that's double freed.

If you then do set_option trace.complier true, you can identify it as what gets assigned to _x_2. I think it's very likely it's the object assigned to _x_8 that ends up with _x_2 as its first two arguments where the reference count for _x_2 doesn't get incremented correctly.

[compiler.lambda_pure] >> _tmp._@.Mathlib.buggg._hyg.3840._closed_4 := let _x_1 := Option._cnstr.0.0.0;
let _x_2 := _tmp.«_@».Mathlib.buggg._hyg.3840._closed_3;
let _x_3 := 1;
let _x_4 := _x_3;
let _x_5 := 0;
let _x_6 := _x_5;
let _x_7 := Simps.ParsedProjectionData.projNrs._default._closed_1;
let _x_8 := Simps.ParsedProjectionData._cnstr.0.0.3 _x_2 _x_2 _x_1 _x_7;
let _x_9 := _sset.1.4.0 _x_8 _x_4;
let _x_10 := _sset.1.4.1 _x_9 _x_6;
let _x_11 := _sset.1.4.2 _x_10 _x_6;
let _x_12 := _x_11;
_x_12

Kyle Miller (Feb 07 2023 at 21:49):

What's responsible for processing the _x_2 arguments for _x_8 and incrementing the reference counts? I'm not very familiar with the IR interpreter.

Gabriel Ebner (Feb 07 2023 at 21:54):

What does set_option compiler.ir.result true show? λ-pure is the phase before refcounting, I believe.

Kyle Miller (Feb 07 2023 at 21:57):

def _tmp.«_@».Mathlib.buggg._hyg.3840._closed_4 : obj :=
  let x_1 : obj := ctor_0[Option.none];
  let x_2 : obj := _tmp.«_@».Mathlib.buggg._hyg.3840._closed_3;
  let x_3 : u8 := 1;
  let x_4 : u8 := 0;
  let x_5 : obj := Simps.ParsedProjectionData.projNrs._default._closed_1;
  inc x_2;
  let x_6 : obj := ctor_0.0.3[Simps.ParsedProjectionData.mk] x_2 x_2 x_1 x_5;
  sset x_6[4, 0] : u8 := x_3;
  sset x_6[4, 1] : u8 := x_4;
  sset x_6[4, 2] : u8 := x_4;
  ret x_6

Kyle Miller (Feb 07 2023 at 21:58):

That inc x_2 there should be enough right?

Sebastian Ullrich (Feb 07 2023 at 22:01):

I thought the interpreter does the other increment in load?


Last updated: Dec 20 2023 at 11:08 UTC