I'm getting a weird crash in a mathlib file, that is very susceptible to any changes in the file: the error disappears or re-appears semi-randomly when removing or changing unrelated definitions or certain doc comments in the file. This means that it is hard to minimize, but the crash does consistently happen when running the same file repeatedly (both in VSCode and from the command line). I managed to semi-minimize it to the code here: https://gist.github.com/fpvandoorn/9d21b9530e22bb49ca4adc23484f8402
It depends on mathlib, but gives an error on commit 4d0921d2 of mathlib4 (i.e. current master).
On several versions of the crash I ran gdb together with the help of @Kyle Miller and looked at the stack trace. Each time, the error is Program received signal SIGSEGV, Segmentation fault. and the top line of the backtrace is
It's hard to know if the error is due to lean_alloc_small or if it's from multiple interacting errors, but two places that could cause a segmentation fault are this line in case g_heap is null or this line in case the page for the given object size for some reason hasn't been allocated yet.
One thing I noticed is that the small deallocation functions check for g_heap being null, but it's possible to avoid this check if you go through lean_alloc_small_object
Floris and I got as far as finding that this code causes interpreter::~interpreter() to do reference count decrementing for dead objects.
My best guess at the moment is that this line of interpreter::load() might be the issue. It is trusting that t is the actual type of the returned value -- if that's wrong, then if the result is an object it won't increment the reference count.
The way we found this was by adding lean_assert((uintptr_t)get_next_obj(r) % 8 == 0); right before p->m_header.m_free_list = get_next_obj(r); in lean_alloc_small, which revealed an object r in a free list that had a bad next pointer (an in particular, one whose pointer was one less than what it was supposed to be). Then we added a watch in gdb for when that next pointer got decremented to that value, and that happened to be right in interpreter::~interpreter().
My best guess at the moment is that this line of interpreter::load() might be the issue. It is trusting that t is the actual type of the returned value -- if that's wrong, then if the result is an object it won't increment the reference count.
That should be easy to check. Can you add an assert there?
#11 0x00007ffff7a356e2 in lean::ir::interpreter::~interpreter (
this=0x7fffffffc150, __in_chrg=<optimized out>)
at /home/kmill/projects/lean4/src/library/compiler/ir_interpreter.cpp:950
#12 0x00007ffff7a3707b in lean::ir::interpreter::with_interpreter<lean_object*>(lean::environment const&, lean::options const&, lean::name const&, std::function--Type <RET> for more, q to quit, c to continue without paging--
<lean_object* (lean::ir::interpreter&)> const&) (env=..., opts=..., fn=...,
f=...)
at /home/kmill/projects/lean4/src/library/compiler/ir_interpreter.cpp:388
#13 0x00007ffff7a344a9 in lean::ir::interpreter::stub_m_aux (
args=0x7fffffffc320)
at /home/kmill/projects/lean4/src/library/compiler/ir_interpreter.cpp:899
#14 0x00007ffff7a3477a in lean::ir::interpreter::stub_6_aux (
x_1=0x5555563b7388, x_2=0x55555645b778, x_3=0x555556275da8,
x_4=0x555556679820, x_5=0x555556523c38, x_6=0x1)
at /home/kmill/projects/lean4/src/library/compiler/ir_interpreter.cpp:910
#15 0x00007ffff7a64ab0 in lean::lean_apply_3 (f=0x555556275928,
a1=0x555556679820, a2=0x555556523c38, a3=0x1)
at /home/kmill/projects/lean4/src/runtime/apply.cpp:232
#16 0x00007ffff7a64997 in lean::lean_apply_3 (f=0x5555562bc1b8,
a1=0x555556679820, a2=0x555556523c38, a3=0x1)
at /home/kmill/projects/lean4/src/runtime/apply.cpp:230
#17 0x00007ffff5b7bace in l_Lean_Elab_withInfoTreeContext___at_Lean_Elab_Command_withMacroExpansion___spec__1___rarg ()
from /home/kmill/projects/lean4/build/stage1/bin/../lib/lean/libleanshared.so
#18 0x00007ffff5b7d0b2 in l___private_Lean_Elab_Command_0__Lean_Elab_Command_elabCommandUsing ()
from /home/kmill/projects/lean4/build/stage1/bin/../lib/lean/libleanshared.so
#19 0x00007ffff7a64ab0 in lean::lean_apply_3 (f=0x5555562c4278,
a1=0x555556679820, a2=0x555556523c38, a3=0x1)
at /home/kmill/projects/lean4/src/runtime/apply.cpp:232
#20 0x00007ffff7a64a14 in lean::lean_apply_3 (f=0x555556266588,
a1=0x555556679790, a2=0x555556523c38, a3=0x1)
at /home/kmill/projects/lean4/src/runtime/apply.cpp:231
#21 0x00007ffff5b869c7 in l_Lean_Elab_withLogging___at_Lean_Elab_Command_elabCommand___spec__2 ()
from /home/kmill/projects/lean4/build/stage1/bin/../lib/lean/libleanshared.so
#22 0x00007ffff5b938b0 in l_Lean_Elab_Command_elabCommandTopLevel___lambda__2
()
from /home/kmill/projects/lean4/build/stage1/bin/../lib/lean/libleanshared.so
#23 0x00007ffff5c58d65 in l_Lean_Elab_Frontend_elabCommandAtFrontend ()
from /home/kmill/projects/lean4/build/stage1/bin/../lib/lean/libleanshared.so
#24 0x00007ffff5c59856 in l_Lean_Elab_Frontend_elabCommandAtFrontend___boxed ()
from /home/kmill/projects/lean4/build/stage1/bin/../lib/lean/libleanshared.so
#25 0x00007ffff7a5e35f in lean::lean_apply_1 (f=0x5555562c5bf8, a1=0x1)
at /home/kmill/projects/lean4/src/runtime/apply.cpp:111
--Type <RET> for more, q to quit, c to continue without paging--
#26 0x00007ffff6086cb4 in l_Lean_profileitIOUnsafe___rarg___lambda__1 ()
from /home/kmill/projects/lean4/build/stage1/bin/../lib/lean/libleanshared.so
#27 0x00007ffff6086e01 in l_Lean_profileitIOUnsafe___rarg___lambda__1___boxed
()
from /home/kmill/projects/lean4/build/stage1/bin/../lib/lean/libleanshared.so
#28 0x00007ffff7a5e26b in lean::lean_apply_1 (f=0x5555562bc618, a1=0x1)
at /home/kmill/projects/lean4/src/runtime/apply.cpp:109
#29 0x00007ffff7989018 in lean::apply_1 (f=0x5555562bc618, a1=0x1)
at /home/kmill/projects/lean4/src/runtime/object.h:134
#30 0x00007ffff7988823 in lean::lean_profileit (category=0x7ffff3fec788,
opts=0x55555645b778, fn=0x5555562bc618)
at /home/kmill/projects/lean4/src/library/time_task.cpp:69
#31 0x00007ffff6086e99 in l_Lean_profileitIOUnsafe___rarg ()
from /home/kmill/projects/lean4/build/stage1/bin/../lib/lean/libleanshared.so
#32 0x00007ffff5c5c2d0 in l_Lean_Elab_Frontend_processCommand ()
from /home/kmill/projects/lean4/build/stage1/bin/../lib/lean/libleanshared.so
#33 0x00007ffff5c5d07e in l_Lean_Elab_Frontend_processCommands ()
from /home/kmill/projects/lean4/build/stage1/bin/../lib/lean/libleanshared.so
--Type <RET> for more, q to quit, c to continue without paging--
#34 0x00007ffff5c5d5a6 in l_Lean_Elab_IO_processCommands ()
from /home/kmill/projects/lean4/build/stage1/bin/../lib/lean/libleanshared.so
#35 0x00007ffff5c5e9f3 in l_Lean_Elab_runFrontend___lambda__2 ()
from /home/kmill/projects/lean4/build/stage1/bin/../lib/lean/libleanshared.so
#36 0x00007ffff5c5fe2e in lean_run_frontend ()
from /home/kmill/projects/lean4/build/stage1/bin/../lib/lean/libleanshared.so
#37 0x00007ffff79002fc in lean::run_new_frontend (
input="/-\nCopyright (c) 2022 Floris van Doorn. All rights reserved.\nReleased under Apache 2.0 license as described in the file LICENSE.\nAuthors: Floris van Doorn\n-/\n\nimport Mathlib.Init.Data.Nat.Notation\nimp"..., opts=...,
file_name="./././Mathlib/buggg.lean", main_module_name=...,
trust_level=1025, ilean_file_name=...)
at /home/kmill/projects/lean4/src/util/shell.cpp:372
#38 0x00007ffff79020bd in lean_main (argc=12, argv=0x7fffffffded8)
at /home/kmill/projects/lean4/src/util/shell.cpp:726
#39 0x000055555555516d in main (argc=12, argv=0x7fffffffded8)
at /home/kmill/projects/lean4/src/shell/lean.cpp:15
@Gabriel Ebner I'm not sure what the assert should be exactly since value is a union, but there's something I could try here to at least test the theory.
So far it looks like my theory is wrong -- the interpreter::load() function increments the reference count just fine, so there must be something somewhere else double decrementing.
This gives you the address of an object o that is being double freed. On my system it is reliably at 0x5555563db268 (which I'll use below).
I then set a breakpoint for src/library/compiler/ir_interpreter.cpp:832 conditioned on r.m_obj == 0x5555563db268 and ran Lean until I reached the breakpoint. Then I added a watchpoint for *0x5555563db268 conditioned on *(uint32_t*)0x5555563db268 == 1 and continued execution. This leads you into the middle of the deallocation routines. From within lean_del_core, you can see that there is a 4-argument constructor that is being deallocated, and the first two elements are 0x5555563db268.
So somehow this object appears twice without the reference count being incremented properly, leading to it being marked as free even though it's still in m_constant_cache.
Here is a little bit further. You can get the name of the m_constant_cache object that's double freed.
If you then do set_option trace.complier true, you can identify it as what gets assigned to _x_2. I think it's very likely it's the object assigned to _x_8 that ends up with _x_2 as its first two arguments where the reference count for _x_2 doesn't get incremented correctly.