Zulip Chat Archive

Stream: lean4

Topic: Debugging compile time memory overflow


Yicheng Qian (Mar 24 2023 at 05:56):

I'm working on a Lean project where we want to write a custom tactic, and I found it frustrating to debug compile-time memory overflow errors. Lean's trace mechanism stops working when there is memory overflow, and it's difficult to find out where things might go wrong in a large codebase. I've tried gdb --args lean -threads=1 Main.lean, but then lean produces unknown package error for the import commands, which is not surprising.

Generally, I want to know more about how to debug lean when there is compile-time memory overflow. It would be of great help if I can get more information about how the memory overflow occurred, e.g, getting the backtrace.

Reid Barton (Mar 24 2023 at 06:05):

You can try lake env gdb --args lean [...]

Yicheng Qian (Mar 24 2023 at 06:06):

Ah, it works now! Thanks!

Yicheng Qian (Mar 24 2023 at 07:21):

Now compile-time calls to IO.println are printed (even without gdb) which is already very helpful.

I also get backtrace when there is internal panic (where I can b lean_internal_panic) or when the program receives SIGINT (where I can handle SIGINT stop). However, the backtrace does not provide much information because it seems that Lean does not directly call the user-defined functions. Instead it calls lean::ir::interpreter::call and lean::ir::interpreter::eval_body. I don't know which arguments are supplied to these two functions, so I still can't directly determine in which function did the stack overflow actually occur.

Sebastian Ullrich (Mar 24 2023 at 07:29):

With a debug build of Lean you can use -Dtrace.interpreter.call=true

Yicheng Qian (Mar 24 2023 at 07:39):

I find

git clone https://github.com/leanprover/lean4 --recurse-submodules
cd lean4
mkdir -p build/release
cd build/release
cmake ../..
make

on Lean's manual https://leanprover.github.io/lean4/doc/make/index.html. How can I modify this to get a debug build of Lean? Also currently our project uses lean-toolchain to determine lean's version. Is it also possible to modify lean-toolchain so that we directly get a debug build?

Yicheng Qian (Mar 24 2023 at 07:41):

Also is it possible to use elan?

Sebastian Ullrich (Mar 24 2023 at 07:53):

Use -D CMAKE_BUILD_TYPE=DEBUG as documented below that code block. You may want to adjust the name of the build directory, but it doesn't really matter. After building, you can use elan toolchain link as described in https://leanprover.github.io/lean4/doc/dev/index.html#dev-setup-using-elan, lean-toolchain will understand the new name.

Yicheng Qian (Mar 24 2023 at 10:43):

Ah, I missed that part. I should have read the manual more carefully :-)

Yicheng Qian (Mar 24 2023 at 10:47):

I also figured out that we can bt and examine the lean::ir::interpreter::call(this, fn, args) frame using print_args. This only works in the debug build. This way we get the pointer to fn, and then get the name of the function using

print ((lean::name *)(<ptr_to_fn>))->to_string(".")

in gdb.

Yicheng Qian (Mar 24 2023 at 10:49):

The only annoying part is that compiling and building seems to be a lot slower in the debug build.

Yicheng Qian (Mar 24 2023 at 11:25):

Oh, there is also a lean_gdb.py which might be helpful.

Sebastian Ullrich (Mar 24 2023 at 12:12):

Ah yes, instead of changing lean-toolchain I should have recommended compiling only the file in question with the debug Lean as in lake env lean +your-toolchain your-file.lean -Dtrace.interpreter.call=true

Sebastian Ullrich (Mar 24 2023 at 12:13):

lean_gdb helps only with inspecting Lean objects from gdb, which hopefully you won't need to do

Yicheng Qian (Mar 24 2023 at 16:36):

Sebastian Ullrich said:

Ah yes, instead of changing lean-toolchain I should have recommended compiling only the file in question with the debug Lean as in lake env lean +your-toolchain your-file.lean -Dtrace.interpreter.call=true

This works perfectly!

Yicheng Qian (Mar 24 2023 at 17:01):

There is still a problem with -Dtrace.interpreter.call=true. When I used it to debug a real program, printing all interpreter calls makes the program extremely slow (possibly several hundreds of times slower, even though I've redirected the output to a file). Since the time of the execution for our tactic is often around 0.1~10 seconds (sometimes even minutes), we'll wait for too long before the program actually reaches the function that causes stack overflow.

Currently, it seems to me that gdb the program using lean debug build, setting breakpoint at panic (and stop on SIGINT), invoke bt when overflow occurs, go up the stack and inspect arguments of lean::ir::interpreter::call works best. The execution time of the program is about 10 times slower (than release build) using debug build. It's still kind of acceptable.

It would be nice if we can use the above approach in release build though. The gdb command info args doesn't work in release builds, but I think it won't take too long to find out where the arguments are located if we look at the assembly code. I'll try to figure this out later (if there aren't any better approaches).

Sebastian Ullrich (Mar 24 2023 at 17:07):

In the end you will want these tactics to be compiled anyway, which will also solve the stacktrace problem. Have you given precompileModules a try yet?

Yicheng Qian (Mar 24 2023 at 17:13):

The option is already set to true.

Yicheng Qian (Mar 24 2023 at 17:15):

Ah, I think I haven't described the "compile-time error" clearly enough.

Yicheng Qian (Mar 24 2023 at 17:21):

The scenario is as follows:

  1. The tactic files are already successfully compiled.
  2. We create a new file, for example foo.lean, and add these lines
    theorem foo : True := by custom_tactic

  3. We open foo.lean in vscode and let Lean check it, or compile foo.lean.

  4. When we do (3), lean will elaborate by custom_tactic (note that 3 is at compile time, so we elaborate this at compile time). If the custom_tactic is buggy, for example containing infinite recursion, then it will cause stack overflow error.
  5. If there is stack overflow error, then foo.lean fails to compile.

Sebastian Ullrich (Mar 24 2023 at 17:24):

Unless precompileModules is misbehaving, you should get native backtraces for your tactic in that case

Sebastian Ullrich (Mar 24 2023 at 17:25):

On what platform are you? Do you have an example output from lake build?

Yicheng Qian (Mar 24 2023 at 17:26):

I'm running lean on WSL.

Yicheng Qian (Mar 24 2023 at 17:32):

Here is an example output:

lake build

Building Main
error: > LEAN_PATH=./build/lib:./lake-packages/std/build/lib LD_LIBRARY_PATH=/home/indprinciple/.elan/toolchains/leanprover--lean4---nightly-2023-02-03/lib:./lake-packages/std/build/lib:./build/lib /home/indprinciple/.elan/toolchains/leanprover--lean4---nightly-2023-02-03/bin/lean ./././Main.lean -R ././. -o ./build/lib/Main.olean -i ./build/lib/Main.ilean -c ./build/ir/Main.c --load-dynlib=.........(too long)........ --load-dynlib=./build/lib/libDuper-TPTP.so
error: stderr:
INTERNAL PANIC: out of memory
error: external command `/home/indprinciple/.elan/toolchains/leanprover--lean4---nightly-2023-02-03/bin/lean` exited with code 1

Sebastian Ullrich (Mar 24 2023 at 17:33):

Oh I see, we do not automatically print the backtrace on internal panics aside from stack overflows

Sebastian Ullrich (Mar 24 2023 at 17:34):

When you use gdb lean, make sure to copy the whole line Lake printed to make precompilation work

Sebastian Ullrich (Mar 24 2023 at 17:35):

Specifically, LD_LIBRARY_PATH and --load-dynlib

Yicheng Qian (Mar 24 2023 at 17:39):

Sebastian Ullrich said:

Oh I see, we do not automatically print the backtrace on internal panics aside from stack overflows

Yes. Maybe "stack overflow" is completely wrong because it's actually heap overflow :-). Another scenario that happened during my tests is that the OS just signals SIGKILL to the program and we get nothing on the command line.

Sebastian Ullrich (Mar 24 2023 at 17:42):

Oh, I haven't seen that before

Yicheng Qian (Mar 24 2023 at 17:44):

partial def list_SIGKILL (n : List Nat) : Nat :=
  list_SIGINT (n ++ n)

#eval list_SIGKILL [2]

def main : IO Unit := IO.println "Hello world"

Yicheng Qian (Mar 24 2023 at 17:45):

partial def array_PANIC (n : Array Nat) : Nat :=
  array_PANIC (n.append n)

#eval array_PANIC #[2]

def main : IO Unit := IO.println "Hello world"

Yicheng Qian (Mar 24 2023 at 17:47):

Probably because the array checks memory overflow when allocating space for new elements, while it's not acceptable to check memory overflow for each constructor operation because that's too expensive?

Yicheng Qian (Mar 24 2023 at 17:50):

Oh sorry for the confusion. We still get some messages printed on command line (although they're not helpful).

Sebastian Ullrich (Mar 24 2023 at 17:53):

Right, when the OOM killer kicks in, there is no sensible output except for in the syslog

Yicheng Qian (Mar 24 2023 at 18:00):

Sebastian Ullrich said:

When you use gdb lean, make sure to copy the whole line Lake printed to make precompilation work

I'll check this out tomorrow.

Yicheng Qian (Mar 25 2023 at 03:29):

Great! After adding --load-dynlib, the name of the functions are directly visible in the stack backtrace, even when using release build.

Yicheng Qian (Mar 25 2023 at 03:32):

Also I can now set breakpoints in the functions of our custom tactic, but that's probably not very useful.


Last updated: Dec 20 2023 at 11:08 UTC