Zulip Chat Archive

Stream: general

Topic: Looking for a mac/windows debugging expert


Gabriel Ebner (May 14 2020 at 12:45):

lean#234 is a potentially huge improvement to simp performance. (Constructing the simp set takes 25% of mathlib build time on Scott's benchmarks.)
However, building the core library segfaults on, well, everything except linux. I would really appreciate it if somebody with a mac/windows could make a debugging build, run lean with a debugger, and post a backtrace of the segfault. I have no idea where to look for problems.

Patrick Massot (May 14 2020 at 12:56):

What about dropping support for Windows and MacOS? It would simplify a lot of things.

Rob Lewis (May 14 2020 at 13:22):

I'll make an attempt on Windows. It's a good excuse to get these things set up on the Windows side of my new laptop.

Jalex Stark (May 14 2020 at 13:38):

Patrick Massot said:

What about dropping support for Windows and MacOS? It would simplify a lot of things.

If we had a web-based editor (which would have to interact with an external linux server) good enough to replace VSCode then one could drop support without dropping a large fraction of the userbase.
(This isn't so crazy --- lots of data scientists do their machine learning work in the cloud via jupyter / google colab)

Rob Lewis (May 14 2020 at 13:40):

Are the WIndows build directions still up to date? Following the directions in the repo exactly, I get

CMake Error at CMakeLists.txt:5 (project):
  The CMAKE_CXX_COMPILER:

    cl

  is not a full path and was not found in the PATH.

at the cmake stage. Adding -G Ninja seems to work (which I remember from a while back, it's been ages since I've built Lean)

Ryan Lahfa (May 14 2020 at 13:42):

Jalex Stark said:

Patrick Massot said:

What about dropping support for Windows and MacOS? It would simplify a lot of things.

If we had a web-based editor (which would have to interact with an external linux server) good enough to replace VSCode then one could drop support without dropping a large fraction of the userbase.
(This isn't so crazy --- lots of data scientists do their machine learning work in the cloud via jupyter / google colab)

It's not that perfect to use Jupyter Notebooks for everything, but I guess CoCalc is the direct we somewhat want

Jalex Stark (May 14 2020 at 13:44):

Yeah, I agree that CoCalc is a lot of the way there. I guess there are already students whose only Lean environment is a CoCalc one, but I would guess that's not true of any of the regulars on this server

Gabriel Ebner (May 14 2020 at 13:45):

@Rob Lewis This looks like it wants to compile lean using visual studio (≠ vscode). You should be running cmake and make inside the msys2 shell.

Rob Lewis (May 14 2020 at 13:47):

As in, not the "msys2 MinGW 64-bit" shell?

Gabriel Ebner (May 14 2020 at 13:48):

No, that's the right one. Do you have gcc installed? How are you calling cmake?

Rob Lewis (May 14 2020 at 13:49):

Straight from the directions at https://github.com/leanprover-community/lean/blob/master/doc/make/msys2.md, then the generic ones.

Rob Lewis (May 14 2020 at 13:50):

gcc version 8.3.0 (Rev2, Built by MSYS2 project)

Gabriel Ebner (May 14 2020 at 13:51):

Can you paste the output of cmake here please?

Rob Lewis (May 14 2020 at 13:54):

robyl@DESKTOP-NIOLIBD MINGW64 /d/lean/lean/build/release
$ cmake ../../src
-- Building for: NMake Makefiles
-- The CXX compiler identification is unknown
-- The C compiler identification is unknown
CMake Error at CMakeLists.txt:5 (project):
  The CMAKE_CXX_COMPILER:

    cl

  is not a full path and was not found in the PATH.

  To use the NMake generator with Visual C++, cmake must be run from a shell
  that can use the compiler cl from the command line.  This environment is
  unable to invoke the cl compiler.  To fix this problem, run cmake from the
  Visual Studio Command Prompt (vcvarsall.bat).

  Tell CMake where to find the compiler by setting either the environment
  variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path
  to the compiler, or to the compiler name if it is in the PATH.


CMake Error at CMakeLists.txt:5 (project):
  The CMAKE_C_COMPILER:

    cl

  is not a full path and was not found in the PATH.

  To use the NMake generator with Visual C++, cmake must be run from a shell
  that can use the compiler cl from the command line.  This environment is
  unable to invoke the cl compiler.  To fix this problem, run cmake from the
  Visual Studio Command Prompt (vcvarsall.bat).

  Tell CMake where to find the compiler by setting either the environment
  variable "CC" or the CMake cache entry CMAKE_C_COMPILER to the full path to
  the compiler, or to the compiler name if it is in the PATH.


-- Configuring incomplete, errors occurred!
See also "D:/lean/lean/build/release/CMakeFiles/CMakeOutput.log".
See also "D:/lean/lean/build/release/CMakeFiles/CMakeError.log".

Rob Lewis (May 14 2020 at 13:54):

The ninja build is still going, very slowly. I guess because it's a debug build?

Gabriel Ebner (May 14 2020 at 13:55):

cmake -G Makefile

Gabriel Ebner (May 14 2020 at 13:55):

Or ninja, that should work as well.

Gabriel Ebner (May 14 2020 at 13:55):

It should take about 15 minutes or so.

Patrick Massot (May 14 2020 at 13:56):

Jalex Stark said:

Patrick Massot said:

What about dropping support for Windows and MacOS? It would simplify a lot of things.

If we had a web-based editor (which would have to interact with an external linux server) good enough to replace VSCode then one could drop support without dropping a large fraction of the userbase.
(This isn't so crazy --- lots of data scientists do their machine learning work in the cloud via jupyter / google colab)

This message wasn't meant to be answered... Of course I don't seriously mean we should drop Windows support. Students will always be students, you need to let them grow before assuming they use Linux.

Rob Lewis (May 14 2020 at 13:56):

Aha. Add that at https://github.com/leanprover-community/lean/blob/master/doc/make/index.md ? :)

Gabriel Ebner (May 14 2020 at 13:56):

Apparently the default for cmake is to use the visual studio compiler and nmake. I guess the documentation doesn't mention that because we recommended to use ninja instead.

Rob Lewis (May 14 2020 at 13:56):

15 min seems about right, it's just about at the library build.

Rob Lewis (May 14 2020 at 13:58):

Okay, here's your first traceback.

FAILED: CMakeFiles/leanpkg
cmd.exe /C "cd /D D:\lean\lean\leanpkg && D:\lean\lean\src\..\bin\lean --make"
LEAN ASSERTION VIOLATION
File: D:/lean/lean/src/kernel/expr.h
Line: 62
Task: D:\lean\lean\library\init\logic.lean:
get_rc() > 0
(C)ontinue, (A)bort/exit, (S)top/trap, Invoke (G)DB
ninja: build stopped: subcommand failed.

Rob Lewis (May 14 2020 at 13:59):

I guess you want GDB output?

Gabriel Ebner (May 14 2020 at 13:59):

Yes.

Oliver Nash (May 14 2020 at 14:06):

Patrick Massot said:

What about dropping support for Windows and MacOS? It would simplify a lot of things.

I may be missing some key context here but would this mean that I could no longer run new versions of (the community fork of) Lean on MacOS? This would make me very sad.

Oliver Nash (May 14 2020 at 14:07):

Gabriel Ebner said:

lean#234 is a potentially huge improvement to simp performance. (Constructing the simp set takes 25% of mathlib build time on Scott's benchmarks.)
However, building the core library segfaults on, well, everything except linux. I would really appreciate it if somebody with a mac/windows could make a debugging build, run lean with a debugger, and post a backtrace of the segfault. I have no idea where to look for problems.

I might just be able to help with this but time is always very tight for me so I regret that I cannot commit. If at all possible, I will make an attempt at the weekend.

Scott Morrison (May 14 2020 at 14:08):

@Oliver Nash, it was not at all a serious suggestion. Patrick likes to make provocative statements. :-)

Oliver Nash (May 14 2020 at 14:08):

PHEW!

Oliver Nash (May 14 2020 at 14:08):

Thanks :-)

Scott Morrison (May 14 2020 at 14:18):

@Rob Lewis , @Gabriel Ebner , I get the same lean assertion violation on macos.

Scott Morrison (May 14 2020 at 14:18):

I'm not sure what command to run with gdb.

Gabriel Ebner (May 14 2020 at 14:19):

gdb --args ~/lean/bin/lean --make in the library directory. Then type r to start running. And then type bt after it crashed.

Scott Morrison (May 14 2020 at 14:22):

I'm assuming you mean gdb --args ../bin/lean --make in the library directory.

Gabriel Ebner (May 14 2020 at 14:26):

Yes.

Rob Lewis (May 14 2020 at 14:29):

I'm having issues with gdb on WIndows, this is where I'm at so far.

robyl@DESKTOP-NIOLIBD MINGW64 /d/lean/lean/library
$ gdb --args ../bin/lean --make
GNU gdb (GDB) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-msys".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Traceback (most recent call last):
  File "<string>", line 3, in <module>
ModuleNotFoundError: No module named 'libstdcxx'
/etc/gdbinit:6: Error in sourced command file:
Error while executing Python code.
Reading symbols from ../bin/lean...done.
(gdb) r
Starting program: /d/lean/lean/bin/lean --make
[New Thread 8416.0x2b80]
[New Thread 8416.0xa8c]
[New Thread 8416.0x5ec]
[New Thread 8416.0x32d4]
[Thread 8416.0x32d4 exited with code 0]
[Thread 8416.0xa8c exited with code 0]
[Thread 8416.0x5ec exited with code 0]
[Inferior 1 (process 8416) exited normally]
(gdb) bt
No stack.
(gdb)

Rob Lewis (May 14 2020 at 14:29):

Trying to figure out what's up with the error...

Gabriel Ebner (May 14 2020 at 14:30):

Ok, this time lean didn't crash. Can you remove all olean files and then try the same gdb invocation again please?

Scott Morrison (May 14 2020 at 14:31):

After working out how to codesign gdb, I'm back...

Scott Morrison (May 14 2020 at 14:32):

Now I get:

(gdb) r
Starting program: /Users/scott/projects/lean/lean/bin/lean --make
[New Thread 0x1703 of process 64332]
[New Thread 0x1a03 of process 64332]
warning: `/BuildRoot/Library/Caches/com.apple.xbs/Binaries/Libc_darwin/install/TempContent/Objects/Libc.build/libsystem_darwin.dylib.build/Objects-normal/x86_64/darwin_vers.o': can't open to read symbols: No such file or directory.
warning: `/BuildRoot/Library/Caches/com.apple.xbs/Binaries/Libc_darwin/install/TempContent/Objects/Libc.build/libsystem_darwin.dylib.build/Objects-normal/x86_64/dirstat.o': can't open to read symbols: No such file or directory.
warning: `/BuildRoot/Library/Caches/com.apple.xbs/Binaries/Libc_darwin/install/TempContent/Objects/Libc.build/libsystem_darwin.dylib.build/Objects-normal/x86_64/dirstat_collection.o': can't open to read symbols: No such file or directory.
warning: `/BuildRoot/Library/Caches/com.apple.xbs/Binaries/Libc_darwin/install/TempContent/Objects/Libc.build/libsystem_darwin.dylib.build/Objects-normal/x86_64/init.o': can't open to read symbols: No such file or directory.
warning: `/BuildRoot/Library/Caches/com.apple.xbs/Binaries/Libc_darwin/install/TempContent/Objects/Libc.build/libsystem_darwin.dylib.build/Objects-normal/x86_64/variant.o': can't open to read symbols: No such file or directory.
LEAN ASSERTION VIOLATION
File: /Users/scott/projects/lean/lean/src/kernel/expr.h
Line: 62
Task: /Users/scott/projects/lean/lean/library/init/logic.lean:
get_rc() > 0
(C)ontinue, (A)bort/exit, (S)top/trap

Scott Morrison (May 14 2020 at 14:32):

I assume I should type S to stop, before I run bt?

Scott Morrison (May 14 2020 at 14:32):

Doing that reports:

(gdb) bt
#0  0x00007fff6588fb66 in ?? () from /usr/lib/system/libsystem_kernel.dylib
#1  0x00007fff65a5a080 in pthread_kill () from /usr/lib/system/libsystem_pthread.dylib
#2  0x00007fff657eb1ae in abort () from /usr/lib/system/libsystem_c.dylib
#3  0x00007fff9da5c520 in ?? ()
#4  0x00007fffffffffdf in ?? ()
#5  0xffffffffefbf6120 in ?? ()
#6  0x0000000101d81aa8 in ?? ()
#7  0x00007ffeefbf6120 in ?? ()
#8  0x000000010008d499 in lean::debuggable_exit () at /Users/scott/projects/lean/lean/src/util/debug.cpp:80
Backtrace stopped: frame did not save the PC

Gabriel Ebner (May 14 2020 at 14:33):

This looks like a release build. Could you try this again with a debug build (-DCMAKE_BUILD_TYPE=Debug)?

Scott Morrison (May 14 2020 at 14:34):

oops!

Scott Morrison (May 14 2020 at 14:35):

Hmm, no, here's what I ran:

mkdir -p build/debug
cd build/debug
cmake -DCMAKE_BUILD_TYPE=DEBUG -G Ninja ../../src
ninja

Gabriel Ebner (May 14 2020 at 14:37):

Hmm, this should actually work.

Rob Lewis (May 14 2020 at 14:38):

Let's see what happens here. I took a tea break in the hopes it would be done when I got back, but nope!

Scott Morrison (May 14 2020 at 14:44):

I ran git clean -xfd and rebuilt, with exactly the same output...

Gabriel Ebner (May 14 2020 at 14:48):

I'm not really sure what is going on on macos. I don't think you're doing anything wrong per se.

Rob Lewis (May 14 2020 at 14:50):

Is there an expected time until error? Been going ~20 min so far.

Gabriel Ebner (May 14 2020 at 14:52):

The error should appear almost immediately when compiling the standard library. The assertion violation happens in init/logic.lean.

Rob Lewis (May 14 2020 at 14:54):

Then I guess I'm killing this gdb run...

Rob Lewis (May 14 2020 at 14:54):

It seems to have built an olean for every file in core.

Sebastian Ullrich (May 14 2020 at 15:00):

@Scott Morrison You might have better luck with lldb. Or not.

Scott Morrison (May 14 2020 at 15:02):

I better sleep. I'll try tomorrow.

Gabriel Ebner (May 14 2020 at 15:03):

@Rob Lewis If you're still here, can you try to finish compiling library. And then in a separate call, compile the leanpkg directory?

Rob Lewis (May 14 2020 at 15:03):

In or outside of gdb?

Rob Lewis (May 14 2020 at 15:04):

I killed the last session and cleared the oleans. Trying again to see how long it took. Almost have them all back.

Gabriel Ebner (May 14 2020 at 15:04):

Inside gdb, just in case something interesting happens.

Rob Lewis (May 14 2020 at 15:09):

Okay, so I've had no errors in gdb, a olean file exists for every lean in core. Still two lean threads running at full throttle. gdb output isn't changing.

Rob Lewis (May 14 2020 at 15:09):

Kill gdb and start again?

Rob Lewis (May 14 2020 at 15:09):

(Keeping the current oleans?)

Gabriel Ebner (May 14 2020 at 15:12):

Let's first wait until it's finished.

Reid Barton (May 14 2020 at 15:12):

Is it possibly nondeterministic and, if so, maybe it can also fail on linux?

Rob Lewis (May 14 2020 at 15:21):

Well, I'm down to one thread at full throttle now.

Rob Lewis (May 14 2020 at 15:55):

Okay, it built the library and leanpkg with no errors.

Gabriel Ebner (May 14 2020 at 15:55):

That's unfortunate.

Nam (May 14 2020 at 16:00):

looks like 3.11.0 is built okay on Mac. a few more builds will get to the breaking change.

Gabriel Ebner (May 14 2020 at 16:00):

Also 3.12 is fine. The crash is due to the PR: lean#234

Nam (May 14 2020 at 16:02):

yeah, that PR is based on v3.11.0

Sebastian Ullrich (May 14 2020 at 16:03):

Generally speaking, setting up an ASan+UBSan build has saved me many times developing Lean 4 and debugging its runtime. I can't promise it will help in this case, of course.

Nam (May 14 2020 at 16:04):

the stack trace points to some ref counting issue

Nam (May 14 2020 at 16:05):

so maybe msan would help

Sebastian Ullrich (May 14 2020 at 16:10):

use-after-frees are detected by asan, not msan

Nam (May 14 2020 at 16:10):

yeah, you're right.

Gabriel Ebner (May 14 2020 at 16:11):

Ok, building with -fsanitize=address -fsanitize=undefined now.

Sebastian Ullrich (May 14 2020 at 16:13):

(I hope that concatenates instead of overrides the sanitizers. Lean 4 CI is using -fsanitize=address,undefined)

Nam (May 14 2020 at 16:13):

after bisecting, i think the very first commit in that PR is the cause https://github.com/leanprover-community/lean/pull/234/commits/26ae0a7fec042d80820b14b54eaa8b199888d3ba

Gabriel Ebner (May 14 2020 at 16:15):

I've already run it with -fsanitize=address and it just found pages worth of leaks, but no real problems. It's ok if only UB is sanitized now.

Gabriel Ebner (May 14 2020 at 16:26):

Oh, wow. This begins well:

/home/gebner/lean/src/frontends/lean/cmd_table.h:31:8: runtime error: load of value 30, which is not a valid value for type 'bool'
/home/gebner/lean/src/kernel/inductive/inductive.cpp:99:12: runtime error: load of value 236, which is not a valid value for type 'bool'
/home/gebner/lean/src/kernel/inductive/inductive.cpp:99:12: runtime error: load of value 15, which is not a valid value for type 'bool'
/home/gebner/lean/src/library/user_recursors.h:11:7: runtime error: load of value 88, which is not a valid value for type 'bool'
/home/gebner/lean/src/library/user_recursors.h:11:7: runtime error: load of value 18, which is not a valid value for type 'bool'
/home/gebner/lean/src/library/projection.h:22:8: runtime error: load of value 179, which is not a valid value for type 'bool'

Nam (May 14 2020 at 16:53):

https://pastebin.com/DB9eLMpG

Gabriel Ebner (May 14 2020 at 17:37):

@Nam Thanks to your backtrace I was able to fix the issue.

Nam (May 14 2020 at 17:44):

you're welcome.

Gabriel Ebner (May 14 2020 at 17:47):

This is what I believe happened. (But all I know about the error is a backtrace, so this might be wrong.) The segfault happened in the deserialization code, which looks like this:

simp_lemma sl;
deser >> sl;

Inside the >>, sl is assigned. At this point the destructor for sl is called, and then the segfault happens according to the backtrace. Note that we call the default constructor for simp_lemma here:

simp_lemma::simp_lemma():simp_lemma(g_dummy) {}

And this dummy was an "exotic" simp lemma. Usually the m_kind field indicates which subclass of simp_lemma_cell we have, e.g. Simp means regular_simp_lemma_cell. But the dummy lemma was of type simp_lemma_cell even though it had kind Simp.
Unfortunately that's all I've got. I can't explain how this leads to a double-free of an expression.


Last updated: Dec 20 2023 at 11:08 UTC