Zulip Chat Archive

Stream: general

Topic: Lean compile fails with GCC 9.1


Yury G. Kudryashov (Jul 13 2019 at 18:51):

Hi,
I failed to build lean 3 from source on Fedora 30. It seems that the reason is GCC 9.1.
More precisely, the code compiles with many warnings, then crashes every time.

Wojciech Nawrocki (Jul 13 2019 at 19:05):

That's strange, it works for me with GCC 9.1.0 on Arch Linux. Is it the community version (https://github.com/leanprover-community/lean)?

Yury G. Kudryashov (Jul 13 2019 at 19:28):

I tried both 3.4.2 and community version. I'll try git clean, then recompile.

Yury G. Kudryashov (Jul 13 2019 at 19:31):

@Wojciech Nawrocki Do you get a lot of warnings?

Wojciech Nawrocki (Jul 13 2019 at 19:33):

I get a lot of -Wdeprecated-copy, but the built binary works fine. Possibly Fedora has a different set of libraries that break something.

Yury G. Kudryashov (Jul 13 2019 at 19:35):

I also get -Wpessimizing-move

Yury G. Kudryashov (Jul 13 2019 at 19:35):

(still compiles)

Wojciech Nawrocki (Jul 13 2019 at 19:36):

Yeah, I also get a few of -Wpessimizing-move.

Yury G. Kudryashov (Jul 13 2019 at 20:03):

Here is the error:

/home/urkud/projects/lean3/library/init/meta/tactic.lean: parsing at line 1399terminate called after throwing an instance of 'lean::exception'
  what():  vm check failed: is_composite(o) (possibly due to incorrect axioms, or sorry)

Wojciech Nawrocki (Jul 13 2019 at 20:05):

I've seen that one before. I'm not sure exactly why it happens. For me removing the Lean folder, downloading from git again and rebuilding fixed it. Sorry I don't have a cleaner possible solution!

Wojciech Nawrocki (Jul 13 2019 at 20:07):

(It seemed to have something to do with the built .olean files under lean/library/ being corrupted, but don't quote me on it).

Yury G. Kudryashov (Jul 13 2019 at 20:22):

git clean -fdx removed all .olean files

Yury G. Kudryashov (Jul 13 2019 at 20:24):

Actually, I fixed this somehow a few months ago (or just got lucky after one more recompile), and use this lean binary since then. Probably I'll try to fix (most of) the warning and see if it fixes compilation.

Yury G. Kudryashov (Jul 13 2019 at 20:25):

BTW, writing outside of the build directory is supposed to be a bad idea, and lean cmake script does it (instead of, e.g., copying lean/library to the build directory).

Yury G. Kudryashov (Jul 13 2019 at 20:26):

This is a bad idea because different build directories are no longer independent. E.g., I can't try different compilers/flags in different build directories and be sure that they do not interfere with each other.

Wojciech Nawrocki (Aug 28 2019 at 10:58):

I've made an issue for this in https://github.com/leanprover-community/lean/issues/62 .

Simon Hudon (Aug 28 2019 at 11:58):

Can you guys setup a travis build that will reproduce the problem? Is it specific to Linux or will it crash also on Mac and Windows?

Wojciech Nawrocki (Aug 28 2019 at 13:36):

@Simon Hudon It already fails in the AppVeyor MinGW build, for example https://ci.appveyor.com/project/cipher1024/lean-e5aoi/builds/27002887/job/pkp6cl1q3t4wf5xk

Simon Hudon (Aug 28 2019 at 13:40):

Ok but there are changes in that branch. Without the changes, I think it's still working

Wojciech Nawrocki (Aug 28 2019 at 13:45):

There are changes, but your own PR from 24 days ago fails with the same error despite changing completely different things. In the bug report I also mention that I get this error on a fresh master. This makes me semi-confident that new GCC (which MinGW also uses) started doing something Lean doesn't like. Either it's UB in Lean, or an incorrect optimization in GCC. I will also try clang++ on my laptop and see if it works

Simon Hudon (Aug 28 2019 at 13:47):

what's UB?

Bryan Gin-ge Chen (Aug 28 2019 at 13:48):

undefined behavior

Simon Hudon (Aug 28 2019 at 13:50):

Ok, I'll see what happens when I revert my changes and then we can start looking at removing the warnings

Wojciech Nawrocki (Aug 28 2019 at 13:51):

Note this is probably a bug in Lean proper, and not the community fork, because I also reliably see it when building github.com/leanprover/lean . (See https://ci.appveyor.com/project/leodemoura/lean/build/job/jmxe5g5ywklbaofr )

Wojciech Nawrocki (Aug 28 2019 at 14:00):

lean --make just segfaults for me on library/ when built by Clang 8.0.1. I'll stick to GCC for now

Bryan Gin-ge Chen (Aug 28 2019 at 14:04):

The appveyor build history (for leanprover/lean) shows that the nightly builds started failing Aug 1. It looks like that's when GCC was changed from 8.2.0 to 9.1.0 . And that's presumably due to the update to MSYS2 here.

Mario Carneiro (Aug 28 2019 at 17:01):

If there is UB in lean, I wonder if we can tame the nasal demons to prove false in paranoid mode

Sebastian Ullrich (Aug 28 2019 at 17:04):

I don't know about GCC, but LLVM 8 is known to miscompile Lean due to https://bugs.llvm.org/show_bug.cgi?id=40044

Wojciech Nawrocki (Aug 28 2019 at 18:29):

This was a little ridiculous to debug since minute changes like adding print statements make the code compile correctly, but the bug is this:
The vm_obj_cell::dealloc code has a switch statement which calls the right function to dealloc a vm_obj_cell depending on its kind. Two of these kinds are together called "composites": inline bool is_composite(vm_obj_cell const * o) { return is_constructor(o) || is_closure(o); } and are both deallocated with to_composite(it)->dealloc(todo). The code looks correct to me, but it gets compiled to:

      case 1: // vm_obj_kind::Constructor
        if ( (unsigned __int8)o & 1 )
          lean::vm_check_failed("to_composite: is_composite(o)");
        lean::vm_composite::dealloc(o, (lean::buffer<lean::vm_obj_cell*,16> *)&todo);
        break;
      case 2: // vm_obj_kind::Closure
        lean::vm_check_failed("to_composite: is_composite(o)");
        return result;

where the Closure branch incorrectly goes straight to failure without checking the LEAN_VM_IS_PTR tag on the vm_obj_cell* o pointer. I tried adding alignas(2) to class vm_obj_cell but no dice. At this point I'm not sure if Lean is abusing the pointer-int conversions (which are sometimes UB) or if some optimization in GCC incorrectly figures that o is always odd.

Simon Hudon (Aug 28 2019 at 18:37):

Although it crashes there, the problem might appear earlier. I suggest you try addressing the warnings first and see if anything gets fixed. Because of the new features of C++11 and so on, I wouldn't be surprised if misusing them in subtle ways would result in code whose behavior varies between compilers

Wojciech Nawrocki (Aug 28 2019 at 18:39):

Yeah, I agree. I've started looking through the warnings.

Wojciech Nawrocki (Aug 28 2019 at 20:43):

Sadly nothing useful comes out of -Wall -Wextra -Wpedantic, I'll try making an MWE.

Simon Hudon (Aug 28 2019 at 20:44):

It would be nice to know which test or which file in library/ makes Lean fail and which ones do not

Wojciech Nawrocki (Aug 28 2019 at 20:50):

It fails on init/meta/hole_command.lean, but I think that's probably just the first place in the library that exercises deallocation on closures.

Simon Hudon (Aug 28 2019 at 20:54):

You might want to create a custom .lean file and add one ingredient at a time from hole_command.lean to see what does it

Mario Carneiro (Aug 28 2019 at 22:36):

I don't think the issue is with that line of code exactly, and AFAICT it's not UB, but the general type-punning scheme implemented with LEAN_VM_IS_PTR is implementation-defined. According to the standard, casting uintptr_t -> T* -> uintptr_t is not guaranteed to be the identity, and in particular GCC is allowed to assume that every element of the type T* is aligned, which may have something to do with this optimization

Wojciech Nawrocki (Aug 28 2019 at 22:58):

I just posted an update on the GitHub issue, but the gist of is that the following code, derived directly from bits of vm.cpp doesn't do what it should when compiled by GCC 9.1.0 with at least -O2:

#include <cstddef>
#include <cstdint>

enum class foo_kind { A, B, C };

struct alignas(4) foo {
    foo_kind m_kind;
};

foo_kind kind_(foo * o) {
    if ((reinterpret_cast<std::uintptr_t>(o) & 1) == 0)
        return o->m_kind;
    else
        return foo_kind::A;
}

bool is_bc(foo * o) { return kind_(o) == foo_kind::B || kind_(o) == foo_kind::C; }

void loop(foo * o);

__attribute__((always_inline)) inline void loop_if_bc(foo * o) {
    if (__builtin_expect(!is_bc(o), 0))
        throw "what"; // Should never be thrown if loop_if_bc is called only with B/C kind foos.
                      // But with GCC, it's *always* thrown.
    loop(o);
}

void loop(foo * o) {
    switch (o->m_kind) {
    case foo_kind::A: return;
    case foo_kind::B: loop_if_bc(o); break;
    case foo_kind::C: loop_if_bc(o); break;
    }
}

int main(int, char**, char**) {
    foo obj { foo_kind::C };
    loop(&obj);
    return 0;
}

Wojciech Nawrocki (Aug 28 2019 at 23:02):

Note that for the LSB check to fail, GCC would have to assume that all foos are not aligned, and have odd addresses.

Mario Carneiro (Aug 28 2019 at 23:39):

According to the compiler explorer, this bug exists in 9.1 and 9.2 and is fixed in GCC trunk

Keeley Hoek (Aug 30 2019 at 08:03):

Wow, lean found a gcc bug. It's growing up so fast. :P

Patrick Massot (Aug 30 2019 at 08:06):

Wait until Lean 6 autogenerate a certified GCC

Patrick Massot (Aug 30 2019 at 08:07):

I guess we can't really say it found a bug if it's already fixed in trunk

Sebastian Ullrich (Aug 30 2019 at 08:15):

@Keeley Hoek https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85764

Keeley Hoek (Aug 30 2019 at 08:39):

@Sebastian Ullrich I wish I had a gcc issue which had status RESOLVED FIXED :)
Isn't that the pinnacle of CS achievement?

Patrick Massot (Aug 30 2019 at 08:40):

It may be so only until it turns out Lean 4 is faster than gcc.

Sebastian Ullrich (Aug 30 2019 at 08:42):

Well, we're using clang by default, which is faster than GCC in quite a few benchmarks...

Sebastian Ullrich (Aug 30 2019 at 08:43):

(I hope you're not talking about the compile times because let's not talk about those...)

Patrick Massot (Aug 30 2019 at 08:43):

You mean Lean code gets compiled to C++ code which is then compiled by clang which is faster than GCC?

Patrick Massot (Aug 30 2019 at 08:44):

Who cares about compile time?

Keeley Hoek (Aug 30 2019 at 08:44):

I think to C, actually

Keeley Hoek (Aug 30 2019 at 08:44):

Incidentally, why the switch Sebastian?

Patrick Massot (Aug 30 2019 at 08:44):

Users don't care (unless you include interactive proof checking speed in compile time)

Sebastian Ullrich (Aug 30 2019 at 08:55):

Incidentally, why the switch Sebastian?

Because it's faster

Keeley Hoek (Aug 30 2019 at 08:58):

To compile or run?

Keeley Hoek (Aug 30 2019 at 08:58):

Which C++ features were you using

Sebastian Ullrich (Aug 30 2019 at 09:05):

Oh, you mean the switch from C++ to C. It's faster to compile, yes. We were only using C++ because the runtime header was written in it, but it's now all C.

Keeley Hoek (Aug 30 2019 at 09:08):

Cool

Wojciech Nawrocki (Aug 30 2019 at 10:48):

@Patrick Massot re it being fixed, i asked the gcc devs about it, and the fix seems to be more or less accidental. So probably a real bug :)


Last updated: Dec 20 2023 at 11:08 UTC