Zulip Chat Archive

Stream: lean4

Topic: Crashing bug, possibly related to UInt64 arithmetic


Geoffrey Irving (Aug 17 2024 at 21:15):

The following (partially minimized) code crashes on the last line with Lean v4.11.0-rc2. Happy to file an issue if that's the next step, but I thought I'd ask here first.

import Mathlib.Data.UInt

open scoped UInt64.CommRing

structure Int64 where
  n : UInt64

structure Fixed (s : Int64) where
  n : Int64

variable {s : Int64}

instance : Zero Int64 where zero := 0
instance : NatCast Int64 where natCast n := (n : UInt64)
instance : Repr (Fixed s) where reprPrec _ _ := repr 7
instance : Zero (Fixed s) where zero := 0
def Fixed.ofNat (_ : ) (_ : Bool) : Fixed s := 7
def Fixed.repoint (x : Fixed s) (t : Int64) : Fixed t := ⟨⟨x.n.n <<< (7 : UInt64)⟩⟩

-- Crash here
#eval (.ofNat 7 false : Fixed 0).repoint (⟨-60)

Geoffrey Irving (Aug 17 2024 at 21:28):

Not related to UInt64.CommRing after all. Here's more minimization:

import Mathlib.Data.UInt

structure Fixed where
  n : UInt64

def Fixed.repoint (x : Fixed) (_ : Fixed) : Fixed :=
  x.n <<< (7 : UInt64)

-- Crash here
#eval (Fixed.repoint 7 ⟨-60).n

Geoffrey Irving (Aug 17 2024 at 21:31):

Minimized further:

import Mathlib.Data.UInt

def repoint (_ : UInt64) : UInt64 :=
  (7 : UInt64) <<< (7 : UInt64)

-- Crash here
#eval repoint (-60)

Henrik Böving (Aug 17 2024 at 21:36):

Your last mwe does not crash on live.lean-lang.org

Henrik Böving (Aug 17 2024 at 21:36):

Neither does the first

Geoffrey Irving (Aug 17 2024 at 21:37):

I'm on leanprover/lean4:v4.11.0-rc2. Does that match?

Henrik Böving (Aug 17 2024 at 21:38):

Lean.versionString gives the same on live yes

Henrik Böving (Aug 17 2024 at 21:40):

can you locally mwe it further to

def repoint (_ : UInt64) : UInt64 :=
  (7 : UInt64) <<< (7 : UInt64)

-- Crash here
#eval repoint (0-60)

?

Henrik Böving (Aug 17 2024 at 21:41):

just to get mathlib out of the picture entirely

Geoffrey Irving (Aug 17 2024 at 21:41):

No, I'm using UInt64 negation and shifting. I updated mathlib again and it didn't make a difference.

Geoffrey Irving (Aug 17 2024 at 21:41):

I suppose I need to pull out the definitions of those two operators next.

Henrik Böving (Aug 17 2024 at 21:41):

shifting is a built in

Henrik Böving (Aug 17 2024 at 21:41):

negation should be equivalent to 0 - 60?

Geoffrey Irving (Aug 17 2024 at 21:42):

Changing it to 0 - 60 fixes it.

Henrik Böving (Aug 17 2024 at 21:42):

Aha, so mathlib's neg instance is up to something

Henrik Böving (Aug 17 2024 at 21:43):

      instance : Neg $typeName where
        neg a := mk (-a.val)

hmmm

Mario Carneiro (Aug 17 2024 at 21:43):

that should be fine assuming the negation on Fin n is correct

Geoffrey Irving (Aug 17 2024 at 21:44):

I inlined the two instances but then it doesn't crash:

instance neg (n : Nat) : Neg (Fin n) :=
  fun a => (n - a) % n, Nat.mod_lt _ a.pos⟩⟩

instance : Neg UInt64 where
  neg a := UInt64.mk (-a.val)

def repoint (_ : UInt64) : UInt64 :=
  (7 : UInt64) <<< (7 : UInt64)

-- Doesn't crash
#eval repoint (-60)

Geoffrey Irving (Aug 17 2024 at 21:45):

Inlining just the Neg UInt64 instance and importing Data.Fin.Basic also fixes the crash.

Henrik Böving (Aug 17 2024 at 21:46):

That sure is interesting

Henrik Böving (Aug 17 2024 at 21:46):

So something that mathlib is doing is making this misbehave

Mario Carneiro (Aug 17 2024 at 21:46):

I'm not replicating this crash

Mario Carneiro (Aug 17 2024 at 21:47):

I am suspecting some runtime issue

Henrik Böving (Aug 17 2024 at 21:47):

@Geoffrey Irving what platform are you on?

Geoffrey Irving (Aug 17 2024 at 21:47):

Mac, and I'm getting the crash in both VSCode and on the commandline.

Henrik Böving (Aug 17 2024 at 21:47):

intel or ARM mac?

Mario Carneiro (Aug 17 2024 at 21:47):

could be GMP related

Mario Carneiro (Aug 17 2024 at 21:48):

Does it replicate with UInt8 instead of UInt64?

Geoffrey Irving (Aug 17 2024 at 21:48):

I thought we didn't depend on GMP?

Mario Carneiro (Aug 17 2024 at 21:48):

mac doesn't depend on GMP

Geoffrey Irving (Aug 17 2024 at 21:49):

UInt8, 16, 32 are fine. Only 64 is bad.

Henrik Böving (Aug 17 2024 at 21:49):

It sure is confusing that it works with the instances inlined but not when the code is in mathlib

Geoffrey Irving (Aug 17 2024 at 21:50):

And that it doesn't replicate for others!

Henrik Böving (Aug 17 2024 at 21:50):

how would that change the behavior of GMP if GMP is at fault

Henrik Böving (Aug 17 2024 at 21:50):

well i assume mario is not on mac, neither am I

Mario Carneiro (Aug 17 2024 at 21:50):

Try deleting the cache files for mathlib and rebuild Mathlib.Data.UInt locally

Mario Carneiro (Aug 17 2024 at 21:51):

The cache files are not completely architecture independent, because they are stored in GMP format only when GMP is enabled

Mario Carneiro (Aug 17 2024 at 21:52):

leangz has to make the appropriate conversion based on the OS, but if lean has recently changed the settings on when GMP is enabled there could be an issue here

Mario Carneiro (Aug 17 2024 at 21:53):

because for some reason this isn't logged in the file at all so leangz has to make an educated guess

Geoffrey Irving (Aug 17 2024 at 21:54):

Building without a cache fixes it.

Geoffrey Irving (Aug 17 2024 at 21:54):

Yikes!

Mario Carneiro (Aug 17 2024 at 21:57):

I double checked the release settings, seems GMP is still enabled on linux x64

Mario Carneiro (Aug 17 2024 at 21:58):

it's really annoying that there is no easy #eval Lean.useGMP declaration I can look at

Mario Carneiro (Aug 17 2024 at 21:58):

it just silently changes the entire ABI of the application with no other visible differences

Mario Carneiro (Aug 17 2024 at 21:59):

But in that case I'm still confused, has something changed in the non-GMP ABI recently?

Geoffrey Irving (Aug 17 2024 at 22:02):

ARM Mac, by the way: M2 MacBook Air.

Mario Carneiro (Aug 17 2024 at 22:14):

@Geoffrey Irving I have a test for you: Can you check out branch#mpz-test (downloading the cache), create a new file, import Mathlib.MpzTest and see what #print foo gives?

Mario Carneiro (Aug 17 2024 at 22:15):

the correct result is

import Mathlib.MpzTest

#print foo
-- def foo : Nat :=
-- 68915718021581205938132336367

Mario Carneiro (Aug 17 2024 at 22:16):

but if something is broken you might end up seeing

def foo : Nat :=
******************************

instead

Geoffrey Irving (Aug 17 2024 at 22:16):

Will try tomorrow; off to sleep now.

Mario Carneiro (Aug 17 2024 at 22:16):

anyone else on an ARM mac is welcome to try the test

Kim Morrison (Aug 17 2024 at 23:57):

On ARM mac, I see

def foo : Nat :=
0

:-(

Mario Carneiro (Aug 18 2024 at 00:11):

@Kim Morrison Can you open .lake/build/lib/Mathlib/MpzTest.olean with the vscode hex editor and search for EFBE?

Mario Carneiro (Aug 18 2024 at 00:12):

or actually just send me the olean file

Geoffrey Irving (Aug 18 2024 at 19:35):

I get the same output as @Kim Morrison, and here's MpzTest.olean:

MpzTest.olean

Mario Carneiro (Aug 18 2024 at 21:04):

@Geoffrey Irving Could you also delete MpzTest.olean, run lake build Mathlib.MpzTest to regenerate it, then send me that olean file as well?

Geoffrey Irving (Aug 18 2024 at 21:06):

Done, but it still says foo is 0. Is rm .lake/build/lib/Mathlib/MpzTest.* the wrong command to have run before building?

MpzTest-rebuilt.olean

Mario Carneiro (Aug 18 2024 at 21:07):

oh that's weird, you could try just nuking .lake altogether and then lake build Mathlib.MpzTest, since it has no dependencies

Geoffrey Irving (Aug 18 2024 at 21:09):

Okay, now foo is 68915718021581205938132336367.

MpzTest-fresh.olean

Mario Carneiro (Aug 18 2024 at 21:11):

...this is a GMP enabled olean file, it's identical to my linux x86 olean

Eric Wieser (Aug 18 2024 at 21:11):

Mario Carneiro said:

it's really annoying that there is no easy #eval Lean.useGMP declaration I can look at

Presumably you also want this flag in the olean file itself, and for them to be checked for consistency at olean-load time?

Mario Carneiro (Aug 18 2024 at 21:12):

indeed

Geoffrey Irving (Aug 18 2024 at 21:13):

Just need some way of detecting whether GMP is used by doing certain very complicated calculations.

https://en.wikipedia.org/wiki/Dark_Integers

Mario Carneiro (Aug 18 2024 at 21:15):

it's actually pretty easy to tell by looking at an olean file whether it's using GMP or non-GMP bignums, they look quite different in the binary. The trouble is that leangz has to make a guess whether the lean it's supposed to be generating an olean file for uses GMP or not, the input file doesn't know anything about that since it's cross platform

Mario Carneiro (Aug 18 2024 at 21:16):

@Geoffrey Irving just to confirm, what's your lean-toolchain?

Geoffrey Irving (Aug 18 2024 at 21:16):

leanprover/lean4:v4.11.0-rc2

Mario Carneiro (Aug 18 2024 at 21:24):

indeed, the workflow file no longer says USE_GMP=OFF on macos (both x86 and arm)

Mario Carneiro (Aug 18 2024 at 21:29):

changed in https://github.com/leanprover/lean4/commit/d1a96f6d8f867100428a243521f2047fb9d9217e by @Sebastian Ullrich

Geoffrey Irving (Aug 18 2024 at 21:31):

That looks like it was most likely a typo. Are there any CI tests that would catch this?

Mario Carneiro (Aug 18 2024 at 21:32):

if mathlib CI had some kind of MPZ test in it then possibly lean's nightly-testing or PR testing branches of mathlib would have caught the issue

Mario Carneiro (Aug 18 2024 at 21:33):

currently, MPZ representation is the only architecture dependent behavior leangz has to deal with during decompression

Mario Carneiro (Aug 18 2024 at 21:35):

(and it's not even architecture dependent, it's cmake option dependent, which means this breaks whenever someone thinks that it's fine to change the default cmake options on release builds for an arch)

Mario Carneiro (Aug 18 2024 at 21:35):

BTW this also impacts FFI code trying to interact with lean

Geoffrey Irving (Aug 18 2024 at 21:36):

Certainly the olean file format should be platform independent, in the sense that any platform can read any platform's generated oleans.

Mario Carneiro (Aug 18 2024 at 21:36):

this is definitely not the case

Geoffrey Irving (Aug 18 2024 at 21:36):

But not for any fundamental reason, right?

Mario Carneiro (Aug 18 2024 at 21:37):

yes for a fundamental reason, it's an in memory data structure serialized directly to disk

Mario Carneiro (Aug 18 2024 at 21:37):

the only reason we don't have worse issues is because we don't care to support big endian architectures

Geoffrey Irving (Aug 18 2024 at 21:37):

Ah, and also mmapped back, which blocks flags being the solution.

Mario Carneiro (Aug 18 2024 at 21:37):

oleans would look completely different there

Mario Carneiro (Aug 18 2024 at 21:38):

oleans aren't deserialized, they are mmapped, there is no opportunity to make modifications to the bits unless you do platform independent data structure access at runtime

Mario Carneiro (Aug 18 2024 at 21:38):

that's why leangz has to deal with the issue, because it actually does have a deserialization step

Eric Wieser (Aug 18 2024 at 21:38):

oleans weren't really designed as something intended to be shared between devices, right? That's just the way mathlib has always used them?

Mario Carneiro (Aug 18 2024 at 21:39):

yes, they really should have considered this issue from the start but they didn't and now we are stuck with this design

Geoffrey Irving (Aug 18 2024 at 21:41):

It still seems like we can benefit from oleans being platform independent, in the sense of adding the flags, even if we only use the flags to yell.

Geoffrey Irving (Aug 18 2024 at 21:41):

The flags can then in theory be used to do the translation wherever.

Geoffrey Irving (Aug 18 2024 at 21:49):

It's delightful that this commit is the one that lifted Apple Silicon Macs from Tier 2 to Tier 1. :)

Mario Carneiro (Aug 19 2024 at 02:04):

@Sebastian Ullrich After talking about this issue with some others, I think the best fix is to bump the olean version from 1 to 2, since this is a data format change which should really be recorded in the olean file itself. Unfortunately we'll have some casualties for all in between versions, which is unfortunate since it's everything from v4.9.0-rc1 to v4.11.0-rc2.

Mario Carneiro (Aug 19 2024 at 03:10):

leangz patch prepared at https://github.com/digama0/leangz/compare/olean2 , but this version will need to be updated every lean release until the olean version is bumped (lean4#5089)

Mario Carneiro (Aug 19 2024 at 03:12):

the best fix would be if lean could just record the status of flags like USE_GMP in a json file or similar somewhere in the distribution, so that external tools can just read that instead of making guesses

Sebastian Ullrich (Aug 19 2024 at 14:57):

I believe we'll want to go back to using GMP on every platform, former restrictions are not applicable anymore. And then bump the olean version again.

Geoffrey Irving (Aug 19 2024 at 15:50):

What were the restrictions before?

Eric Wieser (Aug 31 2024 at 09:23):

Sebastian Ullrich said:

I believe we'll want to go back to using GMP on every platform, former restrictions are not applicable anymore. And then bump the olean version again.

Even if this is the case, if the build-from-source still permits non GMP builds, then it would be useful to encode this flag in the oleans

Geoffrey Irving (Oct 02 2024 at 17:22):

Has this bug been fixed yet? leanprover/lean4:v4.12.0 seems to still be broken.

Mario Carneiro (Oct 02 2024 at 19:02):

No

Mario Carneiro (Oct 02 2024 at 19:03):

but watch lean4#5144 for movement

Sebastian Ullrich (Oct 02 2024 at 19:13):

Oh it's actually green? incredible

Mario Carneiro (Nov 04 2024 at 12:48):

@Geoffrey Irving The fix is now on mathlib master, so LMK if things look alright now.

Geoffrey Irving (Nov 08 2024 at 18:24):

Confirmed that it works fine now, thank you! :heart:


Last updated: May 02 2025 at 03:31 UTC