Zulip Chat Archive

Stream: new members

Topic: What is going on?

Kenny Lau (Apr 19 2020 at 18:10):

theorem what_is_going_on (H : ∃ N : ℕ, N = N) : ∃ N : ℕ, N = N :=
let ⟨Ν, ΗΝ⟩ := H in ⟨N, HN⟩

have fun :)

Jannis Limperg (Apr 19 2020 at 18:38):

Unicode shenanigans?

Kevin Buzzard (Apr 19 2020 at 19:53):

The red boxes around some of the characters give it away

Shing Tak Lam (Apr 20 2020 at 02:00):

theorem what_is_going_on (H : ∃ N : ℕ, N = N) : ∃ N : ℕ, N = N :=
let ⟨ο, hο⟩ := H in ⟨o, ho⟩

No red boxes now?

A meta point, anyone aware of what is allowed as part of an identifier name? Is it UAX 31 or Latin+Greek?

Just to find the limits, I tried a few things from UTR 36, mixed script is accepted in Lean. At least Zero Width Characters aren't accepted by Lean(ahem, C++...)

Mario Carneiro (Apr 20 2020 at 02:00):

the list of valid identifier character is a mess

Mario Carneiro (Apr 20 2020 at 02:04):

bool is_greek_unicode(unsigned u) { return 0x391 <= u && u <= 0x3DD; }
bool is_letter_like_unicode(unsigned u) {
    return
            (0x3b1  <= u && u <= 0x3c9 && u != 0x3bb) || // Lower greek, but lambda
            (0x391  <= u && u <= 0x3A9 && u != 0x3A0 && u != 0x3A3) || // Upper greek, but Pi and Sigma
            (0x3ca  <= u && u <= 0x3fb) ||               // Coptic letters
            (0x1f00 <= u && u <= 0x1ffe) ||              // Polytonic Greek Extended Character Set
            (0x2100 <= u && u <= 0x214f) ||              // Letter like block
            (0x1d49c <= u && u <= 0x1d59f);              // Latin letters, Script, Double-struck, Fractur
}
bool is_sub_script_alnum_unicode(unsigned u) {
    return
            (0x207f <= u && u <= 0x2089) || // n superscript and numberic subscripts
            (0x2090 <= u && u <= 0x209c) || // letter-like subscripts
            (0x1d62 <= u && u <= 0x1d6a);   // letter-like subscripts
}

bool is_id_first(unsigned char const * begin, unsigned char const * end) {
    if (std::isalpha(*begin) || *begin == '_')
        return true;
    unsigned u = utf8_to_unicode(begin, end);
    return u == id_begin_escape || is_letter_like_unicode(u);
}

bool is_id_rest(unsigned char const * begin, unsigned char const * end) {
    if (std::isalnum(*begin) || *begin == '_' || *begin == '\'')
        return true;
    unsigned u = utf8_to_unicode(begin, end);
    return is_letter_like_unicode(u) || is_sub_script_alnum_unicode(u);
}

Bryan Gin-ge Chen (Apr 20 2020 at 02:07):

Is it the same in Lean 4?

Mario Carneiro (Apr 20 2020 at 02:08):

I don't see any reason why they would change it

Mario Carneiro (Apr 20 2020 at 02:09):

https://github.com/leanprover/lean4/blob/ba648e419af29b140eb07dae7769a6659b7e6e23/src/util/name.cpp#L69-L98

Mario Carneiro (Apr 20 2020 at 02:10):

I don't know if there is another version of this code in lean

Mario Carneiro (Apr 20 2020 at 02:11):

here it is https://github.com/leanprover/lean4/blob/ba648e419af29b140eb07dae7769a6659b7e6e23/library/Init/Lean/Parser/Identifier.lean#L11-L31

Mario Carneiro (Apr 20 2020 at 02:13):

Note however that even characters outside this range can be used outside an escape if they are declared as notations

Mario Carneiro (Apr 20 2020 at 02:19):

Shing Tak Lam (Apr 20 2020 at 03:27):

Mario Carneiro said:

Note however that even characters outside this range can be used outside an escape if they are declared as notations

Oh wow. That accepts zero width characters as well. That's interesting...

notation `(` a `` b `)` := a + b

#eval (1234) -- 235
#eval (1234) -- 46
#eval (1234) -- 127

(Zulip highlights the zero-width characters, but they're invisible in VSCode)

Shing Tak Lam (Apr 20 2020 at 03:30):

Which means if anyone feels like doing so, you can get (implicit) multiplication.

notation a `` b := a * b

def x := 2
def y := 3
def z := 4

#eval xy       -- 6
#eval x(y + z) -- 27
#eval (x + y)z -- 20

For anyone reading this thread, please do not ever do this in real code.

Johan Commelin (Apr 20 2020 at 06:48):

Hah! That's a lovely hack!

Patrick Stevens (Apr 20 2020 at 07:20):

You can get some truly beautifully incomprehensible syntax if you use the right-to-left and left-to-right characters

Gabriel Ebner (Apr 20 2020 at 08:15):

If you're motivated you can also use combining characters (think of umlauts and other diacritics) as postfix operators: https://github.com/flypitch/flypitch/blob/deb288e2f7aca1c5679c88ef3ba25351fc81fd89/src/bvm.lean#L2218

Johan Commelin (Apr 23 2020 at 06:46):

Lol, there is actually a unicode symbol for "invisible times"

U+2062  note    INVISIBLE TIMES

Shing Tak Lam (Apr 23 2020 at 07:26):

I just looked into that and the design is actually rather interesting. It's for when typeset text need to be read by a program, then an expression like $f(x)$ can have multiple meanings, either f times x (f is a variable) or f of x (f is a function). So you can put U+2062 between the f and ( if it is a function or U+2061 if it is a function. I don't know if this is actually used anywhere though...

and for when you want to break (almost?) all conventions, there is U+2064 INVISIBLE PLUS

Reid Barton (Apr 23 2020 at 11:32):

that's just to_additive of INVISIBLE TIMES I guess

Last updated: May 02 2025 at 03:31 UTC