Zulip Chat Archive

Stream: lean4

Topic: Character unicode code


Arthur Paulino (Jul 25 2022 at 00:32):

Is there a function in Lean 4 core that can help me detect whether a character is a special Unicode character and to extract its code?

So alpha would become U-03B1 or something

Arthur Paulino (Jul 25 2022 at 02:48):

Hmm, I believe I can find my way with this

Gabriel Ebner (Jul 25 2022 at 09:06):

Yes, Lean characters are Unicode codepoints. You can use Char.toNat to get the numerical value of the codepoint. Weirdly enough, we even have a function to turn the numerical value of a Char into a hex string (charToHex), but not the corresponding function for natural numbers. :shrug:

I don't know what a "special" Unicode character is. Also please be aware that some characters are encoded as a sequence of multiple codepoints.

Arthur Paulino (Jul 25 2022 at 12:24):

@Gabriel Ebner what do you mean by your last sentence?

Gabriel Ebner (Jul 25 2022 at 12:26):

#eval "😵‍💫".length -- 3

Arthur Paulino (Jul 25 2022 at 13:26):

Gabriel Ebner said:

Weirdly enough, we even have a function to turn the numerical value of a Char into a hex string (charToHex), but not the corresponding function for natural numbers. :shrug:

And weirdly enough, turns out that's precisely what I need!

Chris Lovett (Jul 28 2022 at 21:26):

I'm confused by the result of charToHex on that smiley face string. The face-with-spiral-eyes s made up of 3 unicode code points U+1F635, ‍ U+200D and U+1F4AB and indeed the Char.toNat values match this:

def dump_chars (s : String) : IO Unit :=
  for c in String.data s do
    IO.println (Char.toNat c)

#eval dump_chars "😵‍💫"
/-
128565
8205
128171
-/

but charToHex doesn't look very hex like...

def dump_chars (s : String) : IO Unit :=
  for c in String.data s do
    IO.println (charToHex c)

#eval dump_chars "😵‍💫"
/-
*5
*d
*b
-/

Is charToHex is broken?

Wojciech Nawrocki (Jul 28 2022 at 21:35):

Interesting! Looks like it assumes c.toNat <= 255. By the usage in toHexCore below I assume it's only meant to work for low ASCII values, but if so then it shouldn't be in the global namespace.

Chris Lovett (Jul 28 2022 at 21:36):

perhaps it needs to be recursive?

partial def toHex(n : Nat) : String :=
  if n < 16 then hexDigitRepr n else
    let d2 := n / 16;
    let d1 := n % 16;
    (toHex d2) ++ hexDigitRepr d1

(The largest unicode code point is 10FFFF, so I guess it could be limited to a max of 6 hex digits).


Last updated: Dec 20 2023 at 11:08 UTC