Zulip Chat Archive
Stream: lean4
Topic: Character unicode code
Arthur Paulino (Jul 25 2022 at 00:32):
Is there a function in Lean 4 core that can help me detect whether a character is a special Unicode character and to extract its code?
So alpha would become U-03B1
or something
Arthur Paulino (Jul 25 2022 at 02:48):
Hmm, I believe I can find my way with this
Gabriel Ebner (Jul 25 2022 at 09:06):
Yes, Lean characters are Unicode codepoints. You can use Char.toNat
to get the numerical value of the codepoint. Weirdly enough, we even have a function to turn the numerical value of a Char
into a hex string (charToHex
), but not the corresponding function for natural numbers. :shrug:
I don't know what a "special" Unicode character is. Also please be aware that some characters are encoded as a sequence of multiple codepoints.
Arthur Paulino (Jul 25 2022 at 12:24):
@Gabriel Ebner what do you mean by your last sentence?
Gabriel Ebner (Jul 25 2022 at 12:26):
#eval "😵💫".length -- 3
Arthur Paulino (Jul 25 2022 at 13:26):
Gabriel Ebner said:
Weirdly enough, we even have a function to turn the numerical value of a
Char
into a hex string (charToHex
), but not the corresponding function for natural numbers. :shrug:
And weirdly enough, turns out that's precisely what I need!
Chris Lovett (Jul 28 2022 at 21:26):
I'm confused by the result of charToHex
on that smiley face string. The face-with-spiral-eyes s made up of 3 unicode code points U+1F635, U+200D and U+1F4AB and indeed the Char.toNat values match this:
def dump_chars (s : String) : IO Unit :=
for c in String.data s do
IO.println (Char.toNat c)
#eval dump_chars "😵💫"
/-
128565
8205
128171
-/
but charToHex doesn't look very hex like...
def dump_chars (s : String) : IO Unit :=
for c in String.data s do
IO.println (charToHex c)
#eval dump_chars "😵💫"
/-
*5
*d
*b
-/
Is charToHex is broken?
Wojciech Nawrocki (Jul 28 2022 at 21:35):
Interesting! Looks like it assumes c.toNat <= 255
. By the usage in toHexCore
below I assume it's only meant to work for low ASCII values, but if so then it shouldn't be in the global namespace.
Chris Lovett (Jul 28 2022 at 21:36):
perhaps it needs to be recursive?
partial def toHex(n : Nat) : String :=
if n < 16 then hexDigitRepr n else
let d2 := n / 16;
let d1 := n % 16;
(toHex d2) ++ hexDigitRepr d1
(The largest unicode code point is 10FFFF, so I guess it could be limited to a max of 6 hex digits).
Last updated: Dec 20 2023 at 11:08 UTC