Zulip Chat Archive

Stream: new members

Topic: encodings


Michael Beeson (Jan 05 2021 at 04:43):

When I run grep on my *.lean files and extract some text to, say, file4.txt, and open that file in TextMate, which is set to UTF-8 encoding by default,
I see my mathbb characters converted to stuff like this: setof(λ κ, κ ∈ 픽 ∧ (κ = zero ∨ ∃ μ:M, μ ∈ 픽 ∧ κ = 핊 μ ))
On the other hand, pasted in gmail, the text looks just like it does in Lean Infoview. I conclude that the text is not in UTF-8 but some
other encoding. Does anybody know which one?

Alex J. Best (Jan 05 2021 at 04:49):

On linux and osx the command file blah.lean tries to guess the encoding, for all mathlib files it reports either UTF-8 or ASCII for me

Michael Beeson (Jan 05 2021 at 23:29):

Hm, that command reports "UTF-8 Unicode text". If I do "grep setof inf.lean" I get the correct characters on-screen. If I do
"grep setof inf.lean > file4.txt" I get Japanese characters in file4.txt. even though "file file4.txt" reports UTF-8 Unicode.
If I cut and paste from the screen to a file I get the correct output. So it seems to be the Unix piping in ">" that screws things up.
Fine, I will just cut and paste it from the screen then, never mind the supposed shortcut.

Julian Berman (Jan 05 2021 at 23:42):

What OS are you on? And what is your locale set to?

Julian Berman (Jan 05 2021 at 23:43):

> is shell syntax it shouldn't affect mojibake'ing your files but a broken locale might

Michael Beeson (Jan 06 2021 at 00:58):

iMac:nf beeson$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=


Last updated: Dec 20 2023 at 11:08 UTC