Zulip Chat Archive

Stream: general

Topic: Unicode normalization


Mario Carneiro (May 27 2020 at 05:33):

In the Bonn meeting in January I spoke with Makarius, and I asked him about why Isabelle uses a custom encoding of special characters rather than something like Unicode. He brought up the issue with Unicode normalization, and I didn't really have a good response. The basic problem is that Unicode allows for multiple encodings of characters that are nevertheless to be treated as equal in all contexts; the main example of this is a letter followed by a combining mark like e + ◌́ = e◌́ = é. Should lean be aware of this process? That is, if I make a definition étale and later use e◌́tale should these be considered the same constant? I am curious what community consensus is on issues like this.

Andrew Ashworth (May 27 2020 at 05:50):

They should be equal iff NFC(x) = NFC(y), where NFC is unicode normal form C

Andrew Ashworth (May 27 2020 at 05:54):

however for maximum laziness we could just do whatever Lean 4 does when it compares symbols for equality

Mario Carneiro (May 27 2020 at 05:55):

I assume that is the same thing as what lean 3 currently does which is nothing

Mario Carneiro (May 27 2020 at 05:56):

Unicode has a pretty unambiguous answer to what the correct thing to do is here. The question is mostly about the tradeoff between implementation complexity and accurate handling of unicode characters

Andrew Ashworth (May 27 2020 at 06:00):

I vote for inaccurate handling and simple implementation complexity because it doesn't seem like a pressing problem right now

Andrew Ashworth (May 27 2020 at 06:00):

hashtag procrastinate issue until next major version

Andrew Ashworth (May 27 2020 at 06:03):

most unicode is entered using vscode short codes so it's unlikely to be a big issue. If it can't be entered using a short code I don't think most people bother

Andrew Ashworth (May 27 2020 at 06:04):

there are lots of people with ascii keyboards and making people type graves and accents when there's perfectly good romanization / ascii schemes lying around...

Mario Carneiro (May 27 2020 at 06:08):

To be fair, I think in isabelle they also use short codes for input

Mario Carneiro (May 27 2020 at 06:10):

My personal stance is that unicode causes more problems than it solves, but it is an easy way to achieve okayish results in a variety of viewing media (vscode, html docs, github)

Mario Carneiro (May 27 2020 at 06:13):

Andrew Ashworth said:

there are lots of people with ascii keyboards and making people type graves and accents when there's perfectly good romanization / ascii schemes lying around...

This is just an argument for straight ASCII though, not the current halfsie unicode used in lean 3/4

Mario Carneiro (May 27 2020 at 06:13):

it's also kind of english-centric

Andrew Ashworth (May 27 2020 at 06:21):

爲什麽代碼都是用英文來寫的,將來會有中文寫代碼的那天嗎?

Andrew Ashworth (May 27 2020 at 06:24):

I wouldn't get hung up on preferring Latin-ascii for most code. Just by choosing the latin alphabet to represent words you're excluding a lot of the world

Mario Carneiro (May 27 2020 at 06:25):

I saw a demo of that once from some CMU student

Mario Carneiro (May 27 2020 at 06:25):

it was made up to look like a traditional chinese text

Gabriel Ebner (May 27 2020 at 07:00):

Unicode allows for multiple encodings of characters that are nevertheless to be treated as equal in all contexts

I don't remember this ever being an issue in practice (and Lean doesn't do any kind of Unicode normalization as far as I'm aware of). I think the appropriate response is "identifiers are byte strings, equality is byte string equality". This is also how file names are handled on sane operating systems (and windows).

Kenny Lau (May 27 2020 at 07:07):

Andrew Ashworth said:

爲什麽代碼都是用英文來寫的,將來會有中文寫代碼的那天嗎?

I can't tell if this is translated. the translations engines are getting scarily stronger day by day

Kenny Lau (May 27 2020 at 07:09):

but I'll go with yes just because people who speak Chinese are statistically more likely to use Simplified Chinese (because Simplified Chinese is used in China)

Mario Carneiro (May 27 2020 at 07:19):

The related problem of many visually indistinguishable but distinct definitions is definitely an issue though

Kenny Lau (May 27 2020 at 07:21):

but I as a Hongkonger appreciate the representation of Traditional Chinese

Shing Tak Lam (May 27 2020 at 07:21):

Andrew Ashworth said:

爲什麽代碼都是用英文來寫的,將來會有中文寫代碼的那天嗎?

https://en.wikipedia.org/wiki/Easy_Programming_Language

這個算嗎?

Kenny Lau (May 27 2020 at 07:22):

we need Google to translate Cantonese

Mario Carneiro (May 27 2020 at 07:23):

a proprietary programming language? Eww

Gabriel Ebner (May 27 2020 at 07:27):

In case anybody wonders how it looks like:
etx_9.jpg

Johan Commelin (May 27 2020 at 07:28):

Why do they use those foreign { } characters?

Mario Carneiro (May 27 2020 at 07:29):

Why is it a proportional font? This is doubly baffling considering that the language is already monospace

Shing Tak Lam (May 27 2020 at 07:30):

I guess "祖國你好" is like the Chinese equivalent of "Hello World"?

Mario Carneiro (May 27 2020 at 07:32):

not sure about the shades of meaning here but it sounds very nationalistic

Shing Tak Lam (May 27 2020 at 07:33):

Mario Carneiro said:

Why is it a proportional font? This is doubly baffling considering that the language is already monospace

There are two sets of punctuation used in Chinese computer text. Full width which is monospace (全角) so for example, and Half width (半角), ,. And they made the design decision to use the half width set (ie. the ASCII set of punctuation).

Dunno why they used a proportional font though

Mario Carneiro (May 27 2020 at 07:34):

The code looks like they are using ascii, not halfwidth, for the ascii characters

Gabriel Ebner (May 27 2020 at 07:35):

Ascii is half-width.

Shing Tak Lam (May 27 2020 at 07:35):

Half width doesn't mean half the width of a normal character. It just means ASCII. It's a term (I think) borrowed from Japanese, and is from typesetting prior to computers.

Mario Carneiro (May 27 2020 at 07:35):

it's a bit more than halfwidth in most fonts. Isn't there a dedicated halfwidth section of unicode?

Mario Carneiro (May 27 2020 at 07:36):

which is literally a half moji wide

Gabriel Ebner (May 27 2020 at 07:37):

But it is exactly the half width in monospace fonts.

Mario Carneiro (May 27 2020 at 07:37):

thank goodness for that, I suppose. I guess they just chose not to use such a font in the snippet

Gabriel Ebner (May 27 2020 at 07:38):

Terminals would be hell otherwise. It's already a problem that half-widthness changes with new Unicode versions...

Sebastian Ullrich (May 27 2020 at 08:18):

Mario Carneiro said:

it was made up to look like a traditional chinese text

You're probably thinking of https://github.com/wenyan-lang/wenyan/blob/master/README.md. It's beautiful.

Sebastian Ullrich (May 27 2020 at 08:21):

But as for the topic at hand, I agree with Gabriel. I'd much rather have a linter that forbids non-normalized/deceptive input than adding slow normalization to the parser itself.

Johan Commelin (May 27 2020 at 08:28):

"unicode normalization is an editor feature" — Ancient eastern wisdom

Kevin Buzzard (May 27 2020 at 10:09):

Mario Carneiro said:

My personal stance is that unicode causes more problems than it solves, but it is an easy way to achieve okayish results in a variety of viewing media (vscode, html docs, github)

I am convinced that unicode is one of the reasons that undergraduate mathematicians are interested. It has only dawned on me recently that computer scientists are not all familiar with notation like ,,,,¬\forall, \exists, \land, \lor, \lnot. In maths degrees this notation is introduced literally on day 1 (by me, at Imperial, for example, in lectures 1 and 2) and it makes Lean code more readable for us.

Mario Carneiro (May 27 2020 at 10:10):

Weren't you the one who explained to me that most mathematicians never get exposed to logic?

Kevin Buzzard (May 27 2020 at 10:11):

They get exposed to precisely the truth tables for and, or, not in classical logic, and modus ponens, and contrapositive, and that's it.

Mario Carneiro (May 27 2020 at 10:11):

but once they are out of that class they never use those symbols again

Mario Carneiro (May 27 2020 at 10:11):

except maybe the quantifiers

Kevin Buzzard (May 27 2020 at 10:12):

You're right I don't see much and, or, not, but the quantifiers are everywhere in the analysis scripts I'm marking right now

Kevin Buzzard (May 27 2020 at 10:12):

in the questions as well as the student scripts

Mario Carneiro (May 27 2020 at 10:13):

Anyway I'm not saying using symbols is bad, I'm saying that unicode is not a very good solution

Kevin Buzzard (May 27 2020 at 10:13):

Probably after a while you don't see many of the quantifiers either, but they are just used as part of the vernacular, e.g. some people use the symbols in English sentences if they're taking notes

Mario Carneiro (May 27 2020 at 10:13):

I'm not proposing ascii art as an alternative

Kevin Buzzard (May 27 2020 at 10:14):

I'm saying symbols is essential for readability by mathematicians which is one of the many problems which the area had 5 years ago

Mario Carneiro (May 27 2020 at 10:14):

unicode is a specific implementation of the idea of symbols in code, and it's not a very good one

Kevin Buzzard (May 27 2020 at 10:15):

I can't comment on other possibilities, we don't care what is going on under the hood

Mario Carneiro (May 27 2020 at 10:15):

isabelle has a different implementation of the same idea

Mario Carneiro (May 27 2020 at 10:15):

isabelle doesn't use unicode but you still see all the fancy symbols when you pull it up in the various media

Mario Carneiro (May 27 2020 at 10:16):

and more besides - things that don't exist in unicode

Mario Carneiro (May 27 2020 at 10:17):

LaTeX is yet another implementation of symbols in code (well, typeset text) that is not related to unicode

Mario Carneiro (May 27 2020 at 10:18):

and frankly I think LaTeX addresses the symbolic needs of mathematicians far better than unicode, which is primarily aimed at representing international scripts

Kevin Buzzard (May 27 2020 at 10:33):

I agree that LaTeX would be better. There have been times when writing instructional Lean code where I have wished for LaTeX over what unicode can offer me.

Jannis Limperg (May 27 2020 at 11:45):

Kevin Buzzard said:

It has only dawned on me recently that computer scientists are not all familiar with notation like ,,,,¬\forall, \exists, \land, \lor, \lnot. In maths degrees this notation is introduced literally on day 1 (by me, at Imperial, for example, in lectures 1 and 2) and it makes Lean code more readable for us.

These certainly appear very early in any CS degree. It's not like we do literally no math. ;) Unicode remains unpopular in programming languages because programmers like to use a variety of editors, not all of which can comfortably input Unicode. (Also, entrenched ASCII prejudice imo.)

Sebastian Ullrich (May 27 2020 at 12:28):

In practice, Isabelle's symbols can be a huge pain in the neck - they are pretty when you're in JEdit, but turn into an unreadable and unmanageable mess when you look at them with any other tool or on Github or...

Kevin Buzzard (May 27 2020 at 12:32):

And people post lean code on discord now

Kevin Buzzard (May 27 2020 at 12:32):

And It Just Works

Mario Carneiro (Jun 09 2020 at 05:27):

Oh wow, looks like this comes up in Python, where they use a fairly aggressive normalization: https://stackoverflow.com/questions/62256014/does-python-forbid-two-similarly-looking-unicode-identifiers


Last updated: Dec 20 2023 at 11:08 UTC