Zulip Chat Archive

Stream: lean4

Topic: Grapheme Clusters


Marcus Rossel (Oct 02 2023 at 13:09):

Does Lean have a way of accessing a String's grapheme clusters?
After reading this lovely blog post on Unicode, I tried the following example given in the post:

#eval "🤦🏼‍♂️".length -- 5

From the definition of String.length, String and Char I take that we're counting Unicode Scalar Values, i.e. codepoints which aren't surrogates. What if I want to count the number of grapheme clusters in "🤦🏼‍♂️", though? Is there a way of doing this in Lean?

Ioannis Konstantoulas (Oct 02 2023 at 13:37):

I am not an expert, but I presume this is quite difficult. Mature languages have entire libraries dedicated to Unicode subtleties like this, since what "really" counts as a grapheme / grapheme cluster is dependent on language, location etc.

I had a canonical reference discussing the issues, but I cannot find it now. The following thread on reddit is illuminating:
https://www.reddit.com/r/programming/comments/d1dhq9/its_not_wrong_that_length_7/

I think the best Lean can do now is add bindings to a mature Unicode library, if possible.

Ioannis Konstantoulas (Oct 02 2023 at 13:57):

I must add that I don't think this makes Unicode a bad design, like some people in that thread conclude. Unicode does have serious problems (Han unification) and less serious problems (some letters are subscripts, some are not); but overall, Unicode is a good design for a monumental task: digitize human languages, current and historical, and written symbols. Those are inherently messy and ill-defined; the complexities of Unicode merely reflect this fact.

François G. Dorais (Oct 02 2023 at 15:21):

Add a feature request to https://github.com/fgdorais/lean4-unicode-basic

The grapheme cluster algorithm is a bit tricky but definitely within reach. I don't know when I would get around to it but a PR would be welcome.


Last updated: Dec 20 2023 at 11:08 UTC