Zulip Chat Archive
Stream: lean4
Topic: Grapheme Clusters
Marcus Rossel (Oct 02 2023 at 13:09):
Does Lean have a way of accessing a String
's grapheme clusters?
After reading this lovely blog post on Unicode, I tried the following example given in the post:
#eval "🤦🏼♂️".length -- 5
From the definition of String.length
, String
and Char
I take that we're counting Unicode Scalar Values, i.e. codepoints which aren't surrogates. What if I want to count the number of grapheme clusters in "🤦🏼♂️"
, though? Is there a way of doing this in Lean?
Ioannis Konstantoulas (Oct 02 2023 at 13:37):
I am not an expert, but I presume this is quite difficult. Mature languages have entire libraries dedicated to Unicode subtleties like this, since what "really" counts as a grapheme / grapheme cluster is dependent on language, location etc.
I had a canonical reference discussing the issues, but I cannot find it now. The following thread on reddit is illuminating:
https://www.reddit.com/r/programming/comments/d1dhq9/its_not_wrong_that_length_7/
I think the best Lean can do now is add bindings to a mature Unicode library, if possible.
Ioannis Konstantoulas (Oct 02 2023 at 13:57):
I must add that I don't think this makes Unicode a bad design, like some people in that thread conclude. Unicode does have serious problems (Han unification) and less serious problems (some letters are subscripts, some are not); but overall, Unicode is a good design for a monumental task: digitize human languages, current and historical, and written symbols. Those are inherently messy and ill-defined; the complexities of Unicode merely reflect this fact.
François G. Dorais (Oct 02 2023 at 15:21):
Add a feature request to https://github.com/fgdorais/lean4-unicode-basic
The grapheme cluster algorithm is a bit tricky but definitely within reach. I don't know when I would get around to it but a PR would be welcome.
Last updated: Dec 20 2023 at 11:08 UTC