Zulip Chat Archive
Stream: Is there code for X?
Topic: unicode segmentation
Alok Singh (Nov 15 2024 at 04:02):
like the rust crate, or bindings to it. i've often used its .graphemes()
method to get arrays of characters for easier-to-think-about parsing rather than raw utf8 strings
Eric Wieser (Nov 15 2024 at 11:13):
I assume you are aware that there is a middle ground between "raw UTF8 strings" and "sequences of grapheme clusters", which is "sequences of unicode codepoints" (and supported in Lean natively)?
Eric Wieser (Nov 15 2024 at 11:13):
https://github.com/fgdorais/lean4-unicode-basic is probably the most sophisticated unicode processing that currently exists in Lean.
Last updated: May 02 2025 at 03:31 UTC