Zulip Chat Archive

Stream: Is there code for X?

Topic: unicode segmentation


Alok Singh (Nov 15 2024 at 04:02):

like the rust crate, or bindings to it. i've often used its .graphemes() method to get arrays of characters for easier-to-think-about parsing rather than raw utf8 strings

Eric Wieser (Nov 15 2024 at 11:13):

I assume you are aware that there is a middle ground between "raw UTF8 strings" and "sequences of grapheme clusters", which is "sequences of unicode codepoints" (and supported in Lean natively)?

Eric Wieser (Nov 15 2024 at 11:13):

https://github.com/fgdorais/lean4-unicode-basic is probably the most sophisticated unicode processing that currently exists in Lean.


Last updated: May 02 2025 at 03:31 UTC