Zulip Chat Archive

Stream: lean4

Topic: Accept more Unicode in identifiers?


Julien Marquet-Wagner (Jan 07 2025 at 13:57):

Hi,
I noticed Lean does not accept bold face Unicode characters (𝐀𝐁𝐂...𝐙) as identifiers. Hower, the unicode-input extension that ships with vscode-lean4 does.
Turns out this function from lean4 could accept more characters.

I think it would be nice if vscode-lean4 and lean4 regained consistency with respect to Unicode characters. There are two possibilities:

  • Either extend the accepted Unicode range (my personal favorite)
  • Or deprecate these characters from unicode-input (bummer but might make sense).

Proposal: extend the accepted Unicode range.

characters current range proposed range reference
Mathematical Alphanumeric Symbols 1D49C--1D59F 1D400-1D7ff https://www.unicode.org/charts/PDF/U1D400.pdf
Greek and Coptic 391-3A9\{3A0;3A3}; 3B1-3C9\{3BB}; 3CA-3FB 370-3FF\{3BB;3A0;3A3} https://www.unicode.org/charts/PDF/U0370.pdf
Greek Extended fully accepted nothing to change https://www.unicode.org/charts/PDF/U1F00.pdf
Letterlike Symbols fully accepted nothing to change https://www.unicode.org/charts/PDF/U2100.pdf

(For _Greek and Coptic_, the excluded symbols are 3A0 Ξ , 3A3 Ξ£, and 3BB Ξ».)

Any thoughts? If (a subset of) these changes are acceptable I'll write a PR :rocket:

Eric Wieser (Jan 07 2025 at 14:00):

Why should unicode-input restrict itself to unicode characters that are legal in identifiers? Unicode characters are used for notation and docstrings too

Julien Marquet-Wagner (Jan 07 2025 at 14:02):

True, I forgot about that when I wrote.

Julien Marquet-Wagner (Jan 07 2025 at 14:03):

Also maybe there's a way to "register an identifier" that I don't know of?

Julien Marquet-Wagner (Jan 07 2025 at 14:09):

though the feeling that prompted me to write all this is that it's kind of frustrating to be able to input \bfA and to have a weird "unexpected token" error :/ hence just not allowing \bfA would _technically_ get us rid of the issue.

Kevin Buzzard (Jan 07 2025 at 14:16):

I would expect \bfA to be notation for something, rather than being used in its own right as part of an identifier. Even β„• is an unexpected token, and this is used all over the place in mathlib (edit: this last sentence is incorrect, see below)

Julien Marquet-Wagner (Jan 07 2025 at 14:25):

β„• is actually accepted (it's 2115 in Letterlike Symbols).

I'd say it makes sense to accept blackboard bold symbols, for instance you might want to be able to write about π•œ-modules, and (𝔸, 𝔹)-bimodules (though I believe this does not fit in Mathlib's style).

Julien Marquet-Wagner (Jan 07 2025 at 14:30):

I guess the reason why β„• is a notation in Mathlib is that Nat is actually defined in Lean's standard library. Though technically it would be possible to rename Nat to β„• (I wouldn't say it's a great idea to do so, too much of a breaking change).

Jireh Loreaux (Jan 07 2025 at 15:17):

I think, as noted above, you're just conflating tokens with tokens-allowed-in-identifiers. It's quite useful to be able to take a token which is not currently recognized by any parser and then turn it into notation for something. Not being about to easily type that token would just be painful, not helpful.

Kevin Buzzard (Jan 07 2025 at 15:39):

Julien Marquet said:

β„• is actually accepted (it's 2115 in Letterlike Symbols).

Aah yes, apologies. I tried def β„• := 37 and got an error about tokens, but this was because I had mathlib already imported.

Julien Marquet-Wagner (Jan 07 2025 at 17:50):

Back to my original proposal: I guess I'll create an issue about it on GitHub, then PR this if it makes sense and hopefully it'll get merged :)

Julien Marquet-Wagner (Jan 07 2025 at 17:51):

I'll let this topic live a little more before creating an issue, in case someone comes with more feedback

Jireh Loreaux (Jan 07 2025 at 18:04):

It's still not clear to me why you want to enable these extra characters in identifiers? (I'm not necessarily arguing against it, but I don't see an explicit reason for it.)

Markus Himmel (Jan 07 2025 at 19:05):

In the interest of keeping related requests together, could you add your request to lean4#3330 instead of creating a new issue? Also, as Jireh mentions, it's much more likely that your propsal is accepted if you have a real use case where your proposed change is required. Also note that we won't merge any PRs until the corresponding RFC is accepted, so it's best to not invest any time into a PR until the RFC is accepted.

Joseph Myers (Jan 07 2025 at 21:11):

My suggestion is still to start with one of the recommendations from UAX#31 for what Unicode characters should be allowed in identifiers in programming languages, then add whatever currently accepted characters are needed for compatibility with existing code and remove whatever currently rejected characters break existing code if allowed in identifiers.

Julien Marquet-Wagner (Jan 08 2025 at 10:20):

Jireh Loreaux said:

It's still not clear to me why you want to enable these extra characters in identifiers? (I'm not necessarily arguing against it, but I don't see an explicit reason for it.)

The _original_ reason is that I want to be able to use bold letters in identifiers for a project of mine. This is reason enough for me, but clearly that's not enough for a RFC.

The first thing is that the change has no negative impact:

  • It's cheap enough: it only touches 5 lines of Lean's source (that's a really small PR!), and performance-wise it even yields an improvement.
  • It won't break anything (I can't think of a case where code relies on Lean not accepting 𝐀 in identifiers)

It benefits some users:

  • As pointed out by @Markus Himmel, lean4#3330 and the accompanying Zulip thread received some positive feedback.
    Note: β±Ό wouldn't be accepted in an identifier with the changes I propose.

  • @Joseph Myers's remark also brings another argument: it would bring us closer to the Unicode standard.

So I guess the rationale is: it's a small change that has not downsides and positively affects some users. The benefit is small, but the cost is none in performance and small in terms of labor. In the end, the question to me is "why refuse this change?" rather than "why do it"?

Julien Marquet-Wagner (Jan 08 2025 at 10:20):

Joseph Myers said:

My suggestion is still to start with one of the recommendations from UAX#31 for what Unicode characters should be allowed in identifiers in programming languages, then add whatever currently accepted characters are needed for compatibility with existing code and remove whatever currently rejected characters break existing code if allowed in identifiers.

This is very relevant, thanks!

Another option would be to make Lean comply with UAX#31.
Implementing UAX#31 isn't trivial, but it would make for a better RFC. I guess it's possible to implement it without negative impact on performance, but I'd have to actually write the code to be sure. Fortunately, the changes would be small (if I'm right saying that the only place that needs changes is here). There is a slightly nontrivial issue: currently, Lean identifiers are <Start> <Continue>* but would ideally need to be <Start> <Continue>* (<Medial> <Continue>+)*. The standard would accept <Medial> = {} though.

Julien Marquet-Wagner (Jan 08 2025 at 10:24):

So, two questions:

  • Is my point convincing?
  • What sould I RFC?
    • Either the original proposal
    • Or compliance with UAX#31

David Thrane Christiansen (Jan 08 2025 at 10:44):

I don't think that just because something is written in a document, it's necessarily better, especially for somewhat unconventional parsing setups that have different tradeoffs than might otherwise be seen. I won't tell you whether to write up a UAX#31-based proposal or one not based on that, because I think part of the job of the person writing the proposal is to evaluate the alternatives and explain why they picked the one they did.

The inability to use mængde (Danish for "set") as an identifier is unfortunate, as is the lack of subscript-j. One limiting factor right now is that we don't have a representation of Unicode tables in the compiler, which would make it easy to say "all letters are allowed". I think there's clearly room to expand the set of identifiers.

Some things I'd like to see in a proposal are:

  • a set of rules that are easy to explain and don't feel too arbitrary
  • reasonable performance
  • backwards compatibility

It'd also be good to consider the issue of RTL scripts (e.g. Arabic, Hebrew), lookalike characters (Cyrillic Π‘ vs Latin C, Ossetian Σ” vs Latin Γ†) and whether any special considerations are needed. If you propose something like UAX#31, then should we also accept full-width question mark characters to accommodate identifiers written in Chinese or Japanese? And what about Unicode normalization and character comparison? It's fine to punt on an issue and say "this RFC doesn't fix this particular subproblem", but it's good to be upfront about that and consider the impact. It's also OK to say "In the idea world, we'd do thing X with Unicode - for now, we're doing approximation Y that delivers much of the benefit and can later be expanded".

As far as your point: I don't find arguments like "why refuse this change?" to be particularly convincing. And every change has negative impact, if only by taking mental bandwidth from other tasks. The point of the RFC is to put together a good case for your change, and let interested parties review it. If the case is easy, so much the better! But the highly polemic rhetoric is not convincing, just exhausting.

Joseph Myers (Jan 08 2025 at 13:09):

Normalization forms are discussed in UAX#31 (section 5). For confusables, see UTS#39 - but since we make heavy, deliberate use of confusable characters in mathlib (not in identifiers), I doubt we want Lean to exclude them.

David Thrane Christiansen (Jan 08 2025 at 13:16):

Great! But an RFC that starts with "Go read and internalize this big ol' document" is not one that's likely to be acted much upon, because now the cost of the change has gone up greatly. It's good to base the work on what others have done, and cite the sources along the way, but there's got to be some digestion and presentation of it as well.

Julien Marquet-Wagner (Jan 08 2025 at 15:54):

I found the commit that initially enabled math characters. I don't really know what the rationale was for this change (especially, why allow only Latin letters, Script, Double-struck, Fractur?). I guess extending the accepted characters to the full Mathematical Alphanumeric Symbols block could be considered as more a fix than a new feature.

So what I'm going to do is

  • Propose to fix the issue that not all alphabetic mathematical symbols are accepted.
    • This makes sense because it's in the spirit of the original commit that added mathematical alphanumeric characters. It looks to me like there was no reason to restrict the symbols to latin, script, double-struck and Fraktur. It also seems to me like it's in the spirit of this same commit to accept the whole Greek and Coptic table.
    • This seems reasonable as it changes nothing at the granularity of Unicode character sets.
    • This only adds "non-exotic" characters, which, from the looks of this discussion, I believe isn't concerning for anyone.
  • Open a RFC on whether Lean should comply with UAX#31. I honestly have no strong opinion on whether this should be done, nor how it should be done. I guess a RFC is would be the right place to discuss the technicalities that were pointed out here.
    • A baseline would be to accept recommended scripts as defined by the Unicode Standard. (By the way, Section 2.4 Specific Character Adjustments of UAX#31 already covers much ground on what characters to exclude because of security issues and/or limited use in practice).
    • The implementation may be non-trivial: depending on the complexity of the logic, this could cause a performance hit if not done cleverly enough.

In short, a bug report + a PR that fixes it, to add "reasonable" characters, and a RFC for more difficult questions.

Julien Marquet-Wagner (Jan 08 2025 at 16:06):

@Markus Himmel The issue of subscript characters is a bit complicated: these characters aren't defined in a coherent manner in Unicode, parts of the alphabet belong to different Unicode blocks. Plus, should capital subscripts be included? And superscripts?
Maybe the reasonable answer is to split in

  • add β±Ό as part of the issue+fix (along with the rest of the lowercase alphabet if need be).
  • discuss the issue of the rest of the subscripts/superscripts in the RFC.
    • This means the Unicode RFC would also cover lean4#3330

Markus Himmel (Jan 08 2025 at 16:22):

The commit you linked states in the commit message and the changelog that it adds the Script, Double-struck, and Fractur blocks, and that is exactly what it does. There is nothing to "fix". It is reasonable to want to be able to use additional characters in identifiers, but the way to get there is to add a convincing comment to lean4#3330 that makes a clear suggestion and argues why it is a good idea, and then to be patient until the FRO accepts the RFC. Opening a bug ticket and a PR will not accelerate the process.

Julien Marquet-Wagner (Jan 08 2025 at 16:39):

Yeah, true. What I have in mind is that there are really two different topics: that of coherence of the current behavior, and that of complying with the Unicode standard. I believe this separation of issues is relevant. Also, the "small" issue is easy to review (and pretty uncontroversial, I think), while the "big" one isn't at all.

That's why I think it's relevant to separate in two RFCs/issues.

Julien Marquet-Wagner (Jan 08 2025 at 16:43):

Also, the changes I propose go further than lean4#3330. Do you think I should still plug my contributions on this RFC? The upside is that it already exists and already has received some positive feedback.

Julien Marquet-Wagner (Jan 08 2025 at 16:48):

Here's what the changes I originally proposed would look like.

Eric Wieser (Jan 08 2025 at 16:52):

I don't think anyone is really questioning whether the change is simple; the challenge is "how do we end up with consensus on which set of characters are the ones lean should allow in identifiers".

Kyle Miller (Jan 08 2025 at 18:00):

Did you check the rest of the repository @Julien Marquet for these ranges? Lean.isLetterLike needs to be kept in sync too. There's also checking with existing editor extensions to see if they need any modification. You mentioned that it's just a ~five line change, but that's optimistic. There's also documentation in doc/lexical_structure.md.

Julien Marquet-Wagner (Jan 08 2025 at 18:06):

@Kyle Miller I made the mistaken assumption that is_letter_like was the only place that governed this based on the commit history. Thanks for pointing that out.

Joseph Myers (Jan 09 2025 at 02:05):

The basic idea of my proposal is: since Unicode experts have already thought about characters in identifiers and written up their conclusions, delegating as far as possible to those experts (with only those changes needed to work with existing practice in Lean and mathlib) seems like a good idea to avoid needing to try to reach consensus separately on lots of separate character groups.

Julien Marquet-Wagner (Jan 09 2025 at 10:44):

@Joseph Myers Makes sense. The discussion then becomes about what profile to implement from UAX#31.

The formulations allow for extensions, also known as profiles. That is, the particular set of code points or sequences of code points for each category used by the syntax can be customized according to the requirements of the environment.

There's even already the Mathematical Compatibility Notation Profile in the standard.


Last updated: May 02 2025 at 03:31 UTC