Zulip Chat Archive

Stream: lean4

Topic: Valid characters in identifiers

Heather Macbeth (Jun 06 2024 at 16:13):

I just tried to use the "subscript minus" character U+208B in a Lean variable name and discovered it was not valid. Is there a list somewhere of what is valid and what is not? And selfishly, would it be possible to add subscript minus/plus to that list?

variable (u₁ : Nat) -- works
variable (u₋ : Nat) -- expected token

Kyle Miller (Jun 06 2024 at 16:50):

Here's the full definition of what's accepted (minus rules about guillemets). The isIdFirst function is for the first character of an identifier, and isIdRest is for all the characters after that.

def isLetterLike (c : Char) : Bool :=
  (0x3b1  ≤ c.val && c.val ≤ 0x3c9 && c.val ≠ 0x3bb) ||                  -- Lower greek, but lambda
  (0x391  ≤ c.val && c.val ≤ 0x3A9 && c.val ≠ 0x3A0 && c.val ≠ 0x3A3) || -- Upper greek, but Pi and Sigma
  (0x3ca  ≤ c.val && c.val ≤ 0x3fb) ||                                   -- Coptic letters
  (0x1f00 ≤ c.val && c.val ≤ 0x1ffe) ||                                  -- Polytonic Greek Extended Character Set
  (0x2100 ≤ c.val && c.val ≤ 0x214f) ||                                  -- Letter like block
  (0x1d49c ≤ c.val && c.val ≤ 0x1d59f)                                   -- Latin letters, Script, Double-struck, Fractur

def isNumericSubscript (c : Char) : Bool :=
  0x2080 ≤ c.val && c.val ≤ 0x2089

def isSubScriptAlnum (c : Char) : Bool :=
  isNumericSubscript c ||
  (0x2090 ≤ c.val && c.val ≤ 0x209c) ||
  (0x1d62 ≤ c.val && c.val ≤ 0x1d6a)

def isIdFirst (c : Char) : Bool :=
  c.isAlpha || c = '_' || isLetterLike c

def isIdRest (c : Char) : Bool :=
  c.isAlphanum || c = '_' || c = '\'' || c == '!' || c == '?' || isLetterLike c || isSubScriptAlnum c

(I think this code needs to be kept in sync with the C++ functions in src/util/name.cpp)

Kyle Miller (Jun 06 2024 at 16:52):

A question regarding making subscript minus into an identifier is whether it's worth making it not be a postfix operator. People would have to write u ₋ rather than u₋, just like n ! instead of n! for factorial.

Heather Macbeth (Jun 06 2024 at 17:18):

I see. I think that mathematical practice would be much more likely to treat a subscript-minus as a name component than as an operator, but it's hard to know whether this impression is field-dependent. And maybe there are non-mathematician unicode users too ....

Kyle Miller (Jun 06 2024 at 17:39):

Here's a previous question about subscripts and superscripts: https://leanprover.zulipchat.com/#narrow/stream/113488-general/topic/add.20characters.20to.20lean.20core.3F/near/261235230

Heather Macbeth (Jun 06 2024 at 17:49):

Ah: @Floris van Doorn says he can imagine use cases of ⁺ being an operator. Floris, have you encountered this in "real life"? The name-component usage definitely occurs (case at hand: section 2 here).

Jireh Loreaux (Jun 06 2024 at 18:38):

We already have the superscripts as operators: «term_⁺» and «term_⁻», although I really wish this was either not global notation, or else a notation class existed for these.

Patrick Massot (Jun 06 2024 at 18:57):

This is clearly yet another notation that should not be global.

Floris van Doorn (Jun 06 2024 at 19:02):

I was indeed thinking about docs#posPart for that postfix notation. I'm fine with that notation being scoped (to the Order namespace?)

Floris van Doorn (Jun 06 2024 at 19:04):

Then again, ₂ could also be a postfix operator (for various things, including Prod.snd), and that is allowed in a valid identifier. But I prefer to keep ⁺ available for notation.

Jireh Loreaux (Jun 06 2024 at 19:29):

Personally, I would prefer that the posPart notation was made into a class (and if preferred, instances could be scoped), as that way I can use it for C*-algebras. (This would cause diamonds if we had both instances available simultaneously, but only in certain contexts.)

Richard Osborn (Jun 06 2024 at 19:29):

My intuition is that subscripts tend to be part of a term, but superscripts are operators. Is this a reasonable stance for lean to have?

Yaël Dillies (Jun 06 2024 at 19:46):

Jireh Loreaux said:

Personally, I would prefer that the posPart notation was made into a class

Actually there were originally a class with a single instance, so I deleted the class.

Yaël Dillies (Jun 06 2024 at 19:48):

Jireh Loreaux said:

that way I can use it for C*-algebras

Can't you already use it for C*-algebras? Like, docs#posPart is the correct notion, right?

Jireh Loreaux (Jun 06 2024 at 20:41):

It's not a lattice (not even the selfadjoint elements)

Joseph Myers (Jun 06 2024 at 23:46):

I feel the default rules for what's accepted in identifiers should be based on XID_Start and XID_Continue in UAX # 31 (as updated for each new Unicode version), with Lean-specific changes only as needed, rather than having something heavily Lean-specific. That might or might not help with these specific issues, but it's the Unicode recommendation for identifiers in programming languages in general.

Patrick Massot (Jun 06 2024 at 23:57):

Would that rule out many of the identifiers we already know and love? I am afraid the general consensus on this is probably pretty far from our traditions.

Joseph Myers (Jun 07 2024 at 00:12):

I think characters already used in identifiers such as ' count as "as needed", but it would probably add many reasonable identifier characters that no-one has got round to adding individually.

Last updated: May 02 2025 at 03:31 UTC