Zulip Chat Archive

Stream: mathlib4

Topic: Whitelist for Unicode?


Adomas Baliuka (Jul 02 2024 at 23:28):

Just for fun, I looked at the distribution of unicode characters in Mathlib's source files. For anyone interested in trivia, there are 610 individual characters and their frequency distribution is roughly exponential.

I'm one of admittedly very few people here who are slightly worried that malicious contributors (or AIs that find exploits) might make use of "cybersecurity-type vulnerabilities" to introduce inconsistencies. Given that unicode is generally a cybersecurity nightmare in most software, I wonder if Mathlib has any checks (e.g. in the linter) or guidelines on using and introducing new unicode. It does not seem so and I think it would be a good idea, e.g., to have a "whitelist of unicode characters" and a linter warning to be issued to any PR introducing new characters.

I found some interesting instances of non-printable non-ascii characters:

  • There is a U+200b (​zero width space) in docstrings here and here
  • There is U+200c (zero-width non-joiner) in docstrings here and here, which the symmetry between those two might suggest they are intentional, but I can't tell why it's used.

What do you think? Should we bother removing these, assuming they are unintentional? Should we worry in general about weird unicode characters being introduced and going unnoticed in PRs?

Bolton Bailey (Jul 02 2024 at 23:53):

Well the nice thing about working in a language which asks you to prove things is that it should be impossible to introduce an inconsistency, at least in the mathematical sense, solely by including tricky unicode characters in theorems because the theorems are eventually checked by the kernel and are either true or not.

Your concern is perhaps a little more valid in the context of definitions that may not be what they are supposed to look like. But I think the mathlib community is currently focused on research math and this attack model is not a big concern for that endeavor. Still, removing unnecessary complications can't hurt.

I do think a linter for unicode characters that don't have backslash codes makes sense though (at least outside of comments), because if any of those are introduced, it could make it harder for maintainers to edit that code after the fact.

Bolton Bailey (Jul 02 2024 at 23:56):

I'd be interested to see the unicode frequency stats

Adomas Baliuka (Jul 03 2024 at 00:04):

Individual frequencies

Julia code to recreate distribution plot:

using StatsBase, Plots
s::String = <concatenate all .lean files in Mathlib>
countmap(s) |> collect |> v->sort(v, by=pair->pair.second) |> tt->bar(tt.|>s->s.first, tt.|>s->s.second, yaxis=:log)

Distribution:
tmp.png

Kim Morrison (Jul 03 2024 at 01:04):

I would encourage removing the non-printable characters.

(And yes, I'm aware that it's possible that someone added them to tweak the docs output in some clever and undocumented way. :-)

Anthony Bordg (Jul 03 2024 at 14:00):

Can I use a curly symbol E\mathcal{E} for the name of a variable? How?

Adomas Baliuka (Jul 03 2024 at 14:07):

Anthony Bordg said:

Can I use a curly symbol E\mathcal{E} for the name of a variable? How?

Any symbol that cannot be entered using <backslash><some letter combination>, you can copy/paste from e.g. https://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode

So for caligraphic E 'ℰ', VSCode tells me to use \McE. I guess 'Mc' stands for "mathcal". If you want to really confuse people, you could use something like 'ℇ'...

Once you have a symbol in e.g. VSCode, hovering over it tells you the <backslash><some letter combination>, if there is one.

Anthony Bordg (Jul 03 2024 at 14:19):

Fantastic Adomas! \Mc is what I was looking for.

Marc Huisinga (Jul 03 2024 at 14:29):

The 'Docs: Show Unicode Input Abbrevations' command also opens a full list in VS Code.

Anthony Bordg (Jul 03 2024 at 14:44):

Is this a command you input in VC Code?

Marc Huisinga (Jul 03 2024 at 14:47):

Yes! See also https://github.com/leanprover/vscode-lean4/blob/master/vscode-lean4/manual/manual.md#unicode-input.

Adomas Baliuka (Jul 07 2024 at 22:54):

I made a PR to remove those characters. @Kim Morrison You were in favor of removing them? Not sure if anyone else should be part of this decision because I don't understand where that Unicode originally comes from in the first place (Mathlib3 ?!).

Eric Wieser (Jul 07 2024 at 23:13):

Interestingly this change breaks unicode rendering on Android in one case, but I don't think that's a good reason to keep the invisible characters.

Adomas Baliuka (Aug 28 2024 at 10:34):

@Michael Rothgang thanks for the hint how to implement a linter for unicode!

I made an attempt at it (and will at some point PR something so it can be discussed).
However, I realized that I don't understand how unicode is actually treated in Lean. For example, why does this not work:

def ka := 2 -- expected token

The notation definition below works in any case, so the linter would probably be useful:

infix:50 " か " => Add.add
#eval 2  2

Mario Carneiro (Aug 28 2024 at 10:48):

because lean has a small hardcoded list of identifier characters, which is not very international-friendly

Michael Rothgang (Aug 28 2024 at 11:00):

Even surprising unicode characters in comments can be useful to check for. (There were a security announcements a while ago, about using unicode to make the rendered code look differently from what it actually was.)

Adomas Baliuka (Aug 28 2024 at 11:16):

I made a PR #16215.

Current issues/TODOs:

  • Does this have community support in the first place?
  • Need to make the whitelist more complete, order it, clean, etc.
  • I'm not sure how to add a (linter-required) docstring to my local instance

Michael Rothgang (Aug 28 2024 at 11:30):

I left a few comments.

Patrick Massot (Aug 28 2024 at 11:46):

I think that any character that appears in https://github.com/leanprover/vscode-lean4/blob/master/lean4-unicode-input/src/abbreviations.json should be in the whitelist, otherwise this is really inconsistent.

Adomas Baliuka (Aug 28 2024 at 12:26):

@Patrick Massot do you think it's OK to hard-code this or should the linter read in and parse abbreviations.json?

Patrick Massot (Aug 28 2024 at 12:26):

I hope it’s ok to go for the simple solution. That list does not change very often.

Adomas Baliuka (Aug 28 2024 at 12:40):

Should we allow "Rial" ? It has a VSCode abbreviation
but it uses right-to-left text flow, so seems a bit awkward to use and have in the code. It's not currently used in Mathlib.

Patrick Massot (Aug 28 2024 at 12:47):

I would vote for removing it from the abbreviation file then.

Adomas Baliuka (Aug 28 2024 at 12:51):

There are also still unprintable non-ascii characters.

@Patrick Massot you introduced two months ago two instances of U+202F (category Zs: Separator, space) here. Do we want to keep and allow this?

Adomas Baliuka (Aug 28 2024 at 12:56):

Do we allow abbreviation \quad (U+2001)?

Patrick Massot (Aug 28 2024 at 13:00):

This is a mistake coming from my native language habits (French uses a non-breakable space before the colon).

Patrick Massot (Aug 28 2024 at 13:01):

Sorry about that.

Eric Wieser (Aug 28 2024 at 13:16):

Patrick Massot said:

This is a mistake coming from my native language habits (French uses a non-breakable space before the colon).

Do French keyboards have a dedicated button for this?

Adomas Baliuka (Aug 28 2024 at 13:17):

Patrick Massot said:

I would vote for removing it from the abbreviation file then.

RFC https://github.com/leanprover/vscode-lean4/issues/522

Eric Wieser (Aug 28 2024 at 13:19):

Adomas Baliuka said:

Patrick Massot do you think it's OK to hard-code this or should the linter read in and parse abbreviations.json?

This sounds like it will be a bit frustrating for users defining new notation, as they then can't get their mathlib PR merged without FRO approval of the shortcut.

Patrick Massot (Aug 28 2024 at 13:22):

Eric Wieser said:

Patrick Massot said:

This is a mistake coming from my native language habits (French uses a non-breakable space before the colon).

Do French keyboards have a dedicated button for this?

You don’t need a dedicated button, this is given by the keyboard layout. I use https://qwerty-lafayette.org/ which has the thin non-breakable space on Shift-space and the large one on AltGr-Shift-Space (but I don’t use it much).

Patrick Massot (Aug 28 2024 at 13:23):

Note this layout also makes it slightly easier to type ’ than '

Patrick Massot (Aug 28 2024 at 13:24):

since ’ is deadkey-space while ' is AltGr-o. This is the same number of keypresses but AltGr must be kept pressed while the deadkey is a deadkey so you release it before typing the next key.

Adomas Baliuka (Aug 28 2024 at 13:25):

Eric Wieser said:

Adomas Baliuka said:

Patrick Massot do you think it's OK to hard-code this or should the linter read in and parse abbreviations.json?

This sounds like it will be a bit frustrating for users defining new notation, as they then can't get their mathlib PR merged without FRO approval of the shortcut.

We're not reading in the JSON file right now, but in any case we would only require that it be a subset of the whitelist.

The whitelist is currently just a string defined in the file where the linter is defined. However we define it, certainly Mathlib will be in control of it.

Patrick Massot (Aug 28 2024 at 13:26):

But note that some French people do not care about non-breaking spaces. There are even French people that don’t type French quotation marks when writing in French (those people usually also don’t use “ and ” as quotation marks, they use "). :man_shrugging:

Adomas Baliuka (Aug 28 2024 at 13:43):

Current whitelist discussion status:

I split the whitelist into parts:

for example, a superset of a reasonable subset of the Wikipedia article might be below

def WIKI_NOT_WHITELIST := "
∊∕∬∭∰∱∲∳∾∿≆≭≸≹⊌⊦⊰⊱⊶⊷⊽⊾⊿⋅⋕⋤⋥⋰⋲⋳⋴⋵⋶⋷⋸⋹⋺⋻⋼⋽⋾⋿⨇⨈⨉⨊⨋⨌⨎⨏⨐⨑⨒⨓⨔⨕⨖⨗⨘⨙⨚⨛⨜⨝⨞⨟⨠⨡⨢⨣⨤⨥⨦⨧⨨⨩⨪⨫⨬⨭⨮⨰⨱⨲⨴⨵⨶
⨷⨸⨹⨺⨻⨼⨽⨾⩀⩁⩂⩃⩄⩅⩆⩇⩈⩉⩊⩋⩌⩍⩎⩏⩐⩑⩒⩓⩔⩕⩖⩗⩘⩙⩚⩛⩜⩝⩞⩟⩠⩡⩢⩣⩤⩥⩦⩧⩨⩩⩪⩫⩬⩭⩮⩯⩰⩱⩲⩳⩴⩵⩶⩷⩸⩹⩺⩻⩼⩽⩾⪀⪁⪂⪃⪄⪅⪆⪇⪈⪉⪊⪋⪌⪍⪎⪏⪐⪑⪒
⪓⪔⪕⪖⪗⪘⪙⪚⪛⪜⪝⪞⪟⪠⪡⪢⪣⪤⪥⪦⪧⪨⪩⪪⪫⪬⪭⪮⪯⪰⪱⪲⪳⪴⪵⪶⪷⪸⪹⪺⪻⪼⪽⪾⪿⫀⫁⫂⫃⫄⫅⫆⫇⫈⫉⫊⫋⫌⫍⫎⫏⫐⫑⫒⫓⫔⫕⫖⫗⫘⫙⫚⫛⫝̸⫝⫞⫟
⫠⫡⫢⫣⫤⫥⫦⫧⫨⫩⫪⫫⫬⫭⫮⫯⫰⫱⫲⫳⫴⫵⫶⫷⫸⫹⫺⫻⫼⫽⫾⫿ℎ𝐆𝑋𝑑𝑭𝑾𝒞𝒿𝓰𝔊𝔛𝔴𝕔
𝕬𝕭𝕮𝕯𝕰𝕱𝕲𝕳𝕴𝕵𝕶𝕷𝕸𝕹𝕺𝕻𝕼𝕽𝕾𝕿𝖀𝖁𝖂𝖃𝖄𝖅𝖆𝖇𝖈𝖉𝖊𝖋𝖌𝖍𝖎𝖏𝖐𝖑𝖒𝖓𝖔𝖕𝖖𝖗𝖘𝖙𝖚𝖛𝖜𝖝𝖞𝖟
𝖠𝖡𝖢𝖤𝖥𝖦𝖧𝖨𝖩𝖪𝖫𝖬𝖭𝖮𝖯𝖰𝖱𝖲𝖳𝖴𝖵𝖶𝖷𝖸𝖹𝖺𝖻𝖼𝖽𝖾𝖿𝗀𝗁𝗂𝗃𝗄𝗅𝗆𝗇𝗈𝗉𝗊𝗋𝗌𝗍𝗎𝗏𝗐𝗑𝗒𝗓
𝗔𝗕𝗖𝗗𝗘𝗙𝗚𝗛𝗜𝗝𝗞𝗟𝗠𝗡𝗢𝗣𝗤𝗥𝗦𝗧𝗨𝗩𝗪𝗫𝗬𝗭𝗮𝗯𝗰𝗱𝗲𝗳𝗴𝗵𝗶𝗷𝗸𝗹𝗺𝗻𝗼𝗽𝗾𝗿𝘀𝘁𝘂𝘃𝘄𝘅𝘆𝘇
𝘈𝘉𝘊𝘋𝘌𝘍𝘎𝘏𝘐𝘑𝘒𝘓𝘔𝘕𝘖𝘗𝘘𝘙𝘚𝘛𝘜𝘝𝘞𝘟𝘠𝘡𝘢𝘣𝘤𝘥𝘦𝘧𝘨𝘩𝘪𝘫𝘬𝘭𝘮𝘯𝘰𝘱𝘲𝘳𝘴𝘵𝘶𝘷𝘸𝘹𝘺𝘻
𝘼𝘽𝘾𝘿𝙀𝙁𝙂𝙃𝙄𝙅𝙆𝙇𝙈𝙉𝙊𝙋𝙌𝙍𝙎𝙏𝙐𝙑𝙒𝙓𝙔𝙕𝙖𝙗𝙘𝙙𝙚𝙛𝙜𝙝𝙞𝙟𝙠𝙡𝙢𝙣𝙤𝙥𝙦𝙧𝙨𝙩𝙪𝙫𝙬𝙭𝙮𝙯
𝙰𝙱𝙲𝙳𝙴𝙵𝙶𝙷𝙸𝙹𝙺𝙻𝙼𝙽𝙾𝙿𝚀𝚁𝚂𝚃𝚄𝚅𝚆𝚇𝚈𝚉𝚊𝚋𝚌𝚍𝚎𝚏𝚐𝚑𝚒𝚓𝚔𝚕𝚖𝚗𝚘𝚙𝚚𝚛𝚜𝚝𝚞𝚟𝚠𝚡𝚢𝚣𝚤𝚥
𝚨𝚩𝚪𝚫𝚬𝚭𝚮𝚯𝚰𝚱𝚲𝚳𝚴𝚵𝚶𝚷𝚸𝚹𝚺𝚻𝚼𝚽𝚾𝚿𝛀𝛁𝛂𝛃𝛄𝛅𝛆𝛇𝛈𝛉𝛊𝛋𝛌𝛍𝛎𝛏𝛐𝛑𝛒𝛓𝛔𝛕𝛖𝛗𝛘𝛙𝛚𝛛𝛜𝛝𝛞𝛟𝛠𝛡
𝛢𝛣𝛤𝛥𝛦𝛧𝛨𝛩𝛪𝛫𝛬𝛭𝛮𝛯𝛰𝛱𝛲𝛳𝛴𝛵𝛶𝛷𝛸𝛹𝛺𝛻
𝛼𝛽𝛾𝛿𝜀𝜁𝜂𝜃𝜄𝜅𝜆𝜇𝜈𝜉𝜊𝜋𝜌𝜍𝜎𝜏𝜐𝜑𝜒𝜓𝜔𝜕𝜖𝜗𝜘𝜙𝜚𝜛
𝜜𝜝𝜞𝜟𝜠𝜡𝜢𝜣𝜤𝜥𝜦𝜧𝜨𝜩𝜪𝜫𝜬𝜭𝜮𝜯𝜰𝜱𝜲𝜳𝜴𝜵𝜶𝜷𝜸𝜹𝜺𝜻𝜼𝜽𝜾𝜿𝝀𝝁𝝂𝝃𝝄𝝅𝝆𝝇𝝈𝝉𝝊𝝋𝝌𝝍𝝎𝝏𝝐𝝑𝝒𝝓𝝔𝝕
𝝖𝝗𝝘𝝙𝝚𝝛𝝜𝝝𝝞𝝟𝝠𝝡𝝢𝝣𝝤𝝥𝝦𝝧𝝨𝝩𝝪𝝫𝝬𝝭𝝮𝝯𝝰𝝱𝝲𝝳𝝴𝝵𝝶𝝷𝝸𝝹𝝺𝝻𝝼𝝽𝝾𝝿𝞀𝞁𝞂𝞃𝞄𝞅𝞆𝞇𝞈𝞉𝞊𝞋𝞌𝞍𝞎𝞏
𝞐𝞑𝞒𝞓𝞔𝞕𝞖𝞗𝞘𝞙𝞚𝞛𝞜𝞝𝞞𝞟𝞠𝞡𝞢𝞣𝞤𝞥𝞦𝞧𝞨𝞩𝞪𝞫𝞬𝞭𝞮𝞯𝞰𝞱𝞲𝞳𝞴𝞵𝞶𝞷𝞸𝞹𝞺𝞻𝞼𝞽𝞾𝞿𝟀𝟁𝟂𝟃𝟄𝟅𝟆𝟇𝟈𝟉𝟊𝟋
𝟎𝟏𝟐𝟑𝟒𝟓𝟔𝟕𝟖𝟗𝟢𝟣𝟤𝟥𝟦𝟧𝟨𝟩𝟪𝟫𝟰𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿
℀℁℄℅℆ℇ℈℉℔℟℣℩Ⅎℹ℺ℼℽℾℿ⅁⅂⅃⅄ⅅⅆⅇⅈⅉ⅊⅍ⅎ⅏⟀⟁⟃⟄⟇⟈⟉⟊⟌⟎⟏⟐⟑⟒⟓⟔⟕⟖⟗⟘⟙⟚⟛⟜⟝⟞⟟⟠⟡⟢⟣⟤⟥⟬⟭⦀⦁⦂⦅⦆⦇⦈⦉⦊⦋⦌⦍⦎⦏⦐⦑⦒⦓⦔⦕⦖⦗⦘⦙⦚⦛⦜⦝⦞⦟⦠⦡⦢⦣⦤⦥
⦦⦧⦨⦩⦪⦫⦬⦭⦮⦯⦰⦱⦲⦳⦴⦵⦶⦷⦸⦹⦺⦻⦼⦽⦾⦿⧀⧁⧂⧃⧄⧅⧆⧇⧈⧉⧊⧋⧌⧍⧎⧐⧑⧒⧓⧔⧕⧖⧗⧘⧙⧚⧛⧜⧝⧞⧟⧠⧡⧢⧣⧤⧥⧦⧧⧨⧩⧪⧫⧬⧭⧮⧯⧰⧱⧲⧳
⧴⧵⧶⧷⧹⧺⧻⧼⧽⧾⧿⌁⌂⌃⌄⌅⌆⌇⌌⌍⌎⌏⌐⌑⌒⌓⌔⌕⌖⌗⌘⌙⌚⌛⌠⌡⌤⌥⌦⌧⌨〈〉⌫⌬⌭⌮⌯⌰⌱⌲⌳⌴⌵⌷⌸⌹⌺⌻⌼⌽⌾⌿⍀⍁⍂⍃⍄⍅⍆⍇⍈⍉⍊⍋⍌⍍⍎⍏⍐⍑⍒⍓⍔⍕⍖⍗⍘⍙⍚⍛⍜⍝⍞⍠⍡⍢
⍣⍤⍥⍦⍧⍨⍩⍪⍫⍬⍭⍮⍯⍰⍱⍲⍳⍴⍵⍶⍷⍸⍹⍺⍻⍼⍽⍾⍿⎀⎁⎂⎃⎄⎅⎆⎇⎈⎉⎊⎋⎌⎍⎎⎏⎐⎑⎒⎓⎔⎕⎖⎗⎘⎙⎚⎛⎜⎝⎞⎟⎠⎡⎢⎣⎤⎥⎦⎧⎨⎩⎪⎫⎬⎭⎮⎯⎰⎱⎲⎳⎴⎵⎶⎷⎸⎹⎺⎻⎼⎽⎾⎿⏀⏁⏂⏃⏄⏅
⏆⏇⏈⏉⏊⏋⏌⏍⏎⏏⏐⏑⏒⏓⏔⏕⏖⏗⏘⏙⏚⏛⏜⏝⏞⏟⏠⏡⏢⏣⏤⏥⏦⏧⏨⏪⏫⏬⏭⏮⏯⏰⏱⏲⏴⏵⏶⏷⏸⏹⏺⏻⏼⏽⏾⏿▤▥▦▧▨▩▫▮▯▲▶►▻◄◅◉◊◍◐◑◒◓◔
◕◖◗◘◙◚◛◜◝◞◟◠◡◣◤◧◨◩◪◬◭◮◰◱◲◳◴◵◶◷◸◺◻◼◽↲↳↴↵↸↹⇜⇞⇟⇠⇡⇢⇣⇤⇥⇦⇧⇩⇪⇫⇬⇭⇮⇯⇱⇲⇳⇴⇷⇸⇹⇺⇻⇼⇽⇾⇿⟲⟳⟴⟸⟺⟻⟼⟽⟾⟿⤀⤁⤂⤃⤄
⤅⤆⤇⤈⤉⤊⤋⤌⤍⤎⤏⤐⤑⤒⤓⤔⤕⤖⤗⤘⤙⤚⤛⤜⤝⤞⤟⤠⤡⤢⤣⤤⤥⤦⤧⤨⤩⤪⤫⤬⤭⤮⤯⤰⤱⤲⤴⤵⤶⤷⤸⤹⤺⤻⤼⤽⤾⤿⥀⥁⥂⥃⥄⥅⥆⥇⥈⥉⥊⥋⥌⥍
⥎⥏⥐⥑⥒⥓⥔⥕⥖⥗⥘⥙⥚⥛⥜⥝⥞⥟⥠⥡⥢⥣⥦⥧⥨⥩⥪⥫⥬⥭⥮⥯⥰⥱⥲⥳⥴⥵⥶⥷⥸⥹⥺⥻⥼⥽⥾⥿⬀⬁⬂⬃⬄⬅⬆⬇⬈⬉⬊⬋⬌⬍⬎⬏⬐⬑⬒⬓⬔⬕⬖⬗⬘⬙⬚⬛
⬜⬞⬟⬠⬡⬢⬣⬤⬥⬦⬧⬨⬩⬪⬫⬬⬭⬮⬯⬰⬱⬲⬳⬴⬵⬶⬷⬸⬹⬺⬻⬼⬽⬾⬿⭀⭁⭂⭃⭄⭅⭆⭇⭈⭉⭊⭋⭌⭍⭎⭏⭐⭑⭒⭓⭔⭕⭖⭗⭘⭙⭚⭛⭜⭝⭞⭟⭠⭡⭢⭣⭤⭥⭦⭧⭨⭩⭪⭫⭬⭭
⭮⭯⭰⭱⭲⭳⭶⭷⭸⭹⭺⭻⭼⭽⭾⭿⮀⮁⮂⮃⮄⮅⮆⮇⮈⮉⮊⮋⮌⮍⮎⮏⮐⮑⮒⮓⮔⮕⮗⮘⮙⮚⮛⮜⮝⮞⮟⮠⮡⮢⮣⮤⮥⮦⮧⮨⮩⮪⮫⮬⮭⮮⮯⮰⮱⮲⮳⮴⮵⮶⮷⮸⮹⮺⮻⮼⮽⮾⮿
⯀⯁⯂⯃⯄⯅⯆⯇⯈⯉⯊⯋⯌⯍⯎⯏⯐⯒⯓⯔⯕⯖⯗⯘⯙⯚⯛⯜⯝⯞⯟⯠⯡⯢⯣⯤⯥⯦⯧⯨⯩⯪⯫⯬⯭⯮⯯⯰⯱⯲⯳⯴⯵⯶⯷⯸⯹⯺⯻⯼⯽⯾
ϒϴϵ϶؆؇؈″‴☆♡﬩﹡﹢﹣﹤﹥﹦+<=>^|~←↑→↓
"

I excluded all arabic because it's really weird to change the left-to-right flow. This is the example I gave at vscode-lean4/issues/522:

infix:50 " ﷼ " => Add.add
def x :=
  2  2

This code works despite what you might thing "infix" does.

Johan Commelin (Aug 28 2024 at 13:48):

/me has RPN flashbacks

Adomas Baliuka (Aug 28 2024 at 13:49):

Johan Commelin said:

/me has RPN flashbacks

Sorry, what is RPN?

Johan Commelin (Aug 28 2024 at 13:50):

reverse polish notation. So

[prompt] 3 4 +
-- returns 7

Johan Commelin (Aug 28 2024 at 13:50):

And 3 4 2 + * will evaluate to 18

Eric Wieser (Aug 28 2024 at 14:17):

Patrick Massot said:

You don’t need a dedicated button, this is given by the keyboard layout. I use https://qwerty-lafayette.org/ which has the thin non-breakable space on Shift-space and the large one on AltGr-Shift-Space (but I don’t use it much).

Wow, the  [touche sous une voyelle] input strikes me as very unusual! Thanks for the link.

Adomas Baliuka (Aug 31 2024 at 06:36):

We have the following not-yet-whitelisted characters remaining, which I wasn't sure that they really should be there. Any opinions if they should be whitelisted or instead removed?

Mathlib/CategoryTheory/Adjunction/Triple.lean:14: unicode character ' ' (U+00a0)
Mathlib/CategoryTheory/Adjunction/Triple.lean:34: unicode character ' ' (U+00a0)
Mathlib/GroupTheory/GroupExtension/Defs.lean:25: unicode character '︎' (U+fe0e)
Mathlib/GroupTheory/GroupExtension/Defs.lean:27: unicode character '︎' (U+fe0e)
Mathlib/GroupTheory/GroupExtension/Defs.lean:27: unicode character '︎' (U+fe0e)
Mathlib/GroupTheory/GroupExtension/Defs.lean:27: unicode character '️' (U+fe0f)
Mathlib/GroupTheory/GroupExtension/Defs.lean:30: unicode character '︎' (U+fe0e)
Mathlib/GroupTheory/GroupExtension/Defs.lean:32: unicode character '︎' (U+fe0e)
Mathlib/GroupTheory/GroupExtension/Defs.lean:32: unicode character '︎' (U+fe0e)
Mathlib/GroupTheory/GroupExtension/Defs.lean:32: unicode character '️' (U+fe0f)
Mathlib/Util/Superscript.lean:60: unicode character '𐞥' (U+107a5)
Mathlib/Util/Superscript.lean:60: unicode character '' (U+a7f4)
Mathlib/Util/Superscript.lean:239: unicode character '𐞥' (U+107a5)
Mathlib/Util/Superscript.lean:239: unicode character 'ꟴ' (U+a7f4)

  • The U+00a0 (non-breaking space) will be hard to eliminate because many people use it and new PRs that get merged keep reintroducing it. Maybe that means whitelisting it?
  • U+fe0e (variation selector) also seems to be used on purpose?
  • U+107a5(Modifier Letter Small Q) and U+a7f4 (Modifier Letter Capital Q) are the only "superscript" letters that my font does not support. Does yours support it? Will anyone use these in code eventually or should we blacklist them until someone does?

Kevin Buzzard (Aug 31 2024 at 06:49):

What is non-breaking space used for in mathlib? I don't know how to search for it

Adomas Baliuka (Aug 31 2024 at 07:40):

Kevin Buzzard said:

What is non-breaking space used for in mathlib? I don't know how to search for it

Non-breaking spaces are ~ in Latex. They prevent linebreaks at that position.

It's used in comments, presumably to improve rendering on mobile phones (?). Also, apparently some French contributors use it out of habit, as discussed above.

You can use the VSCode search tool, turn on regex and search for "\u00a0".

Michael Rothgang (Aug 31 2024 at 07:43):

Well, if there's consensus that adding this lint is useful, the linter will ensure these won't get added, right? (Sometimes contributors forget about some style guide rules; that happens. Linters can be useful to enforce this all the time.)

Adomas Baliuka (Aug 31 2024 at 07:48):

What I meant was that if (as has been the case so far) removing these characters is done in separate PRs and the linter is only merged when it passes, that might take a lot of PRs removing whitespace until one gets lucky.

Jon Eugster (Aug 31 2024 at 07:59):

Didnt read the entire conv. but make sure to include \uFE0E and \uFE0F which are text-variant selector and emoji-variant selector, see lean4#5015 and #16186

or even better, only allow the correct variant of these unicode characters

Jon Eugster (Aug 31 2024 at 08:04):

and it would be amazing if the linter directly prints the right variant.I can have a go at this later if you want.

Adomas Baliuka (Aug 31 2024 at 08:11):

Jon Eugster said:

and it would be amazing if the linter directly prints the right variant.I can have a go at this later if you want.

I'm not sure what you mean here. In any case, go ahead implementing it if it's useful.

Concerning the variant selectors, they should probably only be allowed in front of emoji characters? Are the emoji characters supposed to be allowed everywhere or just in specific places in the repository that one might add to an exception list?

Michael Rothgang (Aug 31 2024 at 08:22):

Adomas Baliuka said:

What I meant was that if (as has been the case so far) removing these characters is done in separate PRs and the linter is only merged when it passes, that might take a lot of PRs removing whitespace until one gets lucky.

Sure! But I don't see a problem with combining a few selected fix-ups with landing the linter. The point is more to review 200 lines of repetitive diff elsewhere, not to be extremely strict about it.

Johan Commelin (Aug 31 2024 at 08:24):

What are the risks of allowing arbitrary unicode in docstrings?

Johan Commelin (Aug 31 2024 at 08:25):

Would it suffice to lint specifically the notation commands?

Johan Commelin (Aug 31 2024 at 08:26):

Every linter adds a potential burden to contributors, so I would like to see a bit more motivation, I think

Ruben Van de Velde (Aug 31 2024 at 08:28):

I think there's been cases before where they broke a markdown header in a confusing manner

Jon Eugster (Aug 31 2024 at 09:21):

Adomas Baliuka said:

Jon Eugster said:

and it would be amazing if the linter directly prints the right variant.I can have a go at this later if you want.

I'm not sure what you mean here. In any case, go ahead implementing it if it's useful.

Concerning the variant selectors, they should probably only be allowed in front of emoji characters? Are the emoji characters supposed to be allowed everywhere or just in specific places in the repository that one might add to an exception list?

they can be after unicode characters, but yes exactly. They are used in regular strings in core and I could imagine tactics/widgets using emojis in a similar way.

But yes, that's probably a separate list of emojis allowed

Adomas Baliuka (Sep 03 2024 at 23:57):

What do we think of "stacked" unicode characters, e.g., the normal 'e' ( U+0065) followed by '⃗ (U+20D7)? For example, the Julia programming language does allows these in variable names, even.

In Lean, on the other hand, it allows things like this (which I guess is fine to allow, people should "just not do it")

infix:50 "⃗  " => Add.add

def e := 1

#eval e 1 -- 2

Jeremy Tan (Sep 03 2024 at 23:58):

Let it be

Adomas Baliuka (Sep 04 2024 at 00:01):

Jeremy Tan said:

Let it be

I'm not sure what exactly you mean by that. I didn't propose doing anything with it, just exploring the way Lean handles the various Unicode peculiarities.

Right now, the linter would always flag the character U+20D7 because it's not in the whitelist (and that, in turn, is because it's currently not used in Mathlib and has no shortcut).

Jon Eugster (Sep 04 2024 at 00:26):

I thought I saw a topic here were there was some argument about how they were problematic, but I cant find it. All I could find is that one can apparently write é or e\u0301 (or something, untested), the latter seems troublesome.

Adomas Baliuka (Sep 04 2024 at 00:33):

Jon Eugster said:

\u0301

(Tested, yes that's the one!) This is also currently flagged by the linter (and I guess we'll keep it that way).

Jon Eugster (Sep 25 2024 at 12:18):

Coming back to this, there is a series of PRs adding unicode linters ready for review:

  • Part 1 (#17129): adds linter which checks that the invisible unicode-variant-selectors (for emojis) are only on the right unicode characters and provides automatic fixing for it.
  • Part 2 (#17131): adds linter functionality to capture bad unicode characters and automatically fix them if possible. The PR only implements this for \u00A0 (non-breaking-space) which will automatically be fixed to a regular space
  • Part 3 (#16215): Extends the linter to capture more unwanted unicode characters.

Jon Eugster (Sep 25 2024 at 12:18):

So far I understood the conversations that part 1 & 2 are uncontroversial, especially since they provide automatic fixes with lake exe lint-style --fix (or commenting bot fix style on the PR; or review dog assuming it works)

Is this true, or are there concerns regarding part 1 & 2?

Jon Eugster (Sep 25 2024 at 12:22):

Regarding part 3 it seems there is still some discussion to be had. @Adomas Baliuka's approach (based on the discussion here) is currently to have a whitelist of characters which comprises everything with a abbreviation and a few already present unicode characters. When any other unicode character is used, the warning suggests to "add this character to the whitelist".

An alternative approach might include a blacklist, although with unicode being increadible expressive, I think it might be hard to blacklist all bad (invisible, order reversing, decorators (i.e. printed on top of previous char), etc.) unicode characters.

Are there opinions on this?


Last updated: May 02 2025 at 03:31 UTC