## Stream: general

### Topic: VS Code hover and notation

#### Kevin Buzzard (Mar 15 2020 at 16:59):

I just noticed something interesting with hovering over the \gg notation in the category theory library. tensorclip.png ggclip.png compclip.png

If I hover over the tensor ⊗ symbol then I get simultaneously the keyboard shortcut for the notation and the type of the term which the notation stands for. I've noticed in the past that for some notation this doesn't work, one just gets the shortcut (which is annoying when you're trying to learn the category theory library and are trying to get on top of an infinite amount of notation). For example if you hover over the \gg ≫ symbol then you only get its keyboard shortcut, not what it means. But I noticed today that if you hover slightly to the left of the symbol then you get the type of the term! Is this intended?

#### Bryan Gin-ge Chen (Mar 15 2020 at 17:04):

My guess is that VS Code (or vscode-lean) sees the \gg symbol as multiple characters so the positions might get screwed up. Unicode strikes again!

#### Bryan Gin-ge Chen (Mar 15 2020 at 17:06):

@Gabriel Ebner any ideas on whether we could fix this?

#### Gabriel Ebner (Mar 15 2020 at 17:13):

One possible theory: the hover for the notation and the type have different ranges. Vscode picks the first range. Maybe it's also a problem that the range for the type is actually a position (I.e. length 0).

#### Kevin Buzzard (Mar 15 2020 at 17:19):

Personally I'm just really pleased that I've stumbled upon this workaround :-)

#### Gabriel Ebner (Mar 15 2020 at 17:30):

This is a bit weird. I see both hovers in functor.lean no matter which parts of the >> I'm on, but in monoidal/category.lean I have the problem for both (X) and >>.

#### Gabriel Ebner (Mar 15 2020 at 17:31):

Ah, ah, oh no, arrgh.

#### Gabriel Ebner (Mar 15 2020 at 17:32):

Of course it's a Unicode issue. The hover works as expected for the first symbol on a line, but fails on the following ones.

#### Gabriel Ebner (Mar 15 2020 at 17:34):

And this makes it obvious what the bug is: Lean counts ⊗ as one character, vscode counts it as two characters. So for the first symbol in a line the two counts coincide and hence the notation and the type hovers agree on the column. For the second symbol, they are off by one character.

#### Mario Carneiro (Mar 15 2020 at 17:37):

is there a more permanent way to fix this? like making lean respect vscode counting conventions

#### Gabriel Ebner (Mar 15 2020 at 17:37):

Bryan Gin-ge Chen said:

Gabriel Ebner any ideas on whether we could fix this?

We need to translate the UTF-16 positions used in vscode to the Unicode codepoint positions used in Lean, and back. So when we receive positions from Lean, we need to translate to UTF-16. And the other way around when we ask lean something.

#### Bryan Gin-ge Chen (Mar 15 2020 at 17:38):

Would it be better to do this in Lean itself or should we write something in lean-client-js?

#### Gabriel Ebner (Mar 15 2020 at 17:41):

like making lean respect vscode counting conventions
Would it be better to do this in Lean itself

I don't believe it is worth the effort to change Lean to internally store positions in UTF-16. (It also seems just wrong to me.)

We can put the translation code between the two "formats" on either the Lean side or the javascript side. I don't know what will happen in 4, and for now it's probably easier to do in javascript. Whether we do this in the lean-client-js library or in the vscode extension itself, I don't have a strong opinion on.

#### Patrick Massot (Mar 15 2020 at 17:43):

I think there will be more and more ways of interacting with Lean besides the VScode extension (format_lean, leancrawler, machine learning, new pedagogical tools...) so it would be nice to fix the problem close to the Lean end, unless VS code is clearly wrong and Lean is right.

#### Bryan Gin-ge Chen (Mar 15 2020 at 17:45):

OK, maybe lean-client-js is the right place then. I wasn't thinking of making any very deep changes to Lean, just adding some functions in src/shell/server.cpp to tweak the positions in the JSON output.

#### Patrick Massot (Mar 15 2020 at 17:46):

I think everything using python to talk to Lean currently bypasses lean-client-js.

#### Bryan Gin-ge Chen (Mar 15 2020 at 17:47):

Oh, that's a good argument in favor of making changes to the Lean server then.

#### Gabriel Ebner (Mar 15 2020 at 17:47):

In this case I think it's really vscode that is weird. I think only javascript (and java) count 𝟙 as two characters. Most other languages (including python for example) count it as one.

#### Bryan Gin-ge Chen (Mar 15 2020 at 17:51):

Oh, I get it. Yes, I retract my earlier statements.

#### Gabriel Ebner (Mar 15 2020 at 17:51):

I don't think that we should change this in the Lean server. This just shifts the problem, then python tools will break. I still think the easiest fix is in the vscode extension or lean-client-js.

#### Gabriel Ebner (Mar 15 2020 at 17:53):

@Sebastian Ullrich Any executive decisions on string length in Lean 4? UTF-8 byte count, UTF-16 code units, Unicode codepoints, graphemes?

#### Sebastian Ullrich (Mar 15 2020 at 17:53):

UTF-8 byte count + Unicode codepoints

#### Gabriel Ebner (Mar 15 2020 at 17:59):

Just to clarify: I made an error. The character ⊗ is represented as a single UTF-16 code unit and is hence fine. However 𝟙 requires two UTF-16 code units and therefore causes the issue at hand.

#### Mario Carneiro (Mar 15 2020 at 18:02):

why does unicode codepoint count matter?

#### Mario Carneiro (Mar 15 2020 at 18:03):

I see byte count for lean + utf-16 codepoints for vscode/js

#### Mario Carneiro (Mar 15 2020 at 18:05):

the nature of the breakage seems to suggest that lean is sending utf-8 codepoint count to vscode

#### Gabriel Ebner (Mar 15 2020 at 18:05):

I dunno. Some people probably expect example : "𝟙".length = 1 := rfl to work.

#### Mario Carneiro (Mar 15 2020 at 18:05):

It should, but that's another matter

#### Mario Carneiro (Mar 15 2020 at 18:05):

That's just the behavior of the lean string.length function

#### Gabriel Ebner (Mar 15 2020 at 18:05):

There is no such thing as "UTF-8 codepoint". There are Unicode codepoints, and there are UTF-8 bytes.

#### Mario Carneiro (Mar 15 2020 at 18:06):

I don't know what terminology to use here

#### Gabriel Ebner (Mar 15 2020 at 18:06):

Lean sends (number of) Unicode codepoints to vscode. In vscode we interpret this number as a number of UTF-16 code units.

#### Mario Carneiro (Mar 15 2020 at 18:06):

I mean the thing where 𝟙 has 1 unicode codepoint, two utf-16 code units(?) and three bytes

#### Gabriel Ebner (Mar 15 2020 at 18:07):

Yes, that's the terminology. But it's four bytes in UTF-8.

#### Mario Carneiro (Mar 15 2020 at 18:07):

why doesn't lean send byte counts via its RPC and client-js converts to utf-16

#### Gabriel Ebner (Mar 15 2020 at 18:09):

Because that would be just as awkward and Lean doesn't store the byte count anywhere? Internally, Lean 3 only stores the (line, column) pair that you see in the error messages on the command line. This column` is calculated using Unicode code points.

#### Mario Carneiro (Mar 15 2020 at 18:09):

oh, I thought it was bytes internally

#### Kevin Buzzard (Mar 15 2020 at 18:27):

I'm glad category theory has found this practical application to coding.

Last updated: May 11 2021 at 00:31 UTC