Zulip Chat Archive

Stream: general

Topic: Unicode symbols in Loogle


Jeremy Kahn (Apr 14 2024 at 10:24):

One thing I've noticed using loogle is that I often want to search for expressions using special characters, as in _ \in _ or _ \le _ or _\inv. Is there any mechanism for including such characters in loogle? Failing that, is there an easy way to look up the actual (ASCII) function for a special character?

Kevin Buzzard (Apr 14 2024 at 10:39):

I just write stuff in VS Code and then copy-paste.

Joachim Breitner (Apr 14 2024 at 10:49):

Or live.lean-lang.org if you don't have a browser.
I doubt it's easy to add that functionality to the Loogle webpage. One argument in favor of vscode-integrated Loogle search, so you might try that extension!

Yaël Dillies (Apr 14 2024 at 10:51):

You might want to try https://github.com/arthurpaulino/chrome-lean-unicode

Jeremy Kahn (Apr 14 2024 at 10:52):

Yes, @Kevin Buzzard 's suggestion is pure genius! I don't know why I didn't think of that. Maybe it's a matter of learning to look for workarounds. :laughing:

Joachim Breitner (Apr 14 2024 at 10:53):

Oh, nice. @Arthur Paulino , do you think you could make a variant of that that websites could embed and enable for specific text input fields?

Ruben Van de Velde (Apr 14 2024 at 11:05):

Also you can spell out Inv.inv, for example

Jeremy Kahn (Apr 14 2024 at 11:07):

Is there a nice way to look up the official name for a function from its unicode infix or suffix?

Jeremy Kahn (Apr 14 2024 at 11:10):

@Yaël Dillies @Arthur Paulino I installed the extension, and now I can get ∈ and α in the Zulip chat. I added "https://loogle.lean-lang.org" to the list of covered websites, and the icon appears as green, but it doesn't seem to work in the loogle entry field.

Joachim Breitner (Apr 14 2024 at 11:10):

Would it be useful if the lean elaborator simply understood \in just like corresponding symbol, so that ASCII input works? Would require the parser or elaborator to know the list of abbreviations, and is a bit of a layering violation, but would be useful in a few places (e.g. on the command line). (Not sure, just thinking out loudly)

Joachim Breitner (Apr 14 2024 at 11:11):

Maybe the extension only works in text areas, not input fields?

Mario Carneiro (Apr 14 2024 at 11:15):

the leader key isn't always \

Mario Carneiro (Apr 14 2024 at 11:16):

plus, that would cause problems for string escapes like \n (which I always have to convince the extension to not replace with ¬)

Mario Carneiro (Apr 14 2024 at 11:18):

Jeremy Kahn said:

Is there a nice way to look up the official name for a function from its unicode infix or suffix?

If you just type #check x \in y or similar in the editor you can ctrl-click on the symbol to go to the definition

Mario Carneiro (Apr 14 2024 at 11:19):

ditto if you see it in existing code in either docgen or mathlib

Shreyas Srinivas (Apr 14 2024 at 19:40):

I tried adding that feature in the extension. But it is painful to get it right

Shreyas Srinivas (Apr 14 2024 at 19:41):

On the positive side, I do the copy paste thing easily because I am already in vscode

Shreyas Srinivas (Apr 14 2024 at 19:42):

If someone's willing to contribute this feature, PRs to the loogle extension repo are welcome

Arthur Paulino (Apr 15 2024 at 01:52):

Sorry, I don't have the time to debug :sad:

Jeremy Kahn (Apr 17 2024 at 14:40):

@Joachim Breitner I do think it would be helpful for loogle to understand escape sequences in the input field. But not really necessary given the cut-and-paste option and other workarounds (including the loogle bot in Zulip).

Joachim Breitner (Apr 17 2024 at 15:42):

I fully agree! I think the best course of action is if someone can generalize https://github.com/arthurpaulino/chrome-lean-unicode to a library that a website can embed, then loogle can use it.

Kenny Lau (Jul 13 2025 at 08:56):

I would like to make a bug report, if you go to https://loogle.lean-lang.org and type \|- Fact (Nat.Prime 2), then you'll find the result, but then if you go back and delete and then type it again (\|- ), then you'll find <input>:1:1: expected token because somehow a NBSP (U+00A0) was inserted

Joachim Breitner (Jul 13 2025 at 09:02):

Hmm, couldn’t reproduce with Firefox on Linux

Alexander Bentkamp (Jul 13 2025 at 09:33):

I think you could probably use this: https://github.com/leanprover/vscode-lean4/tree/master/lean4-unicode-input-component

Joachim Breitner (Jul 13 2025 at 09:54):

Loogle is using https://www.npmjs.com/package/@leanprover/unicode-input-component (but an old version it seems)

Eric Wieser (Jul 13 2025 at 09:56):

I get this nbsp bug in about 50% of my loogle queries (chrome on windows)

Yaël Dillies (Jul 13 2025 at 09:57):

I get it always when copy-pasting from somewhere else. I've learned to delete all dots and writing them back

Eric Wieser (Jul 13 2025 at 09:57):

by dots do you mean spaces?

Yaël Dillies (Jul 13 2025 at 09:57):

Yes, sorry

Joachim Breitner (Jul 13 2025 at 09:58):

I don’t expect updating will change anything around that:
https://github.com/leanprover/vscode-lean4/commits/master/lean4-unicode-input

I know about an issue with pasting NBSP, but @Kenny Lau describes typing, so it's probably different?

Joscha Mennicken (Jul 13 2025 at 11:12):

Can't reproduce in Firefox, but can reproduce in Chromium:

  1. Type or paste Fact (Nat.Prime 2) into the text box
  2. Move cursor to the beginning
  3. Type \|-<spacebar>
  4. The space will be a &nbsp;

As far as I can tell, Chromium turns a trailing space in the contexteditable div into a &nbsp;. Typing further characters after the &nbsp; turns it into a normal space again. However, this doesn't always work properly. Firefox, on the other hand, doesn't use &nbsp; in the first place. You can observe this by inspecting the contenteditable div in the dev tools and seeing it update as you type.

Maybe the reason for the in-line &nbsp; is the following:

  1. User starts typing unicode input with \
  2. The unicode input characters are wrapped in a <ul>
  3. When the user enters a space to complete the input sequence, it's a trailing space inside the <ul> element, so chrome turns it into a &nbsp;
  4. However, the unicode input characters are now replaced with the corresponding unicode character, and the <ul> removed
  5. The &nbsp; at the end of the <ul> now finds itself in the middle of the original <div>, but is not promoted into a space because that only happens when the user inserts or removes a character

Joscha Mennicken (Jul 13 2025 at 11:14):

In short, inserting a unicode character using \ before any existing text will trigger the bug, and continuing to type immediately after will remove the NBSP again.

Aaron Liu (Jul 13 2025 at 13:24):

is this why loogle randomly rejects my input sometimes?

Jz Pan (Jul 13 2025 at 16:22):

My suggestion is, in the preprocess code of Loogle, just replace every Unicode character categorized as space (including but not limited to \r, \n, \t) to \x20.

Eric Wieser (Jul 13 2025 at 16:23):

I think a weaker approach of

doLoogle origStr <|> (do doLoogle (replaceNbsp origStr); log "warning: ignoring nbsp"

would be safer

Eric Wieser (Jul 13 2025 at 16:24):

I don't think there's any need to replace any other space characters

Eric Wieser (Jul 13 2025 at 16:25):

Joscha Mennicken said:

but can reproduce in Chromium:

This is https://issues.chromium.org/issues/41067523

Joachim Breitner (Jul 14 2025 at 14:31):

Somehow this rings a bell. I vaguely remember that @Marc Huisinga dealt with this at some point?

Joachim Breitner (Jul 14 2025 at 14:32):

Jz Pan schrieb:

My suggestion is, in the preprocess code of Loogle, just replace every Unicode character categorized as space (including but not limited to \r, \n, \t) to \x20.

This may be the easiest, maybe somewhere near
https://github.com/nomeata/loogle/blob/e086424b48662fc2473ebe697099f710ff4b737e/server.py#L302

Jireh Loreaux (Jul 14 2025 at 14:45):

Eric Wieser said:

I get this nbsp bug in about 50% of my loogle queries (chrome on windows)

and I thought this was just me doing something wrong :face_palm:

Kenny Lau (Jul 15 2025 at 13:15):

I have another similar bug report (this time might not be only on Chromium!):

  1. Copy the string "Nat.Prime,  ZMod" (with two spaces in the middle)
  2. Go to https://loogle.lean-lang.org
  3. Paste the string and search
  4. Notice that how it says <input>:1:10: expected token, and that the URL magically now contains %C2%A0 (nbsp) again

Jz Pan (Jul 15 2025 at 13:21):

Joachim Breitner said:

Jz Pan schrieb:

My suggestion is, in the preprocess code of Loogle, just replace every Unicode character categorized as space (including but not limited to \r, \n, \t) to \x20.

This may be the easiest, maybe somewhere near
https://github.com/nomeata/loogle/blob/e086424b48662fc2473ebe697099f710ff4b737e/server.py#L302

Maybe we can just do it on the client side using js

Joachim Breitner (Jul 15 2025 at 13:21):

I’ll just do it in server.py, or anything wrong with that?

Jz Pan (Jul 15 2025 at 13:25):

Joachim Breitner said:

I’ll just do it in server.py, or anything wrong with that?

That's also OK but you need to deploy it, which is a little bit more involved IMHO

Joachim Breitner (Jul 15 2025 at 13:59):

I’d have to deploy the frontend as well, but also touch JS :-)

Joachim Breitner (Jul 15 2025 at 14:28):

Pushed a fix, mostly untested. Better now?

Kenny Lau (Jul 15 2025 at 15:47):

thanks!

Jz Pan (Jul 15 2025 at 16:21):

Thank you! It works.

Marc Huisinga (Jul 21 2025 at 11:43):

Joachim Breitner said:

Somehow this rings a bell. I vaguely remember that Marc Huisinga dealt with this at some point?

That was another issue on Firefox IIRC


Last updated: Dec 20 2025 at 21:32 UTC