Zulip Chat Archive

Stream: lean4

Topic: Syntax highlighting for Lean 4 on GitHub


Marc Huisinga (Nov 22 2023 at 08:12):

@Eric Wieser created a PR to Linguist for the Lean 4 grammar so that we can have Lean 4 syntax highlighting on GitHub.
According to the maintainer of Linguist, there are two issues with a separate Lean 4 grammar:

  • Lean 4 still often sufficiently looks like Lean 3 through the lens of the TextMate grammar that Linguist has a hard time differentiating the two.
  • We (apparently?) do not meet the contribution requirements for new languages.

One way to resolve both of these issues is to replace the Lean 3 grammar with the Lean 4 one. The downside is that this will subtly break syntax highlighting for Lean 3 projects on GitHub. The alternative is that we invest a lot of time into updating the Lean 4 TextMate grammar so that it captures more Lean 4 specific syntax.

What do others think?

Kevin Buzzard (Nov 22 2023 at 08:16):

I'm all for slightly breaking old stuff FWIW.

Eric Wieser (Nov 22 2023 at 08:18):

It's not yet clear to me that we are going to be forced to make a decision here; another option might just be that I:

  • fix the technical issues
  • wait a few months until we do hit the usage threshold

Kevin Buzzard (Nov 22 2023 at 08:19):

I'm also quite happy to wait until sometime else makes it work

Marc Huisinga (Nov 22 2023 at 08:21):

Eric Wieser said:

It's not yet clear to me that we are going to be forced to make a decision here; another option might just be that I:

  • fix the technical issues
  • wait a few months until we do hit the usage threshold

My understanding is that Linguist's classification needs to be able to distinguish the two languages to provide the correct highlighting. The standard samples you provided in the PR seem to fail their classification test, so even if we provide other samples, I expect that Linguist may still have a hard time distinguishing the two languages in practice?

Eric Wieser (Nov 22 2023 at 08:37):

Given we have an almost perfect heuristic for determining that, I'd be surprised if it's the case

Eric Wieser (Nov 22 2023 at 08:37):

Regarding usage requirements, I updated the issue with a better search query that I think will meet their threshold

Marc Huisinga (Nov 22 2023 at 08:40):

Ah okay, I may just have misunderstood how Linguist works then based on the review comments :-)

Eric Wieser (Nov 22 2023 at 08:46):

It likely doesn't help that the existing lean 3 training data is actually lean 2

Mac Malone (Nov 22 2023 at 09:17):

I imagine one of the problems we have with the contribution requirements is that many of the major repositories have been moved under leanprover-community (and some under leanprover) which likely hurts the distinct :user/:repo combination metric mentioned in linguist#5756.

Eric Wieser (Nov 22 2023 at 10:29):

I think the issue was just that my original search query was bad, the new one (for import Mathlib, case sensitive) seems much more promising

Eric Wieser (Nov 22 2023 at 10:57):

@Marc Huisinga, do you know what's going on with https://github.com/github-linguist/linguist/pull/6616/files#r1401634263? Was leaving the language name as source.lean.markdown instead of source.lean4.markdown a deliberate choice? Or are the two literally identical? (i.e., just markdown with -/-detection)

Marc Huisinga (Nov 22 2023 at 11:01):

IIRC my thought process back then was that they would be identical, though they seem to have diverged slightly since then.

Marc Huisinga (Nov 22 2023 at 11:03):

I certainly didn't have Linguist in mind - I believe I thought that it wouldn't matter whether vscode-lean or vscode-lean4 contribute this grammar.

Marc Huisinga (Nov 22 2023 at 11:04):

Maybe we should just change the identifier to lean4 now, I don't think it makes a difference?

Marc Huisinga (Nov 22 2023 at 11:18):

PR at vscode-lean4#361. I'll leave it up for a bit to see if any of the other vscode-lean4 developers object.

Eric Wieser (Nov 22 2023 at 11:27):

I would imagine it will start diverging more once @David Thrane Christiansen starts shipping better documentation tools, so I think giving it a new name now is fine

Eric Wieser (Nov 22 2023 at 11:28):

(the alternative would be to tell linguist to ignore the file and use the lean3 one, or vice-versa)

David Thrane Christiansen (Nov 22 2023 at 11:30):

Once #lang becomes a thing, it can become quite easy to tell. But I don't think the doc tools are relevant here, because those will use the compiler to accurately highlight everything

Eric Wieser (Nov 22 2023 at 11:39):

For the sake of GitHub and other dumb editors (or even just low-resource systems where the lean server is deliberately disabled) it's nice to have some basic highlighting

Eric Wieser (Nov 22 2023 at 12:09):

Eric Wieser said:

I think the issue was just that my original search query was bad, the new one (for import Mathlib, case sensitive) seems much more promising

This has been confirmed, it sounds like we are over the usage threshold after all :tada:

Alex J. Best (Nov 22 2023 at 17:20):

+1 for just having Lean(4) as a single language (even if it makes Lean 3 look slightly odd). I think in a year's time we really won't be looking at lean 3 code often enough for it to be worth the trouble of ensuring github supports both flawlessly.

Eric Wieser (Nov 22 2023 at 22:17):

I don't think it is any trouble to support both, the resolution was for me to delete some files

Eric Wieser (Nov 22 2023 at 22:18):

The question is whether we want the language stats to distinguish the two languages

Eric Wieser (Nov 22 2023 at 22:33):

To be a little more clear, we have three options:

  1. Have a single language for highlighting, and a single language for github stats
  2. Have two languages for highlighting, and two languages for github stats
  3. Have two languages for highlighting, and one language for github stats

In my opinion 1 is the worst option, and we should only pick it if it turns out to be more work to support (which I think isn't true). The choice between 2 and 3 is partly up to github, but I think we have some power to sway it if we care. The effect would be seeing "Lean 4" and "Lean" separately in the language stats for a repo rather than "Lean"

Scott Morrison (Nov 23 2023 at 00:18):

We definitely do not want separate language stats if they would be listed as "Lean 4" and "Lean". (If it could be "Lean" and "Lean 3", then fine, whatever...)

Kevin Buzzard (Nov 23 2023 at 09:31):

Yes I never refer to "lean 4" when I'm giving talks (and I give a lot of talks). As far as I'm concerned the message should be that we're using lean (and there was this older version once).

Eric Wieser (Dec 13 2023 at 20:36):

Option 3 above is now live on github

Mac Malone (Dec 15 2023 at 00:06):

Does Linguist automatically regularly update the submodule it dervies the grammar from? For example, if keywords are added/removed from vscode-lean4, will the GitHub highlighting update accoridingly?

/cc @Eric Wieser / @Marc Huisinga

Eric Wieser (Dec 15 2023 at 00:18):

Part of their release process is to update all the submodules

Eric Wieser (Dec 15 2023 at 00:19):

Historically it seems they release every three months or so

Eric Wieser (Dec 15 2023 at 00:21):

If this question is about lake keywords, one option might be to have a dedicated lake grammar file (though you'd need to prove that there are 200 separate users publishing lake packages to meet the significance threshold for linguist); and of course you'd still be bound to the same release cycle.

Mac Malone (Dec 15 2023 at 01:49):

@Eric Wieser That sounds great! No, the question was not about Lake -- I kind of assumed that would be a more complex endeavour. It was more about the fact the the Lean grammar is currently still in flux, and I wanted to know whether we would have request the GitHub to update the module or wether we can just update vscode-lean4 and have that eventually trickle down. The fact the it is the later is great news!


Last updated: Dec 20 2023 at 11:08 UTC