Zulip Chat Archive

Stream: lean4

Topic: DocInk


Henrik Böving (Jun 20 2022 at 14:08):

I happily present:
image.png
image.png

my newest Lean4 rip off of an HTML generating tool in Python, this time alectryon.

I have two main questions on this (which are the actual reason im posting here):

  • In alectryon pygments does the syntax highlighting for us and it uses pygments CSS to reflect this in the end code, since we want to avoid dependencies on Python, how do I highlight this?
  • The types are merely given to me as boring old strings, linking these back to doc-gen's main page seems like quite of a challenge, I'd have to reelaborate them and what not, is there a way we can store InfoTree information instead of types in Aelctryon? (Should be possible to do as far as I can see from the code, I just wanted to make sure I'm not doing something impossible before I start)

CC: @Sebastian Ullrich @Niklas Bülow

Sebastian Ullrich (Jun 20 2022 at 14:14):

Awesome! I didn't think you would copy Alectryon's style that directly, you should ask @Gabriel Ebner about what we should improve there :big_smile: . But it's a good starting point of course. See also https://github.com/leanprover/lean4/commit/8f2f6bc17d11bfaac8fae7448dff4d047a187568 for some quick adjustments I did after Gabriel's feedback in the other thread.

Sebastian Ullrich (Jun 20 2022 at 14:19):

In alectryon pygments does the syntax highlighting for us and it uses pygments CSS to reflect this in the end code, since we want to avoid dependencies on Python, how do I highlight this?

Note that https://github.com/Kha/alectryon/tree/typeid (the branch the doc examples are built with) uses semantic highlighting reported by LeanInk on top of Pygments. There may be some missing highlighting such as strings and comments in the semantic highlighting, but we should probably fix this in the language server.

is there a way we can store InfoTree information instead of types in Aelctryon?

I assume you mean in LeanInk? But sounds reasonable, yes.

Henrik Böving (Jun 20 2022 at 15:13):

Sebastian Ullrich said:

Awesome! I didn't think you would copy Alectryon's style that directly, you should ask Gabriel Ebner about what we should improve there :big_smile: . But it's a good starting point of course. See also https://github.com/leanprover/lean4/commit/8f2f6bc17d11bfaac8fae7448dff4d047a187568 for some quick adjustments I did after Gabriel's feedback in the other thread.

image.png ^^

Sebastian Ullrich said:

In alectryon pygments does the syntax highlighting for us and it uses pygments CSS to reflect this in the end code, since we want to avoid dependencies on Python, how do I highlight this?

Note that https://github.com/Kha/alectryon/tree/typeid (the branch the doc examples are built with) uses semantic highlighting reported by LeanInk on top of Pygments. There may be some missing highlighting such as strings and comments in the semantic highlighting, but we should probably fix this in the language server.

is there a way we can store InfoTree information instead of types in Aelctryon?

I assume you mean in LeanInk? But sounds reasonable, yes.

exactly, I'll get to implementing this then.

Patrick Massot (Jun 20 2022 at 15:34):

Do I understand that you're somehow using LeanInk without Alectryon?

Henrik Böving (Jun 20 2022 at 16:15):

Yeah I rebuilt the HTML renderer that alectryon provides in Lean, hence the Python rip off.

Patrick Massot (Jun 20 2022 at 16:17):

Do you still use the intermediate json representation?

Mario Carneiro (Jun 20 2022 at 16:18):

you could use the semantic highlighter for doing syntax highlighting

Henrik Böving (Jun 20 2022 at 16:29):

Patrick Massot said:

Do you still use the intermediate json representation?

Yes, I went for a minimally invasive approach on LeanInk so I just taught it how to also read its data structures from JSON again to use that part as a library then I let it produce its output files, read them in and delete them afterwards (I spawn the LeanInk process as an external one due to limitations that Sebastian and I discovered with running all the analysis of mathlib in the same process).

There is a little performance concern here because we basically run elaboration twice, once with the Lean compiler from Lake during the normal olean build (doc-gen then consumes the produced .olean files) and once with LeanInk afterwards (where doc-gen consumes the JSON files), right now a full run for mathlib that explicitly stops the entire compiler from getting rendered with LeanInk (that would probably take ages) takes ~8 min on my laptop. Once we port the big mathlib this will most likely become a bigger concern but Sebastian already came up with the idea that we could maybe load LeanInk into the Lean compiler itself as a plugin and produce the .json files as normal build artifacts during the .olean build which would essentially eliminate the concern.

Henrik Böving (Jun 20 2022 at 16:30):

Mario Carneiro said:

you could use the semantic highlighter for doing syntax highlighting

If i understand Sebastian correctly that's basically what he is proposing as well right?

Henrik Böving (Jun 20 2022 at 21:34):

I deployed the minimal version to github pages now: https://leanprover-community.github.io/mathlib4_docs/ so people can look out for bugs or even use it normally^^. A few remarks on this:

  • the compiler (so Lean, Init and Std) is at the moment not rendered on purpose, rendering this mathilb alone takes 8 min already, the compiler would take ages but I'll get to that at some point
  • the highlighting is as good as it gets right now (though we can of course talk about colouring)
  • Linking back from ink to doc-gen is planned but not implemented right now since this requires some more work on both sides

If you find anything wrong apart from that please do tell!

Gabriel Ebner (Jun 21 2022 at 11:56):

This is nice. I'm surprised by how little code it took, this bodes well for future extensions and tweaks.

Gabriel Ebner (Jun 21 2022 at 11:56):

Do I want to know why <//small> is written with two slashes now?

Sebastian Ullrich (Jun 21 2022 at 14:24):

Note that Alectryon runs quite a few transformation steps after extraction: https://github.com/cpitclaudel/alectryon/blob/739b46da22d272e748f60f3efcd2989d696fba71/alectryon/transforms.py#L841-L849. Compare the "bubble" placement between https://leanprover.github.io/lean4/doc/examples/palindromes.lean.html and https://leanprover-community.github.io/mathlib4_docs/src/Mathlib/Algebra/Group/Basic.html, for example. However, we might as well fix that in LeanInk directly where possible.

Henrik Böving (Jun 21 2022 at 15:51):

Gabriel Ebner said:

Do I want to know why <//small> is written with two slashes now?

weeell, I decided some time ago to add a field to the Html datatype that tells the HTML formatter whether to format it or not so the generated HTML looks more pretty, this is what </ generates per default, however the output of alectryon is of course very very very sensitive to whitespaces and newlines so I decided to add another syntax that does explicitly not do this. I guess I could also run a recursive function over the generated Html that unsets this flag everywhere but at the time i was not sure whether I want to get rid off it everywhere. The other alternative would be to make the default false but I haven't had the time yet to see whether this affects generation of the doc HTML.

Henrik Böving (Jun 21 2022 at 18:06):

Hm, does seem to look fine, who needs pretty looking HTML when the browser shows stuff to us anyways!

Henrik Böving (Jun 21 2022 at 18:55):

Gabriel Ebner said:

Do I want to know why <//small> is written with two slashes now?

Now it's gone...it might be that during my inspection of how the now much less pretty printed HTML is generated i missed a few cosmetic things so if you notice something cosmetically weird that wasn't there before that is probably on this change.

Henrik Böving (Jul 23 2022 at 14:30):

I did start looking into the back linking things and as far as I can tell LeanInk would have to do a hard breaking change on its Alectryon JSON API if doc-gen wants to access this information. This is because right now I'm receiving every type as a single long string so that means if there is a popup like this: t1 : Nat → Test i receive the t1 and the Nat → Test as separate strings. I did consider to just parse and analyze the type strings but I don't think this will work out since they are (correctly) leaving away namespaces etc. so it would basically come down to me writing a mini elaborator if I wanted to do this.

Furthermore there is also things like #check Nat where i actually receive the whole answer as just a single string Nat : Type. similar things happen in the goal view and probably other places I didn't check yet.

So if we wanted backlinking to happen over this JSON API I'd have add changes to LeanInk that probably hard break the Alectryon API as it is right now, is this acceptable/desired? If not, what else should I do about this?

Patrick Massot (Jul 23 2022 at 14:33):

I thought you no longer relied on Alectryon itself, did I dream that?

Sebastian Ullrich (Jul 23 2022 at 14:37):

I think the point is that we want to keep Alectryon working with LeanInk (e.g. for LaTeX output) even if doc-gen4 doesn't need it anymore

Sebastian Ullrich (Jul 23 2022 at 14:43):

I don't think a breaking change would be a big issue; I wouldn't be surprised if all currently existing LeanInk CI setups are in fact by me (in which case they are properly version-pinned). There's always the option of simple producing a strict superset of the current JSON though, at the cost of some redundancy in implementation and output.

Henrik Böving (Jul 23 2022 at 14:46):

I don't rely on alectryon for what I do in doc-gen yes, but I use the JSON format that LeanInk originally provided for alectryon for the doc-gen work so if doc-gen shall do more than now it would either have to break this format or a new API would have to be built entirely

Henrik Böving (Jul 23 2022 at 20:22):

I'm a little uncertain on how to do the encode decode part here, maybe someone can help me out. Ideally I'd like to receive some docs4#Lean.Widget.CodeWithInfos from LeanInk because that is what doc-gen already uses, it seems to me like the relevant parts of LeanInk are basically operating on Exprs as well so generating the CodeWithInfos from that can be done just like in doc-gen as well. What I'm not quite sure about is how to ship them from LeanInk to doc-gen. CodeWithInfos is a TaggedText SubexprInfo where TaggedText derives FromJson and ToJson and Subexprinfo derives RpcEncoding but not the ordinary ToJson FromJson, from reading the docs of RpcEncoding it seems like it is pretty close to ToJson FromJson but has some additional magic involved with WithRpcRef which hides stuff behind some RPC reference (whatever that is, I'm guessing it means I won't get the full data but have to ask for it?). And this is where I'm a little lost because I'm not at all familiar with the LSP and RPC parts of Lean.

Basically I want: CodeWithInfos or a representation convertible to CodeWithInfos -> JSON via LeanInk and then in doc-gen JSON -> whatever representation -> CodeWithInfos to analyze further, how can I get that?

Sebastian Ullrich (Jul 23 2022 at 20:25):

I may have asked it before, but would it be easier to interface with LeanInk as a Lean library? Or do you want the JSON for caching?

Henrik Böving (Jul 23 2022 at 20:26):

I did try that but it failed horribly :D Remember when I bothered you about the repeating imports due to multiple LeanInk calls?

Henrik Böving (Jul 23 2022 at 20:28):

I guess once doc-gen is at the point where it get's called by Lake on a per file basis though this won't be an issue anymore so I could also wait for that instead of trying to solve the JSON encoding issue further.

Sebastian Ullrich (Jul 23 2022 at 20:35):

Ah yeah, I was thinking of per-file

Henrik Böving (Jul 23 2022 at 20:37):

Right, well in that case I'll wait for Mac to finish on his facets thingy and pull LeanInk back in as a library :thumbs_up:

Mac (Jul 23 2022 at 22:38):

@Henrik Böving oh. it is already finished, would you like a PR that does it for doc-gen4?

Henrik Böving (Jul 23 2022 at 22:39):

Oh, definitely yeah :thumbs_up:

Mac (Jul 23 2022 at 23:56):

@Henrik Böving while doing so I noticed that doc-gen currently produces both a Foo.bar link and a Foo.<<Bar.html>> link in the navbar. I imagine that is a bug?

Henrik Böving (Jul 23 2022 at 23:57):

Yeah, how can i reproduce that? I probably didn't parse a name properly somewhere

Henrik Böving (Jul 23 2022 at 23:59):

I remember that i did already observe this behaviour but I fixed it, the doc-gen that is deployed on docs4#Nat right now does not have this issue in the navbar so I'd be curious what you are doing.

Mac (Jul 24 2022 at 00:04):

For instance, I ran (through Lake):

LEAN_PATH=.\lean_packages\Cli\build\lib;.\lean_packages\leanInk\build\lib;.\build\lib;.\lean_packages\Unicode\build\lib;.\lean_packages\mathlib\build\lib;.\lean_packages\lake\build\lib;.\lean_packages\CMark\build\lib;c:\Users\Mac\.elan\toolchains\leanprover--lean4---nightly-2022-07-20\lib\lean .\build\bin\doc-gen4.exe single DocGen4.LeanInk

And got a DocGen4.LeanInk and a DocGen4.<<LeanInk.html>> entry on my Windows machine.

Henrik Böving (Jul 24 2022 at 00:08):

image.png I fear that's some Windows specific behaviour then /o\

Henrik Böving (Jul 24 2022 at 00:15):

https://github.com/leanprover/doc-gen4/blob/main/DocGen4/Process/Hierarchy.lean#L103-L117 this code here will be at fault. Its job should be to find all files that are not in the blacklist and end on html, then cut the html part out properly so I know all the names of modules that have to be clickable in the navbar tree, the rest is done by the Hierarchy.fromArray which definitely works correctly because it's been here for a while. It seems to me that for some reason I am not dropping the .html of a file name properly but I can't quite tell why.

Mac (Jul 24 2022 at 01:06):

@Henrik Böving Oh, I know what the problem is! It is catching the .trace files produced by the module facet as files to generate entries for.

Mac (Jul 24 2022 at 01:06):

To fix this, why not just catch the files with an html extension?

Mac (Jul 24 2022 at 01:10):

Eg. via fi entry.path.extension == some "html"

Mac (Jul 24 2022 at 01:14):

I will add such a check to my PR.

Mac (Jul 24 2022 at 01:19):

I also noticed it does not catch top-level module documentation at all.

Henrik Böving (Jul 24 2022 at 11:01):

Mac said:

Henrik Böving Oh, I know what the problem is! It is catching the .trace files produced by the module facet as files to generate entries for.

Ahhhhh, well I guess I couldn't reproduce that :p

Mac said:

I also noticed it does not catch top-level module documentation at all.

Yeah I did notice that as well while rereading the code but wanted to wait until you are done so there are no merge conflicts but I see you already fixed it.

So I guess I'll just merge now and start operating under the assumption that I'm only building modules with single on my new branch and can thus use LeanInk as a library for my stuff :thumbs_up:

Henrik Böving (Jul 26 2022 at 14:24):

Update! image.png everything that is italics (and the \N and the operators) in the goal view here correctly links back to the doc-gen page

Henrik Böving (Aug 11 2022 at 22:04):

https://github.com/leanprover-community/mathlib4/pull/365 if one of the people with commit permissions on mathlib4 want to merge this we can probably have this feature (at least in a partial version) online and much faster documentation build times on top.

Henrik Böving (Aug 18 2022 at 20:24):

https://leanprover-community.github.io/mathlib4_docs/Mathlib/Data/Array/Basic.html#Array.get_push published!


Last updated: Dec 20 2023 at 11:08 UTC