Zulip Chat Archive

Stream: general

Topic: Contribution to doc-gen4?

Jz Pan (Jun 05 2024 at 23:13):

I'd like to see if I can fix or add some features to doc-gen4, for example:

generate reference page according to references.bib,
make <https://...> in the docstring work; currently they are output as-is, which produces an invalid HTML tag, and renders nothing in the browser.

I'd like to ask for some advices about the codebase of doc-gen4. How to get started with? Is the codebase complicated?

Winston Yin (尹維晨) (Jun 06 2024 at 02:52):

You can check out the repo here:

https://github.com/leanprover/doc-gen4

PRs can be submitted just like in mathlib repo. @Henrik Böving

Henrik Böving (Jun 06 2024 at 06:20):

You can contribute that type of stuff if you want sure. Just note that doc-gen is not a tool that's specific to mathlib, so the solutions you come up with should be generic enough to support arbitrary Lean repositories, not only mathlib.

Jz Pan (Jun 06 2024 at 10:43):

Regarding <https://...> problem, I quickly scanned the code, seems that it uses CMark.lean to parse the markdown, which calls the C library cmark to do the actual work. According to cmark website, it can process links of the form<https://...> in markdown file. So I'm confused why doc-gen4 cannot process them. Maybe because the version of cmark used in CMark.lean is too old? It's two years ago.

Henrik Böving (Jun 06 2024 at 10:44):

I don't know, I never worked on cmark.lean itself so much, if you want you can investigate whether upgrading cmark helps.

Jz Pan (Jun 06 2024 at 10:46):

Let me setup the build environment later.

Jz Pan (Jun 06 2024 at 20:13):

Bad news: the cmark library does not support tables in markdown... But do we ever need the table support? When I was using C++, I usually use doxygen for doc generation, which is also using markdown, and it supports tables.

Jz Pan (Jun 06 2024 at 20:15):

There is an issue https://github.com/leanprover/doc-gen4/issues/59 asking for a pure lean markdown processor.

Henrik Böving (Jun 06 2024 at 20:19):

This is most likely not going to happen, within some time we are probably going to switch to David's tool for rendering doc strings anyways.

Henrik Böving (Jun 06 2024 at 20:19):

Well, I mean anyone can still make it happen of course, but I won't invest time in it and whatever comes out of it will be a temporary solution is my point.

Jz Pan (Jun 06 2024 at 20:25):

Can you provide some information on David's tool?

Henrik Böving (Jun 06 2024 at 20:26):

https://github.com/leanprover/verso

Jz Pan (Jun 06 2024 at 20:33):

Thanks. So doc-gen4 will be deprecated? And how long do you think verso will be available for the doc-generation tools for mathlib docs?

Henrik Böving (Jun 06 2024 at 20:36):

No, as I said above doc-gen is going to use to verso to render doc-strings. Verso is not a replacement, it is compelentary in that we can use it to render doc-strings and also render general purpose book or manual style documentation with it. There is no official hard timeline on it

Jz Pan (Jun 06 2024 at 20:42):

I see. Thanks for the explanation.

Jz Pan (Jun 06 2024 at 23:26):

Regarding <https://...> problem, I located the code here <https://github.com/leanprover/doc-gen4/blob/c7f4ac84b973b6efd8f24ba2b006cad1b32c9c53/DocGen4/Output/DocString.lean#L211>. Seems that cmark itself has no problem, the problem is that before calling cmark the < and >, etc. are escaped, so cmark cannot see <https://...>. After modifying the returned html, the characters are unescaped, producing invalid html <https://...>.

I haven't think of a way to fix it, though.

Utensil Song (Jun 07 2024 at 02:57):

For the <url> issue, maybe just write them in a proper [title](url) is the better approach anyway.

I'm actually more excited about the reference page, and possibly you'll be interested in fixing math formulas.

Utensil Song (Jun 07 2024 at 03:17):

I was interested to work on them, but as Verso is coming, it would be too temporary. But Verso doesn't have full support for math formulas yet (leanprover/verso#54), so sadly at the moment there is no optimal way to approach this.

Ruben Van de Velde (Jun 07 2024 at 05:14):

I think <URL> is standard enough that we should support it, especially since there's no way to flag it at compile time if you get it wrong

Utensil Song (Jun 07 2024 at 05:34):

Jz Pan said:

Regarding <https://...> problem, I located the code here <https://github.com/leanprover/doc-gen4/blob/c7f4ac84b973b6efd8f24ba2b006cad1b32c9c53/DocGen4/Output/DocString.lean#L211>. Seems that cmark itself has no problem, the problem is that before calling cmark the < and >, etc. are escaped, so cmark cannot see <https://...>. After modifying the returned html, the characters are unescaped, producing invalid html <https://...>.

I haven't think of a way to fix it, though.

What would happen if you simply remove the Html.escape call? It seems in docstring, < and > are mostly protected in code blocks.

Utensil Song (Jun 07 2024 at 05:40):

This goes back to leanprover/doc-gen4#157 .

Utensil Song (Jun 07 2024 at 05:49):

But the example it points to doesn't use any of the escaped character at all. In principle, this should not be handled in such an ad hoc way, extra escaping would cause dilemma like this (that removing extra escaping fails some case, adding it fails some other case and there is no simple solution to fix without removing the extra escaping).

CMark.lean was designed to be a simple wrapper (so simple that it literally wraps only one C function, cmark_markdown_to_html) for cmmark which has a spec.

I would suggest to figure out what kind of cases would removing the escape fail , and fix them both in an integrated solution. If we want to support standard markdown, we need to remove extra layers to resume the standard-conforming behavior.

Utensil Song (Jun 07 2024 at 05:56):

For fixing the reference page, the spec already supports collapsed reference link, so we only need to generate the proper [foo]: /url "title"-like stuff from the bib file, and append that to the markdown passed to cmark.

For fixing the math formula, the spec has no support at all, so this indeed needs an extra layer of pass-through treatment, namely replace all math formulas to placeholders in the markdown that passed to cmark, then replace them back after recieving the result from cmark.

Jz Pan (Jun 07 2024 at 09:29):

Utensil Song said:

For fixing the reference page, the spec already supports collapsed reference link, so we only need to generate the proper [foo]: /url "title"-like stuff from the bib file, and append that to the markdown passed to cmark.

Yes, I found that the reference page could potentially be easier to work with. The list of reference page can be generated by a simple converter which converts the bib file to html or markdown (we need to handle some minimal amount of LaTeX commands). We can also get the full list of [foo]: links. A simple and ad-hoc way is just append this huge list to the markdown input before each call to CMark.renderHtml. A better way is scan the markdown input, and we only append the used references from the list to it.

Jz Pan (Jun 07 2024 at 09:32):

Utensil Song said:

But Verso doesn't have full support for math formulas yet (leanprover/verso#54)

I think neither does doc-gen4. It just (pretend to?) identify $ ... $ and leave them as is in the output html. The actual render of math formulas is by MathJax at client side.

Jz Pan (Jun 07 2024 at 09:47):

Utensil Song said:

What would happen if you simply remove the Html.escape call? It seems in docstring, < and > are mostly protected in code blocks.

I think this is reasonable. Even if unprotected < and >, cmark will escape them. The potential problem is for code blocks and LaTeX formulas, since these symbols in them will also be escaped by cmark. So the unescape code in the following processing is necessary (originally I think that was not necessary). Let me try what will be happed if I revert https://github.com/leanprover/doc-gen4/pull/157.

Utensil Song (Jun 07 2024 at 09:55):

Jz Pan said:

Utensil Song said:

But Verso doesn't have full support for math formulas yet (leanprover/verso#54)

I think neither does doc-gen4. It just (pretend to?) identify $ ... $ and leave them as is in the output html. The actual render of math formulas is by MathJax at client side.

The issue is that CMark will mess with what's inside $ pairs, e.g. _, \\ etc. these syntax overlaps between Markdown and LaTeX. so it's broken when MathJax sees the HTML. The only way to prevent this to avoid CMark seeing and touching them at all.

Jz Pan (Jun 07 2024 at 10:03):

Utensil Song said:

The only way to prevent this to avoid CMark seeing and touching them at all.

I agree.

Jz Pan (Jun 07 2024 at 10:07):

The MathJax itself also have some quirks regarding html escape <https://www.mathjax.org/#demo>:

The text you enter is actually HTML, so you can include tags if you want; but this also means you have to be careful how you use less-than signs, ampersands, and other HTML special characters within your math (surrounding them by spaces should be sufficient).

This code: $1 < 2$ vs $1 < 2$ and $1 \& 2$ vs $1 \& 2$ produces the same result, for example.

Jz Pan (Jun 07 2024 at 10:11):

Jz Pan said:

Let me try what will be happed if I revert https://github.com/leanprover/doc-gen4/pull/157.

This indeed fixes the <url> parsing:

before: <li>can you see the link <https://github.com></li>
after: <li>can you see the link <a href="https://github.com">https://github.com</a></li>

but it also introduce other problems:

before: <code>x && y</code>, is the boolean "and" operation
after: <code>x && y</code>, is the boolean "and" operation

I think the final output html should not be unescape anymore, but during internal post-processing we may need to unescape it to get some information correct.

Utensil Song (Jun 07 2024 at 10:17):

hugo-goldmark-extensions has a passthrough parser that handles this quite well, and it has many test cases. It also has the advantage that it can use the extensibility of the markdown parser and unfortunately CMark.lean exposes no access to the AST and low-level functionalities of cmark, so no extension is possible.

Maybe one day, in Lean ecosystem, there would be a Markdown parser based on other extensible Markdown parsers, maybe a port of markdown-it etc.

Utensil Song (Jun 07 2024 at 10:20):

Jz Pan said:

Jz Pan said:

Let me try what will be happed if I revert https://github.com/leanprover/doc-gen4/pull/157.

I think the final output html should not be unescape anymore, but during internal post-processing we may need to unescape it to get some information correct.

Maybe move the esaping to be after the cmark call? Then the final HTML is still escaped, but cmark would not need to see and be interfered by the escaping.

Eric Wieser (Jun 07 2024 at 13:01):

Jz Pan said:

after: <code>x && y</code>, is the boolean "and" operation

This is surely a bug in cmark? It should html escape code blocks for sure.

Jz Pan (Jun 07 2024 at 13:09):

cmark itself has no bug. But in our lean code we unescape the return value of cmark afterwards. To fix it the current logic need to be changed. I will try to figure it out later.

Eric Wieser (Jun 07 2024 at 13:16):

Yes, we certainly should not be either escaping before or escaping after

Eric Wieser (Jun 07 2024 at 13:42):

I think something like this might be right?

partial def elementToHTML : Element → Html
  | .Element n attr cont => .element n false attr.toList.toArray <|
    cont.filterMap fun
      | .Element e => elementToHTML e
      | .Comment c => none
      | .Character (content : String) => some <| .text content

/-- Convert docstring to Html. -/
def docStringToHtml (s : String) : HtmlM (Array Html) := do
  let rendered := CMark.renderHtml s
  match manyDocument rendered.mkIterator with
  | Parsec.ParseResult.success _ res =>
    res.mapM fun x => do return elementToHTML (← modifyElement x)
  | _ => return #[Html.text rendered]

Utensil Song (Jun 07 2024 at 13:44):

Need to rerun agaist mathlib docs and do eye-ball examination to be sure. :joy:

Jz Pan (Jun 07 2024 at 23:25):

Utensil Song said:

do eye-ball examination

You can use some diff tool working on directories. Unfortunately, I'm not free today; I'll try it on Sunday.

Eric Wieser (Jun 07 2024 at 23:55):

My comment above is wrong; HTML.text I think is misnamed, I think it is actually indeed to represent raw html strings

Jz Pan (Jun 08 2024 at 23:29):

I'm still studying the exact function of this code:

  match manyDocument rendered.mkIterator with
  | Parsec.ParseResult.success _ res =>

Seems that manyDocument is a combination of Lean's built-in XML parser elements. I suspect that the strings in the return value res are already unescaped by the built-in parser. But I need to check if it's indeed the case.

Jz Pan (Jun 09 2024 at 00:19):

I use the following stupid code to try to understand what is happening:

/-- Convert docstring to Html. -/
def docStringToHtml (s : String) : HtmlM (Array Html) := do
  -- I don't know why `Html.escape` is necessary; it breaks `<url>` parsing
  -- let escaped := Html.escape s
  -- let rendered := CMark.renderHtml escaped
  let rendered := CMark.renderHtml s
  let debug_test := s.startsWith "debug_test"
  if debug_test then do
    dbg_trace "DEBUG: s = {s}"
    -- dbg_trace "DEBUG: escaped = {escaped}"
    dbg_trace "DEBUG: rendered = {rendered}"
  match manyDocument rendered.mkIterator with
  | Parsec.ParseResult.success _ res =>
    res.mapM fun x => do
      let before := toString x
      let after := toString (← modifyElement x)
      if debug_test then do
        dbg_trace "DEBUG: before = {before}"
        dbg_trace "DEBUG: after = {after}"
      return Html.text after
  | _ => return #[Html.text rendered]

I got the following debug output:

DEBUG: s = debug_test

# Try CommonMark

You can try CommonMark here.  This dingus is powered by
[commonmark.js](https://github.com/commonmark/commonmark.js), the
JavaScript reference implementation.

1. item one
2. item two
   - sublist
   - sublist
3. can you see the link https://github.com
5. can you see the link <https://github.com>

[Link](http://a.com)

[Link][1]

[1]: http://b.org

[GTM106]

[Link][GTM106]

[GTM106]: http://b.org/GTM106

$y^2=x^3-x$

## Main definitions

- `hello`: world
- `< > " &`

< > " &


DEBUG: rendered = <p>debug_test</p>
<h1>Try CommonMark</h1>
<p>You can try CommonMark here.  This dingus is powered by
<a href="https://github.com/commonmark/commonmark.js">commonmark.js</a>, the
JavaScript reference implementation.</p>
<ol>
<li>item one</li>
<li>item two
<ul>
<li>sublist</li>
<li>sublist</li>
</ul>
</li>
<li>can you see the link https://github.com</li>
<li>can you see the link <a href="https://github.com">https://github.com</a></li>
</ol>
<p><a href="http://a.com">Link</a></p>
<p><a href="http://b.org">Link</a></p>
<p><a href="http://b.org/GTM106">GTM106</a></p>
<p><a href="http://b.org/GTM106">Link</a></p>
<p>$y^2=x^3-x$</p>
<h2>Main definitions</h2>
<ul>
<li><code>hello</code>: world</li>
<li><code>&lt; &gt; &quot; &amp;</code></li>
</ul>
<p>&lt; &gt; &quot; &amp;</p>

DEBUG: before = <p>debug_test</p>
DEBUG: after = <p>debug_test</p>
DEBUG: before = <h1>Try CommonMark</h1>
DEBUG: after = <h1 class="markdown-heading" id="Try-CommonMark">Try CommonMark <a class="hover-link" href="#Try-CommonMark">#</a></h1>
DEBUG: before = <p>You can try CommonMark here.  This dingus is powered by
<a href="https://github.com/commonmark/commonmark.js">commonmark.js</a>, the
JavaScript reference implementation.</p>
DEBUG: after = <p>You can try CommonMark here.  This dingus is powered by
<a href="https://github.com/commonmark/commonmark.js">commonmark.js</a>, the
JavaScript reference implementation.</p>
DEBUG: before = <ol>
<li>item one</li>
<li>item two
<ul>
<li>sublist</li>
<li>sublist</li>
</ul>
</li>
<li>can you see the link https://github.com</li>
<li>can you see the link <a href="https://github.com">https://github.com</a></li>
</ol>
DEBUG: after = <ol>
<li>item one</li>
<li>item two
<ul>
<li>sublist</li>
<li>sublist</li>
</ul>
</li>
<li>can you see the link https://github.com</li>
<li>can you see the link <a href="https://github.com">https://github.com</a></li>
</ol>
DEBUG: before = <p><a href="http://a.com">Link</a></p>
DEBUG: after = <p><a href="http://a.com">Link</a></p>
DEBUG: before = <p><a href="http://b.org">Link</a></p>
DEBUG: after = <p><a href="http://b.org">Link</a></p>
DEBUG: before = <p><a href="http://b.org/GTM106">GTM106</a></p>
DEBUG: after = <p><a href="http://b.org/GTM106">GTM106</a></p>
DEBUG: before = <p><a href="http://b.org/GTM106">Link</a></p>
DEBUG: after = <p><a href="http://b.org/GTM106">Link</a></p>
DEBUG: before = <p>$y^2=x^3-x$</p>
DEBUG: after = <p>$y^2=x^3-x$</p>
DEBUG: before = <h2>Main definitions</h2>
DEBUG: after = <h2 class="markdown-heading" id="Main-definitions">Main definitions <a class="hover-link" href="#Main-definitions">#</a></h2>
DEBUG: before = <ul>
<li><code>hello</code>: world</li>
<li><code>< > " &</code></li>
</ul>
DEBUG: after = <ul>
<li><code><a href=".././Trydocgen/Basic.html#hello">hello</a></code>: world</li>
<li><code>< > " &</code></li>
</ul>
DEBUG: before = <p>< > " &</p>
DEBUG: after = <p>< > " &</p>

Jz Pan (Jun 09 2024 at 00:22):

Seems that calling toString of Lean.Xml.Element is not appropriate here; it does not do any escaping on its values.

Jz Pan (Jun 09 2024 at 00:42):

I think this code works :big_smile::

private def _root_.Lean.Xml.Attributes.toStringEscaped (as : Attributes) : String :=
  as.fold (fun s n v => s ++ s!" {n}=\"{Html.escape v}\"") ""

mutual

private partial def _root_.Lean.Xml.eToStringEscaped : Element → String
| Element.Element n a c => s!"<{n}{a.toStringEscaped}>{c.map cToStringEscaped |>.foldl (· ++ ·) ""}</{n}>"

private partial def _root_.Lean.Xml.cToStringEscaped : Content → String
| Content.Element e => eToStringEscaped e
| Content.Comment c => s!"<!--{c}-->"
| Content.Character c => Html.escape c

end

/-- Convert docstring to Html. -/
def docStringToHtml (s : String) : HtmlM (Array Html) := do
  -- I don't know why `Html.escape` is necessary; it breaks `<url>` parsing
  -- let escaped := Html.escape s
  -- let rendered := CMark.renderHtml escaped
  let rendered := CMark.renderHtml s
  let debug_test := s.startsWith "debug_test"
  if debug_test then do
    dbg_trace "DEBUG: s = {s}"
    -- dbg_trace "DEBUG: escaped = {escaped}"
    dbg_trace "DEBUG: rendered = {rendered}"
  match manyDocument rendered.mkIterator with
  | Parsec.ParseResult.success _ res =>
    res.mapM fun x => do
      let before := eToStringEscaped x
      let after := eToStringEscaped (← modifyElement x)
      if debug_test then do
        dbg_trace "DEBUG: before = {before}"
        dbg_trace "DEBUG: after = {after}"
      return Html.text after
  | _ => return #[Html.text rendered]

Jz Pan (Jun 09 2024 at 00:45):

This also changes some docs of built-in functions, for example

before: "outside in", meaning that it will first look at the expression <code>_ + _ : <a href=".././Init/Data/Int/Basic.html#Int">Int</a></code>
after: "outside in", meaning that it will first look at the expression <code>_ + _ : <a href=".././Init/Data/Int/Basic.html#Int">Int</a></code>

Strictly speaking, the old output is not correct, as ideally the " in html should also be escaped.

Ruben Van de Velde (Jun 09 2024 at 06:01):

There's no reason to escape " outside attribute values in HTML, but it's harmless

Eric Wieser (Jun 09 2024 at 06:28):

I think the missing escaping in docs#Lean.Xml.instToStringElement is a bug

Eric Wieser (Jun 09 2024 at 06:29):

We should file an issue (or make a PR) in the issue tracker (edit: https://github.com/leanprover/lean4/issues/4411)

Jz Pan (Jun 09 2024 at 10:32):

Eric Wieser said:

I think the missing escaping in docs#Lean.Xml.instToStringElement is a bug

I agree. But when I came to the definition there were no any documentation describing it. So I was not sure if it's intended print without escaping.

Jz Pan (Jun 09 2024 at 10:42):

I'm not sure if there are codes which relies on the unescaped behavior of docs#Lean.Xml.instToStringElement, though.

Ruben Van de Velde (Jun 09 2024 at 10:44):

I'm a bit nervous about generating HTML with functions that have XML in the name though

Jz Pan (Jun 09 2024 at 10:47):

Ruben Van de Velde said:

I'm a bit nervous about generating HTML with functions that have XML in the name though

It's OK for me; XHTML is a form of HTML which must be also a valid XML document.

Eric Wieser (Jun 09 2024 at 10:59):

Jz Pan said:

So I was not sure if it's intended print without escaping.

If this was originally intended, then I would argue this was a design mistake that should still be fixed

Ruben Van de Velde (Jun 09 2024 at 11:06):

Yeah, but you're not serving XHTML

Eric Wieser (Jun 09 2024 at 11:08):

I think XML/XHTML/HTML concerns are much less relevant than handing escaped html to a markdown parser

Eric Wieser (Jun 09 2024 at 11:08):

In practice I think browsers are pretty forgiving about serving one as the other

Jz Pan (Jun 09 2024 at 17:33):

PR for fixing <url> processing is created as <https://github.com/leanprover/doc-gen4/pull/189>.

Eric Wieser (Jun 09 2024 at 21:19):

I guess we'll know if that worked in some number of hours?

Eric Wieser (Jun 09 2024 at 21:24):

Ah, looks like @Henrik Böving kicked off a manual build, so optimistically we'll know in 10 minutes or so

Henrik Böving (Jun 09 2024 at 21:25):

It's almost done

Eric Wieser (Jun 09 2024 at 21:29):

Did we have an example page that used the <http...> syntax to check?

Eric Wieser (Jun 09 2024 at 21:30):

Ah yes, https://leanprover-community.github.io/mathlib4_docs/Init/Conv.html#Lean.Parser.Tactic.Conv.conv which works correctly :tada:!

Henrik Böving (Jun 09 2024 at 21:31):

Great :thumbs_up:

Eric Wieser (Jun 09 2024 at 21:32):

It looks like https://github.com/leanprover/doc-gen4/issues/126 is (still?) fixed, which is also good news

Henrik Böving (Jun 09 2024 at 21:32):

Great, I'll close it

Eric Wieser (Jun 09 2024 at 21:34):

I'd argue that docgen should probably add an HTML.raw constructor alongside HTML.text, since the latter implies an HTML text node, but is actually used to carry raw HTML payloads; and there might be an avenue for HTML injection (perhaps in filenames or decl names?) here as a result

Eric Wieser (Jun 09 2024 at 21:35):

But maybe that type of cleanup is not worth it as the verso horizon approaches

Jz Pan (Jun 09 2024 at 21:51):

In my opinion, just add a docstring to HTML.text, warn the user that this means HTML text node but potentially can contain arbitrary HTML tags inside it, and the DOM model cannot access these tags.

Eric Wieser (Jun 09 2024 at 23:32):

this means HTML text node but potentially can contain arbitrary HTML tags inside it

"text node" has a prescribed meaning by the DOM standard, and it is specifically not this

Eric Wieser (Jun 09 2024 at 23:32):

Fixed in #190, which prevents:

def foo («<script>alert("hello world")</script>» : Nat) : Nat := 0

doing anything silly

Jz Pan (Jun 10 2024 at 00:37):

As for the cmark, the javascript version https://spec.commonmark.org/dingus/ happily accepts any HTML tags, while the C version, in default setting, disallows any HTML tags, including <b>, etc.

David Ang (Jun 10 2024 at 18:06):

There's a small bug with rendering LaTeX in a list, see e.g. here - any ideas?

David Ang (Jun 10 2024 at 18:07):

image.png

Furthermore, define the associated sequences $\Psi_n^{[2]}, \Phi_n \in R[X]$ by
 * $\Psi_n^{[2]} := \tilde{\Psi}_n^2\Psi_2^{[2]}$ if $n$ is even,
 * $\Psi_n^{[2]} := \tilde{\Psi}_n^2$ if $n$ is odd,
 * $\Phi_n := X\Psi_n^{[2]} - \tilde{\Psi}_{n + 1}\tilde{\Psi}_{n - 1}$ if $n$ is even, and
 * $\Phi_n := X\Psi_n^{[2]} - \tilde{\Psi}_{n + 1}\tilde{\Psi}_{n - 1}\Psi_2^{[2]}$ if $n$ is odd.

Utensil Song (Jun 10 2024 at 18:08):

It's not because they are in a list, only due to underscores.

Utensil Song (Jun 10 2024 at 18:09):

Markdown interprets underscores as emphasis, thus break the LaTeX seen by MathJax and it's discussed above.

Eric Wieser (Jun 10 2024 at 19:39):

This is https://github.com/leanprover/doc-gen4/issues/178, and the solution is either "patch cmark to handle latex", or "switch to a markdown parser that already handles LaTeX"

Eric Wieser (Jun 10 2024 at 19:40):

A simple solution would be to have doc-gen invoke the same python library that we used in lean3, via a subprocess

Eric Wieser (Jun 10 2024 at 19:40):

But this of course makes installing doc-gen harder. I think for mathlib users, the tradeoff is massively tilted, and no one cares about having local builds work; but doc-gen is an FRO product that is trying to work for everyone.

Eric Wieser (Jun 10 2024 at 19:41):

@Henrik Böving, would being able to toggle the parser (via a flag) to a subprocess-based one be acceptable to you, assuming the flag-less approach is not viable?

Jz Pan (Jun 10 2024 at 19:42):

Maybe we can consider this:

hugo-goldmark-extensions has a passthrough parser that handles this quite well, and it has many test cases. It also has the advantage that it can use the extensibility of the markdown parser and unfortunately CMark.lean exposes no access to the AST and low-level functionalities of cmark, so no extension is possible.

Eric Wieser (Jun 10 2024 at 19:43):

Switching to a go-based markdown parser probably has the same installation pains as python ones

Jz Pan (Jun 10 2024 at 19:47):

I'm not meaning switching parser; I mean maybe we can copy a little bit of code from it.

Henrik Böving (Jun 10 2024 at 19:47):

Having a flag for something as crucial like this sounds like an easy way to miss bugs to me. We would end up with two impelmentations and probably eventually forget about checking one of the two when adding a feature or something like that.

Eric Wieser (Jun 10 2024 at 19:50):

Do we expect to add any markdown features before the markdown component is replaced with verso?

Henrik Böving (Jun 10 2024 at 19:51):

I'm personally not planning on doing it, but e.g. changes like escaping etc. might end up yielding different results on different parsers no?

Eric Wieser (Jun 10 2024 at 19:51):

If the parser is correct then there is no escaping to do.

Eric Wieser (Jun 10 2024 at 19:52):

(a markdown parser's job is to output an HTML document, which is pre-escaped)

Eric Wieser (Jun 10 2024 at 19:54):

Maybe the lowest-maintainance form of the option is to let the lakefile specify an arbitrary renderMD (md : String) : IO String function that mathlib can replace with a subprocess call (which saves docgen having to deal with subprocesses)

Henrik Böving (Jun 10 2024 at 19:55):

That's not how doc-gen integrates with lake files at all. Either way, are we sure that CMark is actually incapeable of doing what we want and the issue is not just that we are massaging it in the wrong way? Their README claims they can at least output LaTeX, I would hope they can parse markdown with latex if that's the case?

Jz Pan (Jun 10 2024 at 20:04):

I'm googling for markdown parsers written in C now. The first result is <https://github.com/mity/md4c>, and the second result is <https://github.com/commonmark/cmark>, which is the one we currently uses.

Jz Pan (Jun 10 2024 at 20:05):

The md4c looks more promising than cmark, as it only contains one c file and one h file, and it claims that it supports LaTeX formulas, table, etc.

Henrik Böving (Jun 10 2024 at 20:07):

If someone wants to make bindings for that and check out of it can be integrated with doc-gen to solve our issues, sure go for it

Jz Pan (Jun 10 2024 at 20:09):

Henrik Böving said:

are we sure that CMark is actually incapeable of doing what we want

I think probably yes. CMark is a reference implementation of a minimal markdown specification. I think they are not like to add new features outside their specification...

Jz Pan (Jun 10 2024 at 20:09):

Henrik Böving said:

If someone wants to make bindings for that and check out of it can be integrated with doc-gen to solve our issues, sure go for it

I can try later.

Henrik Böving (Jun 10 2024 at 20:10):

Jz Pan said:

Henrik Böving said:

are we sure that CMark is actually incapeable of doing what we want

I think probably yes. CMark is a reference implementation of a minimal markdown specification. I think they are not like to add new features outside their specification...

I'm not saying to add new features, I'm asking if maybe we are doing something wrong with their code given that they claim to be able to at least generate LaTeX, so one might think they thought this through and actually tried a few examples

Jz Pan (Jun 10 2024 at 20:15):

Oh, I need some information on compiling C code with Lake build file. I've checked CMark.lean but is a little bit confused about it.

Henrik Böving (Jun 10 2024 at 20:18):

Well then you'll have to say what you are confused about, otherwise it's a bit hard to help :P

Jz Pan (Jun 10 2024 at 20:22):

OK let me try first.

Eric Wieser (Jun 10 2024 at 20:28):

Henrik Böving said:

I'm asking if maybe we are doing something wrong with their code given that they claim to be able to at least generate LaTeX

CMark is an implementation of CommonMark, and does not implement $math$ notation because the spec does not allow it.

Eric Wieser (Jun 10 2024 at 20:29):

So it can translate _italic_ input to \emph{italic} output in LaTeX, but there is no legal input to produce $\TeX$ as output.

Henrik Böving (Jun 10 2024 at 20:29):

Right, in that case investigating the other seems the way to go

Henrik Böving (Jun 10 2024 at 20:30):

image.png

they explicitly claim to be able

Jz Pan (Jun 10 2024 at 20:30):

Henrik Böving said:

they explicitly claim to be able

Yes, and tables, etc.

Jz Pan (Jun 10 2024 at 23:34):

I think it basically works. But there is a decision need to be made: now I have freedom to choose $ $$ or  \[ \] for formulas to be fed to MathJax. Now I use  \[ \] because I think by using this the beginning and ending marks can be distinguished. But this will introduce larger diffs compared to old library.

Jz Pan (Jun 10 2024 at 23:36):

The auto-link feature is enabled, for example:

before:

<li>can you see the link https://github.com</li>
<li>can you see the link <a href="https://github.com">https://github.com</a></li>

after:

<li>can you see the link <a href="https://github.com">https://github.com</a></li>
<li>can you see the link <a href="https://github.com">https://github.com</a></li>

Jz Pan (Jun 10 2024 at 23:41):

The LaTeX formula works correctly, for example:

before:

<p>$y^2=x^3-x$</p>
<p>${f \in \operatorname{Hom}(M, M_2) \mid f(p) \subseteq q }$</p>
<p><code>Hom(M, M₂)</code></p>
<p>Furthermore, define the associated sequences $\Psi_n^{[2]}, \Phi_n \in R[X]$ by</p><ul>
<li>$\Psi_n^{[2]} := \tilde{\Psi}_n^2\Psi_2^{[2]}$ if $n$ is even,</li>
<li>$\Psi_n^{[2]} := \tilde{\Psi}_n^2$ if $n$ is odd,</li>
<li>$\Phi_n := X\Psi_n^{[2]} - \tilde{\Psi}<em>{n + 1}\tilde{\Psi}</em>{n - 1}$ if $n$ is even, and</li>
<li>$\Phi_n := X\Psi_n^{[2]} - \tilde{\Psi}<em>{n + 1}\tilde{\Psi}</em>{n - 1}\Psi_2^{[2]}$ if $n$ is odd.</li>
</ul>

after (notice that now it uses  for delimiter, but if you think $ is better I can change them back):

<p>\(y^2=x^3-x\)</p>
<p>\(\{f \in \operatorname{Hom}(M, M_2) \mid f(p) \subseteq q \}\)</p>
<p><code>Hom(M, M₂)</code></p>
<p>Furthermore, define the associated sequences \(\Psi_n^{[2]}, \Phi_n \in R[X]\) by</p><ul>
<li>\(\Psi_n^{[2]} := \tilde{\Psi}_n^2\Psi_2^{[2]}\) if \(n\) is even,</li>
<li>\(\Psi_n^{[2]} := \tilde{\Psi}_n^2\) if \(n\) is odd,</li>
<li>\(\Phi_n := X\Psi_n^{[2]} - \tilde{\Psi}_{n + 1}\tilde{\Psi}_{n - 1}\) if \(n\) is even, and</li>
<li>\(\Phi_n := X\Psi_n^{[2]} - \tilde{\Psi}_{n + 1}\tilde{\Psi}_{n - 1}\Psi_2^{[2]}\) if \(n\) is odd.</li>
</ul>

Jz Pan (Jun 10 2024 at 23:48):

autolink also affects existing files, for example Init/Data/Int/DivMod.html

before:

<code>a % b + b * (a / b) = a</code>, unconditionally.</p><p>[t-rounding]: https://dl.acm.org/doi/pdf/10.1145/128861.128862 [theo
mod_add_div]:
https://leanprover-community.github.io/mathlib4_docs/find/?pattern=Int.mod_add_div#doc</p><p>Examples:</p><pre><code>#eval (7 : Int) / (0 : Int) -- 0

after:

<code>a % b + b * (a / b) = a</code>, unconditionally.</p><p>[t-rounding]: <a href="https://dl.acm.org/doi/pdf/10.1145/128861.128862">https://dl.acm.org/doi/pdf/10.1145/128861.128862</a> [theo
mod_add_div]:
<a href="https://leanprover-community.github.io/mathlib4_docs/find/?pattern=Int.mod_add_div#doc">https://leanprover-community.github.io/mathlib4_docs/find/?pattern=Int.mod_add_div#doc</a></p><p>Examples:</p><pre><code>#eval (7 : Int) / (0 : Int) -- 0

(by the way, I think the docstring in that file is just broken)

Jz Pan (Jun 10 2024 at 23:49):

PS: I think it's useful to add autolink for docs#XXX, just link it to find is OK.

Jz Pan (Jun 10 2024 at 23:55):

PR created as https://github.com/leanprover/doc-gen4/pull/193.

Eric Wieser (Jun 11 2024 at 00:05):

I think we should use $$ because:

Mathlib already does this (mostly)
This is more readable in vscode

Jz Pan (Jun 11 2024 at 00:11):

OK let me change it (and add an option to it, then hardcode it to enabled)

Jz Pan (Jun 11 2024 at 00:21):

There are still some minor TODOs in <https://github.com/acmepjz/md4lean/blob/main/wrapper/wrapper.c>. Perhaps the most important (but not that important, because the CMark.lean has no this feature either) thing is return Option String instead of String. Currently if the parsing failed, it returns an empty string instead.

Eric Wieser (Jun 11 2024 at 00:27):

I can address that for you now...

Eric Wieser (Jun 11 2024 at 00:42):

https://github.com/acmepjz/md4lean/pull/1

Eric Wieser (Jun 11 2024 at 01:15):

I think I misunderstood your comment about delimiters; are you talking about the input, output, or both?

Jz Pan (Jun 11 2024 at 01:46):

Eric Wieser said:

https://github.com/acmepjz/md4lean/pull/1

Thank you very much; will look at it later today.

Jz Pan (Jun 11 2024 at 01:47):

Eric Wieser said:

I think I misunderstood your comment about delimiters; are you talking about the input, output, or both?

I mean output. Ideally, for input, both $ and  should be supported, but for now it seems that md4c only support $ for input; it does not recognize .

Henrik Böving (Jun 11 2024 at 06:32):

Is there a particular reason you are shipping your own buffer instead of using a lean string directly?

Jz Pan (Jun 11 2024 at 06:55):

No, I just not familiar with the Lean string API. Will check later.

Eric Wieser (Jun 11 2024 at 08:45):

Jz Pan said:

Eric Wieser said:

I think I misunderstood your comment about delimiters; are you talking about the input, output, or both?

I mean output. Ideally, for input, both and  should be supported, but for now it seems that md4c only support for input; it does not recognize .

If it's only the output, then I don't think we care at all about the format, and you can remove your patch to md4c. Maybe the x-equation option is best, since that can be post-processed on the Lean side.

Henrik Böving (Jun 11 2024 at 19:43):

New LaTeX stuff is up

Jz Pan (Jun 11 2024 at 19:49):

There is something broken in <https://leanprover-community.github.io/mathlib4_docs/foundational_types.html>:

(a : A) -&gt; B a

I'm not sure which PR introduces this bug.

Jz Pan (Jun 11 2024 at 19:58):

It's in https://github.com/leanprover/doc-gen4/blob/main/DocGen4/Output/FoundationalTypes.lean

Henrik Böving (Jun 11 2024 at 19:59):

Would not be surprised if it's the one by Eric (that we already reverted and then fixed)?

Jz Pan (Jun 11 2024 at 20:00):

I think so. The fix is also straightforward.

Jz Pan (Jun 11 2024 at 20:04):

Just change it to <li><code>{"(a : A) -> B a"}</code>... ?

Henrik Böving (Jun 11 2024 at 20:05):

Yeah I'll do that really quick

Eric Wieser (Jun 11 2024 at 20:14):

I think -> alone might also be legal there?

Eric Wieser (Jun 11 2024 at 20:14):

Yeah, that's the easy fix right now. The right fix is teaching the parser to translate > as >, which is what JSX does.

Eric Wieser (Jun 11 2024 at 20:20):

Thanks again @Jz Pan and @Henrik Böving for pushing this through!

Jz Pan (Jun 11 2024 at 20:38):

Eric Wieser said:

I think -> alone might also be legal there?

It's legal in HTML yes. But I don't know if the custom syntax parser in JSX like this. It's possible that it panics on any < and >.

Jz Pan (Jun 12 2024 at 21:06):

I'm experimenting with autolink for docs#, see <https://github.com/acmepjz/doc-gen4/commit/577dfdc42c1ed9cd90e3b47ca7c5bc5706583bd7>. So far it only works for [link](docs#Lean.Xml.Element), but not for docs#Lean.Xml.Element in code block, which I think is because current logic treats # as a separator in code block.

Jz Pan (Jun 12 2024 at 21:09):

In fact I think it's not needed to be supported in code blocks. What about the syntax [docs#Lean.Xml.Element] just like referencing bibliography? Then this can be unified with bibliography processing code.

Eric Wieser (Jun 12 2024 at 21:27):

I think sometimes we've been using Lean/Xml/Element.lean in places where we intend to link to files

Eric Wieser (Jun 12 2024 at 21:27):

(this has the advantage that you can copy paste it into the "find file" dialog in vscode)

Jz Pan (Jun 12 2024 at 21:50):

I know that link to file is working. But usually we want to link to a definition/theorem which is not imported in this file yet, for example in <https://leanprover-community.github.io/mathlib4_docs/Mathlib/RingTheory/OrzechProperty.html> I want to say "see definition XXX" but these definitions are in downstream files, so currently doc-gen cannot see them. For now I can only link to the file where these definition lives. It's best if we can support docs# link scheme in this case.

Jz Pan (Jun 12 2024 at 21:52):

Oh, I'd like to ask that currently we support ## which just like docs# but only for the symbols already imported; is this feature actually used in mathlib docs? As I said before, seems that this only works in [link](##definition) syntax.

Jz Pan (Jun 12 2024 at 22:00):

Seems that I cannot find (## and "## in mathlib. The ## is too many, since it also serves as heading.

Jz Pan (Jun 12 2024 at 22:02):

I suggest that maybe it's better to use docs# syntax which unifies with Zulip convention.

Eric Wieser (Jun 12 2024 at 22:33):

Can doc-gen really not auto-link to things in downstream files?

Eric Wieser (Jun 12 2024 at 22:37):

Huh you're right: docs#Function.End.applyMulAction is a good example, vs docs3#function.End.apply_mul_action

Eric Wieser (Jun 12 2024 at 22:37):

I think the solution here is to fix the auto-linker, not to add a new link syntax

Henrik Böving (Jun 12 2024 at 22:39):

There is no fix for the auto linker in this. doc-gen4 analyzes all files in parallel and does so by only importing the file that it is analyzing. It has no knowledge of anything that happens downstream. All of the global information like instances etc. that you see is dynamically injected by JS code from a JSON index.

Jz Pan (Jun 12 2024 at 22:41):

Eric Wieser said:

I think the solution here is to fix the auto-linker

Yes, that's the best, just like other document generator, e.g. doxygen. It requires two-pass compilation, though, the first pass is compile whole mathlib to get a symbol table, the second pass is process all the docstrings and do auto-linking.

Eric Wieser (Jun 12 2024 at 22:54):

Yes, I was imagining a second pass. How is the sidebar assembled without one?

Henrik Böving (Jun 12 2024 at 23:01):

Its an iframe of an html file that's generated after the fact.

Jz Pan (Jun 12 2024 at 23:02):

I have an idea. Since there is find function, it must read some data file. We may ask doc-gen4 to load that data file produced by previous compilation, to serve as the global symbol table if the symbol in question is not found. This is very similar to LaTeX multi-pass compilation.

Jz Pan (Jun 12 2024 at 23:03):

Moreover, in the first pass we may ask doc-gen4 to only generate that file, suppress the generation of all other files.

Henrik Böving (Jun 12 2024 at 23:04):

If you want to spend time on that go ahead, I don't have cycles free to spend on developing stuff like this for doc-gen currently.

Eric Wieser (Jun 12 2024 at 23:05):

Are you happy to accept patches with largish design changes like this? (that obviously takes cycles too!)

Jz Pan (Jun 12 2024 at 23:05):

I think this works without huge change to current code.

Eric Wieser (Jun 12 2024 at 23:06):

I think you'll have difficulty fitting that design into lake's build graph (but am happy to be proven wrong)

Jz Pan (Jun 12 2024 at 23:08):

OK. Let's postpone this. Working on bibliography first. :joy:

Henrik Böving (Jun 12 2024 at 23:09):

You can do n step builds in lake no problem. I do see one modeling issue with lake that I wouldn't instantly know q solution to but maybe its doable.

The current code would have to be pulled apart a little bit into a Pre and post processing step. If you really want to make this clean you will also have the pleasure of deleting the iframe stuff as well as the dynamically HTML generating JS stuff so this is certainly not a trivial change.

Eric Wieser (Jun 12 2024 at 23:10):

The approach I was thinking of is something like:

A :doc_data facet that dumps some json file for every lean file, with no cross-references resolved (but maybe markdown and signatures preprocessed, to speed up the build)
A lean_docs target type alongside lean_exe, which pulls in the :doc_data files transitively and glues together all the final HTML at once

Henrik Böving (Jun 12 2024 at 23:11):

But this does not suffice because if you wish to have a global index it needs to be global for all packages that are in scope, not just Mathlib. I wouldn't know from the top of my head how to do that properly as the current command to interact with lake allows people to selectively build docs only for certain modules.

Henrik Böving (Jun 12 2024 at 23:13):

Jz Pan said:

OK. Let's postpone this. Working on bibliography first. :joy:

On that one, I was thinking a little about how to make this perform reasonably, I'm guessing you are just planning to parse the bibtex file in every doc-gen process again and use it right? That sounds rather wasteful to me, there should be a better wat

Jz Pan (Jun 12 2024 at 23:15):

Henrik Böving said:

I'm guessing you are just planning to parse the bibtex file in every doc-gen process again and use it right?

I haven't realized that the doc-gen is run in parallel. But maybe it's not a big deal.

Henrik Böving (Jun 12 2024 at 23:17):

As mentioned above it runs per file in parallel, not doing multi core would be rather crazy, generating mathlib docs on a beefy machine with a bunch of cores is 15 min currently

Jz Pan (Jun 12 2024 at 23:20):

We have FoundationalTypes.lean and that file must only run once. For bibliography it's the same, we have a Bibliography.lean which generates the bibliography HTML file and a simple data file listing all the keys of references and the link they link to. Then when processing each input source file, we load that data file. In this way the .bib file only process once.

Jz Pan (Jun 13 2024 at 23:39):

Unrelated note: I tried to make MD4Lean compile under Windows. I found that I can specify the clang shipped with lean (docs#Lake.getLeanCc) to docs#Lake.compileO, but then it complains that missing stdio.h. Looks like that standard include files are missing for the clang installation.

Utensil Song (Jun 14 2024 at 00:01):

This issue applies to all platforms, here is how you can work around it, by extracting the search path from an actual clang installation and pass the flags.

Utensil Song (Jun 14 2024 at 00:08):

Oh, by all platforms, I means this applies to leanc on all platforms (because lean doesn't ship all C/C++ headers). But if you specified an actual clang, there should be no issue on all three platforms.

Utensil Song (Jun 14 2024 at 00:19):

And you probably need to install a msys2 clang toolchain as demonstrated by this workflow.

Jz Pan (Jun 14 2024 at 00:28):

I have MSVC installed. Obviously the command line of it is incompatible with gcc/clang.

Jz Pan (Jun 14 2024 at 00:28):

I have an extremely hacky solution:

import Lake
open System Lake DSL

package MD4Lean

def md4cDir : FilePath := "md4c"
def wrapperDir := "wrapper"
def srcNames := #["entity", "md4c", "md4c-html"]
def wrapperName := "wrapper"
def buildDir := defaultBuildDir

def MSVC.compileO (oFile srcFile : FilePath) (moreArgs : Array String := #[]) : LogIO Unit := do
  createParentDirs oFile
  proc {
    cmd := "cl"
    args := #["/nologo", "/c", "/Fo:", oFile.toString, srcFile.toString] ++ moreArgs
  }

def md4cOTarget (pkg : Package) (srcName : String) : FetchM (BuildJob FilePath) := do
  let oFile := pkg.dir / buildDir / md4cDir / ⟨ srcName ++ ".o" ⟩
  let srcTarget ← inputFile <| pkg.dir / md4cDir / ⟨ srcName ++ ".c" ⟩
  buildFileAfterDep oFile srcTarget fun srcFile => do
    if Platform.isWindows then
      let flags := #["/I", (pkg.dir / md4cDir).toString]
      MSVC.compileO oFile srcFile flags
    else
      let flags := #["-I", (pkg.dir / md4cDir).toString, "-fPIC"]
      compileO oFile srcFile flags

def wrapperOTarget (pkg : Package) : FetchM (BuildJob FilePath) := do
  let oFile := pkg.dir / buildDir / wrapperDir / ⟨ wrapperName ++ ".o" ⟩
  let srcTarget ← inputFile <| pkg.dir / wrapperDir / ⟨ wrapperName ++ ".c" ⟩
  buildFileAfterDep oFile srcTarget fun srcFile => do
    if Platform.isWindows then
      let flags := #["-I", (← getLeanIncludeDir).toString,
        "-I", ((← getLeanIncludeDir) / "clang").toString,
        "-I", (pkg.dir / md4cDir).toString, "-fPIC"]
      compileO oFile srcFile flags (← getLeanCc)
    else
      let flags := #["-I", (← getLeanIncludeDir).toString,
        "-I", (pkg.dir / md4cDir).toString, "-fPIC"]
      compileO oFile srcFile flags

@[default_target]
lean_lib MD4Lean

extern_lib md4c (pkg) := do
  let libFile := pkg.dir / buildDir / md4cDir / "libleanmd4c.a"
  let oTargets := (←srcNames.mapM (md4cOTarget pkg)) ++ #[←wrapperOTarget pkg]
  buildStaticLib libFile oTargets

Jz Pan (Jun 14 2024 at 00:30):

clang can't find stdio.h, while MSVC does not like lean.h. So the library and the wrapper are compiled using different compilers. It said that build successful. But I need to actually run the test program (Main.lean) to see if it crashes or not.

Jz Pan (Jun 14 2024 at 00:40):

OK I need this:

lean_exe «example» where
  root := `Main

Looks like that it runs correctly, not crashing:

C:\Users\XXX\Documents\projects\md4lean>lake exe example ..\md4c-release-0.5.2\build\src\Release\test.md
[1/9] Replayed MD4Lean/md4c.static
info: stdout:
entity.c
info: stdout:
md4c.c
info: stdout:
md4c-html.c
<h1>Test</h1>
<p>Hello <a href="https://github.com">https://github.com</a></p>
<p><code>&lt; &gt; &amp; &quot;</code></p>
<hr />
<p>&lt; &gt; &amp; &quot;</p>
<p>$\{f \in \operatorname{Hom}(M, M_2) \mid f(p) \subseteq q \}$ $$y^2&lt;x^3-x _1x_ 2$$</p>
<p>&lt;script&gt;alert(&quot;hello world&quot;)&lt;/script&gt;</p>
<p>&lt;c&gt;alert(&quot;hello world&quot;)</p>
<p>An apple is worth $2.0. A pear is worth $3.0.</p>
<p>An apple is worth $2.0.</p>
<p>A pear is worth $3.0.</p>
<p>An apple is worth $2.0$. A pear is worth $3.0$.</p>
<p>An apple is worth $2.0. A pear is worth $ 3.0.</p>

Utensil Song (Jun 14 2024 at 01:59):

That's a great data point, I had trouble with linking gcc compiled C++ library with Lean (which uses clang and its libc++), so I never expected VC could link, but I guess the situation may vary.

Utensil Song (Jun 14 2024 at 02:01):

But maybe the reason is just that it's a C library, while C++ has other complications.

Jz Pan (Jun 14 2024 at 07:08):

You are right, even using MSVC, linking C++ program requires that all the objects are compiled by the same version of the compiler, otherwise there will be symbol missing link errors, or random crashes.

Jz Pan (Jun 14 2024 at 19:35):

Utensil Song said:

lean doesn't ship all C/C++ headers

I checked again and found that lean ships all C/C++ libs (e.g. libc++.a). So I don't know a particular reason that why lean doesn't ship headers.

Jz Pan (Jun 14 2024 at 20:27):

OK now compiling MD4Lean under Windows does not require MSVC anymore. It only uses Lean's built-in clang compiler.

Utensil Song (Jun 15 2024 at 10:45):

Jz Pan said:

Utensil Song said:

lean doesn't ship all C/C++ headers

I checked again and found that lean ships all C/C++ libs (e.g. libc++.a). So I don't know a particular reason that why lean doesn't ship headers.

Lean usually only needs to compile IR in C. So it's natural that it misses some C++ headers.

Utensil Song (Jun 15 2024 at 10:47):

Oh, I see you ship some headers with MD4Lean :joy:

Jz Pan (Jun 16 2024 at 20:39):

Please have a look at this https://github.com/acmepjz/doc-gen4/commit/7c545c41db9fd0b181d1ceec8310f69be8be833f which removes comments from bibtex file. It uses a finite state machine. Basically it removes #xxx until a line break, outputs a line break, then ignores all spaces until a non-space character. \# is not removed, though.

import Lean.Data.Parsec
import Lean.Data.HashMap

open Lean

private inductive RemoveCommentState
| normal : RemoveCommentState
| backslash : RemoveCommentState
| comment : RemoveCommentState
| commentAfterLinebreak : RemoveCommentState

private partial def removeCommentAux (s : String) (i : String.Pos)
    (state : RemoveCommentState) (r : String) : String :=
  if s.atEnd i then
    r
  else
    let c := s.get i
    let j := s.next i
    match state with
    | .normal =>
      if c == '#' then removeCommentAux s j .comment r
      else if c == '\\' then removeCommentAux s j .backslash (r.push c)
      else removeCommentAux s j .normal (r.push c)
    | .backslash => removeCommentAux s j .normal (r.push c)
    | .comment =>
      if c == '\n' then removeCommentAux s j .commentAfterLinebreak (r.push c)
      else removeCommentAux s j .comment r
    | .commentAfterLinebreak =>
      if c == ' ' || c == '\t' then removeCommentAux s j .commentAfterLinebreak r
      else removeCommentAux s i .normal r

def removeComment (s : String) : String := removeCommentAux s 0 .normal ""

I have tried (#eval) that it works. Do you think it's better to rewrite it using docs#Lean.Parsec since it is more maintainable?

Jz Pan (Jun 17 2024 at 21:28):

I have implemented bibPrepass subcommand in the test branch <https://github.com/acmepjz/doc-gen4/tree/test>. If calling command line directly, it will process the bib file, copy it to the doc root and save the cite keys to a txt file.

Now I'm having trouble writing build file. Seems that now it does not work. I want to run bibPrepass before all single and coreDocs subcommand. What should I do?

Jz Pan (Jun 17 2024 at 21:35):

Also, I want it to detect changes/existent or not to docs/references.bib file. Currently the bibPrepass subcommand will delete the cite key file if the input reference file is set to - or inexistent.

Henrik Böving (Jun 17 2024 at 21:40):

You probably want to define a custom lake target for the bibPrepass that is dependend on by the single/coreDocs commands and marks whatever input files you take as its input

Jz Pan (Jun 18 2024 at 00:41):

I tried this https://github.com/acmepjz/doc-gen4/blob/33f67bccde8d1e80f18c46c4a26f04b6b075a887/lakefile.lean but seems that it is still not working. This is what I modified:

target bibPrepass : FilePath := do
  let exeJob ← «doc-gen4».fetch
  let basePath := (←getWorkspace).root.buildDir / "doc"
  let inputFile := (←getWorkspace).root.srcDir / "docs" / "references.bib"
  let inputFiles := #[inputFile]
  let outputFile := basePath / "declarations" / "citekey.txt"
  exeJob.bindSync fun exeFile exeTrace => do
    let trace ← buildFileUnlessUpToDate outputFile exeTrace do
      proc {
        cmd := exeFile.toString
        args := #["bibPrepass", inputFile.toString]
        env := ← getAugmentedEnv
      }
    let traces ← inputFiles.mapM computeTrace
    let indexTrace := mixTraceArray traces
    return (outputFile, trace.mix indexTrace)

module_facet docs (mod) : FilePath := do
  let exeJob ← «doc-gen4».fetch
  let bibPrepassJob ← bibPrepass.fetch
  let modJob ← mod.leanArts.fetch
  -- Build all documentation imported modules
  let imports ← mod.imports.fetch
  let depDocJobs ← BuildJob.mixArray <| ← imports.mapM fun mod => fetch <| mod.facet `docs
  let srcUri ← getSrcUri mod
  let buildDir := (←getWorkspace).root.buildDir
  let docFile := mod.filePath (buildDir / "doc") "html"
  depDocJobs.bindAsync fun _ depDocTrace => do
    bibPrepassJob.bindAsync fun _ bibPrepassTrace => do
      exeJob.bindAsync fun exeFile exeTrace => do
        modJob.bindSync fun _ modTrace => do
          let depTrace := mixTraceArray #[exeTrace, modTrace, bibPrepassTrace, depDocTrace]
          let trace ← buildFileUnlessUpToDate docFile depTrace do
            proc {
              cmd := exeFile.toString
              args := #["single", mod.name.toString, srcUri]
              env := ← getAugmentedEnv
            }
          return (docFile, trace)

-- TODO: technically speaking this target does not show all file dependencies
target coreDocs : FilePath := do
  let exeJob ← «doc-gen4».fetch
  let bibPrepassJob ← bibPrepass.fetch
  let basePath := (←getWorkspace).root.buildDir / "doc"
  let dataFile := basePath / "declarations" / "declaration-data-Lean.bmp"
  bibPrepassJob.bindAsync fun _ bibPrepassTrace => do
    exeJob.bindSync fun exeFile exeTrace => do
      let depTrace := mixTraceArray #[exeTrace, bibPrepassTrace]
      let trace ← buildFileUnlessUpToDate dataFile depTrace do
        proc {
          cmd := exeFile.toString
          args := #["genCore"]
          env := ← getAugmentedEnv
        }
      return (dataFile, trace)

Jz Pan (Jun 18 2024 at 00:42):

It only works when I build doc for the first time. In the subsequent builds it just stuck at building bibPrepass step forever.

Utensil Song (Jun 18 2024 at 00:51):

I don't know why, but sometimes I need .await after .fetch to ensure the depended target is generated.

Utensil Song (Jun 18 2024 at 00:54):

But your issue seems to be the opposite, once the upstream target is generated, the downstream target is stuck? I can't tell why from the code.

Mac Malone (Jun 18 2024 at 15:31):

I am not sure why your bibPrepass job is hanging. However, I do have a suggestion for a modification of its tracing:

target bibPrepass : FilePath := do
  let exeJob ← «doc-gen4».fetch
  let basePath := (←getWorkspace).root.buildDir / "doc"
  let inputFile := (←getWorkspace).root.srcDir / "docs" / "references.bib"
  let outputFile := basePath / "declarations" / "citekey.txt"
  exeJob.bindSync fun exeFile exeTrace => do
    let inputTrace ← computeTextFileTrace inputFile
    let depTrace := exeTrace.mix inputTrace
    let trace ← buildFileUnlessUpToDate outputFile depTrace do
      proc {
        cmd := exeFile.toString
        args := #["bibPrepass", inputFile.toString]
        env := ← getAugmentedEnv
      }
    return (outputFile, trace)

Mac Malone (Jun 18 2024 at 15:34):

You should include input files as part of the trace passed to buildFileUnlessUpToDate and returned trace should generally just be the trace of the outputted file. One notable exception to this output rule is Lean oleans, because they include by reference other files (their imports).

Jz Pan (Jun 18 2024 at 18:07):

Thanks.

It turns out that the stuck is not related to the build file changes. It's simply because building coreDocs takes too much time.

Jz Pan (Jun 18 2024 at 20:08):

Now the reference page works (very ugly)

2024-06-19-040708.png

Eric Wieser (Jun 18 2024 at 20:40):

It looks like your bibtex parser is non-conforming with regards to { and }; the specification is at https://www.bibtex.org/Format

Eric Wieser (Jun 18 2024 at 20:43):

I think it would make sense to either write a standalone bibtex parser module for Lean, or to wrap some existing Python / C software that can do this parsing (perhaps into json)

Ruben Van de Velde (Jun 18 2024 at 20:51):

I wonder how much of this is implemented in https://github.com/dupuisf/BibtexQuery

Eric Wieser (Jun 18 2024 at 20:52):

(cc @Frédéric Dupuis)

Frédéric Dupuis (Jun 18 2024 at 21:07):

Indeed it would be nice to turn bibtexquery into a proper standalone bibtex parser module! At the moment it only implements a subset of the full bibtex format (in particular, it doesn't handle string concatenation with #, which in my defense I've never actually seen anyone use).

Jz Pan (Jun 18 2024 at 21:30):

Eric Wieser said:

It looks like your bibtex parser is non-conforming with regards to { and }

I know; currently it is only a proof of concept.

Jz Pan (Jun 18 2024 at 21:34):

According to https://github.com/leanprover/doc-gen4/issues/147 the doc-gen3 uses an external python tool pybtex to process the bibtex file. According to https://pybtex.org/ it can output HTML or markdown. So now the cheapest solution is to consume the HTML/markdown produced by pybtex, while our parser only parses all cite keys and do reverse references.

Jz Pan (Jun 18 2024 at 21:43):

I've checked https://github.com/dupuisf/BibtexQuery/blob/master/BibtexQuery/Parser.lean and it can't process \{ correctly, otherwise it looks better than my own https://github.com/acmepjz/doc-gen4/blob/test/DocGen4/Output/References.lean

Jz Pan (Jun 18 2024 at 21:47):

Do you think it's a good idea to just use pybtex?

Jz Pan (Jun 18 2024 at 22:58):

pybtex-format -f bibtex -b plaintext references.bib out.txt
ERROR: missing publisher in calugareanu

Oops.

Jz Pan (Jun 18 2024 at 23:03):

@Article{         Guitart1980,
  author        = {Guitart, Ren\'{e}},
  tilte         = {Relations et carr\'{e}s exacts},

tilte :joy: This should be fixed.

Jz Pan (Jun 18 2024 at 23:32):

My parser is not working on thiis:

@InBook{          Graham1983,
  author        = "Graham, R. L.",
  editor        = "Bachem, Achim and Korte, Bernhard and Gr{\"o}tschel,
                  Martin",

because there is a \" in quoted string.

Jz Pan (Jun 18 2024 at 23:37):

I think pybtex is good:

pybtex-format -f bibtex -b html --label-style=alpha references.bib out.htm

2024-06-19-073643.png

Jz Pan (Jun 18 2024 at 23:40):

It has some weirdness, though: to make the alpha label formatting working properly, the authors should be in LastName, FirstName and LastName, FirstName, and the name should use UTF-8 characters directly, instead of LaTeX commands. Otherwise it looks good. We only need an HTML parser, which we already have an example in markdown postprocessing.

Jz Pan (Jun 18 2024 at 23:48):

For getting all cite keys, we can also use

pybtex-convert -f bibtex -t bibtexml --preserve-case references.bib out.xml

and then parsing XML using Lean's built-in library.

Jz Pan (Jun 19 2024 at 18:21):

I'm stuck at String.findSubstr? which is in batteries. Are there similar functions in the core library? I can only find docs#String.replace in the core library.

Jz Pan (Jun 19 2024 at 18:22):

docs#RegularExpression is in the mathlib!

Henrik Böving (Jun 19 2024 at 18:23):

This is not really the kind of regex you are looking for, it's the mathematical notion, not a well optimized software one

Henrik Böving (Jun 19 2024 at 18:27):

Jz Pan said:

pybtex-format -f bibtex -b plaintext references.bib out.txt
ERROR: missing publisher in calugareanu

Oops.

Note that, just like with NodeJS/NPM I would also prefer to avoid having a dependency on Python/The python packaging eco system. If it's really not otherwise possible (which doesn't seem to be the case given the partial success so far?) it's fine but generally a core Lean tool like doc-gen should be able to provide all of its functionality by just using Lean itself.

Henrik Böving (Jun 19 2024 at 18:34):

This was generally fine with doc-gen3 because it a) was a python tool through and through and b) Lean 3 was not nearly as much of a programming language as we are now. I also believe that this is generally a good thing for the eco system, people actually build libraries like your MD renderer or Francois' Unicode library (or potentially bibtex stuff) that can be helpful for other people. I do of course understand that not going the easy path is a little frustrating sometimes, but with features such as the Bibtex renderer it's not like this is a prio A thing that we desperately need to work correctly now otherwise we loose users or something.

Jz Pan (Jun 19 2024 at 19:10):

I see. But currently I'm working on a version using pybtex command line tools. I think this is the easiest way to work with. I also have a pure Lean work-in-progress, but it takes time.

Jz Pan (Jun 19 2024 at 19:52):

In this branch https://github.com/acmepjz/doc-gen4/tree/test2 I use pybtex to render reference pages. It works:

2024-06-20-034859.png

Eric Wieser (Jun 19 2024 at 19:53):

If it's really not otherwise possible (which doesn't seem to be the case given the partial success so far?) it's fine but generally a core Lean tool like doc-gen should be able to provide all of its functionality by just using Lean itself.

Maybe the way out here is to have an extension hook for downstream packages to provide their own references page? That way the call out to python happens in mathlib, rather than in doc-gen

Eric Wieser (Jun 19 2024 at 19:54):

(I think a mechanism for downstream projects to extend the pages generated is something we want one way or another anyway)

Jz Pan (Jun 19 2024 at 19:55):

Reference page feature is optional; currently the logic of my code is that if it can't find pybtex executable, then the reference page is disabled.

Jz Pan (Jun 19 2024 at 19:56):

I think we can always swapped this out later with a pure Lean implementation.

Jz Pan (Jun 19 2024 at 19:56):

The next thing is make the referencing works.

Henrik Böving (Jun 19 2024 at 19:59):

Eric Wieser said:

(I think a mechanism for downstream projects to extend the pages generated is something we want one way or another anyway)

I don't see why this should be the case. doc-gen is supposed to be used to generate API references, the general documentation tool is supposed to be Verso once it comes around.

Eric Wieser said:

If it's really not otherwise possible (which doesn't seem to be the case given the partial success so far?) it's fine but generally a core Lean tool like doc-gen should be able to provide all of its functionality by just using Lean itself.

Maybe the way out here is to have an extension hook for downstream packages to provide their own references page? That way the call out to python happens in mathlib, rather than in doc-gen

I also don't see how to easily do this, the lake targets of doc-gen cannot call into lake targets that are not yet declared in doc-gens files. And making doc-gen call out to arbitrary code is going to lead to caching issues sooner or later because the inputs of a doc-gen build step will no longer be known so it may happen that upon rerunning doc-gen it might not pick up on changes of whatever custom input you have.

Jz Pan said:

I see. But currently I'm working on a version using pybtex command line tools. I think this is the easiest way to work with. I also have a pure Lean work-in-progress, but it takes time.

Yes, that's why I explained that taking time is fine, there is no high pressure behind these features so there is no need to rush it.

Jz Pan (Jun 21 2024 at 20:46):

Now forward link from docstring to reference page works.

2024-06-22-044049.png

Jz Pan (Jun 21 2024 at 20:48):

The next is add reverse link support from reference page to where it is referenced. See here: https://github.com/acmepjz/doc-gen4/blob/eb65d878971b33e55c27c6a4843502fd73f35978/DocGen4/Output/DocString.lean#L157

I think It's best if I can have a read/write monad support here. The HtmlM seems to be read-only.

Jz Pan (Jun 22 2024 at 14:22):

Now back reference also works

screenshot.png

Jz Pan (Jun 22 2024 at 15:00):

PR created as https://github.com/leanprover/doc-gen4/pull/200.

Henrik Böving (Jun 22 2024 at 15:55):

@Jz Pan I left a review

Henrik Böving (Jun 22 2024 at 19:25):

I propose the following approach to the issue of parsing bibtex: Running https://github.com/dupuisf/BibtexQuery against mathlib4's references.bib the first thing it chokes on is:

  editor        = "Bachem, Achim and Korte, Bernhard and Gr{\"o}tschel,

Would @Frédéric Dupuis be willing to fix his parser to get it to the point of supporting all of docs/references.bib in mathlib4? Additionally it would be great if you could provide your library as a standalone thing since, afaict, you only depend on batteries for your executable? I would like to avoid pulling in a library like batteries into the dependency closure of doc-gen if its not even needed for execution.

If we can arrange that then once @Jz Pan has cleaned up their contribution code we should be able to just swap out the pybtex backend for the lean one as a large part of the code base contributed was written in a generic enough fashion to support this.

Henrik Böving (Jun 22 2024 at 22:06):

Okay there are only two parser failures of @Frédéric Dupuis tool in mathlib, one is caused by the fact that there is a name that has an ö and this name parser only accepts ASCII names:

/-- The name of the bibtex entry (i.e. what goes in the cite command). -/
def name : Parsec String := do
  let firstChar ← Parsec.asciiLetter
  let reste ← manyChars $ (alphaNum <|> pchar ':' <|> pchar '-' <|> pchar '_')
  return firstChar.toString ++ reste

I suggest to make this unicode compatible, potentially using @François G. Dorais UnicodeBasic: https://github.com/fgdorais/lean4-unicode-basic/blob/87791b59c53be80a9a000eb2bcbf61f60a27b334/UnicodeBasic.lean#L887 or just by consuming any characters until a , is hit.

The other is caused by a failure to parse this line:

  editor        = "Bachem, Achim and Korte, Bernhard and Gr{\"o}tschel,

There might be a simple parser fix for this but alternatively we could just change the name to contain an ö (a bit ironic that that just works here) in the mathlib references.bib and this is gone. In general the references.bib does of course contain a lot of these LaTeX-isms that are necessary to do basic unicode. Is this type of stuff still necessary? Can LaTeX maybe just cope with us putting proper unicode instead of funny escape sequences?

Either way after both of these are fixed and the bibtex parser is a standalone lib we can just use it to process mathlib'sreferences.bib. So I don't see a reason to engineer the hacky JSON approach suggested in the PR. On a technical level this feature is perfectly doable and integrateable with doc-gen by only using Lean, in a short amount of time.

Frédéric Dupuis (Jun 22 2024 at 22:29):

Yes, I could try to clean up the parser to make it work. The batteries dependency should be very easy to get rid of, from what I recall I only use it for the hashmap, so I should be able to swap it out for the hashmap in core. I can also try to turn it into a standalone library.

Henrik Böving said:

There might be a simple parser fix for this but alternatively we could just change the name to contain an ö (a bit ironic that that just works here) in the mathlib references.bib and this is gone.

Well, I did make sure it works with my own name :-D Jokes aside, for my personal use I basically do what you propose: just stick with Unicode in the bibtex file instead of the LaTeX workarounds. I'm happy to stick with this if this is an option, I suspect implementing this properly (especially with LaTeX math in the bibtex fields) will be a lot of work.

Jz Pan (Jun 22 2024 at 22:30):

Henrik Böving said:

here is a name that has an ö and this name parser only accepts ASCII names

I think the name (which is called citekey in my PR) can only contain ASCII characters according to the specification. Note that this is also used as the tag attribute of <a>, so it can't contain #, &, >, '<' etc.

Jz Pan (Jun 22 2024 at 22:33):

Frédéric Dupuis said:

LaTeX math in the bibtex fields

Don't worry, LaTeX math is handled by MathJax.

François G. Dorais (Jun 23 2024 at 00:48):

Regarding Unicode and TeX-style {\c c} <-> ç translation. Each accent character has a Unicode combining character. For example:

\' is U+0301 COMBINING ACUTE ACCENT as in Frédéric
\" is U+0308 COMBINING DIAERESIS as in Böving
\c is U+0327 COMBINING CEDILLA as in François

You probably can't visually tell the difference in the names above and these three names: Frédéric, Böving, François. There is one:

#eval "François".utf8ByteSize -- 10
#eval "François".utf8ByteSize -- 9

Using the combining characters, the translation from TeX-style to Unicode is straightforward. (I hope Zulip doesn't automatically normalize UTF-8...)

UnicodeBasic currently implements canonical decomposition which could be used to translate from Unicode to Tex-style. Unfortunately, I haven't yet gotten around to implementing canonical composition. Feel free to raise an issue on UnicodeBasic!

Frédéric Dupuis (Jun 23 2024 at 18:57):

Thanks to the dreadful weather in Montreal today, I actually got this done. I've fixed the escaped quotation mark bug (the Gr{\"o}tschel issue above), and for the Unicode in the citation key issue, I just PRed a fix to mathlib (#14056) since this apparently causes problems in some other bibtex implementations as well. Also, it seems to work fine when used as a library (you just have to require it in lakefile.lean as usual), and I removed the dependency on batteries, now it only depends on Lean core.

Henrik Böving (Jun 23 2024 at 19:39):

Excellent, thank you so much!

Frédéric Dupuis (Jun 23 2024 at 20:50):

No problem!

Jz Pan (Jun 24 2024 at 19:16):

I think there is something wrong when building documentation for multiple libraries. For example, I use

DOCGEN_SRC="file" lake build DocGen4:docs Cli:docs

to build the docs (Cli:docs is needed since it is not imported by DocGen4), it runs documentation indexing pass two times. The Cli:docs is built, but the navbar is incorrect, which has no Cli entry.

Henrik Böving (Jun 24 2024 at 19:52):

That the pass runs two times is expected because its a library facet. The indexing does run for alll of the libraries, so given the fact that one library should be done last there should be a clear winner to this. I've been aware of this race condition for a long time but I don't really see a nice way to fix it in lake apart from telling users to explicitly run another target afterwards which is definitely suboptimal. The process itself is certainly not fundamentally broken as this is exactly how mathlib's documentation is built.

Henrik Böving (Jun 24 2024 at 19:55):

I also cannot reproduce what you are describing locally. The entry is there and working as expected by this design.

Jz Pan (Jun 24 2024 at 20:02):

I got this:

2024-06-25-040140.png

It is still the same even if I rerun lake exe doc-gen4 index manually.

Jz Pan (Jun 24 2024 at 20:03):

Note that there is no Cli in the left hand side.

Henrik Böving (Jun 24 2024 at 20:14):

Well I can't say much more than no that's not what I get:
image.png

Jz Pan (Jun 24 2024 at 20:23):

Oops. Let me investigate it later if I have time; currently I'm redesigning bibPrepass command line interface.

Jz Pan (Jun 24 2024 at 21:04):

Problem solved by rerun the build process again.

Jz Pan (Jun 26 2024 at 22:20):

@Frédéric Dupuis I think your https://github.com/dupuisf/BibtexQuery does not contain any code for translating \'e into é. Is that correct?

Jz Pan (Jun 26 2024 at 22:21):

If that is the case, I think the easiest solution would be to make a PR to mathlib to remove all LaTeX commands outside math environment in references.bib. And we should mention this limitation in that file.

Do you think it's a good idea?

Jz Pan (Jun 26 2024 at 22:31):

Also, your last name process code seems incomplete; it can't get last name from FirstName LastName syntax.

PS: Even worse, it's not always the last word is the last name, for example, the last name of Leonardo de Moura is de Moura but not Moura... So it should be [dM00] but not [Mou00]

Jz Pan (Jun 26 2024 at 22:40):

I think my plan is only import BibtexQuery.Parser and reinventing the wheel for name processing.

Frédéric Dupuis (Jun 26 2024 at 23:01):

Indeed the parser makes no attempt at processing the strings in the various fields of a bibtex entry, so it doesn't try to convert TeX commands into Unicode, or to split names into last name/first name or anything like that. (The actual bibtex-query executable does some of that, but I wouldn't rely on that part of the code, that was never really meant to work on any bibtex file except my own...)

For the names, if we really need to figure out where the last name/first name split is, I think we should enforce the Lastname, Firstname format in our bibtex file instead of Firstname Lastname. With the latter format, I don't think there's an algorithm that will always work properly, since conventions vary between languages, countries, etc. And the same goes for the diacritics, we could just make our lives easier and enforce a UTF-8-only policy in references.bib instead of trying to remplement TeX.

Eric Wieser (Jun 26 2024 at 23:04):

https://github.com/plk/biblatex/issues/236 may be relevant here

Frédéric Dupuis (Jun 26 2024 at 23:05):

Yeah, that's exactly the kind of issue I think we should spare ourselves.

Eric Wieser (Jun 26 2024 at 23:07):

instead of trying to remplement TeX.

I think in general we should be be wary of "we can't use X unless we reimplement it in Lean" arguments. They're worth exploring when the reimplementation is trivial, but here calling out to TeX itself or pybtex would save us a lot of trouble.

Eric Wieser (Jun 26 2024 at 23:08):

From the docs:

Pybtex aims to be 100% compatible with BibTeX. It [...] produces byte-identical output

This is a very high bar that I doubt we can easily replicate with a cleanroom implementation

Frédéric Dupuis (Jun 26 2024 at 23:11):

My point is that by using just a slightly more restricted version of the bibtex format that anyway looks nicer, we can save ourselves a lot of trouble.

François G. Dorais (Jun 26 2024 at 23:12):

To throw a grain of salt here: there is no BibTeX standard. If there ever was one, it was reconstructed from what Oren Patashnik's implementation does. Note that this implementation has had only one update since 1988, which unfortunately didn't fix anything immediately relevant to this discussion. All reimplentations of BibTeX are significant improvements over the original.

Eric Wieser (Jun 26 2024 at 23:17):

Frédéric Dupuis said:

My point is that by using just a slightly more restricted version of the bibtex format that anyway looks nicer, we can save ourselves a lot of trouble.

One thing to consider is users who want to use their .bib file both with real bibtex/pdflatex and doc-gen, and not have to maintain two formats. I don't have a feeling for whether the suggested restrictions come at the expense of LaTeX compatibility with certain compilers.

Jz Pan (Jun 26 2024 at 23:17):

but here calling out to TeX itself or pybtex would save us a lot of trouble.

I agree. Currently my doc-gen4 PR provides both options: built-in Lean implementation (to be implemented) and calling pybtex. They can be switched by command line options.

Pybtex aims to be 100% compatible with BibTeX.

Fun fact: it clearly not quite compatible with bibtex, for example the last name abbreviation does not work if there are LaTeX commands in last names.

Eric Wieser (Jun 26 2024 at 23:20):

I don't have a feeling for whether the suggested restrictions come at the expense of LaTeX compatibility with certain compilers.

Looking further, I think they're fine for biber-compatible .bib files, and probably those are all we need to care about

Frédéric Dupuis (Jun 26 2024 at 23:49):

Eric Wieser said:

One thing to consider is users who want to use their .bib file both with real bibtex/pdflatex and doc-gen, and not have to maintain two formats. I don't have a feeling for whether the suggested restrictions come at the expense of LaTeX compatibility with certain compilers.

For what it's worth, I've been sticking to this restricted format for a long time now and I've never had a problem with standard latex compilers (or received complaints of that nature from my coauthors :-) ).

Jz Pan (Jun 29 2024 at 20:55):

I have added TeX diacritics processing code in https://github.com/acmepjz/doc-gen4/blob/test2/DocGen4/Output/Bibtex/TexDiacritics.lean. @Frédéric Dupuis do you think this part should be moved to BibtexQuery?

Frédéric Dupuis (Jun 29 2024 at 21:40):

I guess it could go into BibtexQuery. Do you want to open a PR?

Jz Pan (Jun 29 2024 at 22:13):

Sure, let me do it later.

Jz Pan (Jun 30 2024 at 00:18):

Sorry, but my current code has problems. The Unicode combining character should be put after the character, but not before.

Jz Pan (Jul 01 2024 at 19:25):

PR created as https://github.com/dupuisf/BibtexQuery/pull/2

Jz Pan (Jul 01 2024 at 19:26):

Feel free to give comments.

PS: May I ask the design decision to use List instead of Array in BibtexQuery?

Jz Pan (Jul 03 2024 at 19:43):

I have added bibtex name processing code in https://github.com/acmepjz/doc-gen4/blob/test2/DocGen4/Output/Bibtex/Name.lean. It uses UnicodeBasic though. So it can't be easily put into BibtexQuery.

François G. Dorais (Jul 04 2024 at 17:57):

UnicodeBasic is safe to require by BibtexQuery since doc-gen4 already depends on it.

Jz Pan (Jul 04 2024 at 19:14):

What's your opinion @Frédéric Dupuis

François G. Dorais (Jul 05 2024 at 01:47):

PS: There will be a major update to UnicodeBasic really soon. This use case should not need any serious updates (except the lakefile) but there will be major performance improvements!

Frédéric Dupuis (Jul 05 2024 at 01:59):

Sure, I'm OK with having BibtexQuery depend on UnicodeBasic if it's not a problem for doc-gen.

Frédéric Dupuis (Jul 05 2024 at 01:59):

@Jz Pan You can feel free to PR the name processing code to BibtexQuery if you want.

Jz Pan (Jul 05 2024 at 16:47):

PR created as https://github.com/dupuisf/BibtexQuery/pull/3.

Meanwhile a lot of functions in BibtexQuery/String.lean can be removed. Could you have a look on it?

Jz Pan (Jul 07 2024 at 18:57):

@Henrik Böving I try to split some general changes in my PR into small PRs. Hopefully they will make the reviewing easier.

Jz Pan (Jul 09 2024 at 18:58):

@Frédéric Dupuis I plan to move the bibtex output code to BibtexQuery since it only depends on Lean's built-in XML nodes. I'll make the PR in the following few days. Meanwhile, I already made 2 small PRs fixing something. Could you look at them?

Frédéric Dupuis (Jul 09 2024 at 19:15):

Sure, I'll have a look in the next few days!

Jz Pan (Jul 10 2024 at 10:45):

The TeX diacritics code written by me has potential bugs: if commands \$ and \backslash are processed (currently they are rejected directly), then they will breaking the following-up process. I think the proper solution is to store TeX AST. But this is of low priority I think. I'll come back to this later.

Jz Pan (Jul 11 2024 at 19:27):

The bibitem format feature is done, PR created as https://github.com/dupuisf/BibtexQuery/pull/6. After the PRs in BibtexQuery is merged, it's trivial to let doc-gen4 call this function.

Jz Pan (Jul 11 2024 at 19:28):

This is the first few lines of the test output:

[AGM94]{abramsky_gabbay_maibaum_1994} S. Abramsky, D. M. Gabbay, T. S. E. Maibaum, editors.
Handbook of logic in computer science. Vol. 3: Semantic structures.
Oxford: Clarendon Press, 1994.
ISBN 0-19-853762-X.

[AL19]{ahrens2017} Benedikt Ahrens, Peter LeFanu Lumsdaine.
Displayed Categories.
Logical Methods in Computer Science, 2019.
doi:10.23638/LMCS-15(1:20)2019.

[AZ99]{aigner1999proofs} Martin Aigner, Günter M Ziegler.
Proofs from THE BOOK.
Berlin. Germany, 1999.

[AE72]{alfseneffros1972} Erik M. Alfsen, Edward G. Effros.
Structure in real Banach spaces. I and II.
Ann. Math. (2), 96:98--173, 1972.
doi:10.2307/1970895.

[AS03]{alfsenshultz2003} Erik M. Alfsen, Frederic W. Shultz.
Geometry of state spaces of operator algebras.
Boston, MA: Birkhäuser, 2003.
ISBN 0-8176-4319-2.

[AK21]{altman2021term} A. Altman, S. Kleiman.
A Term of Commutative Algebra.
Worldwide Center of Mathematics, 2021.
ISBN 9780988557215.

[Alu16]{aluffi2016} Paolo Aluffi.
Algebra: Chapter 0.
Volume 104 of Graduate Studies in Mathematics.
American Mathematical Society, Reprinted with corrections by the American Mathematical Society edition, 2016.

Jz Pan (Jul 12 2024 at 19:35):

I spotted some other low priority bugs:

In LaTeX, some commands, e.g. \& and \#, preserves the trailing spaces, while \", etc. are not. But my current implementation removes all trailing spaces. This makes the space in John Wiley &Sons missing.
The edition shows in a suboptimal place, and it will show some funny thing e.g. "3rd ed. edition", "1 edition", "Third Edition edition", "Reprint of the 1999 hardback ed. edition".

Henrik Böving (Jul 12 2024 at 19:42):

That sounds acceptible for now. I suggest to first write the connection to doc-gen, get that merged, then incrementally improve the Bibtex processing. The likelihood of a user having actual troubles because of something like this is very low, not having the feature is more inconvenient.

Jz Pan (Jul 12 2024 at 20:01):

PR created as https://github.com/leanprover/doc-gen4/pull/209.

Henrik Böving (Jul 12 2024 at 20:02):

Great, will take a look.

Henrik Böving (Jul 12 2024 at 21:00):

@Jz Pan left a review

Herman Chau (Jul 12 2024 at 23:57):

I was looking at the module doc for Mathlib.Data.Finset.Basic and the line "More information can be found in Mathlib.Algebra.BigOperators.Group.Finset" has an erroneous link to https://leanprover-community.github.io/mathlib4_docs/Mathlib/Data/Finset/Basic.html#Finset.

Does it sound like a good idea to modify autolink to attempt to handle intra-module links? I would be interested in taking a stab at implementing this.

Jz Pan (Jul 13 2024 at 01:04):

"More information can be found in Mathlib.Algebra.BigOperators.Group.Finset" has an erroneous link to https://leanprover-community.github.io/mathlib4_docs/Mathlib/Data/Finset/Basic.html#Finset.

Seems a highlighting bug; it should only link the last Finset to the link you mentioned.

By the way, the usage of the docstring in that file is wrong, the correct one should be Mathlib/Algebra/BigOperators/Group/Finset.lean. In this case it can link automatically to that file.

Does it sound like a good idea to modify autolink to attempt to handle intra-module links? I would be interested in taking a stab at implementing this.

Currently it can only recognize symbols which are already imported by current module. To make intra-module links working correctly, IMHO two pass compilation is needed. I planned to looking at this later. But you're welcome to working on this, too!

Ruben Van de Velde (Jul 13 2024 at 06:25):

The module syntax should also work

Yaël Dillies (Jul 13 2024 at 08:28):

The module syntax used to work in doc-gen3

Ruben Van de Velde (Jul 13 2024 at 08:43):

Huh, I swear I've seen it work, but now I can't find an example. Maybe it also doesn't support forward refs?

Jz Pan (Jul 13 2024 at 09:52):

See discussion here https://leanprover.zulipchat.com/#narrow/stream/113488-general/topic/Contribution.20to.20doc-gen4.3F/near/444344003, it was working in doc-gen3 but not implemented yet in doc-gen4

Henrik Böving (Jul 13 2024 at 12:53):

@Jz Pan fixed some bug in your code and integrated it with the mathlib4_docs CI, I think the next mathlib4_docs run should be correct

Henrik Böving (Jul 13 2024 at 13:11):

https://leanprover-community.github.io/mathlib4_docs/ has bibtex references now thanks to work by @Jz Pan and @Frédéric Dupuis !

Jz Pan (Jul 13 2024 at 13:21):

Oops, I spot another bug in the code written by me: currently the \ and ~ are replaced by   in all fields, but this shouldn't be done in the fields url, doi, etc. Will fix it later.

Herman Chau (Jul 13 2024 at 17:48):

Jz Pan said:

By the way, the usage of the docstring in that file is wrong, the correct one should be Mathlib/Algebra/BigOperators/Group/Finset.lean. In this case it can link automatically to that file.

Ah gotcha, I see the code in nameToLink? that links to files if they have a / character. And I see that you already merged in a fix for the links in this file, thanks!

Jz Pan (Jul 14 2024 at 19:28):

https://github.com/dupuisf/BibtexQuery/pull/8 should fix the above reference rendering bugs I mentioned.

Jz Pan (Jul 21 2024 at 18:43):

@Frédéric Dupuis Would you like to have a review on the above PR? It fixes some bibtex rendering.

Frédéric Dupuis (Jul 22 2024 at 01:18):

Yes, I'll get to it very soon!

Jz Pan (Jul 23 2024 at 17:28):

Should the cite key be case insensitive? If I remember correctly, in the \cite{citekey} command in LaTeX, citekey is case sensitive. However, there are pages in mathlib4 docs which got the wrong case: in https://leanprover-community.github.io/mathlib4_docs/Mathlib/SetTheory/Game/PGame.html it has the correct case (conway2001), but in https://leanprover-community.github.io/mathlib4_docs/Mathlib/SetTheory/Surreal/Basic.html it is not correct (Conway2001). Should we allow case insensitive cite key match?

Jz Pan (Jul 26 2024 at 18:59):

Is it a good idea to exclude auto-generated proof_<number> from the docs?

Jz Pan (Nov 15 2024 at 12:14):

@Frédéric Dupuis Sorry to bump you again, I've fixed merge conflicts in PR https://github.com/dupuisf/BibtexQuery/pull/8 and it's ready again. I also attached a test file in the PR discussion.

Frédéric Dupuis (Nov 16 2024 at 00:43):

Sorry for holding up this PR so long. I just have one requested addition (I would do it myself if I were 100% sure there's nothing else to adapt) -- otherwise I'm happy with it and I'll just merge.

Last updated: May 02 2025 at 03:31 UTC