Zulip Chat Archive

Stream: general

Topic: Hosting doc on github and large files


Ian Jauslin (Dec 07 2023 at 01:47):

Hi there! I am looking into having a blueprint for one of the projects that we are working on, and would like it to be hosted on github itself. I figured I would keep things simple, and just upload the documentation generated by doc-gen4 and the blueprint generated by Patrick Massot's plastex plugin to the github repository, and browse it using github Pages. The issue I have is that doc-gen4 generates some pretty big '.bmp' files, which would require git-lfs. Is there a simple way around this? Should I be doing things differently? Or should I just go ahead and use git-lfs?

Yaël Dillies (Dec 07 2023 at 08:13):

Yes. Basically, don't dump the files in the first place.

Eric Wieser (Dec 07 2023 at 09:50):

GitHub pages has two modes:

  • Serve a website from the contents of a git repo
  • Serve a website from the output of a build step that takes a git repo as input

The second one means the bmp file is never committed, which avoids the size issues

Utensil Song (Dec 07 2023 at 14:18):

The second option is what's used in the PFR project, and the same as the one I explained in another thread.

Ian Jauslin (Dec 07 2023 at 15:32):

Thank you! I didn't expect github would do all of this for free. I'm impressed! I'll set it up.

Yury G. Kudryashov (Jan 03 2024 at 23:57):

Why does doc-gen4 need bmp files?

Alex J. Best (Jan 04 2024 at 00:11):

Its a trick to encourage github to compress them better IIRC, they aren't really bmp's but contain most of the data needed for search etc

Eric Wieser (Jan 04 2024 at 00:23):

Yep, the reason is that the github pages server will gzip any file (via Content-Encoding) with a bmp extension, but won't do so for json

Eric Rodriguez (Jan 04 2024 at 00:24):

How long is the decompression time?

Eric Wieser (Jan 04 2024 at 00:24):

It's handled by the browser natively, so who knows

Julian Berman (Jan 04 2024 at 00:35):

I think GitHub pages will do so for .json as well? At least that's what we found for the project I work on recently.

Julian Berman (Jan 04 2024 at 00:36):

(Try http get https://bowtie.report/draft2019-09.json, it gets Content-Encoding: gzip).

Eric Wieser (Jan 04 2024 at 00:39):

Maybe this changed recently; certainly it used to not be gzipped

Julian Berman (Jan 04 2024 at 00:40):

Maybe. It's certainly quite poorly documented and there's some random set of things it gzips, we started with .jsonl, and that it will indeed not gzip.

Julian Berman (Jan 04 2024 at 00:40):

But yeah .json I think works fine now.

Eric Rodriguez (Jan 04 2024 at 00:42):

Eric Wieser said:

It's handled by the browser natively, so who knows

I just wonder if this is one of the reasons why the docs are slow, although downloading something big won't help much either

Eric Wieser (Jan 04 2024 at 00:43):

Chrome refuses to reveal how much time is spent downloading vs gzipping

Eric Wieser (Jan 04 2024 at 00:43):

I think the idea is that the gzipping happens in realtime while the download is happening

Henrik Böving (Jan 04 2024 at 09:01):

Eric Rodriguez said:

Eric Wieser said:

It's handled by the browser natively, so who knows

I just wonder if this is one of the reasons why the docs are slow, although downloading something big won't help much either

The downloading and gzipping is a non issue. You only download the data once, afterwards it remains in your cache. If there is something slow it's the JavaScript that processes the several MB of JSON, tho I spent quite a bit of time making it eh...less slow. Do you have a particular feature that you think is too slow

Henrik Böving (Jan 04 2024 at 09:06):

Eric Wieser said:

Chrome refuses to reveal how much time is spent downloading vs gzipping

And yeah even if it was downloading and gzipping all the time I would bet that on a modern computer with the internet speeds available to the average person the gzipping is much faster than the download if it is indeed implemented natively

Eric Rodriguez (Jan 04 2024 at 09:14):

Search autocomplete and the /find page to me are not the fastest

Henrik Böving (Jan 04 2024 at 10:13):

With the autocomplete there is an intentional debounce delay such that it doesn't unnecessarily perform autocompletion all the time and end up being uninteractive because of that. But the debouncing delay is an arbitrary number that I intentionally picked a little high. I can try to fine tune it to fit closer to a real time experience.

Eric Rodriguez (Jan 04 2024 at 10:15):

Is it easy to toy with this number? I understand that uninteractiveness is to be avoided but real-time autocomplete is always nice, especially with fzf-like bolding; I'm likely asking far too much though!

Henrik Böving (Jan 04 2024 at 10:16):

its the 300 in this line: https://github.com/leanprover/doc-gen4/blob/eab9173b56295c3dadf46e104ab342060cbe2af8/static/search.js#L149

Henrik Böving (Jan 04 2024 at 10:32):

Okay I tuned it way down to a delay that corresponds to 60hz so instant for an human eye. That seems to work fine on the std4 dataset, I'll push the commit and we can check how it feels on the real mathlib dataset.

Henrik Böving (Jan 04 2024 at 12:43):

@Eric Rodriguez better?

Eric Rodriguez (Jan 04 2024 at 12:50):

it's incredible!!

Henrik Böving (Jan 04 2024 at 14:48):

Doing some more measurements, I'm sorry for the chrome people: On firefox the string matcher in the search runs in approximately 80-120ms, on chrome the same JS goes at 210 to 230 :(

Eric Rodriguez (Jan 04 2024 at 14:50):

And v8 said it was a good engine :b

Henrik Böving (Jan 04 2024 at 14:57):

Im quite surprised myself yeah


Last updated: May 02 2025 at 03:31 UTC