Zulip Chat Archive

Stream: general

Topic: Is "header-data.bmp" useful in mathlib4 docs?


Jz Pan (Mar 02 2025 at 14:35):

I want to package an offline version of mathlib4 docs. The file "header-data.bmp" is about 700MB which is huge. I've checked the html and js files, seems that none of them references "header-data.bmp" file. Is it safe to remove this file from packaging?

Mario Carneiro (Mar 02 2025 at 14:36):

that file is not a bitmap btw, it's the search index (calling it .bmp makes browsers cache it better)

Mario Carneiro (Mar 02 2025 at 14:37):

it's what makes the search bar suggest results immediately as you type

Mario Carneiro (Mar 02 2025 at 14:38):

it's possible that things have changed in docgen since this file was introduced and now it is doing something else? I'm surprised it is not used

Jz Pan (Mar 02 2025 at 14:39):

Mario Carneiro said:

that file is not a bitmap btw, it's the search index (calling it .bmp makes browsers cache it better)

Seems that it only uses "declaration-data.bmp".

Jz Pan (Mar 02 2025 at 14:42):

screenshot.png

Mario Carneiro (Mar 02 2025 at 14:45):

It was added in this commit by @Henrik Böving , perhaps he knows more about the purpose of this file

Mario Carneiro (Mar 02 2025 at 14:46):

but it seems the immediate answer to your question is yes, it's safe to remove, it's likely that the purpose of the file is either for nonstandard uses of docgen or otherwise unimplemented things

Jz Pan (Mar 02 2025 at 14:48):

It's said that "Produce header-data.bmp, this allows embedding of doc-gen declarations into other pages and more.".

Jz Pan (Mar 02 2025 at 14:49):

But I think it's not practical if a webpage needs to download 700MB of data...

Mario Carneiro (Mar 02 2025 at 14:49):

yeah, that's a ridiculously large payload

Mario Carneiro (Mar 02 2025 at 14:49):

how big is declaration-data.bmp?

Jz Pan (Mar 02 2025 at 14:51):

Mario Carneiro said:

how big is declaration-data.bmp?

40MB which is not that small.

Mario Carneiro (Mar 02 2025 at 14:52):

It boggles the mind that just the names and links to declarations can take up this much space

Mario Carneiro (Mar 02 2025 at 14:55):

Oh, I think the reason the header-data is so large is because it has the pretty printed types of all the declarations too

Mario Carneiro (Mar 02 2025 at 14:55):

(pretty printed as html too)

Henrik Böving (Mar 02 2025 at 14:57):

It was added so the top 100 list of leanprover-community.github.io could be rendered at the request of Eric

Henrik Böving (Mar 02 2025 at 14:58):

So there is basically some js or other kind of script somewhere in leanprover-community that requests this file from mathlib4_docs.

Henrik Böving (Mar 02 2025 at 15:00):

Note furthermore that for this reason header-data.bmp is only generated on demand with the docsHeader facet and not for every doc-gen build that exists out there. One optimization that you could do is within the mathlib4_docs CI pre-filter it such that only the fields that are actually needed by the top 100 list are preserved, that should probably boil it down to a couple of kilobytes

Mario Carneiro (Mar 02 2025 at 15:00):

how does it actually get used?

Jz Pan (Mar 02 2025 at 15:01):

Do you mean this page https://leanprover-community.github.io/100.html ? I assume that the file "header-data.bmp" is only used during the generation of that page (the build script finds corresponding contents in "header-data.bmp" and inserts them to that file).

Henrik Böving (Mar 02 2025 at 15:01):

Jz Pan said:

Do you mean this page https://leanprover-community.github.io/100.html ? I assume that the file "header-data.bmp" is only used during the generation of that page (the build script finds corresponding contents in "header-data.bmp" and inserts them to that file).

Yes I mean that page and yes that is likely correct.

Mario Carneiro (Mar 02 2025 at 15:01):

is the code that uses it available in doc-gen itself or is it only in the leanprover-community website build?

Henrik Böving (Mar 02 2025 at 15:01):

The latter

Mario Carneiro (Mar 02 2025 at 15:03):

I wonder whether doc-gen should instead provide this facility as some kind of HTML postprocessor to replace some embeds with nice looking html

Mario Carneiro (Mar 02 2025 at 15:03):

(assuming I understand the purpose of the feature correctly)

Jz Pan (Mar 02 2025 at 15:05):

Yeah I read the lake build Mathlib:docsHeader in the build script https://github.com/leanprover-community/mathlib4_docs/blob/main/.github/workflows/docs.yaml . Since my offline version of mathlib4 docs will not contain 100 and 1000 pages, I'll remove it from my build script.

Henrik Böving (Mar 02 2025 at 15:06):

Mario Carneiro said:

I wonder whether doc-gen should instead provide this facility as some kind of HTML postprocessor to replace some embeds with nice looking html

I would like to avoid over engineering a feature that basically only has one consumer in this fashion if possible.

Mario Carneiro (Mar 02 2025 at 15:07):

The relevant code seems to be here

Mario Carneiro (Mar 02 2025 at 15:08):

the data flow seems to be a bit weird here, it's downloading a big pre-generated file from github, I guess because the website build is running separately from the docgen build

Jz Pan (Mar 02 2025 at 15:11):

This is a slow operation × 2

if DOWNLOAD:
    print('Downloading header-data.json...') #  This is a slow operation, so let's inform readers.
    # header-data.json contains information for every single declaration in mathlib
    # which has a generated documentation entry (e.g., skipping private declarations by default).
    # The resulting file is huge (several hundred MB), hence we use a small HACK:
    # doc-gen4 uploads this file as a .bmp file, so GitHub's servers will serve it in
    # compressed form. When downloading it locally, we save it (decompressed) with the correct extension.
    download(
        'https://leanprover-community.github.io/mathlib4_docs/declarations/header-data.bmp',
        DATA/'header-data.json'
    )
    print('Downloaded.')
if (DATA/'header-data.json').exists():
    print('Parsing header-data.json...') #  This is a slow operation, so let's inform readers.
    with (DATA/'header-data.json').open('r', encoding='utf-8') as h_file:
        header_data = json.load(h_file)
    print('Parsed.')
else:
    header_data = dict()

Mario Carneiro (Mar 02 2025 at 15:12):

Henrik Böving said:

Note furthermore that for this reason header-data.bmp is only generated on demand with the docsHeader facet and not for every doc-gen build that exists out there. One optimization that you could do is within the mathlib4_docs CI pre-filter it such that only the fields that are actually needed by the top 100 list are preserved, that should probably boil it down to a couple of kilobytes

can docsHeader be directed to only build this for a specified list of declarations?

Jz Pan (Mar 02 2025 at 15:12):

Mario Carneiro said:

the data flow seems to be a bit weird here, it's downloading a big pre-generated file from github, I guess because the website build is running separately from the docgen build

And the data of 100 and 1000 themselves are also downloaded from docs page...

Mario Carneiro (Mar 02 2025 at 15:12):

that seems better than generating everything and then filtering after

Henrik Böving (Mar 02 2025 at 15:12):

Yes that could also be PRed to doc-gen

Jz Pan (Mar 02 2025 at 15:14):

Mario Carneiro said:

can docsHeader be directed to only build this for a specified list of declarations?

That should be done in this build script https://github.com/leanprover-community/mathlib4_docs/blob/main/.github/workflows/docs.yaml but not in doc-gen

Namely, after running lake build Mathlib:docsHeader another script should be run to filter this file according to contents in 100 and 1000.

Mario Carneiro (Mar 02 2025 at 15:15):

I think that was Henrik's suggestion. My suggestion was to build the filter list first and tell docgen to do the filtering rather than postprocessing the 700MB json to throw away almost all of it

Mario Carneiro (Mar 02 2025 at 15:16):

but that requires a docgen PR as Henrik points out

Mario Carneiro (Mar 02 2025 at 15:17):

both approaches would accomplish the task of not sending a gigabyte from the left hand to the right

Jz Pan (Mar 02 2025 at 15:19):

From that py script it seems that a full "declaration-data.bmp" is still needed since it contains some code calculation num_thms and num_defns and link things to mathlib4 docs pages. But seems that the full theorem statement is only used in 100 and 1000.

Eric Wieser (Mar 02 2025 at 15:43):

Mario Carneiro said:

both approaches would accomplish the task of not sending a gigabyte from the left hand to the right

Given we do this (transfer a gigabyte) most of the time when a cache is invalidated for oleans, I'm not sure this is that valuable a place to spend time on


Last updated: May 02 2025 at 03:31 UTC