Zulip Chat Archive

Stream: lean4

Topic: doc-gen4 search slowness


Sebastian Ullrich (Jul 20 2023 at 14:42):

It will suggest docs#Lean.Meta.MetaM.toIO if you wait long enough, which perhaps could be accelerated by not downloading (and more importantly parsing?) the same big file four times :) /cc @Henrik Böving
image.png

Henrik Böving (Jul 20 2023 at 14:48):

Well the download is cached after the first time (as the timing diagram on the rhs shows you dont actually wait), I guess one could try to clean up the HTML to fix that tho. Regarding the parsing iirc the DeclarationDataCenter in the js is a singleton so it should also not be parsed several times. I've tried to take a look at the Firefox flame graph before (they have an integrated thing) but I dont usually frontend dev so it was rather unclear to me what the thing was actually spending time on.

Henrik Böving (Jul 20 2023 at 14:49):

I'll check whether the declaration data gets parsed multiple times later just to be sure. That said if someone can tell me how professionals :tm: figure out what is slow in such static pages I'd be very grateful^^, its been bothering me for a while as well.

Sebastian Ullrich (Jul 20 2023 at 15:01):

No professionals in this room, but it seems to be very busy with parsing JS, perhaps the response result (four times, maybe?)? The solid blue part later on is cacheDeclarationData
image.png

Henrik Böving (Jul 20 2023 at 15:25):

Oh well thats pretty cool then, I'll check out the parsing issue when I'm home. That said it will at best be a linear speedup and since I expect mathlib to keep growing quite a bit we probably need a better solution sooner or later. I'd be very interested in how rust does this type of stuff if someone knows. The only thing I know about rustdoc is that unlike doc-gen it does not do things like rendering the instance lists on demand but instead its basically a big rust to HTML compiler. Doc-gen used to be like this but I decided to split it up into a per file tool later on so that it could profit from the lake dependency graph and inherent parallelism more. And then in the end a final doc-gen command will accumulate all of the json files into a single big one. I guess an optimization one could do here is to only have the decl name -> module mapping in the accumulated file and load only the module specific full declaration data for the module that is getting vievwed right now. That should severly cut the size of the accumulated file

Henrik Böving (Jul 20 2023 at 19:35):

Sebastian Ullrich said:

No professionals in this room, but it seems to be very busy with parsing JS, perhaps the response result (four times, maybe?)? The solid blue part later on is cacheDeclarationData
image.png

Okay I have an update! It seems that in fact the singleton implementation is not working for some reason, so now to figure out how to properly :tm: do a singleton in JS

Henrik Böving (Jul 20 2023 at 20:45):

I have updates! The declaration singleton does in fact work, however all of the files start their null check in parallel and will notice that it is in fact null and then start their HTTP request and will finally all set the singleton in parallel once their request is done and all of them have parsed the file. So if you do this: https://github.com/leanprover/doc-gen4/commit/5ce54e8e10ffe4c2c91766275f4a0d93aabd65dc it actually works

Sebastian Ullrich (Jul 20 2023 at 20:47):

Nice. Awaiting the finished request is probably cheap enough that the first singleton isn't necessary anymore? And yes, I'm just guessing here from exactly zero experience with async JS...

Henrik Böving (Jul 20 2023 at 21:09):

fixed that and another thing that should store the database in local storage. So once both of these deploy it should be instant :tm:

Henrik Böving (Jul 20 2023 at 21:12):

And eh, thanks to @Siddharth Bhat for spotting the race condition for me :D it was quite fun because of course once you enable the debugger it doesn't occur anymore, a true Heisenbug.

Henrik Böving (Jul 20 2023 at 21:36):

@Eric Wieser should be better now

Notification Bot (Jul 20 2023 at 22:39):

11 messages were moved here from #lean4 > Lean 4 monads by Eric Wieser.

Alex Keizer (Jul 21 2023 at 19:01):

Is the goal for this file to be downloaded only once and then cached for all future page visits? I don't know how doc-gen4 is currently implemented, but that sounds like something a service-worker could be used for.

Eric Wieser (Jul 21 2023 at 23:08):

That's what mathlib3 did, but the worker needed you to keep a second tab open to keep it alive while you navigate in the first tab

Eric Wieser (Jul 21 2023 at 23:09):

Otherwise it was killed in the instant where neither the old nor new page was loaded

Henrik Böving (Jul 22 2023 at 09:33):

We had an idea to basically build a hash based cache as an indexed db? so you can basically request to get the hash of one of the declaration data files and if it did not update you can use stuff from your local storage. But I do not know if that is a reasonable approach

Alex Keizer (Jul 22 2023 at 18:45):

I used to do webdev, so I think I might be able to help here, but I don't really have enough context about doc-gen4 to understand what you're suggesting. This is about speeding up the search, right? Can we look at how, say, rustdoc works and steal their ideas, or is Lean sufficiently different that won't work?

Henrik Böving (Jul 22 2023 at 18:47):

No no we can do that. I already have an idea for how we can reduce the loading time of declaration information drastically outlined above. But I would also be curious how rustdoc does as mentioned above.

Alex Keizer (Jul 22 2023 at 19:05):

Let me know if there is anything I can help with!
One thing I noticed is that currently there is a very noticeable latency between pressing a key and the letter appearing in the search box. I suspect the search suggestion is either blocking the UI thread, or at least blocking some event handler. Making this more asyncronous can already help a lot with search feeling more responsive.

Henrik Böving (Jul 22 2023 at 19:09):

Well I am pretty much blank regarding frontend things so if you want to do any javascript changes go for it. The JS part is rather self contained. The search itself lives here: https://github.com/leanprover/doc-gen4/blob/main/static/search.js

Alex J. Best (Jul 22 2023 at 20:27):

The wait is for it to download all the data to search on iirc, the search is client side on a lot of data. But downloading this data itself is asynchronous with the actual page load.
The difficulty with local storage is that the docs update at least daily so you'd need some more clever way of organising things so that you don't throw out all of the previous data each day. Perhaps chunking the declaration data per namespace would be ok. The logic being that any individual commit might change a few related declarations in one namespace but leave most others untouched so only a few would be invalidated each day.

Henrik Böving (Jul 22 2023 at 20:29):

Well as I explain above technically speaking the data is already chunked per module, it is just that right now I accumulate it in a big json that gets used as a central data store. There is a refactoring possible here where the big json merely contains the mapping from decl names to module names and each web page only loads the module data that it needs.

Alex Keizer (Jul 22 2023 at 20:40):

Alex Keizer said:

One thing I noticed is that currently there is a very noticeable latency between pressing a key and the letter appearing in the search box. I suspect the search suggestion is either blocking the UI thread, or at least blocking some event handler. Making this more asyncronous can already help a lot with search feeling more responsive.

So I had a look at rustdoc, and they circumvent this by only starting a search 500ms after the last keypress, whereas doc-gen immediately starts a search after each input event. If you think it's sensible to adopt this, I could do a PR with the 500ms delay approach.

Alex Keizer (Jul 22 2023 at 20:41):

Henrik Böving said:

Well as I explain above technically speaking the data is already chunked per module, it is just that right now I accumulate it in a big json that gets used as a central data store. There is a refactoring possible here where the big json merely contains the mapping from decl names to module names and each web page only loads the module data that it needs.

Even if each page loads every module, there might still be some benefit to keeping these files separate, if the server runs HTTP/2. Each request should get multiplexed on the same connection, meaning there should be little to no overhead vs. one big file, but with smaller files there's better caching and more chance of parralelism

Henrik Böving (Jul 22 2023 at 20:48):

Well the large declaration file does already get cached on reload as well if you are doing many searches after one another right. The actual issue here is (i believe) not the loading of the file per se which should just be one RTT to the server with 0 data returned but instead the fact that it has to parse the entire JSON blob because right now all of the data is accumulated instead of the "name -> module" map for all names in one file and the info about name in other files.

Yury G. Kudryashov (Jul 22 2023 at 20:48):

Will it be faster to do the actual search on the server side?

Henrik Böving (Jul 22 2023 at 20:48):

Yury G. Kudryashov said:

Will it be faster to do the actual search on the server side?

Most likely yes but then it wouldn't be a static web page anymore and that's not really what we want

Yury G. Kudryashov (Jul 22 2023 at 20:50):

Why?

Henrik Böving (Jul 22 2023 at 20:50):

Alex Keizer said:

Alex Keizer said:

One thing I noticed is that currently there is a very noticeable latency between pressing a key and the letter appearing in the search box. I suspect the search suggestion is either blocking the UI thread, or at least blocking some event handler. Making this more asyncronous can already help a lot with search feeling more responsive.

So I had a look at rustdoc, and they circumvent this by only starting a search 500ms after the last keypress, whereas doc-gen immediately starts a search after each input event. If you think it's sensible to adopt this, I could do a PR with the 500ms delay approach.

With my above comment in mind I think this could still very well be a reasonable thing to try so if you want to do the PR go for it :thumbs_up:

Yury G. Kudryashov (Jul 22 2023 at 20:50):

I mean, what's wrong with non-static pages?

Yury G. Kudryashov (Jul 22 2023 at 20:51):

Besides that you can't use Github pages anymore and have to move to, e.g., a shared hosting.

Ruben Van de Velde (Jul 22 2023 at 20:52):

One risk that comes to mind is DDoS, but also generally the cost of having a server that run code

Alex Keizer (Jul 22 2023 at 20:53):

Because with a static page we can still browse the docs without an internet connection.

Henrik Böving (Jul 22 2023 at 20:53):

On top of that people that want to look at documentation locally need to download additional depednencies outside of the lean world as well

Henrik Böving (Jul 22 2023 at 20:53):

(unless someone here comes up with a lean webserver and database adapter :D)

Henrik Böving (Sep 09 2023 at 22:41):

I pushed the proposed 500ms debounce now. It did make the search experience smoother on a local build of std4. We'll see in a few hours whether it does make things nicer on mathlib4 too. In addition I'm planning on properly splitting up the JSON files tomorrow which should reduce load times and improve performance as well.

Henrik Böving (Sep 10 2023 at 00:26):

https://github.com/leanprover-community/mathlib4_docs/actions/runs/6133947195/job/16646316341 should be up once this is through, I shall hit the fast forward button :sleeping:

Jireh Loreaux (Sep 10 2023 at 02:02):

Seems better. Thanks!

Henrik Böving (Sep 10 2023 at 12:02):

Okay I worked a bit on reducing the JSON data size now. I started at 8.3M on std4 and now ended up at 3.0M. I decided to sacrifice the src# feature along the way as I can count the amount of times I saw it being used on one hand and the link is available with just one additional click anyways, that saved ~ 1M of the 5 that I got. On mathlib4 we started at 77M and are now down to 32M so the page should load much faster and the find function should also be quicker.

Mario Carneiro (Sep 10 2023 at 19:08):

I decided to sacrifice the src# feature along the way as I can count the amount of times I saw it being used on one hand and the link is available with just one additional click anyways

Couldn't the src# link continue with an additional indirection, if it links to the same place as docs# but with an additional query parameter in the URL, which is processed by the doc-gen JS to turn the page into a redirect

Bolton Bailey (Sep 23 2023 at 17:32):

I am getting a page unresponsive error searching for "sin" https://leanprover-community.github.io/mathlib4_docs/search.html?sitesearch=https%3A%2F%2Fleanprover-community.github.io%2Fmathlib4_docs&q=sin

Henrik Böving (Sep 23 2023 at 20:18):

Bolton Bailey said:

I am getting a page unresponsive error searching for "sin" https://leanprover-community.github.io/mathlib4_docs/search.html?sitesearch=https%3A%2F%2Fleanprover-community.github.io%2Fmathlib4_docs&q=sin

Can't reproduce on Linux w/ Firefox, can you give more info about your setup?

Jireh Loreaux (Sep 23 2023 at 23:52):

On Android Chrome I got no error, but it was very slow 15+ seconds.

Bhavik Mehta (Sep 24 2023 at 00:03):

On MacOS Chrome I get no error but also very slow

Scott Morrison (Sep 24 2023 at 01:41):

About 4s for me on macos chrome.

Henrik Böving (Sep 24 2023 at 06:51):

Jireh Loreaux said:

On Android Chrome I got no error, but it was very slow 15+ seconds.

I mean its searching through a gigantic amount of strings just with a three character pattern, some slowness is expected here I would say.

Bolton Bailey (Sep 24 2023 at 11:29):

I'm on MacOS chrome, but trying it again, It's much faster, not sure what's up with that.

Jireh Loreaux (Sep 24 2023 at 13:24):

Maybe you already have the file cached?

Bolton Bailey (Sep 25 2023 at 17:05):

Again getting unresponsive, this time for https://leanprover-community.github.io/mathlib4_docs/search.html?sitesearch=https%3A%2F%2Fleanprover-community.github.io%2Fmathlib4_docs&q=list

Henrik Böving said:

Jireh Loreaux said:

On Android Chrome I got no error, but it was very slow 15+ seconds.

I mean its searching through a gigantic amount of strings just with a three character pattern, some slowness is expected here I would say.

Perhaps slowness is expected, but ideally the page would not hang.

Henrik Böving (Sep 25 2023 at 17:29):

Still can't reproduce :( even on my mobile the pages keeps responsive to scrolling etc all the time

Mac Malone (Sep 25 2023 at 21:26):

@Henrik Böving Do you have a version of Chrome to test it on? When it comes to memory, Chrome and Firefox handle things very differently. From past experience, there a lot of things that cause Chrome to hang that do not cause Firefox to do so. One old significant example I recall was Fandom wikis. They would often hang for me on chrome (and even sometimes crash the page) due to memory use but work fine on Firefox.

Mac Malone (Sep 25 2023 at 21:27):

(If you do not wish to use Google Chrome, there a lot of neat Chromium distributions available.)

Henrik Böving (Sep 25 2023 at 21:27):

I have Chromium 116.0.5845.96 Fedora Project but there it seems fine to me as well? Maybe my machine just does the search too fast for me to notice?

Henrik Böving (Sep 25 2023 at 21:28):

Can someone that has this issue maybe run a profiler in the browser on it and export that profile here? I guess that should be possible right?

Henrik Böving (Sep 25 2023 at 21:32):

When I run a profiler on it in chrome it looks to me like it is spending the vast majority of the time on rendering and layouting the page, not in the JS itself. So I guess if we wanted to address that we would have to do some pagination thingy? Or maybe just straight up limit the amount of results?

Bolton Bailey (Sep 25 2023 at 21:58):

Yeah sorry, I realize I am reporting a heisenbug. I am on Chrome Version 116.0.5845.187 (Official Build) (arm64) on an M1 Max MacOS 13.3.1 if that helps.

Bolton Bailey (Sep 25 2023 at 22:05):

I am trying to produce a "recording" but when I stop the recording, it's taking forever to stop.


Last updated: Dec 20 2023 at 11:08 UTC