Zulip Chat Archive

Stream: general

Topic: edit distance for ranking search results


Arthur Paulino (Jan 28 2022 at 16:29):

This is an attempt to check whether the edit distance works well for mathlib search results or not. What do you guys think?
https://github.com/leanprover-community/doc-gen/pull/154

Arthur Paulino (Jan 28 2022 at 16:30):

@Henrik Böving do you think it makes sense for https://github.com/leanprover/doc-gen4/issues/11?

Henrik Böving (Jan 28 2022 at 16:32):

I've actually never implemented any text search algorithm myself so I wouldnt know, but if it turns out to work nicely in doc-gen I dont see why we shouldnt port it to doc-gen4 as well.

Gabriel Ebner (Jan 28 2022 at 17:06):

At least for the way I search, (unweighted) edit distance works very poorly.

  • Often I don't know or don't want to write the correct namespace. Say I know the lemma is called add_is_o but forgot the namespace, I want the search to return asymptotics.is_O.add_is_o (sixth hit atm). Edit distance penalizes the missing namespace enormously, making rel_iso a better match.
  • Similarly, I'm lazy and don't want to write the whole lemma name. The search should give me reasonable results even if I just write a prefix. If I want to know something about the union of dynkin systems, then dynkunion should help me out. (four hits at the moment, all relevant)
  • I really don't want the search to ignore my query. If I type eq then I want something about equalities! However e.g. or is only a distance of 2 away. Similarly mul is closer to add than addition in edit distance.

Arthur Paulino (Jan 28 2022 at 17:12):

Thanks for testing and for the insights, Gabriel!
I hadn't thought about testing locally. I thought we'd need to deploy the demo before being able to test it.

Rob Lewis (Jan 28 2022 at 17:13):

If we're interested in improving search results, note that doc-gen#131 is still open. There are some outstanding issues -- apparently it's very slow to load, and the 404 page needs to be fixed

Gabriel Ebner (Jan 28 2022 at 17:17):

it's very slow to load

It was already very slow compared to the old version of doc-gen, but the search (start up) has gotten an order of magnitude or two faster last week. I'm not sure how much we can or want to salvage from 131 anymore.

Rob Lewis (Jan 28 2022 at 17:19):

At the same time, it's a much more complete search. Maybe it could be an opt-in ("full search" or something) but I still think it's valuable.

Rob Lewis (Jan 28 2022 at 17:22):

For instance, searching for "pythagorean" doesn't turn up anything about the Pythagorean theorem right now. I don't have a build of 131 to test whether pythagorean_triple outweighs the real answer on that search, but searching for "pythagorean theorem" should certainly turn up what you'd want.

Arthur Paulino (Jan 28 2022 at 17:25):

I'm trying to achieve improvements without big code shake ups. More specifically, I'm more interested in the similarity metric/heuristic per se. I'm very happy to know that what we have on master is fast and provides good results (that could clearly beat a common metric like edit distance)

Arthur Paulino (Jan 28 2022 at 18:07):

Rob Lewis said:

For instance, searching for "pythagorean" doesn't turn up anything about the Pythagorean theorem right now. I don't have a build of 131 to test whether pythagorean_triple outweighs the real answer on that search, but searching for "pythagorean theorem" should certainly turn up what you'd want.

Right now, it seems that using whitespace causes a bug in the search and it doesn't return any result

Gabriel Ebner (Jan 28 2022 at 18:55):

At the same time, it's a much more complete search. Maybe it could be an opt-in ("full search" or something) but I still think it's valuable.

I also like that it searches the docstrings as well. The goals of the PR are certainly great!

But I'm very concerned about the actual implementation. As Eric noted, the performance is really problematic and I can confirm that it takes more than half a minute to start up on my laptop at 100% CPU. The google search is more realistic than that. (Then there's also some minor stuff like the javascript creating html by string concatenation, and further polishing like enabling line wrapping, making the search auto-update again, making the filter menu non-modal, etc. But the performance is really the main issue here.)

I don't think merging the PR in its current state will make the documentation more useful. One realistic option might be to just take the changes to print_docs.py and searchWorker.js and hook them up to a new "full search" button.

There are useful insights from the PR though. For example, it is surprisingly cheap to ship more search information in addition to the declaration name. The declaration list is 1 megabyte gzipped, if we add module names & docstrings that's only 2 megabytes. It might also be possible to add declaration types.

Rob Lewis (Jan 28 2022 at 19:57):

The thing I care most about in the PR is searching doc strings! (And tactic docs/library notes/etc?) Secondly, the UI and filtering.

Henrik Böving (Feb 13 2022 at 11:05):

So now where this has been out for a while what do people think about the new search stuff? Should I try to adapt this 1:1 to doc-gen4 or is there still stuff that we'd like to have fixed?

Kevin Buzzard (Feb 13 2022 at 11:56):

I'm really enjoying the new search

Kyle Miller (Feb 23 2022 at 18:45):

I'm enjoying the search too. Sometimes the results are a bit oblique, though. I was just searching shadow, and the first result is complex_shape.down, but finset.shadow (which is what I was looking for) doesn't make the list.

Eric Wieser (Feb 23 2022 at 18:46):

Also the fact it now works on mobile is great!


Last updated: Dec 20 2023 at 11:08 UTC