Zulip Chat Archive

Stream: general

Topic: Loogle database size


Shreyas Srinivas (Oct 24 2023 at 20:18):

I wanted to ask you one thing about loogle. How big is the entire database?

Joachim Breitner (Oct 24 2023 at 20:49):

200MB, which includes both indices (constants to lemmas, and the lemma name suffix trie; the latter is actually larger).
But note that loogle also needs the oleans of Mathlib to work properly.

Shreyas Srinivas (Oct 24 2023 at 21:01):

Wait so loogle is pointless for non mathlib projects?

Shreyas Srinivas (Oct 24 2023 at 21:03):

Adding a search feature that gives live feedback as you type it is naturally a drag on the extension, especially since the initial search results can be large.

Shreyas Srinivas (Oct 24 2023 at 21:03):

Is there a way to ask the API to not return more than the first 20 matches in the request itself?

Joachim Breitner (Oct 24 2023 at 21:05):

Ah, no: In the context of non-mathlib you need the oleans of whatever else you want to search.

Joachim Breitner (Oct 24 2023 at 21:05):

Sorry, late here, I didn’t make the connection to the VS code work right away :-) :sleepy:

Joachim Breitner (Oct 24 2023 at 21:06):

I can add a parameter for the result size, but I doubt it saves much time: Most of the time is spend producing and sorting the matches.

Shreyas Srinivas (Oct 24 2023 at 21:08):

No it can get bad. Not everyone has a superfast connection dedicated to downloading a few dozen MBs especially if fast response is needed, especially when the requests are made in rapid succession. I am not even thinking of the burden on the server when thousands of lean users are doing this.

Joachim Breitner (Oct 24 2023 at 21:10):

Are the query results already in the MBs? I'd expect a few kb for the bit of JSON (or maybe I am not following).
But yes, eventually we want people to just have loogle running locally, and nothing would have to be downloaded (for mathlib, the cache comes with the oleans you already download, or is created on demand).

Joachim Breitner (Oct 24 2023 at 21:11):

I suggested using https://loogle.lean-fro.org/ primarily so that we focus on the UI part first, and get a prototype out; and later make it use a local installation. At least the current server (cheapest Hetzner vserver around) won’t handle a big load, I expect :-)

Shreyas Srinivas (Oct 24 2023 at 21:12):

The other issue is even with the current set up I see no guarantee that calling the api on a string produces a strict subset of the results produced by it's prefix. I also cannot say I will only start searching after 3-4 characters have been typed since what happens if the search is simply 'pi'

Joachim Breitner (Oct 24 2023 at 21:13):

Naively I would expect that it will search-as-you-type from a certain length on only, unless I press Enter.

Joachim Breitner (Oct 24 2023 at 21:14):

But if serach-as-you-type is causing headaches, then a search-when-you-press-enter UI is also a good MVP!

Shreyas Srinivas (Oct 24 2023 at 21:15):

Joachim Breitner said:

Naively I would expect that it will search-as-you-type from a certain length on only, unless I press Enter.

That's my plan. But in the past I have had bad experiences where users used to getting a dropdown for longer queries complain that nothing is happening when they just type two letters.

Joachim Breitner (Oct 24 2023 at 21:15):

Also, searching for p or pi is really fast, because that’s an unknown identifier. And searching for "pi is illegal syntax :-D

Joachim Breitner (Oct 24 2023 at 21:16):

There is also the option of waiting for x ms, so that the pi search only fires when they stop typing (or a typing rather slowly)

Shreyas Srinivas (Oct 24 2023 at 21:16):

I'm trying out all this stuff already.

Joachim Breitner (Oct 24 2023 at 21:17):

Cool!

Shreyas Srinivas (Oct 24 2023 at 21:17):

I would say that matching how google behaves is the ideal. But I also expect that when people use the current online version, they are going to complain a lot about "loogle hanging"

Joachim Breitner (Oct 24 2023 at 21:19):

Yes, I am surprised I didn’t get many complaints yet with this small, single-threaded service. But hopefully by then #find is part of mathlib proper (and later maybe std4 or its own library used by whoever wants to, including mathlib) and people can start using it locally.

Joachim Breitner (Oct 24 2023 at 21:21):

OTOH, it shouldn’t be hard to scale it out a lot. It’s not very memory hungry, I believe (most data structures are mmap’ed, the requests themselves hopefully don't need much), so a mult-threaded implementation can scale easily, and it has no state, so if we really need an online service that can sustain a high load, I don’t see big obstacles.

Shreyas Srinivas (Oct 24 2023 at 21:23):

With a local version, it should not be a big deal to load the entire trie into memory if required. We are in the era of chrome and vscode after all ;)

Shreyas Srinivas (Oct 24 2023 at 21:23):

Realistically it won't happen of course

Joachim Breitner (Oct 24 2023 at 21:24):

It is loaded into memory, via mmap (so the pages are read on demand), just like oleans by lean.

Notification Bot (Oct 24 2023 at 21:24):

25 messages were moved here from #general > Loogle is live! by Joachim Breitner.

Joachim Breitner (Oct 24 2023 at 21:25):

Oh, I can move messages after all! That’s good to know. (I guess just not within streams?)

Shreyas Srinivas (Oct 24 2023 at 21:26):

Joachim Breitner said:

It is loaded into memory, via mmap (so the pages are read on demand), just like oleans by lean.

Is this really the right level of abstraction? Wouldn't a small DB server do better?

Joachim Breitner (Oct 24 2023 at 21:27):

A DB server doesn't do anything different internally, doesn’t it? And it’s how library_search and rw? and other tactics works, so it seemed reasonable to me so that #find works the same

Shreyas Srinivas (Oct 24 2023 at 21:29):

I just trust that they have implemented proper cache optimal data structures for indexing and querying.

Joachim Breitner (Oct 24 2023 at 21:44):

We can certainly swap out data structures and/or storage later if needed. So far I was positively surprised by the performance we get out of Lean's data structures (search trees and Tries). Remind me to look again once someone wrote SQLite bindings for Lean :-D

Shreyas Srinivas (Oct 24 2023 at 21:51):

I know we can do that. I just think that the people who wrote redis or dragonfly know a lot more about real world cache optimization than me. Algorithms engineering lives on an entirely different planet from algorithms theory.

Shreyas Srinivas (Oct 24 2023 at 21:53):

Especially for rapid query management

Shreyas Srinivas (Oct 24 2023 at 21:54):

The db can be extracted from the data structures or alternatively act as an intermediary for caching queries.

Joachim Breitner (Oct 24 2023 at 22:01):

I'm not denying that! It won't help with elaborating and pattern matching expressions though, which (I believe) is the current bottleneck for many slow queries.

Joachim Breitner (Oct 24 2023 at 22:02):

Anyway, let's first build it, and then make it usable, and then make it fast :-)

Shreyas Srinivas (Oct 24 2023 at 22:02):

But why does loogle need to elaborate?

Shreyas Srinivas (Oct 24 2023 at 22:02):

I'll try to post a prototype this weekend

Shreyas Srinivas (Oct 24 2023 at 22:02):

No firm guarantees though

Jireh Loreaux (Oct 24 2023 at 22:03):

Because what you enter into the search query is syntax, not raw expressions, so it has to elaborate them.

Shreyas Srinivas (Oct 24 2023 at 22:05):

oh okay, I think I see your point

Joachim Breitner (Oct 24 2023 at 22:05):

Exciting!

Shreyas Srinivas (Oct 24 2023 at 22:06):

I don't want to raise any expectations. It is just basically "type in your query" and pick your sin. The only thing is that I want to find out what people think about where they best like accessing it from and how.

Shreyas Srinivas (Oct 24 2023 at 22:06):

I wish I could query the mathlib docs and give a short description (without any web scraping of course).

Yaël Dillies (Oct 25 2023 at 05:50):

Joachim Breitner said:

Oh, I can move messages after all! That’s good to know. (I guess just not within streams?)

You can within streams but not across.

Joachim Breitner (Oct 25 2023 at 08:21):

Loogle has access to the environment, and could return type and docstring along with search results. I anyways wanted to include the type on loogle, just haven't had a good idea for a nice UI there yet.

Shreyas Srinivas (Oct 25 2023 at 16:39):

Joachim Breitner said:

Loogle has access to the environment, and could return type and docstring along with search results. I anyways wanted to include the type on loogle, just haven't had a good idea for a nice UI there yet.

Could it return this info in the json for now?

Joachim Breitner (Oct 25 2023 at 16:40):

Let me give it a try.

Joachim Breitner (Oct 25 2023 at 17:10):

That was easier than expected. Most of the time was spent finding findDocString?. Deploy running…

Jireh Loreaux (Oct 25 2023 at 17:12):

Why not just display it upon hovering the cursor over the link to the decl?

Joachim Breitner (Oct 25 2023 at 17:20):

That’s surely one option. But isn’t it even nicer if you don’t even have to move the mouse, i.e. if it's always visible? Or is that too cluttered? Maybe <detail>?
I’ve just pushed a change (still deploying) that simply shows the type in a line below the hit, let’s see what feedback we get.

For the VS extension, one eventually wants to have a rich view like in the InfoView, where you hover on subexpressions of the type, go to defintion etc. Probably that needs integration with the LSP server somehow? Anyway, that’s future work :-)

Joachim Breitner (Oct 25 2023 at 17:23):

Now with types: https://loogle.lean-lang.org/?q=%28List.replicate+%28_+%2B+_%29+_+%3D+_%29
And JSON with types and docstring: https://loogle.lean-lang.org/json?q=%22List.replicate%22

Jireh Loreaux (Oct 25 2023 at 17:48):

have a look at:
@loogle (Complex.re ?z : ℂ)

loogle (Oct 25 2023 at 17:48):

:search: Complex.re_add_im, Complex.conj_eq_iff_re, and 22 more

Jireh Loreaux (Oct 25 2023 at 17:49):

The issue is that when you get into the more complicated statements, there are too many hypotheses. Maybe you could avoid displaying type classes and implicit arguments unless we're hovering? That would be similar to doc-gen, which greys out the implicits and type classes.

Jireh Loreaux (Oct 25 2023 at 17:49):

Sorry, I don't mean to ask for the moon, it's already great.

Jireh Loreaux (Oct 25 2023 at 17:51):

(just fyi, in the example I gave above, the results I was thinking of were the last two. I'm not arguing they should be first, but they would be much more readable without that extra info.)

Joachim Breitner (Oct 25 2023 at 20:10):

I am using (ppExpr ci.type); if there are any options I should set to make it a bit smaller, I’m happy to do so (like hiding unused parameter names, hiding universes, etc.)
I don’t think there should be much more smartness in this part of the code yet; maybe once we do really nice semantic types, like in the info view (with hover and clickable).
Another cheap fix might be some CSS that disables line wraps and sets overflow-x:scroll, so if types get large you get to scrol that type, but at least it’s not a wall of text.

Kyle Miller (Oct 25 2023 at 20:13):

You could look at how #check works to trigger pretty printing like a declaration. The extract_goal tactic also does this, if you want another example. (Look around PrettyPrinter.ppSignature.)

Joachim Breitner (Oct 25 2023 at 20:24):

I see, delabConstWithSignatures looks good…
annoyingly, I'd like to get just <params> : <type> and not c.{<levels>} <params> : <type>, but all the interesting logic about parameters is bundled. Maybe I can PR a refactor that exposes that. Or simply blindly copy that code…

Shreyas Srinivas (Nov 04 2023 at 16:56):

I have a question: How should loogle-lean (what I am calling my extension for now) insert the definitions into the editor

Shreyas Srinivas (Nov 04 2023 at 16:56):

Btw, I think if I get this and a couple of other snafus fixed, I can share a prototype later today @Joachim Breitner. I still need to configure the correct places to start off the extension.

Shreyas Srinivas (Nov 04 2023 at 16:58):

Do we just want a text insert into the cursor position? Another option is to copy the definition into the clipboard (for now I have both options, but insert is the default).

Joachim Breitner (Nov 04 2023 at 22:15):

Up to you :-)


Last updated: Dec 20 2023 at 11:08 UTC