Zulip Chat Archive

Stream: general

Topic: search-mathlib

Deming Xu (Jul 30 2024 at 15:22):

Hi. I also made a webpage that can search for Mathlib theorems.
https://huggingface.co/spaces/dx2102/search-mathlib

You can find the code and data I used here. I think the code is quite short. There is only one Lean file (about 100 lines) and two Python files (about 100 lines each).
https://huggingface.co/spaces/dx2102/search-mathlib/tree/main

image.png

Lean is a difficult language. I used this project as an exercise for my learning of Lean.

I originally only knew that Lean had a searcher based on symbol matching from Loogle (https://loogle.lean-lang.org). It was not until I finished making this webpage that I realized there were similar natural language-based searchers like https://moogle.ai and https://leansearch.net/ which was also announced today ...

I think the principle of my webpage should be similar to moogle and leansearch. I used the text embedding model in natural language processing. These neural networks convert any text into a vector (e.g. a 1024-dimensional vector, represented by 1024 floating point numbers). This high-dimensional vector represents the "embedding" of the text in R^1024. Any two pieces of text that are semantically close should correspond to similar vectors, as one of the training objectives of the embedding model. In other words, the angle or distance between the vectors should be small, where the angle is calculated by dot product. So this technique can now be used to help search (and other tasks, such as text classification, entity recognition, and helping chatbots).

I found the text embedding API service provided by VoyageAI. They are ranked high on the MTEB leaderboard, and I can use their server GPU instead of preparing one myself. They gave everyone 50 million tokens of free usage. (1 token is about 4 English characters.) After embedding Mathlib's theorems, I still have 14 million tokens of free credits left. I think this is enough to run this web page for now.

For each definition and theorem in Mathlib, I found its name, type, and comment, concatenated them together, and converted them into an embedding vector. At the same time, the user's search terms are also converted into vectors, and then dot-multiplied with these 300,000 theorems to find the most similar definitions and theorems.

Finally, I will briefly explain my three code files. I hope this can help people who want to do similar things.

extract.lean: This is the code I modified from lean-training-data (https://github.com/semorrison/lean-training-data). I recommend lean-training-data to others who want to extract information from Lean. lean-dojo seems to be another option. The main content of the code is to import Mathlib through CoreM.withImportModules #[Mathlib] do .... Then use let env ← getEnv to view and print the information in the Lean.Environment data structure, including theorem name, type, document string, etc. This information is stored in theorems.txt. (The export takes about 15 minutes, and the progress will be displayed).

embed.ipynb: This code will call the VoyageAI API, convert each item in theories.txt into a 1024-dimensional vector, and store it in embed_arr.npy. You can view the format of name, type, and docstring concatenation here.

server.py: This code will mount a simple web page on the huggingface space server. It will also forward the user's search terms to VoyageAI, convert the search terms into vectors, and then match similar ones from the vectors of existing theorem definitions.

I hope this page can help others who are learning Lean understand Mathlib faster.

Jireh Loreaux (Jul 30 2024 at 16:07):

(deleted)

Jason Rute (Jul 31 2024 at 06:02):

I like that your search engine supports searching definitions unlike leansearch.net. I also think this interface is better than leansearch.net, but not as google as Moogle.ai. (Interfaces are important to get users.)

Jason Rute (Jul 31 2024 at 06:02):

I haven’t used yours or leansearch.net enough to really stress test it, but it would be fun to compare all three engines to each other.

Jason Rute (Jul 31 2024 at 06:02):

I think it would be nice to have one (or maybe all three of) these engines available to use in the search bar for Mathlib #docs.

Johan Commelin (Jul 31 2024 at 06:28):

We could have a "one search-bar to rule them all, one query to bind them". Which farms out searches to these (and loogle) and aggregates the results... :thinking:

Yaël Dillies (Jul 31 2024 at 06:34):

What I really care about with all these search engines is that the usual stuff we have in #docs stays available. This includes, but is not restricted to:

Hover to see associativity
Links to declarations mentioned
The "Instances" and "Instances for" lists
A link to the source code
A non-eye bleeding color scheme

Deming Xu (Jul 31 2024 at 08:47):

There are definitely a lot of improvements that can be made. My code is really just a small prototype...

First, like leansearch, you could have a large language model write a natural language description for each Lean definition or theorem. This would definitely make the results better.

Regarding the style and theme, I am partially referring to the design of Loogle. Like Loogle, the theorem names you see are actually links that should lead to mathlib4_docs. You can use the link if you want more information. Currently, the theorems types are just pretty-printed into plain text by calling ppExpr, I need to look at the code of doc-gen4 to turn these into interactive html.

Deming Xu (Jul 31 2024 at 08:53):

Perhaps it would be better to integrate these Loogle symbolic matching, text embedding, and LLM description functions directly into the mathlib docs.

Last updated: May 02 2025 at 03:31 UTC