Zulip Chat Archive

Stream: general

Topic: Light Semantic Search for mathlib4, Looking for feedbacks!

Isaac Li (Aug 09 2025 at 04:08):

Hi all!

I’ve been working on a small project: Light-weight and LLM-Free Semantic Search for mathlib4.
It uses a LoRA-adapted Qwen/Qwen3-Embedding-0.6B model + FAISS HNSW index to enable fast semantic search over ~180k Lean 4.20.1 theorems (extracted via LeanDojo, there may be minor gaps compared to full mathlib).

Here is the GitHub repo:
https://github.com/IsaacLi74/Lightweight-and-LLM-Free-Semantic-Search-for-mathlib4

You can have a quick start with Google Colab:

Open Google Colab
Upload the notebook: Light-Weight Semantic Mathlib Search DEMO.ipynb
Click Run all.
On Colab Free (CPU-only, low RAM), it uses ~7 GB RAM and delivers ~0.5 s/query interactive search latency.

For now, it supports natural language and formula queries (can be mixed), but LaTeX is not supported.

I’d really appreciate any feedbacks! Thanks!

Isaac Li (Aug 09 2025 at 18:02):

:face_holding_back_tears:

Derek Rhodes (Aug 09 2025 at 19:06):

Hi, getting this running on Colab was a piece of cake. The query runs very quickly, but I think it would be helpful if you could include a couple query examples in the README that demonstrate an effective search query and what one might look like.

I'm learning abstract algebra at the moment, so my first search was regarding the transitivity of subgroups. (for the curious, following this paragraph is my search, and the results generated.) I admit that I don't know what Commensurable is, but, that would have been easy to discover with "goto definition". Obviously not available in Colab, but maybe someone knows if there is a standard interface for integrating search engines? I have a feeling that one of the first feature requests you'll get is: editor integration. (this has been done already for other engines, not sure if the plumbing has been abstracted)

(search :magnifying_glass_tilted_right: ) : subgroup transitivity

Commensurable.trans
theorem trans {H K L : Subgroup G} (hhk : Commensurable H K) (hkl : Commensurable K L) :     Commensurable H L :=

CongruenceSubgroup.isCongruenceSubgroup_trans
theorem isCongruenceSubgroup_trans (H K : Subgroup SL(2, ℤ)) (h : H ≤ K)     (h2 : IsCongruenceSubgroup H) : IsCongruenceSubgroup K :=

Set.Subset.trans
theorem Subset.trans {a b c : Set α} (ab : a ⊆ b) (bc : b ⊆ c) : a ⊆ c :=

Subgroup.map_subtype_le
theorem map_subtype_le {H : Subgroup G} (K : Subgroup H) : K.map H.subtype ≤ H :=

CategoryTheory.Subgroupoid.inclusion_trans
theorem inclusion_trans {R S T : Subgroupoid C} (k : R ≤ S) (h : S ≤ T) :     inclusion (k.trans h) = inclusion k ⋙ inclusion h :=

Relation.TransGen.trans
theorem Relation.TransGen.trans {α : Sort u} {r : α → α → Prop} {a b c} :     TransGen r a b → TransGen r b c → TransGen r a c :=

trans
lemma trans [IsTrans α r] : a ≺ b → b ≺ c → a ≺ c :=

subset_trans
@[trans] lemma subset_trans [IsTrans α (· ⊆ ·)] {a b c : α} : a ⊆ b → b ⊆ c → a ⊆ c :=

Lists'.Subset.trans
theorem Subset.trans {l₁ l₂ l₃ : Lists' α true} (h₁ : l₁ ⊆ l₂) (h₂ : l₂ ⊆ l₃) : l₁ ⊆ l₃ :=

Multiset.Subset.trans
theorem Subset.trans {s t u : Multiset α} : s ⊆ t → t ⊆ u → s ⊆ u :=

suhr (Aug 09 2025 at 20:08):

Any comparsions with LeanExplore?

Isaac Li (Aug 09 2025 at 20:32):

@Derek Rhodes
Thanks for trying it on Colab and for the thoughtful feedback! I’m not a very experienced Lean user (I only started last semester), but I think the “transitivity of subgroups” you had in mind might just be: le_trans

example {H K L : Subgroup G} (hHK : H ≤ K) (hKL : K ≤ L) : H ≤ L := le_trans hHK hKL

Also, a bit of context on the search engine: I pre-generate multiple possible queries from the formal theorems and use those to drive retrieval. Tactics or very general thms with many uses are hard to hit directly by a vague natural-language query, so the results may skew toward named theorems such as Commensurable.trans.

As a comparison point, you can try the same query on LeanSearch:
https://leansearch.net/?q=transitivity+of+subgroups

Isaac Li (Aug 09 2025 at 20:32):

@suhr
I haven’t tried LeanExplore. I’ll look it up!

Isaac Li (Aug 09 2025 at 20:47):

@Derek Rhodes

On the “couple query examples”: there isn’t a fixed query format. Free-form natural language, formula-style text all works.

You can search for things like:
:magnifying_glass_tilted_right: m+-i=m-i
Int.add_neg_eq_sub
theorem add_neg_eq_sub {a b : Int} : a + -b = a - b :=

:magnifying_glass_tilted_right: the least factor of a number is a prime
Nat.minFac_prime
theorem minFac_prime {n : ℕ} (n1 : n ≠ 1) : Prime (minFac n) :=

:magnifying_glass_tilted_right: law of cosine
EuclideanGeometry.dist_sq_eq_dist_sq_add_dist_sq_sub_two_mul_dist_mul_dist_mul_cos_angle
theorem dist_sq_eq_dist_sq_add_dist_sq_sub_two_mul_dist_mul_dist_mul_cos_angle (p₁ p₂ p₃ : P) : dist p₁ p₃ * dist p₁ p₃ = dist p₁ p₂ * dist p₁ p₂ + dist p₃ p₂ * dist p₃ p₂ - 2 * dist p₁ p₂ * dist p₃ p₂ * Real.cos (∠ p₁ p₂ p₃) :=

Isaac Li (Aug 09 2025 at 21:08):

@Derek Rhodes

Also, thanks for the thoughtful note on editor integration! This project is still in its early stages. The demo is mainly to show that a local semantic search can run smoothly on a Colab-level CPU (no GPU needed). For now, I’m keeping the scope focused on identifying issues in the search-only phase.

Derek Rhodes (Aug 09 2025 at 21:11):

@Isaac Li, thank you, le_trans works.

Pre-generating queries is an interesting strategy, that brings up an idea along those lines. In case you were not aware, @Kyle Miller and @Patrick Massot built an informalizer (PDF) that is described by @Floris van Doorn (maybe he helped build it too?) as something that: "turns a formal proof into a [natural language] readable proof, where the reader decides how much detail they want." I was thinking that maybe the semantic analysis would work well with English, I know that some engines that use vector databases and text similarity use models that are trained on English. Just a thought :)

Isaac Li (Aug 09 2025 at 21:38):

@Derek Rhodes
Thanks for the information. I’m still pretty new to Zulip, so I wasn’t aware of that. My search engine is also based on vector databases and text similarity, but I’ve done some fine-tuning and variable adjustments to help the rather small model get a better understand on mathematical statements. I need to find more queries that the model can’t retrieve, so I can get a more complete picture of its performance. If you come across any other theorems it fails to find, please feel free to share them with me!

Frank Yu (Aug 11 2025 at 01:49):

This search engine handles pure equations better than LeanExplore and Leansearch. The engine has high recall and medium precision, I expect it to be more focused.
demo.png
Using query "m+i=m-i", the result turns out to be this, it is exactly what I am looking for even the query is mistaken.
explore.png
With the same query I cannot obtain the desired result.
search_1.png
Using query "a+b=a-b", the result is fine.
search_2.png
Once change back to query "m+i=m-i", I failed to find the target result. That is the effect of naming bias as variable "a""b" is more likely to be recognized.
@Isaac Li I am curious about how to eliminate the naming bias in variables?

Isaac Li (Aug 11 2025 at 03:21):

@Frank Yu

Thanks for trying out the demo!

I think LeanExplore uses a relatively small encoder (BAAI bge-base-en-v1.5), which does not perform so well at symbol-heavy equations. LeanSearch does perform better on equations, but as you said, it still shows variable-naming bias.

Also, this demo is part of my AITP 2025 submission, “Towards Lightweight and LLM-Free Semantic Search for mathlib4.” I’m finishing the paper now and will attach it to the GitHub repo together with a minimal reproduction guide.

If you are interested and have other test examples, please share them!

Last updated: Dec 20 2025 at 21:32 UTC