Zulip Chat Archive

Stream: general

Topic: How does Lean's search of its library work?

Daniel Donnelly (Sep 22 2021 at 06:03):

I'm experimenting with a visual language of nodes and arrows so "search" naturally becomes subgraph isomorphism search that is variable-substitution aware. However, this made me wonder how does Lean match a user's text input to a library rule's input parameters? If it even does that or needs to do that. See my idea is that the library should always be searching as my user's graph input changes so that the "Apply rule" buttons automatically show up and the user just clicks one to logically transform their diagram. I'm pretty certain Lean has some tools that do a search for applicable rules, from what I've read here in the past. So how if negative on the above, then how do these community plugins perform their searching?

Scott Morrison (Sep 22 2021 at 09:18):

I'm not entirely sure what you're asking, but perhaps https://en.wikipedia.org/wiki/Unification_(computer_science) is a reasonable starting point?

Daniel Donnelly (Sep 22 2021 at 09:22):

@Scott Morrison yes, that is a good starting point! I didn't know about this notion before now.

Daniel Donnelly (Sep 22 2021 at 10:45):

Scott Morrison said:

I'm not entirely sure what you're asking, but perhaps https://en.wikipedia.org/wiki/Unification_(computer_science) is a reasonable starting point?

So what algorithm does Lean go with, is it Paterson-Wegner Linear-time algorithm that uses DAGs?

Eric Taucher (Sep 22 2021 at 11:05):

Thanks Scott.
I use that same link to the Wikipedia article on Unification often with work related to Prolog.
The article notes a few types of unification but for Prolog is is specifically the syntactic unification that matters.

For lean can someone identify which unification it is using? It would also be helpful is one could identify in the GitHub repository where the unification code is? I find references to unification hint but nothing that jumps out at me that says I do unification.

Daniel Donnelly (Sep 22 2021 at 11:17):

A side question maybe that is easy for someone here to answer is how does one mix category theory with unification? The wikipedia article talks about solving systems over some algebraic structure possibly with some functions, what if your terms consist of diagrams. Would this definitely be called graph unification?

Kyle Miller (Sep 22 2021 at 15:21):

This might be where unification happens: https://github.com/leanprover-community/lean/blob/master/src/library/type_context.cpp (is_def_eq_core it seems)

It's fairly complicated, but approximately speaking it compares two terms piece-by-piece (terms are represented as DAGs I believe), and whenever a metavariable is compared with something, that metavariable is replaced with what it is being compared to, assuming the occurs check succeeds. By "replaced" I mean the metavariable is assigned the term in the metavariable context, but it's as if every instance of the metavariable is replaced with the term.

There are additional complications from (1) the fact that you might need to reduce terms a bit to see that they can be successfully unified, (2) the fact unification hints might need to be consulted to help with unification, and (3) other heuristics to help the unifier along.

Daniel Donnelly (Sep 22 2021 at 19:59):

@Kyle Miller How is the library represented in memory in order to determine what a variable is replaceable with? Then how does lean search this representation?

Kyle Miller (Sep 22 2021 at 20:21):

@Daniel Donnelly I don't know too much about these internal details, and someone who actually knows something can correct me -- I'm just reading source code. The library_search tactic uses tactic.get_env to get the current "environment," which contains everything that's been declared (lemmas, definitions, and notation) and then uses tactic.suggest.match_head_symbol to figure out which lemmas from the environment might apply to the current goal (there are many many lemmas, so this is an optimization). For each of these potential lemmas, tactic.suggest.apply_and_solve uses tactic.apply (which is what tactic#apply uses) to see if it can be applied. This uses a Lean-internal function, written in C++ for speed, to try to unify the lemma with the goal, while instantiating arguments with metavariables.

Kyle Miller (Sep 22 2021 at 20:29):

For a simple example of how apply works:

lemma foo : ∀ (n : ℕ), 0 ≤ n := zero_le

example (n : ℕ) : 0 ≤ n :=
begin
  apply foo,
end

I think it roughly works like this:

Lean tries and quickly fails to unify the type of foo with 0 ≤ n.
Lean creates a metavariable ?m_1 and constructs the expression foo ?m_1, which has the type 0 ≤ ?m_1.
Lean tries to unify this with 0 ≤ n, and it succeeds with the solution ?m_1 =?= n.

(I suspect though that it actually tries step 2 first, and it would fall back to step 1 if it fails.)

Daniel Donnelly (Sep 22 2021 at 20:38):

@Kyle Miller So my question now becomes how is the environment represented as a data structure (?). Surely it isn't a simply a list, and can be searched a lot quicker if it were some more appropriate data structure for searching.

Kyle Miller (Sep 22 2021 at 20:40):

The environment seems to be written in C++, but the interface on the Lean side is essentially just a list. (It has a slight optimization that it exposes a fold operation, so it doesn't have to convert the list of declarations to a Lean list when you want to access it.)

Daniel Donnelly (Sep 22 2021 at 20:42):

@Kyle Miller
So what data structure do they use in C++? This becomes important at the level I'm coding at which would be at the Lean kernel (C++) level. I want to make sure that the algorithm I choose is not a bad idea w.r.t. what Lean does.

Kyle Miller (Sep 22 2021 at 20:44):

If you're asking because you're wondering about speed, I don't think it matters. library_search is ultimately iterating over the entire list of declarations.

Daniel Donnelly (Sep 22 2021 at 20:45):

@Kyle Miller that's not good, and maybe why Lean is slow at ray tracing (1/30th the speed of C++). But that's good news for me because it means I don't have to be super competitive in the library search feature.

Kyle Miller (Sep 22 2021 at 20:47):

This library_search process doesn't have anything to do with runtime properties -- this is during compilation, so to speak. Also, I take it you're talking about Lean 4 because of the raytracer? Everything I said is about Lean 3.

Kyle Miller (Sep 22 2021 at 20:48):

The raytracer (if it had any proofs) doesn't need to reprove everything every frame.

Daniel Donnelly (Sep 22 2021 at 20:50):

@Kyle Miller yes, I mean whenever library search happens, whether at compile time or else. My search will happen when the app is running, but the language itself doesn't "run" but that might be interesting to see if that's possible.

Jason Rute (Sep 22 2021 at 21:26):

@Daniel Donnelly I want to be clear. Are you specifically taking about the lean tactic library_search? Or are you asking about a more general notion of search (e.g. when a user writes measurable f Lean has to parse and elaborate what measurable and f mean from the environment and local context)?

Daniel Donnelly (Sep 22 2021 at 21:29):

@Jason Rute I'm wanting to know how both work, shouldn't they be the same thing? Ideally library_search should be implemented in C++ not Lean itself, because you need ultimate speed to search known math :)

Scott Morrison (Sep 22 2021 at 21:35):

No, they are not the same thing at all, I'm not sure what you could mean.

Scott Morrison (Sep 22 2021 at 21:36):

library_search will presumably be much faster in Lean4!

Jason Rute (Sep 22 2021 at 21:37):

I caution that since you are asking very technical questions that it is good to be specific. How the parser works is fairly complicated. I don’t think library search works on the same idea as the parser and elaborator since in library search you have to match an expression to a theorem expression via unification. I don’t think (but am not sure) that there is any helpful indexing of theorems in Lean 3 that might significantly speed up library search. I believe it is a linear search. (I could be wrong here.) I think that (more than language of implementation) is the speed issue.

Jason Rute (Sep 22 2021 at 21:37):

I do think the names of constants in the environment are indexed which makes their lookup in the parser faster.

Scott Morrison (Sep 22 2021 at 21:39):

Yes library_search is just a linear search over the entire environment. We do a tiny bit of filtering (only selecting declarations in the environment with a "head symbol" which matches the goal), but in the end apply is so incredibly fast (in particular, it fails fast) that mostly it is better to just hand everything over to it.

Jason Rute (Sep 22 2021 at 21:41):

Is that filter based on an index (like a hash map) or do you just skip apply if the head symbol is wrong?

Scott Morrison (Sep 22 2021 at 21:42):

No indexing. In Lean3 there's no real way to keep an updated map from head symbols to declarations available and updated.

Scott Morrison (Sep 22 2021 at 21:42):

So we just do a filter on the environment, checking the head symbol, and then a map trying to actually apply the declaration against the current goal.

Scott Morrison (Sep 22 2021 at 21:43):

(I was shocked when I first tried it out that an approach as dumb as that was actually viable. Unfortunately mathlib is large enough now that deep in the import hierarchy library_search becomes too slow to be usable.)

Daniel Donnelly (Sep 22 2021 at 21:51):

Scott Morrison said:

No indexing. In Lean3 there's no real way to keep an updated map from head symbols to declarations available and updated.

Yes, so it's O(n f(n)) but needs to be O((log n) f(n)) where n is the library size, and f(n) is the running time of whatever needs be done at each item in the library. I'm wondering if my code will be any faster since it has to do a subgraph isomorphism search (brute force would be) for each subgraph of the user's drawn graph, try and match a graphical rule input graph (in the library) to that subgraph, and then do any additonal checks needed that are not handled by the subgraph search.

Reason for graphs, is that is the natural format of my visual language. There is a Node & Arrow class, and they're connected by pointers/references. So there is naturally a graph, not pure textual format to the language.

Jason Rute (Sep 22 2021 at 21:59):

Again, to be specific, is what you are doing the same thing as library search? Matching a goal e.g. x y : nat |- x + y = y + x to a theorem in the library? If you have a constant library of theorems I think there are known ways to do this efficiently. The problem is that the Lean environment is constantly changing and so is the list of theorems in the environment. Again, I think this is being addressed in Lean 4, but I don’t know the details.

Daniel Donnelly (Sep 22 2021 at 22:08):

@Jason Rute What I'm doing is matching the user's input math (they can draw whatever graph they want) to an rule input in the library, so instead of the user entering in all of tha line: x y : nat |- x + y = ... they would just enter in say x and y as two nodes within a parent node nat and the library will suggest the theorem by its input arguments from the library. But the user could also just enter in the whole theorem and that would also match the whole theorem. Essentially any labeled math graph that the user enters is a valid query, but may not be mathematically valid under the library's current collection of definitions. Say if the user enters in f: A -> B but also f:C -> D and A != C. Entering that should highlight parts of their graph with error color.

Eric Wieser (Sep 22 2021 at 23:14):

This is totally valid lean code though:

example {A B C D : Type*} (f : A → B) (f : C → D) (h : A ≠ C) : sorry := sorry

So that example isn't very insightful.

Sebastian Ullrich (Sep 23 2021 at 07:09):

Scott Morrison said:

library_search will presumably be much faster in Lean4!

It already is https://leanprover.zulipchat.com/#narrow/stream/270676-lean4/topic/.23find/near/243616013

Last updated: May 02 2025 at 03:31 UTC