Zulip Chat Archive

Stream: mathlib4

Topic: potential improvements to library_search


Scott Morrison (Mar 10 2023 at 09:21):

library_search in mathlib4 is smarter than it was in mathlib3 --- we store all the declarations in a discrimination tree, rather than a single big bucket. This means that we can use the goal to quickly identify only the lemmas that could possibly match.

The basic idea here is that for each lemma, we look at the conclusion (i.e. the type, after all binders), and write down a linearised version of the type, with various things (arguments of the lemma, and typeclass instances) replaced with wildcards.

However, there are many lemmas that the discrimination tree claims are relevant for any possible goal (or, relatedly, any possible equational goal). Usually these lemmas are pretty unlikely --- they will in fact successfully apply against the goal, but there's no chance we will be able to fill in the arguments from the local hypotheses (this step is handled by solve_by_elim).

Moreover, these "bad" lemmas really clutter up the output of library_search when it fails to close the goal (i.e. what the output that we used to get from suggest in mathlib3).

A few possible proposals:

  1. just discard these universal lemmas, so library_search doesn't even try them (anything with discrimination key #[*] or #[Eq, *, *, *]).
  2. try the universal lemmas, but only after the more specific ones?
  3. try the universal lemmas, but don't report them in the output when we fail to close the goal.

Proposal 1. does speed up library_search measurably, but at a slight cost: I can post some examples where we'd now fail. Proposal 2. probably doesn't gain us much --- maybe a slight speed up on successful runs, because we're trying out fewer of the unlikely lemmas before we get to the right one. Proposal 3. would additionally give much better output when library_search fails.

I've implemented 1. on a branch, but thought I'd check for feedback or suggestions before trying out 2. / 3.

Scott Morrison (Mar 10 2023 at 09:22):

If anyone had ideas for how to better filter the "matches everything" lemmas according to the hypotheses, that would be cool, but I don't see anything practicable yet.

Kevin Buzzard (Mar 10 2023 at 14:27):

One fix for "these lemmas clutter up the output when library_search fails to close the goal" is to go the Lean 3 way and just report that library_search failed, without showing the 100 near-misses.

Jireh Loreaux (Mar 10 2023 at 14:56):

How hard would it be to implement 1 and then implement 2 when we pass ! or just a config? It seems like both options would be useful.

Gabriel Ebner (Mar 10 2023 at 21:01):

I think (1) is a reasonable choice.

Gabriel Ebner (Mar 10 2023 at 21:01):

@Scott Morrison Can you post examples of where we get a #[*] key for a useful lemma?

Gabriel Ebner (Mar 10 2023 at 21:03):

Also library_search doesn't see through semireducible definitions at the moment. How problematic is this in practice? I have wanted to implement this by inserting reduced versions of the theorem statement into the discrimination tree, but I've never got around to it.

Mario Carneiro (Mar 10 2023 at 22:44):

Uh... you want to unfold semireducible definitions?

Mario Carneiro (Mar 10 2023 at 22:44):

That's nearly everything

Gabriel Ebner (Mar 10 2023 at 22:51):

I was thinking about only the root function at first, e.g. apply lemmas about HasDerivAtFilter to a HasDerivAt goal. But yeah, that could be super expensive if done recursively.

James Gallicchio (Mar 12 2023 at 03:31):

what did mathlib3 library_search use for its heuristic?

Scott Morrison (Mar 13 2023 at 02:44):

@Gabriel Ebner, here are some examples that rely on allowing lemmas with a #[*] key:

example (p : Prop) : (¬¬p)  p := by library_search -- says: `exact not_not.mp`
example (a b : Prop) (h : a  b) : a := by library_search -- says: `exact h.left`
example (P Q : Prop) : (¬ Q  ¬ P)  (P  Q) := by library_search -- says:`exact Function.mtr`
example {α : Sort u} (h : Empty) : α := by library_search -- says: `exact Empty.elim h`
example (f : A  C) (g : B  C) : (A  B)  C := by library_search -- says: `exact Sum.elim f g`
example (n : ) (r : ) :  := by library_search using n, r -- says: `exact nsmulRec n r`

I think I would be happy to "lose" these ones. With the exception of the last one (which is not the most useful answer anyway), these all feel like "basic logic" things, an perhaps out of scope for library_search.

Scott Morrison (Mar 13 2023 at 02:46):

Here's the timing data:


With all lemmas:
lake env lean test/librarySearch.lean 212.08s user 2.17s system 98% cpu 3:36.44 total
lake env lean test/librarySearch.lean 194.68s user 1.49s system 99% cpu 3:18.03 total

With all lemmas, just building the cache:
lake env lean test/librarySearch.lean 59.59s user 1.07s system 99% cpu 1:00.69 total

Without #[*] lemmas:
lake env lean test/librarySearch.lean 96.73s user 1.33s system 99% cpu 1:38.33 total
lake env lean test/librarySearch.lean 96.44s user 1.31s system 99% cpu 1:37.93 total

Without #[*] lemmas, just building the cache:
lake env lean test/librarySearch.lean 55.84s user 1.10s system 99% cpu 57.008 total


That is, just starting out the library_search cache takes almost a minute now (with all of current mathlib4 imported). After that, the library_search test file takes approximately 140s to run, or approximately 36s to run if we drop all the "non-specific" lemmas.

I think that's a pretty strong argument in favour of doing this filtering!

Scott Morrison (Mar 13 2023 at 02:46):

@James Gallicchio, in mathlib3 we did not use a discrimination tree, we just tried everything every time, so these non-specific lemmas were always considered.

James Gallicchio (Mar 13 2023 at 02:52):

like a defeq check against every lemma?

Michael Stoll (Mar 13 2023 at 02:56):

In mathlib3, there is also library_search!, but I have no idea what the difference is (the documentation is silent on this).
Maybe we can use library_search! for a variant that includes the non-specific lemmas?
Then library_serach would be reasonably fast and good for most cases, but if one wants to find something fairly general, but does not remember the name, library_search! would do it.

Scott Morrison (Mar 13 2023 at 02:58):

@James Gallicchio, library_search uses apply, so it's not exactly a defeq check, but pretty close. It is trying to unify the conclusion of the lemma with the current goal, and then use solve_by_elim to fill in any arguments.

Scott Morrison (Mar 13 2023 at 03:02):

To enable library_search! (or something similar) to use the non-specific lemmas, we'd have to actually keep track of them (my proposal at the moment is just to drop them during indexing). I can do this, but it's slightly more work so I'm happy to hear other opinions. :-)

Jireh Loreaux (Mar 13 2023 at 03:14):

Is there a way to cache the discrimination tree?

Scott Morrison (Mar 13 2023 at 03:42):

At present there isn't a way to cache it across files. (That is, the first time you use library_search in a given file, it takes a long time to "boot up", but then is fast for repeated uses in that file.)

Scott Morrison (Mar 13 2023 at 03:43):

I would like to see this, and in fact to have the libary_search caches built into oleans, so that the work can be done in CI.

Scott Morrison (Mar 13 2023 at 03:43):

I haven't thought about how to do this. If someone has an outline of how this could be achieved, I'd be happy to try to implement it.

Michael Stoll (Mar 13 2023 at 04:19):

Having the cache built automatically by CI sounds like the right thing to do.

Patrick Massot (Mar 13 2023 at 12:08):

I think that having a fast library_search is really useful. When this tactic first came to Lean 3 I was using it everyday, multiple times per day, even for lemma I knew I could find rather easily (say I would have known precisely in which file it would be). Then mathlib3 grew and it became slower and slower and I mostly stopped using it.

Kevin Buzzard (Mar 13 2023 at 13:01):

...perhaps because your knowledge of the library had got greater. I am using it all the time in my teaching, because I'm supposed to be demonstrating most of the undergraduate-level contents of the library and there are parts of it which I don't know at all well. However another advantage of experience is knowing about the naming convention (and also knowing how to search, e.g. I realised that writing rw continuouscomp<ctrl-space> is a better idea than rw continuous_comp<ctrl-space> or rw continuous.comp<ctrl-space> because it will find whichever one is the right one.

Floris van Doorn (Mar 13 2023 at 13:50):

A 4x improvement in applying the lemmas sounds like an excellent reason to remove the #[*] lemmas by default. I think it would be good to have a configuration option to still apply these lemmas, but I think the default should be quick.
It is unfortunate that building the cache takes a full minute.
Am I correct that currently, if I'm in the middle of writing a proof where I'm using library_search, and I am not leaving any existing occurrences of library_search around (so I'll have 0 or 1 occurrences at all times), the server will recompute the cache whenever I write a new occurrence of library_search? That sounds not very nice to use, and it suggests that we also want a command compute_library_search_cache (or something) that we can put at the start of the file.

Or, to build the caches into the oleans, you can probably use an environment extension and update the extension each time a declaration is added.

Gabriel Ebner (Mar 13 2023 at 17:16):

not leaving any existing occurrences of library_search around (so I'll have 0 or 1 occurrences at all times), the server will recompute the cache whenever I write a new occurrence

The cache is stored in a global variable. It will only be recomputed if you restart the server (for the file).

Floris van Doorn (Mar 13 2023 at 18:09):

Ah, I see. That is nice!

Scott Morrison (Mar 15 2023 at 00:07):

mathlib4#2888


Last updated: Dec 20 2023 at 11:08 UTC