Zulip Chat Archive

Stream: general

Topic: Implementing find_theorems


Joachim Breitner (Jul 24 2023 at 16:01):

As an Isabelle-exile, I dearly miss Isabelle’s find_theorems command. It can search for lemmas in various ways (name substrings, type patterns), but maybe the most important is “Give me all lemmas that mention both foo and bar in the type”. It’s similar to library_search, but doesn’t require the needle to be at the head of the type.

I made a quick experiment using ad-hoc code:
A file importing import Mathlib.Topology.Algebra.InfiniteSum.Basic, but nothing else from Mathlib, finds all lemmas mentioning HasSum in 6 seconds; this is a linear scan through the whole environment. Not amazing, but already somewhat practical.
With import Mathlib it now runs for multiple minutes, so certainly needs optimization to be practical – probably some kind of reverse index mapping constants to the set of definitions they appear in. The DeclCache infrastructure can probably help here…

Kalle Kytölä (Jul 24 2023 at 16:11):

Joachim Breitner said:

As an Isabelle-exile, I dearly miss Isabelle’s fine_theorems command. ...

I like Lean/Mathlib theorems so much that my naming suggestion would be: not_just_fine_but_beautiful_theorems. :upside_down:

Joachim Breitner (Jul 24 2023 at 16:18):

Ok, the one on Mathlib still runs… but presumably if I run it like that, my command is interpreted, not compiled? Will try to see if it makes a difference (tomorrow or so)

Jannis Limperg (Jul 24 2023 at 16:26):

Try docs#Lean.Expr.foldConsts for more efficient expression traversal. If you use that, compiling probably won't make much of a difference any more because all the work will happen in already-compiled core functions.

Joachim Breitner (Jul 24 2023 at 16:37):

Ah, again shows how bad I am at finding stuff, as I was looking for Expr traversals there…

Jannis Limperg (Jul 24 2023 at 16:43):

Yeah, there's a bunch of these specialised traversals (collectMVars and collectFVars come to mind) and they're not easy to find.

Joachim Breitner (Jul 24 2023 at 20:03):

With the smaller set of imports, down from 6s to 4s. Good, but not a game changer.

Joachim Breitner (Jul 24 2023 at 20:16):

Ah, if I put the code into a

@[command_elab find_theorems] def findTheoremsElab : CommandElab := λ _stx => do

in file A.lean, and import that from B.lean to use the command, so that both are compiled, searching throuh all of impot Mathlib finishes in 11s. That’s already usable!
(Fun fact: there are 296 definitions mentioning HasSum.)

Yury G. Kudryashov (Jul 24 2023 at 20:37):

So far, git grep is faster.

Yury G. Kudryashov (Jul 24 2023 at 20:37):

Though typing the query may be the bottleneck anyway.

Damiano Testa (Jul 24 2023 at 20:42):

I have the feeling that competing on speed with grep is a losing battle.

However, grep will not find automatically generated declarations

Joachim Breitner (Jul 24 2023 at 21:10):

Or lemmas where the two constants are not on the same line…

Joachim Breitner (Jul 24 2023 at 21:11):

find_theorems.gif
This looks rather promising…

Code at <https://github.com/nomeata/rerolling-sixes/blob/master/RerollingSixes/FindTheorems.lean> for now.

Scott Morrison (Jul 24 2023 at 22:16):

We might be able to build a global cache, that is stored in an olean generated by CI, like rw? and exact? do.

Scott Morrison (Jul 24 2023 at 22:17):

This could then tell you about theorems that haven't been imported yet.

Scott Morrison (Jul 24 2023 at 22:20):

But maybe this cache would be too big. I'm not exactly sure what we would have to store: a c : NameMap NameSet so that (c.get n).contains m tells you whether n appears in the type of m?

Scott Morrison (Jul 24 2023 at 22:21):

We'd possibly want to separately cache a lookup table of declaration-name to module-of-definition (as of the last CI), so that we can tell users what they would need to import for results that are not yet imported.

Adam Topaz (Jul 24 2023 at 22:22):

(deleted)

Adam Topaz (Jul 24 2023 at 22:25):

deleted as that's what Scott essentially said in the last message :)

Joachim Breitner (Jul 24 2023 at 22:31):

You could also have a tab open with a scratch file that just imports Mathlib, where you #search for stuff. This way you find it all, and the slow commands don't interfere with the processing of the files you look at.

I also wonder how big the cache will be. It'll still be a fraction of the Environment, so maybe doable. Or maybe I'll have to nerd out with bloom filters and whatnot :-)

I'll look into caching next, maybe using DeclCache.

Adam Topaz (Jul 24 2023 at 22:32):

BTW, you probably want to exclude internal stuff with docs#Lean.Name.isInternal'

Scott Morrison (Jul 24 2023 at 22:33):

Yes, using DeclCache is the right idea, (in part) because then you can just copy and paste existing code to get CI to generate a global version for you.

Joachim Breitner (Jul 25 2023 at 08:52):

I am reading through the code at <https://github.com/leanprover-community/mathlib4/blob/c652e603884371ddd609b7bee3c2229991d9d6fa/Mathlib/Tactic/Cache.lean>, and I wonder how the Cache gets preserved when the inputs are unchanged, or rather, how it gets invalidated when the imports of a module change.

Sebastian Ullrich (Jul 25 2023 at 09:04):

Imports changing is the simple case because the whole process gets restarted :)

Scott Morrison (Jul 25 2023 at 09:33):

@Joachim Breitner, the Cache there is rebuilt at the beginning of each file. It is not a global cache, and so whenever you change the imports that is going to trigger a rebuild.

exact? and rw? on top of that implement a global cache (i.e. in CI a file that has imported Mathlib writes out its Cache in an olean, and then exact? and rw? load that as needed).

Joachim Breitner (Jul 25 2023 at 09:35):

Thanks. Why exactl does changing the imports trigger a rebuild? Does changing the imports actually restart the complete per-file lean process?

Joachim Breitner (Jul 25 2023 at 09:36):

Nice. With the cache the first invocation takes ~20s, but then searching is essentially instantaneous

Scott Morrison (Jul 25 2023 at 09:36):

Exactly. Processing the imports is completely separate from processing the rest of the file. I'd have to dig to find the code, but if you grep for Frontend you'll find examples calling processImports, which is this initial step.

Scott Morrison (Jul 25 2023 at 09:37):

And with a global cache, you can hand off that ~20s to CI. It turns out loading oleans is really fast (hooray memory mapping).

Joachim Breitner (Jul 25 2023 at 09:37):

Thanks, no digging needed, that’s the clarification I needed :-)

Joachim Breitner (Jul 25 2023 at 09:42):

find_theorems2.gif

Sebastian Ullrich (Jul 25 2023 at 09:49):

You shouldn't need the brackets fwiw, the sequence of idents will be terminated by the keyword of the next statement

Joachim Breitner (Jul 25 2023 at 10:08):

I tried that briefly, and it would make the command pick up partial keywords as I type them in the next line, causing supurious messages, which I found unergonomic. But I guess that's a general problem with the lean syntax, and for consistency no brackets are better?

Scott Morrison (Jul 25 2023 at 10:24):

Can a colGt ensure that it only picks up things on a subsequent line if they are indented?

Sebastian Ullrich (Jul 25 2023 at 10:28):

Yes, (colGt ident)+ is the standard approach for this

Joachim Breitner (Jul 25 2023 at 11:57):

Ah, perfect!

Joachim Breitner (Jul 25 2023 at 12:11):

Hmm, doesn’t seem t work work. Neither

syntax (name := find_theorems) "#find_theorems" (colGt ident)+ : command

nor

syntax (name := find_theorems) "#find_theorems" (colGt (ident+)) : command

prevents

#find_theorems HasSum tsum
xy

from interpreting xy as part of the list of ident

Sebastian Ullrich (Jul 25 2023 at 12:21):

Ah, it should be withPosition("#find_theorems" (colGt ident)+)

Joachim Breitner (Jul 25 2023 at 12:27):

Thanks, that works!

Joachim Breitner (Jul 25 2023 at 12:46):

Is there a way to “flush” the log messages? The first time this command is used (with all of mathlib in scope) it builds the cache and that takes a while, but just putting logInfo in the command will not show this info until the command is done. Is there a way to make it display “Building cache, please hang one” while the command is still running?

Sebastian Ullrich (Jul 25 2023 at 13:13):

Not right now, there is no asynchronous communication between elaboration and language server

Joachim Breitner (Jul 25 2023 at 13:24):

Ok, in that case I declare #find_theorems has reached MVP status.
Should I put it into a mathlib PR, as its own thing?
Or should I instead kick off a discussion if we can expand #find to support a richer algebra of queries – in addition to the patterns it supports also constants and maybe names?
Or both – introduce #find_theorems (or better #find_by_consts) soon to get user feedback, and then work towards unifying with #find.

Scott Morrison (Jul 27 2023 at 00:04):

Let's get this into Mathlib first, and then worry about unification with other commands.

Scott Morrison (Jul 27 2023 at 00:04):

how about #mentioning as the name?

Joachim Breitner (Jul 27 2023 at 06:11):

I had a chat with Sebastian yesterday, who wrote #find in the first place, and he encouraged me to extend #find instead (if you can search for consts, and for patterns, then it will be weird if you can't search for the conjunction). So my current plan would be to make a PR that subsumes #find right awat, if that's reasonable.

Joachim Breitner (Jul 28 2023 at 22:23):

In my local prototype I have now added #finds ability to search by patterns to my #find_theorem. It seems enough, or even better, to have a const-to-def cache, so if you do

 #find_theorems (Lean.Syntax  Lean.Name)

it can use the cache to quickly find the set of definitions mentioning Lean.Syntax, the set of definitions mentioning Lean.Name, take the intersection, and on only that (hopefully smaller) list select those that match this pattern.

It also seems that this cache is cheaper to construct than the HeadIndex-indexed cache used by #find: For a file importing Mathlib the first

 #find_theorems (Lean.Syntax  Lean.Name)

takes 22s while the first

 #find  (Lean.Syntax  Lean.Name)

takes 101s.

And it can be useful to combine the two, for example

#find_theorems (∑' _, _ = _) HMul.hMul HPow.hPow

finds all lemmas that give the value of a series where multiplication and power occurs.

I am inclined to also add a way to search for patterns _anywhere_ in the type (e.g. _ * _ ^ _), not necessarily only rooted at the conclusion. But for that I need a nice idea for a syntax that distinguishes the two variants.

Joachim Breitner (Aug 03 2023 at 14:01):

I am now working on the variant where I am finding for patterns anywhere in the type of constants. This is my approach so far:

structure PreparedPattern extends  AbstractMVarsResult where
  headIndex : HeadIndex

def preparePat (t : Expr) : MetaM PreparedPattern := do
  let amr <- withReducible $ do abstractMVars ( instantiateMVars t)
  pure { toAbstractMVarsResult := amr, headIndex := t.toHeadIndex }

def matchPatAnywhere (pat : PreparedPattern) (c : ConstantInfo) : MetaM Bool := do
  let found   IO.mkRef false
  withReducible do
    let cTy := c.instantiateTypeLevelParams ( mkFreshLevelMVars c.numLevelParams)
    Lean.Meta.forEachExpr' cTy fun sub_e => do
      if pat.headIndex == sub_e.toHeadIndex then do
        withNewMCtxDepth $ do
          let pat := pat.expr.instantiateLevelParamsArray pat.paramNames
            ( mkFreshLevelMVars pat.numMVars).toArray
          let (_, _, pat)  lambdaMetaTelescope pat
          if  isDefEq pat sub_e
          then found.set true
      -- keep searching if we haven't found it yet
      not <$> found.get
  found.get

It seems to work, but it is quite slow, for example with

#find_theorems (_ * (_ + _))

(and note that it only looks at lemmas that mention both HMul.hMul and HAdd.hAdd ), and I run out of heartbeats sometimes.

It seems that there some definitions with very large types, e.g.

#check MonoidAlgebra.liftMagma.proof_5

which you very likely don’t want to #find anyways. Can I recognize such definitions easily? I am already using

-- from Lean.Server.Completion
-- from Mathlib.Tactic.Find
-- (someone should put it somewhere public)
private def isBlackListed (declName : Name) : MetaM Bool := do
  let env  getEnv
  pure $ declName.isInternal
   || isAuxRecursor env declName
   || isNoConfusion env declName
  <||> isRec declName
  <||> Meta.isMatcher declName

Adam Topaz (Aug 03 2023 at 14:11):

I think most will be caught by docs#Lean.Name.isInternal'

Adam Topaz (Aug 03 2023 at 14:12):

That at least catches the proof_37 names

Joachim Breitner (Aug 03 2023 at 14:13):

Thanks! I’m surprised this wasn’t already in the blacklist filter that I stole from #find

Adam Topaz (Aug 03 2023 at 14:14):

It’s part of docs#Lean.Name.isBlackListed

Joachim Breitner (Aug 03 2023 at 14:14):

Great, now

#find_theorems (_ * (_ + _))

takes 3s (with a warm cache) and finds plausible 161 theorems.

Joachim Breitner (Aug 03 2023 at 15:18):

I also added filtering by name substring (this helps with debuggging :-)), so

#find_theorems (_ * (_ + _)) "distr"

will print (instantaneously, with the additional name filter)

Found 2113 definitions mentioning HMul.hMul and HAdd.hAdd.
Of these, 31 have a name containing all of "distr".
Of these, 14 have mention your pattern(s).
 left_distrib
 CanonicallyOrderedCommSemiring.left_distrib
 Distrib.left_distrib
 IsLprojection.distrib_lattice_lemma
 LeftDistribClass.left_distrib
 Nat.left_distrib
 NonUnitalNonAssocRing.left_distrib
 NonUnitalNonAssocSemiring.left_distrib
 OreLocalization.left_distrib
 PGame.left_distrib_equiv
 PGame.quot_left_distrib
 FreeAlgebra.Rel.left_distrib
 LucasLehmer.X.left_distrib
 FreeAlgebra.Rel.below.left_distrib

and the lemma names will show their type when hovering.

Joachim Breitner (Aug 03 2023 at 16:33):

And because the matching of patterns against subexpressions uses defEq and works with metavariables, it seems to support non-linear patterns just fine:

#find_theorems (Real.sqrt ?a * Real.sqrt ?a)

Found 8 definitions mentioning HMul.hMul and Real.sqrt.
Of these, 1 match your patterns.
 Real.mul_self_sqrt

Jireh Loreaux (Aug 03 2023 at 17:49):

This looks amazing. :heart:

Kevin Buzzard (Aug 03 2023 at 20:05):

Yeah that ability to do that kind of matching was exactly what was missing from Lean 3 #find. I couldn't say List X -> List X, I had to say List X -> List Y and then get a bunch of spurious results.

Joachim Breitner (Aug 04 2023 at 08:21):

I think #find (List ?a → List ?a) already works as intended with the present #find implementation, but maybe it’s not documented like that. I’ll make sure to point it out in the docs for the new version.

Joachim Breitner (Aug 04 2023 at 10:08):

The new #find command is ready for reviewing: https://github.com/leanprover-community/mathlib4/pull/6363


Last updated: Dec 20 2023 at 11:08 UTC