Zulip Chat Archive
Stream: general
Topic: Implementing find_theorems
Joachim Breitner (Jul 24 2023 at 16:01):
As an Isabelle-exile, I dearly miss Isabelle’s find_theorems
command. It can search for lemmas in various ways (name substrings, type patterns), but maybe the most important is “Give me all lemmas that mention both foo
and bar
in the type”. It’s similar to library_search
, but doesn’t require the needle to be at the head of the type.
I made a quick experiment using ad-hoc code:
A file importing import Mathlib.Topology.Algebra.InfiniteSum.Basic
, but nothing else from Mathlib, finds all lemmas mentioning HasSum
in 6 seconds; this is a linear scan through the whole environment. Not amazing, but already somewhat practical.
With import Mathlib
it now runs for multiple minutes, so certainly needs optimization to be practical – probably some kind of reverse index mapping constants to the set of definitions they appear in. The DeclCache
infrastructure can probably help here…
Kalle Kytölä (Jul 24 2023 at 16:11):
Joachim Breitner said:
As an Isabelle-exile, I dearly miss Isabelle’s
fine_theorems
command. ...
I like Lean/Mathlib theorems so much that my naming suggestion would be: not_just_fine_but_beautiful_theorems
. :upside_down:
Joachim Breitner (Jul 24 2023 at 16:18):
Ok, the one on Mathlib
still runs… but presumably if I run it like that, my command is interpreted, not compiled? Will try to see if it makes a difference (tomorrow or so)
Jannis Limperg (Jul 24 2023 at 16:26):
Try docs#Lean.Expr.foldConsts for more efficient expression traversal. If you use that, compiling probably won't make much of a difference any more because all the work will happen in already-compiled core functions.
Joachim Breitner (Jul 24 2023 at 16:37):
Ah, again shows how bad I am at finding stuff, as I was looking for Expr traversals there…
Jannis Limperg (Jul 24 2023 at 16:43):
Yeah, there's a bunch of these specialised traversals (collectMVars
and collectFVars
come to mind) and they're not easy to find.
Joachim Breitner (Jul 24 2023 at 20:03):
With the smaller set of imports, down from 6s to 4s. Good, but not a game changer.
Joachim Breitner (Jul 24 2023 at 20:16):
Ah, if I put the code into a
@[command_elab find_theorems] def findTheoremsElab : CommandElab := λ _stx => do
in file A.lean
, and import that from B.lean
to use the command, so that both are compiled, searching throuh all of impot Mathlib
finishes in 11s. That’s already usable!
(Fun fact: there are 296 definitions mentioning HasSum
.)
Yury G. Kudryashov (Jul 24 2023 at 20:37):
So far, git grep
is faster.
Yury G. Kudryashov (Jul 24 2023 at 20:37):
Though typing the query may be the bottleneck anyway.
Damiano Testa (Jul 24 2023 at 20:42):
I have the feeling that competing on speed with grep is a losing battle.
However, grep will not find automatically generated declarations
Joachim Breitner (Jul 24 2023 at 21:10):
Or lemmas where the two constants are not on the same line…
Joachim Breitner (Jul 24 2023 at 21:11):
find_theorems.gif
This looks rather promising…
Code at <https://github.com/nomeata/rerolling-sixes/blob/master/RerollingSixes/FindTheorems.lean> for now.
Scott Morrison (Jul 24 2023 at 22:16):
We might be able to build a global cache, that is stored in an olean
generated by CI, like rw?
and exact?
do.
Scott Morrison (Jul 24 2023 at 22:17):
This could then tell you about theorems that haven't been imported yet.
Scott Morrison (Jul 24 2023 at 22:20):
But maybe this cache would be too big. I'm not exactly sure what we would have to store: a c : NameMap NameSet
so that (c.get n).contains m
tells you whether n
appears in the type of m
?
Scott Morrison (Jul 24 2023 at 22:21):
We'd possibly want to separately cache a lookup table of declaration-name to module-of-definition (as of the last CI), so that we can tell users what they would need to import for results that are not yet imported.
Adam Topaz (Jul 24 2023 at 22:22):
(deleted)
Adam Topaz (Jul 24 2023 at 22:25):
deleted as that's what Scott essentially said in the last message :)
Joachim Breitner (Jul 24 2023 at 22:31):
You could also have a tab open with a scratch file that just imports Mathlib
, where you #search for stuff. This way you find it all, and the slow commands don't interfere with the processing of the files you look at.
I also wonder how big the cache will be. It'll still be a fraction of the Environment, so maybe doable. Or maybe I'll have to nerd out with bloom filters and whatnot :-)
I'll look into caching next, maybe using DeclCache
.
Adam Topaz (Jul 24 2023 at 22:32):
BTW, you probably want to exclude internal stuff with docs#Lean.Name.isInternal'
Scott Morrison (Jul 24 2023 at 22:33):
Yes, using DeclCache
is the right idea, (in part) because then you can just copy and paste existing code to get CI to generate a global version for you.
Joachim Breitner (Jul 25 2023 at 08:52):
I am reading through the code at <https://github.com/leanprover-community/mathlib4/blob/c652e603884371ddd609b7bee3c2229991d9d6fa/Mathlib/Tactic/Cache.lean>, and I wonder how the Cache
gets preserved when the inputs are unchanged, or rather, how it gets invalidated when the imports of a module change.
Sebastian Ullrich (Jul 25 2023 at 09:04):
Imports changing is the simple case because the whole process gets restarted :)
Scott Morrison (Jul 25 2023 at 09:33):
@Joachim Breitner, the Cache
there is rebuilt at the beginning of each file. It is not a global cache, and so whenever you change the imports that is going to trigger a rebuild.
exact?
and rw?
on top of that implement a global cache (i.e. in CI a file that has imported Mathlib
writes out its Cache
in an olean, and then exact?
and rw?
load that as needed).
Joachim Breitner (Jul 25 2023 at 09:35):
Thanks. Why exactl does changing the imports trigger a rebuild? Does changing the imports actually restart the complete per-file lean process?
Joachim Breitner (Jul 25 2023 at 09:36):
Nice. With the cache the first invocation takes ~20s, but then searching is essentially instantaneous
Scott Morrison (Jul 25 2023 at 09:36):
Exactly. Processing the imports is completely separate from processing the rest of the file. I'd have to dig to find the code, but if you grep for Frontend
you'll find examples calling processImports
, which is this initial step.
Scott Morrison (Jul 25 2023 at 09:37):
And with a global cache, you can hand off that ~20s to CI. It turns out loading oleans is really fast (hooray memory mapping).
Joachim Breitner (Jul 25 2023 at 09:37):
Thanks, no digging needed, that’s the clarification I needed :-)
Joachim Breitner (Jul 25 2023 at 09:42):
Sebastian Ullrich (Jul 25 2023 at 09:49):
You shouldn't need the brackets fwiw, the sequence of idents will be terminated by the keyword of the next statement
Joachim Breitner (Jul 25 2023 at 10:08):
I tried that briefly, and it would make the command pick up partial keywords as I type them in the next line, causing supurious messages, which I found unergonomic. But I guess that's a general problem with the lean syntax, and for consistency no brackets are better?
Scott Morrison (Jul 25 2023 at 10:24):
Can a colGt
ensure that it only picks up things on a subsequent line if they are indented?
Sebastian Ullrich (Jul 25 2023 at 10:28):
Yes, (colGt ident)+
is the standard approach for this
Joachim Breitner (Jul 25 2023 at 11:57):
Ah, perfect!
Joachim Breitner (Jul 25 2023 at 12:11):
Hmm, doesn’t seem t work work. Neither
syntax (name := find_theorems) "#find_theorems" (colGt ident)+ : command
nor
syntax (name := find_theorems) "#find_theorems" (colGt (ident+)) : command
prevents
#find_theorems HasSum tsum
xy
from interpreting xy
as part of the list of ident
Sebastian Ullrich (Jul 25 2023 at 12:21):
Ah, it should be withPosition("#find_theorems" (colGt ident)+)
Joachim Breitner (Jul 25 2023 at 12:27):
Thanks, that works!
Joachim Breitner (Jul 25 2023 at 12:46):
Is there a way to “flush” the log messages? The first time this command is used (with all of mathlib in scope) it builds the cache and that takes a while, but just putting logInfo
in the command will not show this info until the command is done. Is there a way to make it display “Building cache, please hang one” while the command is still running?
Sebastian Ullrich (Jul 25 2023 at 13:13):
Not right now, there is no asynchronous communication between elaboration and language server
Joachim Breitner (Jul 25 2023 at 13:24):
Ok, in that case I declare #find_theorems
has reached MVP status.
Should I put it into a mathlib PR, as its own thing?
Or should I instead kick off a discussion if we can expand #find
to support a richer algebra of queries – in addition to the patterns it supports also constants and maybe names?
Or both – introduce #find_theorems
(or better #find_by_consts
) soon to get user feedback, and then work towards unifying with #find
.
Scott Morrison (Jul 27 2023 at 00:04):
Let's get this into Mathlib first, and then worry about unification with other commands.
Scott Morrison (Jul 27 2023 at 00:04):
how about #mentioning
as the name?
Joachim Breitner (Jul 27 2023 at 06:11):
I had a chat with Sebastian yesterday, who wrote #find in the first place, and he encouraged me to extend #find instead (if you can search for consts, and for patterns, then it will be weird if you can't search for the conjunction). So my current plan would be to make a PR that subsumes #find right awat, if that's reasonable.
Joachim Breitner (Jul 28 2023 at 22:23):
In my local prototype I have now added #find
s ability to search by patterns to my #find_theorem
. It seems enough, or even better, to have a const-to-def cache, so if you do
#find_theorems (Lean.Syntax → Lean.Name)
it can use the cache to quickly find the set of definitions mentioning Lean.Syntax
, the set of definitions mentioning Lean.Name
, take the intersection, and on only that (hopefully smaller) list select those that match this pattern.
It also seems that this cache is cheaper to construct than the HeadIndex-indexed cache used by #find
: For a file importing Mathlib
the first
#find_theorems (Lean.Syntax → Lean.Name)
takes 22s while the first
#find (Lean.Syntax → Lean.Name)
takes 101s.
And it can be useful to combine the two, for example
#find_theorems (∑' _, _ = _) HMul.hMul HPow.hPow
finds all lemmas that give the value of a series where multiplication and power occurs.
I am inclined to also add a way to search for patterns _anywhere_ in the type (e.g. _ * _ ^ _
), not necessarily only rooted at the conclusion. But for that I need a nice idea for a syntax that distinguishes the two variants.
Joachim Breitner (Aug 03 2023 at 14:01):
I am now working on the variant where I am finding for patterns anywhere in the type of constants. This is my approach so far:
structure PreparedPattern extends AbstractMVarsResult where
headIndex : HeadIndex
def preparePat (t : Expr) : MetaM PreparedPattern := do
let amr <- withReducible $ do abstractMVars (← instantiateMVars t)
pure { toAbstractMVarsResult := amr, headIndex := t.toHeadIndex }
def matchPatAnywhere (pat : PreparedPattern) (c : ConstantInfo) : MetaM Bool := do
let found ← IO.mkRef false
withReducible do
let cTy := c.instantiateTypeLevelParams (← mkFreshLevelMVars c.numLevelParams)
Lean.Meta.forEachExpr' cTy fun sub_e => do
if pat.headIndex == sub_e.toHeadIndex then do
withNewMCtxDepth $ do
let pat := pat.expr.instantiateLevelParamsArray pat.paramNames
(← mkFreshLevelMVars pat.numMVars).toArray
let (_, _, pat) ← lambdaMetaTelescope pat
if ← isDefEq pat sub_e
then found.set true
-- keep searching if we haven't found it yet
not <$> found.get
found.get
It seems to work, but it is quite slow, for example with
#find_theorems (_ * (_ + _))
(and note that it only looks at lemmas that mention both HMul.hMul
and HAdd.hAdd
), and I run out of heartbeats sometimes.
It seems that there some definitions with very large types, e.g.
#check MonoidAlgebra.liftMagma.proof_5
which you very likely don’t want to #find
anyways. Can I recognize such definitions easily? I am already using
-- from Lean.Server.Completion
-- from Mathlib.Tactic.Find
-- (someone should put it somewhere public)
private def isBlackListed (declName : Name) : MetaM Bool := do
let env ← getEnv
pure $ declName.isInternal
|| isAuxRecursor env declName
|| isNoConfusion env declName
<||> isRec declName
<||> Meta.isMatcher declName
Adam Topaz (Aug 03 2023 at 14:11):
I think most will be caught by docs#Lean.Name.isInternal'
Adam Topaz (Aug 03 2023 at 14:12):
That at least catches the proof_37
names
Joachim Breitner (Aug 03 2023 at 14:13):
Thanks! I’m surprised this wasn’t already in the blacklist filter that I stole from #find
…
Adam Topaz (Aug 03 2023 at 14:14):
It’s part of docs#Lean.Name.isBlackListed
Joachim Breitner (Aug 03 2023 at 14:14):
Great, now
#find_theorems (_ * (_ + _))
takes 3s (with a warm cache) and finds plausible 161 theorems.
Joachim Breitner (Aug 03 2023 at 15:18):
I also added filtering by name substring (this helps with debuggging :-)), so
#find_theorems (_ * (_ + _)) "distr"
will print (instantaneously, with the additional name filter)
Found 2113 definitions mentioning HMul.hMul and HAdd.hAdd.
Of these, 31 have a name containing all of "distr".
Of these, 14 have mention your pattern(s).
• left_distrib
• CanonicallyOrderedCommSemiring.left_distrib
• Distrib.left_distrib
• IsLprojection.distrib_lattice_lemma
• LeftDistribClass.left_distrib
• Nat.left_distrib
• NonUnitalNonAssocRing.left_distrib
• NonUnitalNonAssocSemiring.left_distrib
• OreLocalization.left_distrib
• PGame.left_distrib_equiv
• PGame.quot_left_distrib
• FreeAlgebra.Rel.left_distrib
• LucasLehmer.X.left_distrib
• FreeAlgebra.Rel.below.left_distrib
and the lemma names will show their type when hovering.
Joachim Breitner (Aug 03 2023 at 16:33):
And because the matching of patterns against subexpressions uses defEq
and works with metavariables, it seems to support non-linear patterns just fine:
#find_theorems (Real.sqrt ?a * Real.sqrt ?a)
Found 8 definitions mentioning HMul.hMul and Real.sqrt.
Of these, 1 match your patterns.
• Real.mul_self_sqrt
Jireh Loreaux (Aug 03 2023 at 17:49):
This looks amazing. :heart:
Kevin Buzzard (Aug 03 2023 at 20:05):
Yeah that ability to do that kind of matching was exactly what was missing from Lean 3 #find
. I couldn't say List X -> List X
, I had to say List X -> List Y
and then get a bunch of spurious results.
Joachim Breitner (Aug 04 2023 at 08:21):
I think #find (List ?a → List ?a)
already works as intended with the present #find
implementation, but maybe it’s not documented like that. I’ll make sure to point it out in the docs for the new version.
Joachim Breitner (Aug 04 2023 at 10:08):
The new #find
command is ready for reviewing: https://github.com/leanprover-community/mathlib4/pull/6363
Last updated: Dec 20 2023 at 11:08 UTC