Zulip Chat Archive
Stream: mathlib4
Topic: instant library_search
Scott Morrison (Apr 12 2023 at 10:07):
!4#3404 provides a substantial speed-up to library_search, by caching the discrimination tree it builds (and indeed having CI do this work, so the cache is automatically obtained via lake exe cache get
). However it involved redesigning how we use Mathlib.lean
, so please read on!
Scott Morrison (Apr 12 2023 at 10:07):
From the PR description:
This works by creating an extra file in the build
directory containing a cache of the library_search
discrimination tree.
- If you modify a file, then open another file,
library_search
will not see the modified declarations. - However after a full
lake build
orlake exe cache get
, you will see the modifications. - In particular CI will automatically do the work of building this cache.
- This will increase the size of a
lake exe cache get
download. (by 26mb, compressed)
In order to achieve this, I needed to redesign how Mathlib.lean
works. In the new model, Mathlib.lean
just says:
import Mathlib.All
import Mathlib.Extras
and now Mathlib.All
imports (almost) everything (and this is still enforced by CI), while files in Mathlib/Extras/
are imported in Mathlib.Extras
. This enables a "post-processing" file such as Mathlib/Extras/LibrarySearchCache.lean
to itself import Mathlib.All
, and know that it has everything.
To accommodate the cache file, we create a new build/extra/
directory, that can store .olean
format cache files. Our cache
executable now manages files in this directory too.
Yaël Dillies (Apr 12 2023 at 10:10):
The diff is huge but only because git isn't detecting that you renamed Mathlib
to Mathlib.All
. But I guess you can't rename it in a preliminary PR?
Scott Morrison (Apr 12 2023 at 10:13):
Yes. I tried various things here to persuade github to show it as a rename, but failed. You can safely ignore the differences there; the long chunk is still automatically generated.
Scott Morrison (Apr 12 2023 at 10:14):
This PR will perpetually be in merge-conflict
state, because of what it does to Mathlib.lean
. I won't attempt to keep this up-to-date.
Reid Barton (Apr 12 2023 at 10:14):
The display is a github limitation (git shows the rename just fine), though we shouldn't hold our breath hoping for a fix...
Eric Wieser (Apr 12 2023 at 10:15):
Why not split Mathlib.lean
in a spin-off PR (along with the necessary CI scripts), if that bit keeps getting conflicts?
Ruben Van de Velde (Apr 12 2023 at 10:16):
Let's just review this one quickly :)
Eric Wieser (Apr 12 2023 at 10:16):
Git will show it as a rename if we rename to All
first, _then_ add the new Mathlib.lean
Eric Wieser (Apr 12 2023 at 10:16):
git shows the rename just fine
not once bors gets to it it won't
Ruben Van de Velde (Apr 12 2023 at 10:17):
The issue is that the git data model does not support renames; any rename detection happens after the fact by the tool showing you the diff
Eric Wieser (Apr 12 2023 at 10:17):
Yes, and that rename detection relies on having atomic commits
Eric Wieser (Apr 12 2023 at 10:17):
So if we create an atomic commit (aka PR) that does Mathlib.lean
-> mathlib/All.lean
, everything will behave
Eric Wieser (Apr 12 2023 at 10:18):
This isn't just about making the review easier, it should also prevent conflicts happening in any open PRs
Eric Wieser (Apr 12 2023 at 10:18):
Because git merge
will know that the rename happened
Scott Morrison (Apr 12 2023 at 10:19):
I'd prefer to simply get delegation, do the last merge conflict fix, and merge. I think everyone can ignore the spurious parts of the diff...
Eric Wieser (Apr 12 2023 at 10:20):
I'd strongly prefer we do the rename separately, it makes a handful of things easier and is not hard to do
Reid Barton (Apr 12 2023 at 10:20):
I was going to point out that moving Mathlib.lean like this will affect all other open PRs, and it seems that Eric is right that this could be avoided.
Eric Wieser (Apr 12 2023 at 10:21):
The key thing is to not create the new Mathlib.lean
in the same PR as the one that creates Mathlib/All.lean
Reid Barton (Apr 12 2023 at 10:22):
(or not in the same commit, which given our setup is equivalent)
Eric Wieser (Apr 12 2023 at 10:23):
I think that's misleading Reid; putting things in separate commits in the same PR doesn't help because bors squashes them all back into a single commit. I assume you know that, but I can see someone misreading your message.
Alex J. Best (Apr 12 2023 at 10:24):
If you tell bors "bors single on" it seems it will disable batching for that pr (https://bors.tech/documentation/) perhaps you could just do that for this
Eric Wieser (Apr 12 2023 at 10:24):
Batching is unrelated to squash merging
Alex J. Best (Apr 12 2023 at 10:24):
Eric Wieser said:
Batching is unrelated to squash merging
Urgh yes of course..
Scott Morrison (Apr 12 2023 at 10:25):
Something to note about this PR: it adds a 26mb download to every invocation of lake exe cache get
. I think this is worth it, but I think this is the main decision required to review this PR.
Ruben Van de Velde (Apr 12 2023 at 10:26):
Eric Wieser said:
The key thing is to not create the new
Mathlib.lean
in the same PR as the one that createsMathlib/All.lean
Oh, I misunderstood you. But if we land the commit without creating a new Mathlib.lean, does that leave us in a buildable state?
Reid Barton (Apr 12 2023 at 10:26):
I mean that it must be in its own commit in the eventual master branch.
Scott Morrison (Apr 12 2023 at 10:26):
But I can't just make a separate PR, because without Mathlib.lean
it won't build.
Eric Wieser (Apr 12 2023 at 10:26):
What consumes Mathlib.lean?
Scott Morrison (Apr 12 2023 at 10:26):
lake
Scott Morrison (Apr 12 2023 at 10:27):
Maybe there is a lake configuration setting we can temporarily tweak for an intermediate PR?
Scott Morrison (Apr 12 2023 at 10:27):
as far as I know it is just hardcoded to use the project name, however.
Eric Rodriguez (Apr 12 2023 at 10:28):
Scott Morrison said:
Something to note about this PR: it adds a 26mb download to every invocation of
lake exe cache get
. I think this is worth it, but I think this is the main decision required to review this PR.
could this be configurable in future? like maybe lake exe cache get
will usually get the full cache with the l_s
cache but you could have lake exe cache get minimal
not include this cache
Eric Rodriguez (Apr 12 2023 at 10:29):
this could be useful for leaf projects or such like
Floris van Doorn (Apr 12 2023 at 10:30):
Scott Morrison said:
But I can't just make a separate PR, because without
Mathlib.lean
it won't build.
We can theoretically have a non-building commit/PR in the master branch, if another commit/PR in the same branch fixes the build. That would require some very careful borsing though...
Ruben Van de Velde (Apr 12 2023 at 10:30):
Oh, if the file doesn't actually need to be called Mathlib.lean, we don't really need to rename the existing file
Reid Barton (Apr 12 2023 at 10:31):
I would assume the top level file in lake must be configurable
Ruben Van de Velde (Apr 12 2023 at 10:34):
But also, why does it matter for the rename detection to work?
Reid Barton (Apr 12 2023 at 10:35):
Otherwise every PR adding a file opened before the rename will show a merge conflict, and need to have its entry manually added to the new location instead of the old one
Reid Barton (Apr 12 2023 at 10:35):
It's not the end of the world, but seems worth avoiding
Eric Wieser (Apr 12 2023 at 10:39):
That would require some very careful borsing though...
Or we just push directly. I think avoiding conflicts with merges is a totally valid reason to skip bors one time to merge a rename
Reid Barton (Apr 12 2023 at 10:41):
It would still be nice if whatever we push to master actually builds. So I think it is worth trying to get lake to use a different main library file.
Reid Barton (Apr 12 2023 at 10:44):
In the branch there is
lean_lib Cache where
roots := #[`Cache]
so maybe we just need to add a similar roots
line to the lean_lib Mathlib where
stanza?
Sebastian Ullrich (Apr 12 2023 at 11:05):
Why does the pickling code have to be part of Mathlib
at all? If it is a separate library, like Cache
itself, there is no need to touch Mathlib.lean
and users have control over when it is rebuilt.
Scott Morrison (Apr 12 2023 at 11:18):
This is a good question, and maybe I have the wrong design here. I did think about this question, however, I decided that ease of use was important, and in particular I wanted to minimize the work users would have to do to get a cache. Since they are already compelled to use lake exe cache get
most of the time, I really wanted to piggyback on that.
Scott Morrison (Apr 12 2023 at 11:20):
That said, could we incorporate this functionality into cache
itself? Obviously cache
can't import Mathlib
, or we won't be able to build it. But whatever tool builds the cache needs to do exactly that.
Scott Morrison (Apr 12 2023 at 11:21):
I guess cache
could download a library_search
cache olean, and some entirely separate utility could be responsible for creating it, and we just run this in CI.
Scott Morrison (Apr 12 2023 at 11:23):
This would require cache
to be cleverer, and have custom logic for handling these extra files. At present the cache file just gets bundled along with the other build artifacts for Extras/LibrarySearchCache.lean
.
Sebastian Ullrich (Apr 12 2023 at 11:34):
I'm not sure I understand the problem, cache
doesn't care about how mathlib4 is built and will just upload/download anything reachable from Mathlib.lean
, which we could easily change to "from Mathlib.lean
or Mathlib/Extras/LibrarySearchCache.lean
". And even make that conditional to implement @Eric Rodriguez's request.
Kyle Miller (Apr 12 2023 at 11:34):
I'm not sure how these things work exactly, but could MathlibExtras be a separate lean_lib that itself depends on Mathlib, all using the same lakefile?
Sebastian Ullrich (Apr 12 2023 at 11:36):
The only restriction I can see is that it has to be a file in Mathlib/
, otherwise it will be extracted to the wrong location when using mahtlib4 as a dependency (but that could be adjusted of course): https://github.com/leanprover-community/mathlib4/blob/3b6fdff16311f67b01a2a5abc218d7b96adecb2f/Cache/IO.lean#L185-L188
Jason Rute (Apr 12 2023 at 11:37):
On a related note, many AI tools like the recent suggest_premises
in Lean (@Bartosz Piotrowski, etc) and MagnusHammer for Isabelle (#Machine Learning for Theorem Proving > New paper: Magnushammer ) associate with each theorem (or each declaration) a vector embedding. If there was a common protocol by which projects could associate extra metadata to declarations to speed up retrieval not just for library search but other future AI assisted search tools, I could see a lot of advantage there, especially if it was easy to use.
Jason Rute (Apr 12 2023 at 11:39):
I think suggest_premises
might be fast enough that it doesn’t need to be cached (although I think it would speed it up), but the data for a MagnusHammer-style tactic would definitely need to be cached.
Scott Morrison (Apr 12 2023 at 11:45):
@Sebastian Ullrich, so under this suggestion, we would modify the CI for Mathlib.lean
to check that it imports everything except some subset (e.g. the Mathlib/Extras/
directory in the current PR?
Sebastian Ullrich (Apr 12 2023 at 11:46):
Either that or extend Cache's isPathFromMathlib
function, yes. The latter might be a bit cleaner.
Eric Wieser (Apr 12 2023 at 11:51):
Is there any mechanism here for library_search
to operate (efficiently) on downstream projects too?
Scott Morrison (Apr 12 2023 at 11:51):
I don't think that just modifying Cache is going to cut it. The LibrarySearchCache.lean
needs to be able to import everything (except itself), so it can't be imported by Mathlib.lean
Scott Morrison (Apr 12 2023 at 11:51):
Ugh, this is sounding miserable enough to get working in mathlib, let's not think about downstream projects yet. :-(
Sebastian Ullrich (Apr 12 2023 at 11:55):
So to be clear, I meant extending isPathFromMathlib
so that LibrarySearchCache.lean
would not have to be placed in Mathlib/
but in its own folder and Lake library that depends on Mathlib
.
Scott Morrison (Apr 12 2023 at 11:58):
I guess I don't understand why modifying isPathFromMathlib
is relevant. It is behind a true || isPathFromMathlib ...
as far as I'm concerned. :-)
Sebastian Ullrich (Apr 12 2023 at 11:58):
Yes, we can ignore that for the first version :)
Scott Morrison (Apr 12 2023 at 12:01):
Okay, sometime in the next few days I will have another go at this PR, just making a new lean_lib
entry point.
Sebastian Ullrich (Apr 12 2023 at 12:03):
Scott Morrison said:
Something to note about this PR: it adds a 26mb download to every invocation of
lake exe cache get
. I think this is worth it, but I think this is the main decision required to review this PR.
Might be interesting to throw https://github.com/gebner/oleanparser/pull/4 at it btw. Just from that number I can't tell if it is a reasonable amount or not.
Scott Morrison (Apr 19 2023 at 01:32):
Okay, I have revamped this PR! It no longer requires changing how Mathlib.lean
works, happily. It seems to result in correctly uploading and downloading the cache file.
Johan Commelin (Apr 19 2023 at 07:22):
I would love to get !4#3404 in my mathlib! This looks like a marvellous improvement to library_search
. Could some of the meta reviewers please take a look at it?
Mario Carneiro (Apr 20 2023 at 06:35):
Left some comments. In particular there is some not great handling of unsafe functions: reminder that unsafe
is not just a syntactic burden that should be hidden away by using TermUnsafe
at the first opportunity!
Scott Morrison (Apr 20 2023 at 06:54):
@Mario Carneiro, I'm actually not sure how to use withUnpickle
successfully here. I want to hold onto the value from unpickle
forever: I'm handing it to initialize
.
Mario Carneiro (Apr 20 2023 at 06:54):
in that case you need to hold on to the CompactedRegion
forever too
Mario Carneiro (Apr 20 2023 at 06:54):
put it in a struct in the type of the initialize
Scott Morrison (Apr 20 2023 at 06:54):
Okay!
Mario Carneiro (Apr 20 2023 at 06:55):
(you won't be able to use withUnpickle
in that case)
Scott Morrison (Apr 20 2023 at 07:07):
Fixed, thanks. I'll do another round of CI to make sure it is still putting the file in a findable place.
Sebastian Ullrich (Apr 20 2023 at 08:35):
Mario Carneiro said:
in that case you need to hold on to the
CompactedRegion
forever too
This is not actually true, see docs4#Lean.CompactedRegion.free. readModuleData
could not be safe otherwise.
Mario Carneiro (Apr 20 2023 at 09:26):
I'm not sure how that contradicts my point. I am aware of CompactedRegion.free
, although the part I'm not sure about is whether CompactedRegion
automatically calls free
when the refcount goes to zero
Mario Carneiro (Apr 20 2023 at 09:27):
I am assuming that it does, which means that you need to have a live reference to a CompactedRegion
while you play with any lean objects which might possibly hold pointers to this region
Mario Carneiro (Apr 20 2023 at 09:29):
The alternative design would be that dropping a CompactedRegion
causes the region to be leaked, in which case the original code would be safe
Mario Carneiro (Apr 20 2023 at 09:31):
Ah, I see now def CompactedRegion := USize
implies that it can't have a destructor
Scott Morrison (Apr 24 2023 at 05:09):
Could someone take another look again? Hopefully the issues have been successfully dealt with now.
Johan Commelin (Apr 24 2023 at 05:53):
cc @Mario Carneiro @Gabriel Ebner
Patrick Massot (Apr 27 2023 at 15:00):
Is th warning
Warning: ./MathlibExtras.lean not found. Skipping all files that depend on it
related to the new library_search cache? This is in a project depending on mathlib.
Scott Morrison (Apr 27 2023 at 22:34):
Yes, the library search cache was not tested for downstream projects. I can take a look.
Patrick Massot (Apr 27 2023 at 22:34):
Thanks!
Scott Morrison (Apr 28 2023 at 23:36):
Fixed and tested in !4#3721.
Patrick Massot (Apr 28 2023 at 23:41):
Thanks for those fixes Scott!
Last updated: Dec 20 2023 at 11:08 UTC