Zulip Chat Archive

Stream: mathlib4

Topic: instant library_search


Scott Morrison (Apr 12 2023 at 10:07):

!4#3404 provides a substantial speed-up to library_search, by caching the discrimination tree it builds (and indeed having CI do this work, so the cache is automatically obtained via lake exe cache get). However it involved redesigning how we use Mathlib.lean, so please read on!

Scott Morrison (Apr 12 2023 at 10:07):

From the PR description:


This works by creating an extra file in the build directory containing a cache of the library_search discrimination tree.

  • If you modify a file, then open another file, library_search will not see the modified declarations.
  • However after a full lake build or lake exe cache get, you will see the modifications.
  • In particular CI will automatically do the work of building this cache.
  • This will increase the size of a lake exe cache get download. (by 26mb, compressed)

In order to achieve this, I needed to redesign how Mathlib.lean works. In the new model, Mathlib.lean just says:

import Mathlib.All
import Mathlib.Extras

and now Mathlib.All imports (almost) everything (and this is still enforced by CI), while files in Mathlib/Extras/ are imported in Mathlib.Extras. This enables a "post-processing" file such as Mathlib/Extras/LibrarySearchCache.lean to itself import Mathlib.All, and know that it has everything.

To accommodate the cache file, we create a new build/extra/ directory, that can store .olean format cache files. Our cache executable now manages files in this directory too.

Yaël Dillies (Apr 12 2023 at 10:10):

The diff is huge but only because git isn't detecting that you renamed Mathlib to Mathlib.All. But I guess you can't rename it in a preliminary PR?

Scott Morrison (Apr 12 2023 at 10:13):

Yes. I tried various things here to persuade github to show it as a rename, but failed. You can safely ignore the differences there; the long chunk is still automatically generated.

Scott Morrison (Apr 12 2023 at 10:14):

This PR will perpetually be in merge-conflict state, because of what it does to Mathlib.lean. I won't attempt to keep this up-to-date.

Reid Barton (Apr 12 2023 at 10:14):

The display is a github limitation (git shows the rename just fine), though we shouldn't hold our breath hoping for a fix...

Eric Wieser (Apr 12 2023 at 10:15):

Why not split Mathlib.lean in a spin-off PR (along with the necessary CI scripts), if that bit keeps getting conflicts?

Ruben Van de Velde (Apr 12 2023 at 10:16):

Let's just review this one quickly :)

Eric Wieser (Apr 12 2023 at 10:16):

Git will show it as a rename if we rename to All first, _then_ add the new Mathlib.lean

Eric Wieser (Apr 12 2023 at 10:16):

git shows the rename just fine

not once bors gets to it it won't

Ruben Van de Velde (Apr 12 2023 at 10:17):

The issue is that the git data model does not support renames; any rename detection happens after the fact by the tool showing you the diff

Eric Wieser (Apr 12 2023 at 10:17):

Yes, and that rename detection relies on having atomic commits

Eric Wieser (Apr 12 2023 at 10:17):

So if we create an atomic commit (aka PR) that does Mathlib.lean -> mathlib/All.lean, everything will behave

Eric Wieser (Apr 12 2023 at 10:18):

This isn't just about making the review easier, it should also prevent conflicts happening in any open PRs

Eric Wieser (Apr 12 2023 at 10:18):

Because git merge will know that the rename happened

Scott Morrison (Apr 12 2023 at 10:19):

I'd prefer to simply get delegation, do the last merge conflict fix, and merge. I think everyone can ignore the spurious parts of the diff...

Eric Wieser (Apr 12 2023 at 10:20):

I'd strongly prefer we do the rename separately, it makes a handful of things easier and is not hard to do

Reid Barton (Apr 12 2023 at 10:20):

I was going to point out that moving Mathlib.lean like this will affect all other open PRs, and it seems that Eric is right that this could be avoided.

Eric Wieser (Apr 12 2023 at 10:21):

The key thing is to not create the new Mathlib.lean in the same PR as the one that creates Mathlib/All.lean

Reid Barton (Apr 12 2023 at 10:22):

(or not in the same commit, which given our setup is equivalent)

Eric Wieser (Apr 12 2023 at 10:23):

I think that's misleading Reid; putting things in separate commits in the same PR doesn't help because bors squashes them all back into a single commit. I assume you know that, but I can see someone misreading your message.

Alex J. Best (Apr 12 2023 at 10:24):

If you tell bors "bors single on" it seems it will disable batching for that pr (https://bors.tech/documentation/) perhaps you could just do that for this

Eric Wieser (Apr 12 2023 at 10:24):

Batching is unrelated to squash merging

Alex J. Best (Apr 12 2023 at 10:24):

Eric Wieser said:

Batching is unrelated to squash merging

Urgh yes of course..

Scott Morrison (Apr 12 2023 at 10:25):

Something to note about this PR: it adds a 26mb download to every invocation of lake exe cache get. I think this is worth it, but I think this is the main decision required to review this PR.

Ruben Van de Velde (Apr 12 2023 at 10:26):

Eric Wieser said:

The key thing is to not create the new Mathlib.lean in the same PR as the one that creates Mathlib/All.lean

Oh, I misunderstood you. But if we land the commit without creating a new Mathlib.lean, does that leave us in a buildable state?

Reid Barton (Apr 12 2023 at 10:26):

I mean that it must be in its own commit in the eventual master branch.

Scott Morrison (Apr 12 2023 at 10:26):

But I can't just make a separate PR, because without Mathlib.lean it won't build.

Eric Wieser (Apr 12 2023 at 10:26):

What consumes Mathlib.lean?

Scott Morrison (Apr 12 2023 at 10:26):

lake

Scott Morrison (Apr 12 2023 at 10:27):

Maybe there is a lake configuration setting we can temporarily tweak for an intermediate PR?

Scott Morrison (Apr 12 2023 at 10:27):

as far as I know it is just hardcoded to use the project name, however.

Eric Rodriguez (Apr 12 2023 at 10:28):

Scott Morrison said:

Something to note about this PR: it adds a 26mb download to every invocation of lake exe cache get. I think this is worth it, but I think this is the main decision required to review this PR.

could this be configurable in future? like maybe lake exe cache get will usually get the full cache with the l_s cache but you could have lake exe cache get minimal not include this cache

Eric Rodriguez (Apr 12 2023 at 10:29):

this could be useful for leaf projects or such like

Floris van Doorn (Apr 12 2023 at 10:30):

Scott Morrison said:

But I can't just make a separate PR, because without Mathlib.lean it won't build.

We can theoretically have a non-building commit/PR in the master branch, if another commit/PR in the same branch fixes the build. That would require some very careful borsing though...

Ruben Van de Velde (Apr 12 2023 at 10:30):

Oh, if the file doesn't actually need to be called Mathlib.lean, we don't really need to rename the existing file

Reid Barton (Apr 12 2023 at 10:31):

I would assume the top level file in lake must be configurable

Ruben Van de Velde (Apr 12 2023 at 10:34):

But also, why does it matter for the rename detection to work?

Reid Barton (Apr 12 2023 at 10:35):

Otherwise every PR adding a file opened before the rename will show a merge conflict, and need to have its entry manually added to the new location instead of the old one

Reid Barton (Apr 12 2023 at 10:35):

It's not the end of the world, but seems worth avoiding

Eric Wieser (Apr 12 2023 at 10:39):

That would require some very careful borsing though...

Or we just push directly. I think avoiding conflicts with merges is a totally valid reason to skip bors one time to merge a rename

Reid Barton (Apr 12 2023 at 10:41):

It would still be nice if whatever we push to master actually builds. So I think it is worth trying to get lake to use a different main library file.

Reid Barton (Apr 12 2023 at 10:44):

In the branch there is

lean_lib Cache where
  roots := #[`Cache]

so maybe we just need to add a similar roots line to the lean_lib Mathlib where stanza?

Sebastian Ullrich (Apr 12 2023 at 11:05):

Why does the pickling code have to be part of Mathlib at all? If it is a separate library, like Cache itself, there is no need to touch Mathlib.lean and users have control over when it is rebuilt.

Scott Morrison (Apr 12 2023 at 11:18):

This is a good question, and maybe I have the wrong design here. I did think about this question, however, I decided that ease of use was important, and in particular I wanted to minimize the work users would have to do to get a cache. Since they are already compelled to use lake exe cache get most of the time, I really wanted to piggyback on that.

Scott Morrison (Apr 12 2023 at 11:20):

That said, could we incorporate this functionality into cache itself? Obviously cache can't import Mathlib, or we won't be able to build it. But whatever tool builds the cache needs to do exactly that.

Scott Morrison (Apr 12 2023 at 11:21):

I guess cache could download a library_search cache olean, and some entirely separate utility could be responsible for creating it, and we just run this in CI.

Scott Morrison (Apr 12 2023 at 11:23):

This would require cache to be cleverer, and have custom logic for handling these extra files. At present the cache file just gets bundled along with the other build artifacts for Extras/LibrarySearchCache.lean.

Sebastian Ullrich (Apr 12 2023 at 11:34):

I'm not sure I understand the problem, cache doesn't care about how mathlib4 is built and will just upload/download anything reachable from Mathlib.lean, which we could easily change to "from Mathlib.lean or Mathlib/Extras/LibrarySearchCache.lean". And even make that conditional to implement @Eric Rodriguez's request.

Kyle Miller (Apr 12 2023 at 11:34):

I'm not sure how these things work exactly, but could MathlibExtras be a separate lean_lib that itself depends on Mathlib, all using the same lakefile?

Sebastian Ullrich (Apr 12 2023 at 11:36):

The only restriction I can see is that it has to be a file in Mathlib/, otherwise it will be extracted to the wrong location when using mahtlib4 as a dependency (but that could be adjusted of course): https://github.com/leanprover-community/mathlib4/blob/3b6fdff16311f67b01a2a5abc218d7b96adecb2f/Cache/IO.lean#L185-L188

Jason Rute (Apr 12 2023 at 11:37):

On a related note, many AI tools like the recent suggest_premises in Lean (@Bartosz Piotrowski, etc) and MagnusHammer for Isabelle (#Machine Learning for Theorem Proving > New paper: Magnushammer ) associate with each theorem (or each declaration) a vector embedding. If there was a common protocol by which projects could associate extra metadata to declarations to speed up retrieval not just for library search but other future AI assisted search tools, I could see a lot of advantage there, especially if it was easy to use.

Jason Rute (Apr 12 2023 at 11:39):

I think suggest_premises might be fast enough that it doesn’t need to be cached (although I think it would speed it up), but the data for a MagnusHammer-style tactic would definitely need to be cached.

Scott Morrison (Apr 12 2023 at 11:45):

@Sebastian Ullrich, so under this suggestion, we would modify the CI for Mathlib.lean to check that it imports everything except some subset (e.g. the Mathlib/Extras/ directory in the current PR?

Sebastian Ullrich (Apr 12 2023 at 11:46):

Either that or extend Cache's isPathFromMathlib function, yes. The latter might be a bit cleaner.

Eric Wieser (Apr 12 2023 at 11:51):

Is there any mechanism here for library_search to operate (efficiently) on downstream projects too?

Scott Morrison (Apr 12 2023 at 11:51):

I don't think that just modifying Cache is going to cut it. The LibrarySearchCache.lean needs to be able to import everything (except itself), so it can't be imported by Mathlib.lean

Scott Morrison (Apr 12 2023 at 11:51):

Ugh, this is sounding miserable enough to get working in mathlib, let's not think about downstream projects yet. :-(

Sebastian Ullrich (Apr 12 2023 at 11:55):

So to be clear, I meant extending isPathFromMathlib so that LibrarySearchCache.lean would not have to be placed in Mathlib/ but in its own folder and Lake library that depends on Mathlib.

Scott Morrison (Apr 12 2023 at 11:58):

I guess I don't understand why modifying isPathFromMathlib is relevant. It is behind a true || isPathFromMathlib ... as far as I'm concerned. :-)

Sebastian Ullrich (Apr 12 2023 at 11:58):

Yes, we can ignore that for the first version :)

Scott Morrison (Apr 12 2023 at 12:01):

Okay, sometime in the next few days I will have another go at this PR, just making a new lean_lib entry point.

Sebastian Ullrich (Apr 12 2023 at 12:03):

Scott Morrison said:

Something to note about this PR: it adds a 26mb download to every invocation of lake exe cache get. I think this is worth it, but I think this is the main decision required to review this PR.

Might be interesting to throw https://github.com/gebner/oleanparser/pull/4 at it btw. Just from that number I can't tell if it is a reasonable amount or not.

Scott Morrison (Apr 19 2023 at 01:32):

Okay, I have revamped this PR! It no longer requires changing how Mathlib.lean works, happily. It seems to result in correctly uploading and downloading the cache file.

Johan Commelin (Apr 19 2023 at 07:22):

I would love to get !4#3404 in my mathlib! This looks like a marvellous improvement to library_search. Could some of the meta reviewers please take a look at it?

Mario Carneiro (Apr 20 2023 at 06:35):

Left some comments. In particular there is some not great handling of unsafe functions: reminder that unsafe is not just a syntactic burden that should be hidden away by using TermUnsafe at the first opportunity!

Scott Morrison (Apr 20 2023 at 06:54):

@Mario Carneiro, I'm actually not sure how to use withUnpickle successfully here. I want to hold onto the value from unpickle forever: I'm handing it to initialize.

Mario Carneiro (Apr 20 2023 at 06:54):

in that case you need to hold on to the CompactedRegion forever too

Mario Carneiro (Apr 20 2023 at 06:54):

put it in a struct in the type of the initialize

Scott Morrison (Apr 20 2023 at 06:54):

Okay!

Mario Carneiro (Apr 20 2023 at 06:55):

(you won't be able to use withUnpickle in that case)

Scott Morrison (Apr 20 2023 at 07:07):

Fixed, thanks. I'll do another round of CI to make sure it is still putting the file in a findable place.

Sebastian Ullrich (Apr 20 2023 at 08:35):

Mario Carneiro said:

in that case you need to hold on to the CompactedRegion forever too

This is not actually true, see docs4#Lean.CompactedRegion.free. readModuleData could not be safe otherwise.

Mario Carneiro (Apr 20 2023 at 09:26):

I'm not sure how that contradicts my point. I am aware of CompactedRegion.free, although the part I'm not sure about is whether CompactedRegion automatically calls free when the refcount goes to zero

Mario Carneiro (Apr 20 2023 at 09:27):

I am assuming that it does, which means that you need to have a live reference to a CompactedRegion while you play with any lean objects which might possibly hold pointers to this region

Mario Carneiro (Apr 20 2023 at 09:29):

The alternative design would be that dropping a CompactedRegion causes the region to be leaked, in which case the original code would be safe

Mario Carneiro (Apr 20 2023 at 09:31):

Ah, I see now def CompactedRegion := USize implies that it can't have a destructor

Scott Morrison (Apr 24 2023 at 05:09):

Could someone take another look again? Hopefully the issues have been successfully dealt with now.

Johan Commelin (Apr 24 2023 at 05:53):

cc @Mario Carneiro @Gabriel Ebner

Patrick Massot (Apr 27 2023 at 15:00):

Is th warning

Warning: ./MathlibExtras.lean not found. Skipping all files that depend on it

related to the new library_search cache? This is in a project depending on mathlib.

Scott Morrison (Apr 27 2023 at 22:34):

Yes, the library search cache was not tested for downstream projects. I can take a look.

Patrick Massot (Apr 27 2023 at 22:34):

Thanks!

Scott Morrison (Apr 28 2023 at 23:36):

Fixed and tested in !4#3721.

Patrick Massot (Apr 28 2023 at 23:41):

Thanks for those fixes Scott!


Last updated: Dec 20 2023 at 11:08 UTC