Zulip Chat Archive

Stream: lean4

Topic: Centrally Storing and Referencing Mathlib


Shreyas Srinivas (Aug 16 2023 at 12:28):

This is a topic that came up when the port was in full swing and cache was still new. Mathlib + deps consumes 3.9 GB of space. The result is that disk space fills up very quickly when one is working with many lean projects. At least for me, working with the latest version of Mathlib either doesn't break things or breaks things in a way that are best fixed w.r.t the latest Mathlib. So, even with many projects, I am usually just storing dozens of copies of the same version of Mathlib + deps. The question is, can we have common dependencies like Mathlib stored centrally on the disk, along with the toolchains for example, and linked to projects?

The last time I raised this question, the following objections came up:

  1. Symlinks not working properly on windows.
  2. Mathlib was changing so fast due to the port that it made sense to download a fresh version for each project.
  3. Dev time.

I was wondering if, now that the port is complete, this question might be worth revisiting.

Eric Wieser (Aug 16 2023 at 12:30):

2 is true despite the port being over, but at least if all the lean projects are your own it's reasonable to keep them on matching mathlib versions

Shreyas Srinivas (Aug 16 2023 at 12:33):

True, but even then, if the dependencies are stored centrally and users just have to choose the version of mathlib when they start a project (along with a list of installed versions), users might decide to go with a version they already have.

Shreyas Srinivas (Aug 16 2023 at 12:43):

Back during the port, one might miss nice basic theorems or tactics that were ported in the latest version. I am guessing that is not the case to the same extent these days.

Kevin Buzzard (Aug 16 2023 at 12:57):

Whenever I want to start a new project I definitely want today's mathlib; versions of mathlib I have on other projects are irrelevant

Shreyas Srinivas (Aug 16 2023 at 13:22):

@Kevin Buzzard : In my case, it usually doesn't matter whether I am using today's version or last week's. So, given the option, I would opt to save space, since currently ~30 GB out of my 180 (EDIT: typo) odd GB of free space on my linux partition consists of just copies of mathlib.

Siddhartha Gadgil (Aug 16 2023 at 13:34):

Would it work if one simply created symbolic links to a common place from within lake-packages? I assume that mathlib will not be downloaded again?

Siddhartha Gadgil (Aug 16 2023 at 13:37):

Maybe some linux expert can create a script to do this, with git to make sure versions match.

Shreyas Srinivas (Aug 16 2023 at 13:40):

It would certainly be a quick fix. A proper implementation in lake would be aware of the directories(global/user-level/project-level) that need to be looked into. It would use some default per OS config, or one which can be configured by a CLI like elan. It would compare versions, determine the need to download and cache when necessary, and link them.

Shreyas Srinivas (Aug 16 2023 at 14:08):

Semi-ideally, the user-facing changes would be the following: When you enter lake new <project_name> math, you will get a question about Mathlib versions, with a series of numbered options (1. Latest, 2. Version from <date> 3 Version from <date>. .....). The user will be prompted to enter one of these numbers. Everything else will happen internally (including caching). Cabal already does this for haskell. Users get a series of questions when they call cabal init.

More ideally, all this happens inside the vscode extension with a single lake command, but that is a discussion for another thread.

Jon Eugster (Aug 16 2023 at 16:45):

one thing you can do is having mathlib checked out and built anywhere on your system and then use a local path in your projects lakefile.

something like

require mathlib from
/ "path" / "to" / "package"

If they are all your own projects that would be an option. In one of our projects we even use an env-variable to decide whether to search the lake-package locally or from git, so that if anybody else downloads it, it would by default download it from git

Shreyas Srinivas (Aug 16 2023 at 18:46):

The hope with a good fraction of these projects is that they will eventually be shared with other people.

Thomas Murrills (Aug 16 2023 at 19:12):

As someone with low disk space, I would also appreciate the ability to use a local build of mathlib (without modifying the lakefile)! Just wanted to voice support for it.

Jon Eugster (Aug 16 2023 at 20:48):

Im just saying there is an option to implement this by saying in the projects lakefile "if $HOME/.lean/mathlib exists, use it, else require from github" but of course that imposes other problems like that you have no control about the mathlib version and in particular the lean-toolchain needs to stay in sync.

I think the fact that we still have no control over a project's toolchain would make it particularly hard for any mechanism working with a centralised mathlib:thinking:

Shreyas Srinivas (Aug 16 2023 at 21:08):

This should not be an issue if we have centralised mathlib. It won't be unlike having multiple lean toolchains. You would have different versions identified by commit IDs (or/and even better, date of the lean-toolchain). The difference from status quo would be that you would have the option of reusing the same mathlib version for multiple projects

Scott Morrison (Aug 16 2023 at 22:48):

Of course centrally storing multiple copies comes with its own danger: keeping around old copies that aren't being used anymore, or providing all the necessary tooling to somehow know when old copies can be deleted. You might end up with less disk space if it's not done well!

Shreyas Srinivas (Aug 16 2023 at 22:51):

Scott Morrison said:

Of course centrally storing multiple copies comes with its own danger: keeping around old copies that aren't being used anymore, or providing all the necessary tooling to somehow know when old copies can be deleted. You might end up with less disk space if it's not done well!

Keeping old versions can be an annoyance of course, but we already have this issue with lean toolchains, which we manage with elan. Further if a deleted toolchain is needed later, it gets installed anyway. Essentially this is a design problem with existing solutions.

Shreyas Srinivas (Aug 16 2023 at 23:11):

I am happy to volunteer time for this. I am not certain how long it will take since I will be contributing in my spare time which varies quite a bit these days.

Scott Morrison (Aug 16 2023 at 23:15):

You'd need to ask Mario and Mac about what would be helpful.

Mario Carneiro (Aug 16 2023 at 23:15):

I think the first thing is to make sure that lake is okay with a setup that involves moving mathlib out of the lake-packages directory, assuming for now that the maintenance of this global directory is done manually

Shreyas Srinivas (Aug 16 2023 at 23:41):

I need to get some precious sleep now, before responding to this

Mac Malone (Aug 17 2023 at 02:50):

@Mario Carneiro I think having a per-toolchain lake-packages would be a great idea! However, I personally haven't prioritize it for a few reasons:

  • Toolchains versions update frequently enough that the reuse is not as common as it would be for a normal package manager.
  • It my understanding that the kind of symlink this needs doesn't really exist on Windows, so it linking it into the directory is likely a no-go (but Lake could instead point Lean to the shared copy rather than the workspace's).
  • For the reasons above, there did not appear to be much demand for this rather complex feature.

Mario Carneiro (Aug 17 2023 at 02:52):

I don't think this needs a symlink at all?

Mario Carneiro (Aug 17 2023 at 02:52):

it just needs lake to (be directed to) look in the right place

Mario Carneiro (Aug 17 2023 at 02:53):

I already have an example of this kind of thing in mathport

Mac Malone (Aug 17 2023 at 02:53):

True, this does not need a symlink, but the alternative does have the disadvantage of not all remote packages being in lake-packages which may or may not be a concern (i.e., tools cannot do a directory walk to find repositories).

Mario Carneiro (Aug 17 2023 at 02:54):

the oneshot directory contains a lake project which shares the lake-packages folder with the main project, because having two copies of mathlib in the project would be both a lot of space and also an additional piece of state that can get out of sync

Mario Carneiro (Aug 17 2023 at 02:54):

however it means you have to be careful with lake update because it will make a mess

Mario Carneiro (Aug 17 2023 at 02:55):

I don't think we should be assuming that all dependencies are in ROOT/lake-packages, that's exactly the problem we're trying to fix

James Gallicchio (Aug 17 2023 at 02:57):

(and is already untrue, right? given that you can provide packages locally with lake)

Mario Carneiro (Aug 17 2023 at 02:58):

if you make that assumption you are already wrong with the current lake architecture, packages and libs have individual directory paths

Shreyas Srinivas (Aug 17 2023 at 12:14):

Mario Carneiro said:

the oneshot directory contains a lake project which shares the lake-packages folder with the main project, because having two copies of mathlib in the project would be both a lot of space and also an additional piece of state that can get out of sync

why would lake update make a mess?


Last updated: Dec 20 2023 at 11:08 UTC