Zulip Chat Archive

Stream: lean4

Topic: Size issue


Patrick Massot (Jan 04 2025 at 10:26):

Just checking that I didn’t miss anything before my new course starts: was there any progress on the issue of the size of Lean4 + Mathlib? Is there any way to distribute smaller oleans? Any way to automatically distribute only the part of Mathlib that a given lib depends on? My current plan is to use a Mathlib fork where I will manually delete most of the library, but any tool automating this would be welcome.

Eric Wieser (Jan 04 2025 at 11:59):

You can distribute smaller oleans with lake exe cache get Mathlib.Data.Matrix.Basic or similar

Eric Wieser (Jan 04 2025 at 12:00):

I don't know if that works for downstream libraries or if you have to name a mathlib module (or modules?)

Patrick Massot (Jan 07 2025 at 10:28):

I tried using lake exe cache get Mathlib.Tactic. lake says Warning: .lake/packages/mathlib/Mathlib.Tactic not found. Skipping all files that depend on it but it still downloads only 2.7Gb instead of 5.1Gb.

Patrick Massot (Jan 07 2025 at 10:29):

I’m still not 100% sure I’ll be able to teach using Lean 4 this year, because of this issue. I need to talk to the IT people here.

Patrick Massot (Jan 07 2025 at 10:30):

Right now I’ll try to see if I can find a combination of lighter tactic imports. I’m really surprised that import Mathlib.Tactic imports more than half of Mathlib (if this is indeed what cache is telling me).

Ruben Van de Velde (Jan 07 2025 at 11:47):

And Tactic.Basic?

Damiano Testa (Jan 07 2025 at 12:01):

import Mathlib.Tactic

/-
3589 transitive imports

2291 starting with "Mathlib":
[Mathlib.Algebra.AddTorsor,
 Mathlib.Algebra.Algebra.Basic,
 Mathlib.Algebra.Algebra.Bilinear,
 and so on...
-/
#trans_imports "Mathlib"

while with basic

import Mathlib.Tactic.Basic
import Mathlib.Util.TransImports

/-
1102 transitive imports

18 starting with "Mathlib":
[Mathlib.Init,
 Mathlib.Tactic.Basic,
 Mathlib.Tactic.DeclarationNames,
 Mathlib.Tactic.ExtendDoc,
 Mathlib.Tactic.Lemma,
 Mathlib.Tactic.Linter.DocPrime,
 Mathlib.Tactic.Linter.GlobalAttributeIn,
 Mathlib.Tactic.Linter.HashCommandLinter,
 Mathlib.Tactic.Linter.Header,
 Mathlib.Tactic.Linter.Lint,
 Mathlib.Tactic.Linter.Multigoal,
 Mathlib.Tactic.Linter.OldObtain,
 Mathlib.Tactic.Linter.RefineLinter,
 Mathlib.Tactic.Linter.Style,
 Mathlib.Tactic.Linter.UnusedTactic,
 Mathlib.Tactic.PPWithUniv,
 Mathlib.Tactic.TypeStar,
 Mathlib.Util.TransImports]
-/
#trans_imports "Mathlib"

Patrick Massot (Jan 07 2025 at 12:08):

I need linarith of course. But I manage to get down to 1.5Gb. I’ll continue trying to work on it.

Eric Wieser (Jan 07 2025 at 12:09):

How big is too big for you?

Damiano Testa (Jan 07 2025 at 12:09):

linarith implies 449 mathlib files.

Patrick Massot (Jan 07 2025 at 12:09):

Of course this 1.5Gb does not include the Lean toolchain which is already more than 1Gb.

Patrick Massot (Jan 07 2025 at 12:10):

In the Lean days I was shipping less than 150Mb.

Patrick Massot (Jan 07 2025 at 12:10):

including lean itself.

Eric Wieser (Jan 07 2025 at 12:14):

That was surely the compressed size, right?

Eric Wieser (Jan 07 2025 at 12:15):

I guess my question is whether the problem is shipped size or installed size

Patrick Massot (Jan 07 2025 at 12:33):

Ah yes, you’re right, that was the compressed size. Let me try to uncompress it.

Patrick Massot (Jan 07 2025 at 12:35):

The uncompressed size was 410Mb, including VSCode.

Eric Wieser (Jan 07 2025 at 12:41):

If you delete all the .c files from the uncompressed cache in Lean 4, how big is it?

Eric Wieser (Jan 07 2025 at 12:41):

(this isn't something that lake will actually let you do right now, but is conceivably something it could permit in future)

Patrick Massot (Jan 07 2025 at 12:43):

I can’t test right now because I’m trying to tweak imports and rebuild things.

Patrick Massot (Jan 07 2025 at 12:46):

One issue I have is I import Mathlib.Topology.Instances.Real which has 2030 transitive imports (1040 in Mathlib).

Eric Wieser (Jan 07 2025 at 12:47):

Looks like the c files are less than 10%:

gitpod /workspace/mathlib4/.lake/build (master) $ du -h -d2
3.2M    ./lib/Cache
4.0G    ./lib/Mathlib
4.0G    ./lib
1.8M    ./ir/Cache
372M    ./ir/Mathlib
376M    ./ir
104M    ./bin
4.5G    .

Patrick Massot (Jan 07 2025 at 12:48):

I’ll try to live with import Mathlib.Data.Real.Basic and sorries.

Eric Wieser (Jan 07 2025 at 12:48):

Remember that the 1131 transitive imports in Lean/Init/Std are basically free (or at least, not something you can avoid distributing without a custom lean installation)

Patrick Massot (Jan 07 2025 at 13:25):

Seeing all the linter warnings from Mathlib while building a downstream lib without getting cache first is really painful.

Patrick Massot (Jan 07 2025 at 13:26):

And having

$ lake exe cache get  Mathlib.Tactic.Ring
uncaught exception: Unknown package directory for Mathlib.Tactic

is also painful.

Damiano Testa (Jan 07 2025 at 13:34):

Patrick Massot said:

Seeing all the linter warnings from Mathlib while building a downstream lib without getting cache first is really painful.

Are there more warnings than the docPrime linter?

Kevin Buzzard (Jan 07 2025 at 13:56):

The docPrime linter already produces many many warnings!

Riccardo Brasca (Jan 07 2025 at 15:35):

@Patrick Massot I'm having the same problem with my students. Have you tried gitpod (or codespace), I am thinking about just using it.

Patrick Massot (Jan 07 2025 at 17:50):

Damiano Testa said:

Patrick Massot said:

Seeing all the linter warnings from Mathlib while building a downstream lib without getting cache first is really painful.

Are there more warnings than the docPrime linter?

It would be hard to notice another warning among the tsunami of docPrime warning so I cannot answer for sure.

Patrick Massot (Jan 07 2025 at 17:54):

Riccardo Brasca said:

Patrick Massot I'm having the same problem with my students. Have you tried gitpod (or codespace), I am thinking about just using it.

I tried gitpod or codespace for a GlimpseOfLean afternoon a couple of months ago and it was a disaster so I really want this to be only a backup solution.

Jireh Loreaux (Jan 07 2025 at 17:58):

How many students do you have?

Jireh Loreaux (Jan 07 2025 at 17:59):

(I ask because a slightly insane solution is just to give them all flash drives with the required space available; if there aren't too many this wouldn't be terribly expensive.)

Jireh Loreaux (Jan 07 2025 at 18:00):

I think the painful part would be changing the cache directory.

Patrick Massot (Jan 07 2025 at 18:03):

I will have about 60 students.

Patrick Massot (Jan 07 2025 at 18:04):

So flash drives are not great.

Rob Lewis (Jan 07 2025 at 20:27):

Riccardo Brasca said:

Patrick Massot I'm having the same problem with my students. Have you tried gitpod (or codespace), I am thinking about just using it.

I was just writing about this in a private thread, so let me move it here too: I've taught with Lean on Codespaces with little trouble, except that you can't follow the standard GitHub Classrooms assignment model of having one repository per homework assignment. Even with GitHub Pro, there's a monthly limit on total codespace size that you'll hit if students have more than a couple active codepaces with mathlib caches. You'd need to tell students to be very aggressive about deleting old codespaces, and leave time between new assignments, and hope that students aren't using Codespaces for anything else.

Rob Lewis (Jan 07 2025 at 20:29):

That said, if you follow the model where you have one course repo and each student forks/clones that one time, Codespaces works great.

Kim Morrison (Jan 08 2025 at 01:45):

Patrick Massot said:

And having

$ lake exe cache get  Mathlib.Tactic.Ring
uncaught exception: Unknown package directory for Mathlib.Tactic

is also painful.

Can someone explain to me what @Patrick Massot means here?

lake exe cache get Mathlib/Tactic/Ring.lean

seems to work for me just fine and is what is documented.

Is the message a request for the module name to work as well?

Kim Morrison (Jan 08 2025 at 01:57):

Here's a useful but hacky script:

find . -path ./.lake -prune -o -name "*.lean" -exec grep -h "^import Mathlib\." {} \; | sed 's/import \(Mathlib\.[^[:space:]]*\)/\1/' | sed 's/ .*$//' | tr '.' '/' | sed 's/^//' | sed 's/$/\.lean/' | tr '\n' ' ' | xargs lake exe cache get

This looks for all .lean files in subdirectories (but excluding .lake), and find all import Mathlib.X.Y.Z lines, and then runs lake exe cache get Mathlib/X/Y/Z.lean for all of these.

Kim Morrison (Jan 08 2025 at 01:58):

i.e. it only downloads the subset of Mathlib your current project needs.

Kim Morrison (Jan 08 2025 at 01:58):

Any suggestions as to how we can make this more robust and/or deploy it in an easy-to-use way?

Kim Morrison (Jan 08 2025 at 02:03):

I guess we could just expose it as lake exe cache get^ or something?

Daniel Weber (Jan 08 2025 at 07:01):

If we get proofAsSorry to work with Mathlib perhaps the cache could use that, which could potentially make it much smaller

Patrick Massot (Jan 08 2025 at 08:42):

Thanks a lot Kim. I was misled by #lean4 > Size issue @ 💬 . I agree it would be great to have a robust easy thing to avoid your hack.

Patrick Massot (Jan 08 2025 at 09:14):

(deleted)

Patrick Massot (Jan 08 2025 at 09:15):

(deleted)

Michael Rothgang (Jan 08 2025 at 13:21):

Kim Morrison said:

I guess we could just expose it as lake exe cache get^ or something?

#20567 implements this as miniget; comments (e.g. naming suggestions) welcome

Eric Wieser (Jan 08 2025 at 13:44):

Kim Morrison said:

Is the message a request for the module name to work as well?

Or just for a better error message to indicate that it expects a filename

Kim Morrison (Jan 08 2025 at 22:45):

#20589

Kim Morrison (Jan 08 2025 at 22:46):

It seems everyone, including CI, has been confused about the arguments of lake exe cache get.

Kim Morrison (Jan 08 2025 at 22:47):

If anyone would like to modify this further, so that if cache can't find the path it tries again re-interpreting the argument as a module name, that would be great.

Damiano Testa (Jan 08 2025 at 22:55):

Oh, that would explain why this CI step was always causing problems and has been enabled and disabled a few times... :man_facepalming:

Riccardo Brasca (Jan 13 2025 at 10:39):

Speaking of codespace, is it working for you? If I try mathlib's one it is stuck at the "setting up remote connection" and then it says "oh no it looks you are offline".

Riccardo Brasca (Jan 13 2025 at 10:53):

it now works, never mind

Jz Pan (Jan 13 2025 at 11:14):

I managed to get the compressed Lean4 Windows packages slightly less than 1.0GB. When decompressed (without unpackaging Mathlib caches) it is 3GB. After unpackaging Mathlib caches, it's 8GB.

Jz Pan (Jan 13 2025 at 11:15):

I used a modified version of scripts here #general > trylean bundle for lean4 @ 💬 . The key point is don't let lake update unpacks Mathlib cache. This could be done via an environment variable (found in build script of mathlib).


Last updated: May 02 2025 at 03:31 UTC