Zulip Chat Archive

Stream: general

Topic: Lean's I/O usage while using Mathlib


Vadim Kantorov (Apr 23 2025 at 12:57):

I'm using a networked SSD on a cloud machine and import Mathlib takes 40 seconds for me (when I'm running 128 processes simultaneously on a big CPU machine, this timing goes up to 20 minutes). Is it expected?

How can I profile import Mathlib?

Is import Mathlib indeed I/O heavy?

(in strace logs I can see that it always does a dozen of stat-calls before finding the correct olean and reading a few kilobytes from it)

Is it possible to somehow place all of Mathlib into a zipball? (similar to Python zipimport / egg file) and make sure that Lean indexes its contents to avoid useless stat-calls?

Thank you!

Henrik Böving (Apr 23 2025 at 13:22):

How can I profile import Mathlib?

Probably most easily with a standard C/C++ profiler like samply or perf or intel vtune depending on what platform you're on and what tools you like

Is import Mathlib indeed I/O heavy?

It has to touch a couple of thousand files so certainly not cheap IO wise.

(in strace logs I can see that it always does a dozen of stat-calls before finding the correct olean and reading a few kilobytes from it)

Can you show one of those stat strides?

Is it possible to somehow place all of Mathlib into a zipball? (similar to Python zipimport / egg file) and make sure that Lean indexes its contents to avoid useless stat-calls?

not right now no

Sebastian Ullrich (Apr 23 2025 at 13:41):

"reading a few kbytes" is probably misleading because most reading is done invisibly through mmap - assuming your network FS supports it. I would highly doubt the stat calls are significant compared to the actual reading of data.

Vadim Kantorov (Apr 23 2025 at 13:46):

Looks as so:

24664 stat("././.lake/packages/batteries/.lake/build/lib/Lake", 0x7ffc739e8408) = -1 ENOENT (No such file or directory)
24664 stat("././.lake/packages/batteries/.lake/build/lib/Lake.olean", 0x7ffc739e8418) = -1 ENOENT (No such file or directory)
24664 stat("././.lake/packages/Qq/.lake/build/lib/Lake", 0x7ffc739e8408) = -1 ENOENT (No such file or directory)
24664 stat("././.lake/packages/Qq/.lake/build/lib/Lake.olean", 0x7ffc739e8418) = -1 ENOENT (No such file or directory)
24664 stat("././.lake/packages/aesop/.lake/build/lib/Lake", 0x7ffc739e8408) = -1 ENOENT (No such file or directory)
24664 stat("././.lake/packages/aesop/.lake/build/lib/Lake.olean", 0x7ffc739e8418) = -1 ENOENT (No such file or directory)
24664 stat("././.lake/packages/proofwidgets/.lake/build/lib/Lake", 0x7ffc739e8408) = -1 ENOENT (No such file or directory)
24664 stat("././.lake/packages/proofwidgets/.lake/build/lib/Lake.olean", 0x7ffc739e8418) = -1 ENOENT (No such file or directory)
24664 stat("././.lake/packages/Cli/.lake/build/lib/Lake", 0x7ffc739e8408) = -1 ENOENT (No such file or directory)
24664 stat("././.lake/packages/Cli/.lake/build/lib/Lake.olean", 0x7ffc739e8418) = -1 ENOENT (No such file or directory)
24664 stat("././.lake/packages/importGraph/.lake/build/lib/Lake", 0x7ffc739e8408) = -1 ENOENT (No such file or directory)
24664 stat("././.lake/packages/importGraph/.lake/build/lib/Lake.olean", 0x7ffc739e8418) = -1 ENOENT (No such file or directory)
24664 stat("././.lake/packages/REPL/.lake/build/lib/Lake", 0x7ffc739e8408) = -1 ENOENT (No such file or directory)
24664 stat("././.lake/packages/REPL/.lake/build/lib/Lake.olean", 0x7ffc739e8418) = -1 ENOENT (No such file or directory)
24664 stat("././.lake/build/lib/Lake", 0x7ffc739e8408) = -1 ENOENT (No such file or directory)
24664 stat("././.lake/build/lib/Lake.olean", 0x7ffc739e8418) = -1 ENOENT (No such file or directory)
24664 stat("/home/verifier/.elan/toolchains/leanprover--lean4---v4.9.0-rc1/lib/lean/Lake", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
24664 stat("/home/verifier/.elan/toolchains/leanprover--lean4---v4.9.0-rc1/lib/lean/Lake/Toml/Grammar.olean", {st_mode=S_IFREG|0644, st_size=853832, ...}) = 0
24664 openat(AT_FDCWD, "/home/verifier/.elan/toolchains/leanprover--lean4---v4.9.0-rc1/lib/lean/Lake/Toml/Grammar.olean", O_RDONLY) = 3
24664 newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=853832, ...}, AT_EMPTY_PATH) = 0
24664 newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=853832, ...}, AT_EMPTY_PATH) = 0
24664 lseek(3, 851968, SEEK_SET)        = 851968
24664 read(3, "\5\0\0\0\0\0\0\0\5\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0down\0\0\0\0"..., 1864) = 1864
24664 lseek(3, 851968, SEEK_SET)        = 851968
24664 read(3, "\5\0\0\0\0\0\0\0\5\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0down\0\0\0\0"..., 4096) = 1864
24664 lseek(3, 0, SEEK_SET)             = 0
24664 read(3, "olean\1be6c4894e0a6c542d56a6f4bb1"..., 4096) = 4096
24664 openat(AT_FDCWD, "/home/verifier/.elan/toolchains/leanprover--lean4---v4.9.0-rc1/lib/lean/Lake/Toml/Grammar.olean", O_RDONLY) = 4
24664 mmap(0x3d3b5ae20000, 853832, PROT_READ, MAP_PRIVATE, 4, 0) = 0x3d3b5ae20000
24664 close(4)                          = 0
24664 lseek(3, 0, SEEK_SET)             = 0
24664 read(3, "olean\1be6c4894e0a6c542d56a6f4bb1"..., 56) = 56
24664 close(3)

Sebastian Ullrich (Apr 23 2025 at 13:57):

So can you confirm whether most time is spent before or after the last mmap?

Vadim Kantorov (Apr 23 2025 at 13:59):

import Mathlib strace contains 87974 stat calls and 20179 read calls - quite a lot

moving mathlib4 to /tmp cuts time of import Mathlib from 42-seconds to 2 seconds

Vadim Kantorov (Apr 23 2025 at 14:01):

I haven't done any perf profiling yet - will see how to do this. But I confirm that lean imports mathlib very inefficiently/excessively in terms of number of I/O syscalls

If you have any suggestions on how to measure with perf times of stat calls and read calls separately, I'll try it

Sebastian Ullrich (Apr 23 2025 at 14:04):

You can ask strace to print timestamps

Vadim Kantorov (Apr 23 2025 at 14:53):

logs.zip

I did strace -f -ttt -o log.txt ... and strace -f -r -o log.txt ... to get the syscalls timestamps, attaching the outputs. Not sure if it's sufficient, but clearly I see 20x speedup when moving to /tmp

Vadim Kantorov (Apr 23 2025 at 15:19):

logs_ttt_T.zip

here are the logs (bad and good - for comparison) of strace -f -ttt -T -o log.txt ...

Sebastian Ullrich (Apr 23 2025 at 15:38):

Thanks, that does suggest the stats are significant. Fortunately lean#8024 should eliminate most of them, would you be interested in testing the linked mathlib branch?

Vadim Kantorov (Apr 23 2025 at 16:12):

Unfortunately, I am still not very well versed in Lean toolchain. If you can advise how I can update the lean version in my mathlib4 install, I can try it

I've followed the instructions in https://github.com/Goedel-LM/Goedel-Prover?tab=readme-ov-file#installation: clone the fork in https://github.com/xinhjBrant/mathlib4 ( https://github.com/xinhjBrant/mathlib4/tree/2f65ba7f1a9144b20c8e7358513548e317d26de1 ) and then did cd mathlib4 && lake build. If you could advise me how to upgrade the lean version to this linked lean4 PR (you did not link any mathlib4 branch, right?), I can try it and report.

Thank you!

Sebastian Ullrich (Apr 23 2025 at 16:16):

The mathlib branch is linked in a comment there: lean-pr-testing-8024. If you check that one out, you shouldn't need to do more than lake build

Vadim Kantorov (Apr 23 2025 at 16:20):

Hmm

git clone --recursive https://github.com/leanprover-community/mathlib4 --branch pr-testing-8024 --single-branch --depth 1
Cloning into 'mathlib4'...
warning: Could not find remote branch pr-testing-8024 to clone

Michael Rothgang (Apr 23 2025 at 16:23):

Your ocmmand has a typo, I think: it should be lean-pr-testing-8024

Vadim Kantorov (Apr 23 2025 at 16:37):

Oopsie :) It now worked. It will now take me a few hours to build mathlib4. Will report later if there is still a large difference between running of ramdisk/local ssd and networked ssd

Henrik Böving (Apr 23 2025 at 16:39):

You should be able to just run lake exe cache get I would hope

Vadim Kantorov (Apr 23 2025 at 16:43):

Instead of lake build?

Henrik Böving (Apr 23 2025 at 16:43):

Yes, it downloads the olean cache for your specific commit of mathlib if available (which it should be)

Michael Rothgang (Apr 23 2025 at 16:47):

(First run lake exe cache get, then lake build should be much faster --- in this case, almost instant since you didn't make any changes causing rebuilds.)

Vadim Kantorov (Apr 23 2025 at 17:10):

Tried it out. This PR made things worse: 65 seconds on networked SSD vs 3 seconds on /tmp

Vadim Kantorov (Apr 23 2025 at 17:20):

It now gives 150K stat calls:

$ grep 'stat(' /tmp/log_bad.txt | wc -l
150912

log_bad.txt.zip

Vadim Kantorov (Apr 23 2025 at 17:21):

Should I open a GitHub issue on this?

Sebastian Ullrich (Apr 23 2025 at 18:36):

Could you leave a comment on the PR? That is the most likely place to fix this issue

Mac Malone (Apr 23 2025 at 19:36):

Sebastian Ullrich said:

Fortunately lean#8024 should eliminate most of them, would you be interested in testing the linked mathlib branch?

lean4#8024 does not yet have Lake integration.

Eric Wieser (Apr 23 2025 at 20:41):

Is there a tracking issue I can follow to see what that integration will look like?

Vadim Kantorov (Apr 23 2025 at 21:46):

@Sebastian Ullrich I just added a comment on that PR

Sebastian Ullrich (Apr 24 2025 at 06:11):

Thanks. Your comment that even 2s would be too costly for training purposes did strike me as odd because common wisdom in that space is to do all training (per machine) in a single process that loads the desired imports once.

Vadim Kantorov (Apr 24 2025 at 11:46):

Maybe you're right - for a single REPL process verifying multiple statements. But speeding up basic Lean imports (if possible) can speed-up basic verification of produced whole-proofs using just Lean (e.g. for Goedel-Pset, this is 1.8 million statements * e.g. 32 generations (or 256 generations))

But it's not a blocker indeed (as can probably be sidetracked via REPL)


Last updated: May 02 2025 at 03:31 UTC