Zulip Chat Archive

Stream: lean4

Topic: Precompile specific modules


Mario Carneiro (Nov 02 2022 at 08:58):

Right now, none of mathlib is using the precompileModules flag. How do I enable it? More importantly, we don't want all modules to be precompiled, only those with performance sensitive tactics in them. Can I make a list of modules that I want to have precompiled in the lakefile?

Sebastian Ullrich (Nov 02 2022 at 09:08):

Selective precompilation gets pretty complicated when you skip some modules but then downstream modules (e.g. from someone depending on mathlib) want to get precompiled after all; http://leanprover.github.io/talks/WITS2022.pdf#page=23. It would be great to first quantify the overhead of precompiling everything, especially for modules that don't contain all that much computable code anyway. Though the fact that on Windows we have to pass the entire closure of DLLs in the command line does make scaling a bit more difficult.

Mario Carneiro (Nov 02 2022 at 09:09):

In that case, let's assume that the set of precompiled modules is downward-closed. It shouldn't be too hard to automate that in lake given some set of desired modules

Sebastian Ullrich (Nov 02 2022 at 09:10):

Isn't basically all of mathlib in the downstream of important tactics?

Mario Carneiro (Nov 02 2022 at 09:10):

er, upward-closed

Mario Carneiro (Nov 02 2022 at 09:11):

that is, only the tactic files and their dependencies would get precompiled, not dependents

Sebastian Ullrich (Nov 02 2022 at 09:12):

Yeah, it shouldn't be too hard to implement that

Gabriel Ebner (Nov 02 2022 at 19:02):

Note that the current precompilation logic in Lake has some big performance and functionality issues (https://github.com/leanprover/lake/pull/132#issuecomment-1286372892), and I'm currently looking into fixing that.

Mario Carneiro (Nov 16 2022 at 01:00):

I have a much simpler version of this problem. How can I precompile mathlib completely in a test file? I want to profile compiled tactics but I have no idea how to compile any tactics in std or mathlib4.

Wojciech Nawrocki (Nov 16 2022 at 01:10):

I'm not quite sure if this is what you're asking but here is something to load in the compiled version of a dependency when running tests. You may need to run lake build DepLib:sharedLib first.

Wojciech Nawrocki (Nov 16 2022 at 01:11):

I guess the tldr almost-lake-less variant is lake env lean --load-dynlib=<somelib.so>

Mario Carneiro (Nov 16 2022 at 01:12):

Aha, I also found something like that. It runs, hard to say if it's working

Mario Carneiro (Nov 16 2022 at 01:52):

Help! I'm not having any success with perf-profiling tactics even after all these steps, there are no functions from mathlib showing up in the trace. I tried modifying the mathlib lakefile to this:

import Lake

open Lake DSL

package mathlib where
  precompileModules := true

@[default_target]
lean_lib Mathlib where
  moreLeanArgs := #["-DwarningAsError=true"]
  moreLeancArgs := #["-UNDEBUG", "-Og", "-ggdb", "-g3", "-fno-omit-frame-pointer"]
  moreLinkArgs := #[
    "-fuse-ld=ld",
    "-L/home/mario/.elan/toolchains/leanprover--lean4---nightly-2022-11-14/lib",
    "-Lbuild/lib"
  ]

@[default_target]
lean_exe runLinter where
  root := `scripts.runLinter
  supportInterpreter := true

meta if get_config? doc = some "on" then -- do not download and build doc-gen4 by default
require «doc-gen4» from git "https://github.com/leanprover/doc-gen4" @ "main"

require std from git "https://github.com/leanprover/std4" @ "main"
require Qq from git "https://github.com/gebner/quote4" @ "master"

and then

lake env lean --load-dynlib ./build/lib/libMathlib-Tactic-Ring.so test/ring.lean

takes 1.7 s while

lake env lean test/ring.lean

takes 1.9 s, which is a performance difference small enough that it might be noise; and perf record --call-graph dwarf lake ... records only stack frames from leanlibshared.so.

Gabriel Ebner (Nov 16 2022 at 02:39):

I couldn't get perf report to work either... hotspot works just fine though. I only had to add precompiledModules := true to the lakefile and then ran perf on your example command.

Gabriel Ebner (Nov 16 2022 at 02:46):

The perf difference is so small because around 60% of the runtime is spent inside lean_is_expr_def_eq (a large part of that seems to be in the code generated by ~q(...)). Also really surprising: 23% of the time is spent in elaborating the statements.

Gabriel Ebner (Nov 16 2022 at 02:46):

(The 60% and 23% are probably not disjoint.)

Gabriel Ebner (Nov 16 2022 at 02:47):

Another cost center: 10% of the time is spent processing the imports.

Gabriel Ebner (Nov 16 2022 at 02:55):

Another exciting discovery: Simp.tryTheoremWithExtraArgs? takes 25% of the runtime.

Mario Carneiro (Nov 16 2022 at 03:03):

What does the performance look like for a more extreme example like (1 + x) ^ 50 = (x + 1) ^ 50 for a sufficiently large value of 50? Ideally it should not spend much time in ~q and isDefEq since that's only used during reification

Mario Carneiro (Nov 16 2022 at 03:04):

ran perf on your example command.

ran perf how? I used perf record --call-graph dwarf lake env lean ... and then perf script > test.perf, is that what you mean?

Wojciech Nawrocki (Nov 16 2022 at 03:08):

@Mario Carneiro it works for me with lake env perf record --call-graph dwarf,60000 lean --load-dynlib=build/lib/libMathlib-Tactic-Ring.so test/ring.lean which avoids profiling lake itself.

Gabriel Ebner (Nov 16 2022 at 03:08):

lake env perf record --call-graph=dwarf,60000 lean --load-dynlib=./build/lib/libMathlib-Tactic-Ring.so test/ring2.lean
hotspot

Gabriel Ebner (Nov 16 2022 at 03:09):

The n=100 result is interesting. We now spend 17% of the time in Lean.getMaxHeight.

Mario Carneiro (Nov 16 2022 at 03:10):

is compiling making a bigger difference?

Gabriel Ebner (Nov 16 2022 at 03:12):

6.5s vs 6.1s

Gabriel Ebner (Nov 16 2022 at 03:12):

But we're breaking the FoldConsts cache here.

Mario Carneiro (Nov 16 2022 at 03:13):

oh, is it because the cache size is only 8192?

Gabriel Ebner (Nov 16 2022 at 03:13):

We're also breaking the Expr.find? cache.

Gabriel Ebner (Nov 16 2022 at 03:14):

oh, is it because the cache size is only 8192?

Effectively it's even less because we take the pointer address modulo 8192 as the key. I think you can see the issue here.

Mario Carneiro (Nov 16 2022 at 03:15):

oof

Gabriel Ebner (Nov 16 2022 at 03:15):

For Expr.replace I added a >>> 4 and it's a much smaller issue.

Mario Carneiro (Nov 16 2022 at 03:15):

Maybe pick a prime number cache size? Or is the mod too much

Gabriel Ebner (Nov 16 2022 at 03:17):

Picking an odd cache size would already be an improvement I think (because then the modulo doesn't align perfectly with the pointer alignment..).

Gabriel Ebner (Nov 16 2022 at 03:17):

Or maybe just hash the pointer.

Gabriel Ebner (Nov 16 2022 at 03:18):

It would also be interesting to see whether different proof terms for ring would be helpful here. (I always like nested lets.)

Mario Carneiro (Nov 16 2022 at 03:19):

I don't think ring has a lot of pointer duplication in its proofs unless it is 'discovered' by ShareCommon

Mario Carneiro (Nov 16 2022 at 03:20):

except in the expression terms

Mario Carneiro (Nov 16 2022 at 03:21):

Can the typechecker check a let without doing a full expression traversal to replace the value everywhere? It seems like that would defeat the optimization

Mario Carneiro (Nov 16 2022 at 03:22):

BTW if you are in the mood I also would like a performance gut check on the LinearCombination PR. It was noticeably slow which got me on the topic

Gabriel Ebner (Nov 16 2022 at 03:23):

Ah I see where you're coming from. In Lean 3 the lets were substituted in the kernel iirc. But in Lean 4 the type checker adds let-bindings to the lctx now.

Gabriel Ebner (Nov 16 2022 at 03:23):

Mario Carneiro said:

BTW if you are in the mood I also would like a performance gut check on the LinearCombination PR. It was noticeably slow which got me on the topic

I think you should try installing hotspot.

Mario Carneiro (Nov 16 2022 at 03:24):

I did, trying to replicate your result now

Mario Carneiro (Nov 16 2022 at 03:27):

Gabriel Ebner said:

Ah I see where you're coming from. In Lean 3 the lets were substituted in the kernel iirc. But in Lean 4 the type checker adds let-bindings to the lctx now.

Doesn't it still have to traverse the expression to replace the bvars with fvars?

Mario Carneiro (Nov 16 2022 at 03:30):

Oh hey, it works :tada: , mathlib functions are showing up now

Gabriel Ebner (Nov 16 2022 at 04:10):

Oh yeah the bvar replacement overhead is still there ofc. But this should be only linear.

Mario Carneiro (Nov 16 2022 at 04:24):

only if all the lets are in a block, right? If there was some arbitrary pattern it would be quadratic


Last updated: Dec 20 2023 at 11:08 UTC