Zulip Chat Archive
Stream: lean4
Topic: Performance drop
Arthur Paulino (Nov 15 2022 at 20:50):
Hi! Was there some change that could impact the performance of the execution compiled binaries?
We have a codebase whose execution performance has dropped by half. On leanprover/lean4:nightly-2022-09-11
our CI runs lasted around 7 minutes and then after updating to 10-27
it went up to 12 minutes and beyond. And in the most recent toolchain it's reaching 15 minutes.
This is the codebase: https://github.com/yatima-inc/yatima-lang
Arthur Paulino (Nov 15 2022 at 20:54):
I'm asking in the spirit of providing help if needed :pray:
Mario Carneiro (Nov 15 2022 at 20:58):
Could you run perf
on both versions and see whether the hot functions have changed?
Mario Carneiro (Nov 15 2022 at 20:59):
if you can minimize the issue that would be super helpful for the benchmark suite
Arthur Paulino (Nov 15 2022 at 21:01):
Any particular version I should get?
λ perf
Comando 'perf' não encontrado, mas poder ser instalado com:
sudo apt install linux-intel-iotg-tools-common # version 5.15.0-1017.22, or
sudo apt install linux-nvidia-tools-common # version 5.15.0-1007.7
sudo apt install linux-tools-common # version 5.15.0-52.58
Mario Carneiro (Nov 15 2022 at 21:02):
I'm not an expert but I think sudo apt install linux-tools-common
will work?
Mario Carneiro (Nov 15 2022 at 21:03):
Here's the relevant graph for changes in execution time on a bunch of benchmarks, I don't see any that got much worse on those days
Arthur Paulino (Nov 15 2022 at 21:06):
How do I run perf
in a meaningful way for us?
Mario Carneiro (Nov 15 2022 at 21:12):
I'm hoping someone who knows more will swoop in here, but this page has some example commands, and I think one that is relevant is:
perf record --call-graph dwarf my_command
perf report --stdio
Wojciech Nawrocki (Nov 15 2022 at 21:13):
@Arthur Paulino this thread helped me getting perf
to work.
Wojciech Nawrocki (Nov 15 2022 at 21:14):
Also https://github.com/leanprover/lean4/blob/master/doc/perf.md
Wojciech Nawrocki (Nov 15 2022 at 21:14):
And the FF Profiler is a great web UI for perf
traces
Arthur Paulino (Nov 15 2022 at 21:42):
These are the top lines from perf diff old.data new.data
:
# Baseline Delta Abs Shared Object Symbol
# ........ ......... .................... ..........................................
#
0.19% +66.82% yatima [.] 0x00000000035d69a0
+5.28% yatima [.] lean::for_each_fn::apply
+3.75% yatima [.] l_Yatima_Typechecker_printExpr
+3.34% yatima [.] l_List_mapM___at_Yatima_Transpiler_mkIndLiteral___spec__1
4.57% -2.44% [unknown] [k] 0xffffffffa76063a5
+2.13% yatima [.] l_IO_eprintln___at_compileRun___spec__3
+1.26% yatima [.] lean::add_inductive_fn::mk_rec_infos
+0.95% yatima [.] l_Yatima_Typechecker_toCtorIfLit
2.13% -0.84% libleanshared.so [.] lean_uint64_mix_hash
0.92% -0.80% libleanshared.so [.] lean_mark_mt
+0.66% yatima [.] l_Yatima_Typechecker_eval
+0.57% yatima [.] lean::replace
0.00% +0.51% yatima [.] lean::elim_dead_let_fn::visit_lambda
Patrick Massot (Nov 15 2022 at 21:43):
0x00000000035d69a0
is clearly guilty. Any other question?
Arthur Paulino (Nov 15 2022 at 21:44):
Is that the output we want from perf
?
Kevin Lacker (Nov 15 2022 at 21:45):
i find it easier to look at perf data in a different format, like a flamegraph, for example with this tool: https://github.com/brendangregg/FlameGraph
Gabriel Ebner (Nov 15 2022 at 21:56):
Try dwarf,60000 that should get us a better suspect.
Arthur Paulino (Nov 15 2022 at 21:58):
I downloaded hotspot
and it seems to show better data. Anyone up for a quick jitsi call?
Arthur Paulino (Nov 15 2022 at 21:59):
I can share my screen
Arthur Paulino (Nov 15 2022 at 22:03):
Gabriel Ebner said:
Try dwarf,60000 that should get us a better suspect.
Running that
Arthur Paulino (Nov 15 2022 at 22:14):
32.30% -32.07% yatima [.] 0x0000000000a183d1
+22.17% yatima [.] lean_dec_ref_cold
+9.72% yatima [.] lean_alloc_small
+8.14% yatima [.] lean_free_small
+5.16% yatima [.] l_Lean_AssocList_foldlM___at_Lean_ConstMap_childrenOfWith___spec__2___rarg
+4.86% yatima [.] lean::alloc
+3.56% yatima [.] lean_byte_array_push
+3.43% yatima [.] l_List_compareAux___at_ByteArray_instOrdByteArray___spec__1
+3.19% yatima [.] lean_array_to_list
+3.18% yatima [.] lean_name_eq
4.77% -3.08% [unknown] [k] 0xffffffffa76060d0
+2.65% yatima [.] lean_byte_array_copy_slice
+2.26% yatima [.] lean::dealloc
+2.18% yatima [.] l_Array_zipWithAux___at_Keccak_rho___spec__1
+2.07% yatima [.] l_UnsignedVarInt_toVarIntCore
+1.73% yatima [.] l_List_toByteArray_loop
+1.36% yatima [.] lean_byte_array_data
+1.28% yatima [.] l_ByteArray_append
+1.12% yatima [.] l_Multihash_toBytes
+1.08% yatima [.] l_Array_mapIdxM_map___at_Keccak_chi___spec__1
2.00% -1.00% libleanshared.so [.] lean_uint64_mix_hash
0.97% -0.90% libleanshared.so [.] lean_mark_mt
+0.87% yatima [.] lean_array_push
0.00% +0.80% yatima [.] l_Std_RBNode_ins___at_Yatima_Compiler_addToStore___spec__19
+0.79% yatima [.] l_Std_RBNode_find_x3f___at_Yatima_Compiler_getExprIpld___spec__1
+0.65% yatima [.] l_Std_RBNode_insert___at_Yatima_Compiler_getUnivIpld___spec__2___at_Yatima_Compiler_getUnivIpld___spec__6___lambda__1___boxed
+0.65% yatima [.] l_Cid_toBytes
+0.60% yatima [.] l_Array_foldlMUnsafe_fold___at_Keccak_theta___spec__3
+0.60% yatima [.] l_Std_RBNode_ins___at_Yatima_Compiler_addToStore___spec__14
+0.50% yatima [.] l_Array_foldlMUnsafe_fold___at_Lean_ConstMap_childrenOfWith___spec__5___rarg
0.63% +0.46% libc.so.6 [.] __memmove_avx_unaligned_erms
0.54% -0.45% libleanshared.so [.] lean_name_eq
Arthur Paulino (Nov 15 2022 at 22:19):
lean_dec_ref_cold
, lean_alloc_small
, lean_free_small
, lean::alloc
and lean_byte_array_push
seem to be the biggest aggressors
Arthur Paulino (Nov 15 2022 at 22:28):
Was there some big change on the logic behind memory allocation after 2022-9-11
?
Arthur Paulino (Nov 15 2022 at 22:30):
Because another issue I had involved a stack overflow when running a piece of code that worked just fine before
Mario Carneiro (Nov 15 2022 at 22:51):
Arthur Paulino said:
lean_dec_ref_cold
,lean_alloc_small
,lean_free_small
,lean::alloc
andlean_byte_array_push
seem to be the biggest aggressors
Those are always likely to be the hottest bits of code. The question is more whether something moved to the front after 2022-9-11
Wojciech Nawrocki (Nov 15 2022 at 22:51):
@Arthur Paulino you might or might not be missing some symbol names. What does the first page of perf report | less
output look like?
Arthur Paulino (Nov 15 2022 at 22:59):
Wojciech Nawrocki said:
Arthur Paulino you might or might not be missing some symbol names. What does the first page of
perf report | less
output look like?
It just shows a black terminal screen to me
Wojciech Nawrocki (Nov 15 2022 at 23:02):
Uh, perf report --stdio
?
Arthur Paulino (Nov 15 2022 at 23:07):
The first lines of perf report
on a call that results on a stack overflow:
Samples: 20K of event 'cycles', Event count (approx.): 20390088657
Children Self Command Shared Object Symbol
+ 9,44% 9,41% lake libleanshared.so [.] lean_uint64_mix_hash ◆
+ 7,01% 6,69% yatima yatima [.] lean_byte_array_push ▒
+ 6,31% 6,04% yatima yatima [.] lean_dec_ref_cold ▒
+ 4,48% 4,38% yatima yatima [.] lean_alloc_small ▒
+ 4,30% 4,22% yatima yatima [.] lean_free_small ▒
+ 3,97% 0,03% lake libleanshared.so [.] lean_byte_array_copy_slice ▒
+ 3,94% 0,00% lake libc.so.6 [.] __memcpy_avx_unaligned_erms (inlined) ▒
+ 3,92% 0,00% lake [unknown] [.] 0xffffffffa8400099 ▒
+ 3,91% 3,32% lake libleanshared.so [.] lean_dec_ref_cold ▒
+ 3,67% 0,00% lake [unknown] [.] 0xffffffffa839821c ▒
+ 3,47% 3,42% yatima yatima [.] l_List_toByteArray_loop ▒
+ 2,78% 0,29% yatima yatima [.] lean_byte_array_copy_slice ▒
+ 2,75% 0,00% lake [unknown] [.] 0xffffffffa8400b66 ▒
+ 2,74% 0,00% yatima libc.so.6 [.] __GI___libc_free (inlined) ▒
+ 2,67% 0,00% lake [unknown] [.] 0xffffffffa839c0d7 ▒
+ 2,53% 0,00% yatima libc.so.6 [.] __memcpy_avx_unaligned_erms (inlined) ▒
+ 2,50% 0,00% lake [unknown] [.] 0xffffffffa769ea69 ▒
+ 2,42% 0,01% lake [unknown] [.] 0xffffffffa78fbd88 ▒
+ 2,39% 2,36% yatima yatima [.] l_List_reverseAux___rarg ▒
+ 2,37% 0,00% yatima [unknown] [.] 0xffffffffa8400b66 ▒
+ 2,31% 0,00% lake [unknown] [.] 0xffffffffa78fb977 ▒
+ 2,25% 0,00% yatima [unknown] [.] 0xffffffffa839c0d7 ▒
+ 2,11% 2,04% yatima yatima [.] l_ByteArray_toList_loop ▒
+ 2,08% 0,00% yatima [unknown] [.] 0xffffffffa769ea69 ▒
+ 2,03% 0,00% lake [unknown] [.] 0xffffffffa78f9a1a ▒
+ 1,98% 0,38% yatima yatima [.] __gmp_default_allocate ▒
+ 1,98% 1,95% lake lake [.] l_Lake_computeFileHash ▒
+ 1,96% 0,00% yatima [unknown] [.] 0xffffffffa78fbd88 ▒
+ 1,90% 0,18% yatima yatima [.] __gmp_default_reallocate ▒
+ 1,84% 0,00% lake libc.so.6 [.] __GI__IO_file_xsgetn (inlined) ▒
+ 1,84% 0,00% yatima [unknown] [.] 0xffffffffa78fb977 ▒
+ 1,81% 0,02% lake libleanshared.so [.] lean_io_prim_handle_read ▒
+ 1,80% 0,00% lake libc.so.6 [.] __GI__IO_fread (inlined) ▒
+ 1,79% 0,00% yatima [unknown] [.] 0xffffffffa78f9a1a ▒
+ 1,78% 1,75% yatima yatima [.] lean::alloc ▒
+ 1,74% 1,74% lake libc.so.6 [.] __memmove_avx_unaligned_erms ▒
+ 1,74% 0,00% lake libc.so.6 [.] _IO_new_file_underflow (inlined) ▒
+ 1,74% 0,07% yatima yatima [.] lean::mpz::~mpz ▒
+ 1,72% 1,69% yatima libc.so.6 [.] _int_free ▒
+ 1,71% 0,00% yatima libc.so.6 [.] __GI___libc_realloc (inlined) ▒
+ 1,71% 0,00% lake libc.so.6 [.] __GI___libc_read (inlined) ▒
+ 1,58% 0,00% yatima libc.so.6 [.] __GI___libc_malloc (inlined) ▒
+ 1,57% 1,57% lake libleanshared.so [.] lean_alloc_small ▒
+ 1,55% 1,55% yatima libc.so.6 [.] malloc
Mario Carneiro (Nov 15 2022 at 23:12):
why are there so many things from lake
in this list? Is lake calling yatima or vice versa?
Wojciech Nawrocki (Nov 15 2022 at 23:13):
It's also definitely missing stack traces, so we don't know what function calls what. Did you build the application you are tracing with moreLeancArgs := #["-UNDEBUG", "-Og", "-ggdb", "-g3", "-fno-omit-frame-pointer"]
?
Arthur Paulino (Nov 15 2022 at 23:19):
Mario Carneiro said:
why are there so many things from
lake
in this list? Is lake calling yatima or vice versa?
I run with lake exe yatima ...
Arthur Paulino (Nov 15 2022 at 23:19):
Wojciech Nawrocki said:
It's also definitely missing stack traces, so we don't know what function calls what. Did you build the application you are tracing with
moreLeancArgs := #["-UNDEBUG", "-Og", "-ggdb", "-g3", "-fno-omit-frame-pointer"]
?
Nope. Let me do that
Wojciech Nawrocki (Nov 15 2022 at 23:21):
Arthur Paulino said:
Mario Carneiro said:
why are there so many things from
lake
in this list? Is lake calling yatima or vice versa?I run with
lake exe yatima ...
I think running perf record (...) build/bin/yatima
will give you a cleaner perf report.
Arthur Paulino (Nov 15 2022 at 23:25):
+ 13,21% 12,91% yatima yatima [.] lean_dec_ref_cold
+ 11,42% 11,14% yatima yatima [.] lean_byte_array_push
+ 8,79% 8,78% yatima yatima [.] lean_alloc_small
+ 7,91% 7,87% yatima yatima [.] lean_free_small
+ 6,92% 6,91% yatima yatima [.] l_List_toByteArray_loop
+ 4,72% 0,44% yatima yatima [.] lean_byte_array_copy_slice
+ 4,68% 0,00% yatima libc.so.6 [.] __GI___libc_free (inlined)
+ 4,35% 4,34% yatima yatima [.] l_List_reverseAux___rarg
+ 4,34% 4,32% yatima yatima [.] l_ByteArray_toList_loop
+ 4,31% 0,00% yatima libc.so.6 [.] __memcpy_avx_unaligned_erms (inlined)
+ 3,88% 0,00% yatima [unknown] [.] 0xffffffffa8400b66
+ 3,75% 0,00% yatima [unknown] [.] 0xffffffffa839c0d7
+ 3,57% 0,73% yatima yatima [.] __gmp_default_allocate
+ 3,52% 0,00% yatima [unknown] [.] 0xffffffffa769ea69
+ 3,45% 3,43% yatima yatima [.] lean::alloc
+ 3,39% 0,27% yatima yatima [.] lean::mpz::~mpz
+ 3,30% 0,00% yatima [unknown] [.] 0xffffffffa78fbd88
+ 3,11% 0,00% yatima [unknown] [.] 0xffffffffa78fb977
+ 3,04% 0,00% yatima [unknown] [.] 0xffffffffa8400099
+ 3,04% 0,32% yatima yatima [.] __gmp_default_reallocate
+ 2,98% 0,00% yatima [unknown] [.] 0xffffffffa78f9a1a
+ 2,86% 0,00% yatima libc.so.6 [.] __GI___libc_malloc (inlined)
+ 2,84% 2,84% yatima libc.so.6 [.] malloc
+ 2,73% 0,00% yatima libc.so.6 [.] __GI___libc_realloc (inlined)
+ 2,55% 0,00% yatima libc.so.6 [.] __GI__IO_file_xsgetn (inlined)
+ 2,53% 0,00% yatima [unknown] [.] 0xffffffffa839821c
+ 2,51% 0,00% yatima libc.so.6 [.] _IO_new_file_underflow (inlined)
+ 2,49% 0,02% yatima yatima [.] lean_io_prim_handle_read
+ 2,48% 0,00% yatima libc.so.6 [.] __GI___libc_read (inlined)
+ 2,48% 2,45% yatima libc.so.6 [.] _int_free
+ 2,47% 0,00% yatima libc.so.6 [.] __GI__IO_fread (inlined)
+ 2,33% 0,00% yatima [unknown] [.] 0xffffffffa791f27e
+ 2,23% 0,00% yatima [unknown] [.] 0xffffffffa798d439
+ 2,22% 0,00% yatima [unknown] [.] 0xffffffffa798d397
+ 2,11% 0,00% yatima [unknown] [.] 0xffffffffa78f5632
+ 2,06% 0,00% yatima [unknown] [.] 0xffffffffa798a913
+ 2,03% 0,00% yatima [unknown] [.] 0xffffffffa7989f60
+ 2,01% 0,00% yatima [unknown] [.] 0xffffffffa7a732cb
+ 2,01% 0,00% yatima [unknown] [.] 0xffffffffa78b3575
+ 1,99% 0,00% yatima [unknown] [.] 0xffffffffa7940fad
+ 1,96% 0,00% yatima [unknown] [.] 0xffffffffa791e02d
+ 1,95% 1,95% yatima [unknown] [k] 0xffffffffa7c9e5a7
+ 1,89% 1,89% yatima libc.so.6 [.] cfree@GLIBC_2.2.5
+ 1,78% 0,51% yatima yatima [.] lean::mpz::mpz
+ 1,74% 0,08% yatima libc.so.6 [.] __memset_avx2_unaligned_erms
+ 1,73% 1,73% yatima yatima [.] l_UnsignedVarInt_fromVarIntCore
+ 1,70% 0,08% yatima yatima [.] lean::lean_alloc_small_cold
+ 1,52% 0,78% yatima yatima [.] l_DagCbor_repeatFor___rarg
+ 1,50% 1,50% yatima libc.so.6 [.] realloc
+ 1,42% 1,42% yatima libc.so.6 [.] __memmove_avx_unaligned_erms
Wojciech Nawrocki (Nov 15 2022 at 23:29):
Hm that still doesn't look right. I can give it a go, is it just the version that's in the repo? What command are you benchmarking?
Gabriel Ebner (Nov 15 2022 at 23:31):
All these ByteArray<->List conversions look super fishy.
Gabriel Ebner (Nov 15 2022 at 23:31):
Gabriel Ebner (Nov 15 2022 at 23:32):
This should construct a ByteArray
directly.
def fromVarIntCore : Nat → Nat → List UInt8 → Nat → Option (Nat × List UInt8)
Gabriel Ebner (Nov 15 2022 at 23:35):
I'm kind of surprised that we don't have a DecidableEq ByteArray
instance yet. This one allocates two new arrays, which are 8x as large as the byte arrays btw:
instance : BEq ByteArray where
beq a b := a.data == b.data
Hanting Zhang (Nov 15 2022 at 23:37):
Yes, this instance looks fishy to me as well -- shouldn't this be exported to some C function?
Mario Carneiro (Nov 15 2022 at 23:38):
yeah, this looks like a good place to drop in memcmp
Wojciech Nawrocki (Nov 15 2022 at 23:43):
@Arthur Paulino here is roughly what a perf recording with correct symbol information should look like:
l_typecheckRun
|
|--32.44%--l_Yatima_Typechecker_typecheck
| l_Yatima_Converter_extractPureStore
| l_Yatima_Converter_ConvertM_run___rarg
| lean_apply_2
| lean_apply_3
| l_Yatima_Converter_convertStore___lambda__1
| |
| --31.92%--l_Std_RBNode_forM___at_Yatima_Converter_convertStore___spec__10
| |
| |--31.30%--l_Std_RBNode_forM___at_Yatima_Converter_convertStore___spec__10
| | |
| | |--26.74%--l_Std_RBNode_forM___at_Yatima_Converter_convertStore___spec__10
| | | |
| | | |--18.33%--l_Std_RBNode_forM___at_Yatima_Converter_convertStore___spec__10
| | | | |
| | | | |--13.97%--l_Std_RBNode_forM___at_Yatima_Converter_convertStore___spec__10
| | | | | |
| | | | | |--7.57%--l_Std_RBNode_forM___at_Yatima_Converter_convertStore___spec__10
I got this with
moreLeancArgs := #["-UNDEBUG", "-Og", "-ggdb", "-g3", "-fno-omit-frame-pointer"]
moreLinkArgs := #[
"-fuse-ld=ld",
"-L/home/$THEUSER/.elan/toolchains/leanprover--lean4---nightly-2022-10-27/lib"
]
(some of these flags are likely unnecessary but I don't remember which, so I just include all :) )
Hanting Zhang (Nov 15 2022 at 23:44):
(arthur has to go so I'll be looking at this now. but also I'm on WSL2 so perf
doesn't really work)
Hanting Zhang (Nov 15 2022 at 23:49):
Hi @Wojciech Nawrocki, how should I read this output? It looks like we're spending a lot of time in l_Std_RBNode_forM___at_Yatima_Converter_convertStore___spec__10
?
Gabriel Ebner (Nov 15 2022 at 23:53):
but also I'm on WSL2 so perf doesn't really work
perf
should work well enough on WSL2 (only the performance counters are broken). You just need to install the tools for the wrong linux version:
sudo apt install linux-tools-5.15.0-53-generic
/usr/lib/linux-tools/5.15.0-53-generic/perf ...
Arthur Paulino (Nov 16 2022 at 00:19):
There are some clear improvements we can make on our side, but I still don't understand the performance drop.
@Wojciech Nawrocki to reproduce the issue, you can lake exe yatima compile Fixtures/Transpilation/Demo.lean
(but you need to open it on VSCode first so it generates the olean file). This command will generate an output.ir
. This call, alone, used to take 20s on my machine and now it takes 50s.
To reproduce the overflow (which might not happen on your machine if you have too much memory available), do lake exe yatima typecheck output.ir
Hanting Zhang (Nov 16 2022 at 00:23):
(deleted)
Wojciech Nawrocki (Nov 16 2022 at 00:29):
Hanting Zhang said:
Hi Wojciech Nawrocki, how should I read this output? It looks like we're spending a lot of time in
l_Std_RBNode_forM___at_Yatima_Converter_convertStore___spec__10
?
I didn't make it very clear, sorry. The only point of including that example output was to demonstrate that it has indents where typecheck
calls convertStore
calls .. , whereas the output Arthur seems to have been getting was entirely flat. If you get the flatness, it usually means the recording is missing symbol information so you don't know which function in your code is calling lean_dec_ref_cold
a lot and (perhaps!) resulting in the slowdown.
Arthur Paulino (Nov 16 2022 at 07:42):
Thanks everyone for the pointers!
Apparently the root cause for the slowdown was the expansion for the RBMap API, which contains less partial functions and whose compilation will trigger the compilation of many other constants.
I will work on other improvements mentioned above :pray:🏼
Mario Carneiro (Nov 16 2022 at 08:03):
I'm unclear on the issue here. Is it a regression in the Std RBMap API relative to before / to lean?
Mario Carneiro (Nov 16 2022 at 08:04):
I don't see why fewer partial
functions would affect anything other than compile time
Hanting Zhang (Nov 16 2022 at 08:13):
We are compiling the lean environment into a Yatima one, so since RBMap depended on more constants our process was expectedly slower
Mario Carneiro (Nov 16 2022 at 08:18):
oh I see, it's like the lean self-test benchmark where compile time is the test
Arthur Paulino (Nov 16 2022 at 08:31):
Sorry, I wasn't clear enough when I said "compile"
Last updated: Dec 20 2023 at 11:08 UTC