> -----Original Message----- > From: Thomas Neumann <thomas.neum...@in.tum.de> > Sent: Monday, November 21, 2022 11:23 AM > To: Tamar Christina <tamar.christ...@arm.com>; gcc-patches@gcc.gnu.org; > Jason Merrill <ja...@redhat.com> > Cc: Florian Weimer <fwei...@redhat.com>; Jakub Jelinek > <ja...@redhat.com>; Jonathan Wakely <jwakely....@gmail.com> > Subject: Re: [PATCH v4] eliminate mutex in fast path of __register_frame > > Hi, > > > When dynamically linking a fast enough machine hides the latency, but > > when Statically linking or on slower devices this change caused a 5x > > increase in Instruction count and 2x increase in cycle count before getting > to main. > > > > This has been quite noticeable on smaller devices. Is there a reason > > the btree can't be initialized lazily? It seems a bit harsh to pay the > > cost of unwinding at startup even when you don't throw exceptions.. > > we cannot easily do that lazily because otherwise we need a mutex for lazy > initialization, which is exactly what we wanted to get rid of. > > Having said that, I am surprised that you saw a noticeable difference. > On most platforms there should not be dynamic frame registration at all, as > the regular frames are directly read from the ELF data. > > Can you please send me an precise description on how to reproduce the > issue? (Platform, tools, a VM if you have one would be great). I will then > debug this to improve the startup time.
It's easy to reproduce on x86 as well. As a testcase: #include <cstdio> int main(int argc, char** argv) { return 0; } And just compile with: g++ -O1 hello.cpp -static -o hello.exe. Before this change on x86 I got: > perf stat -r 200 ./hello.exe Performance counter stats for './hello.exe' (200 runs): 0.32 msec task-clock # 0.326 CPUs utilized ( +- 0.34% ) 0 context-switches # 0.000 K/sec 0 cpu-migrations # 0.000 K/sec 22 page-faults # 0.070 M/sec ( +- 0.13% ) 310,194 cycles # 0.984 GHz ( +- 0.33% ) 317,310 instructions # 1.02 insn per cycle ( +- 0.18% ) 58,885 branches # 186.710 M/sec ( +- 0.12% ) 931 branch-misses # 1.58% of all branches ( +- 2.57% ) 0.00096799 +- 0.00000374 seconds time elapsed ( +- 0.39% ) And after this change: > perf stat -r 200 ./hello.exe Performance counter stats for './hello.exe' (200 runs): 1.03 msec task-clock # 0.580 CPUs utilized ( +- 0.23% ) 0 context-switches # 0.000 K/sec 0 cpu-migrations # 0.000 K/sec 27 page-faults # 0.026 M/sec ( +- 0.10% ) 1,034,038 cycles # 1.002 GHz ( +- 0.11% ) 2,485,983 instructions # 2.40 insn per cycle ( +- 0.02% ) 557,567 branches # 540.215 M/sec ( +- 0.01% ) 4,843 branch-misses # 0.87% of all branches ( +- 0.53% ) 0.00178093 +- 0.00000456 seconds time elapsed ( +- 0.26% ) Regards, Tamar > > Best > > Thomas