> -----Original Message-----
> From: Thomas Neumann <thomas.neum...@in.tum.de>
> Sent: Monday, November 21, 2022 11:23 AM
> To: Tamar Christina <tamar.christ...@arm.com>; gcc-patches@gcc.gnu.org;
> Jason Merrill <ja...@redhat.com>
> Cc: Florian Weimer <fwei...@redhat.com>; Jakub Jelinek
> <ja...@redhat.com>; Jonathan Wakely <jwakely....@gmail.com>
> Subject: Re: [PATCH v4] eliminate mutex in fast path of __register_frame
> 
> Hi,
> 
> > When dynamically linking a fast enough machine hides the latency, but
> > when Statically linking or on slower devices this change caused a 5x
> > increase in Instruction count and 2x increase in cycle count before getting
> to main.
> >
> > This has been quite noticeable on smaller devices.  Is there a reason
> > the btree can't be initialized lazily? It seems a bit harsh to pay the
> > cost of unwinding at startup even when you don't throw exceptions..
> 
> we cannot easily do that lazily because otherwise we need a mutex for lazy
> initialization, which is exactly what we wanted to get rid of.
> 
> Having said that, I am surprised that you saw a noticeable difference.
> On most platforms there should not be dynamic frame registration at all, as
> the regular frames are directly read from the ELF data.
> 
> Can you please send me an precise description on how to reproduce the
> issue? (Platform, tools, a VM if you have one would be great). I will then
> debug this to improve the startup time.

It's easy to reproduce on x86 as well.

As a testcase:

#include <cstdio>

int main(int argc, char** argv) {
    return 0;
}

And just compile with: g++ -O1 hello.cpp -static -o hello.exe.

Before this change on x86 I got:

> perf stat -r 200 ./hello.exe

 Performance counter stats for './hello.exe' (200 runs):

              0.32 msec task-clock                #    0.326 CPUs utilized      
      ( +-  0.34% )
                 0      context-switches          #    0.000 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                22      page-faults               #    0.070 M/sec              
      ( +-  0.13% )
           310,194      cycles                    #    0.984 GHz                
      ( +-  0.33% )
           317,310      instructions              #    1.02  insn per cycle     
      ( +-  0.18% )
            58,885      branches                  #  186.710 M/sec              
      ( +-  0.12% )
               931      branch-misses             #    1.58% of all branches    
      ( +-  2.57% )

        0.00096799 +- 0.00000374 seconds time elapsed  ( +-  0.39% )

And after this change:

> perf stat -r 200 ./hello.exe

 Performance counter stats for './hello.exe' (200 runs):

              1.03 msec task-clock                #    0.580 CPUs utilized      
      ( +-  0.23% )
                 0      context-switches          #    0.000 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                27      page-faults               #    0.026 M/sec              
      ( +-  0.10% )
         1,034,038      cycles                    #    1.002 GHz                
      ( +-  0.11% )
         2,485,983      instructions              #    2.40  insn per cycle     
      ( +-  0.02% )
           557,567      branches                  #  540.215 M/sec              
      ( +-  0.01% )
             4,843      branch-misses             #    0.87% of all branches    
      ( +-  0.53% )

        0.00178093 +- 0.00000456 seconds time elapsed  ( +-  0.26% )

Regards,
Tamar
> 
> Best
> 
> Thomas

Reply via email to