Re: [PATCH] Introduce 4-stages profiledbootstrap to get a better profile.

Martin Liška Thu, 25 May 2017 08:50:30 -0700

On 05/25/2017 01:22 PM, Markus Trippelsdorf wrote:
> On 2017.05.25 at 11:55 +0200, Martin Liška wrote:
>> Hi.
>>
>> As I spoke about the PGO with Honza and Richi, current 3-stage is not ideal 
>> for following
>> 2 reasons:
>>
>> 1) stageprofile compiler is train just on libraries that are built during 
>> stage2
>> 2) apart from that, as the compiler is also used to build the final 
>> compiler, profile
>> is being updated during the build. So the stage2 compiler is making 
>> different decisions.
>>
>> Both problems can be resolved by adding another step in between current 
>> stage2 and stage3
>> where we train stage2 compiler by building compiler with default options.
>>
>> I'm going to do some measurements.
> 
> I did some measurements on gcc67 (trunk with --enable-checking=release).
> The apparent speedup is in the noise.


Hello.

Thanks for measurements:

I can see difference for GCC 7.1:

g++-7 tramp3d-v4.ii -O2 && time for i in `seq 1 10` ; do g++-7 tramp3d-v4.ii 
-O2 ; done

before: 2m25.133s
after: real     2m25.133s

which is 99.09124426480228%. It's probably within a noise level.

And apparently file size of binary is bugger:

before (using bloaty):

     VM SIZE                         FILE SIZE
 --------------                   --------------
  59.0%  15.1Mi .text              15.1Mi  62.3%
  21.3%  5.45Mi .rodata            5.45Mi  22.5%
   6.6%  1.69Mi .eh_frame          1.69Mi   6.9%
   5.4%  1.38Mi .bss                    0   0.0%
   3.3%   874Ki .dynstr             874Ki   3.5%
   1.8%   480Ki .dynsym             480Ki   1.9%
   1.1%   285Ki .eh_frame_hdr       285Ki   1.1%
   0.6%   158Ki .gnu.hash           158Ki   0.6%
   0.5%   144Ki .hash               144Ki   0.6%
   0.2%  44.4Ki .data              44.4Ki   0.2%
   0.2%  40.0Ki .gnu.version       40.0Ki   0.2%
   0.0%  11.1Ki .rela.plt          11.1Ki   0.0%
   0.0%  7.44Ki .plt               7.44Ki   0.0%
   0.0%  4.56Ki .data.rel.ro       4.56Ki   0.0%
   0.0%  3.73Ki .got.plt           3.73Ki   0.0%
   0.0%      38 [Unmapped]         2.75Ki   0.0%
   0.0%     624 [ELF Headers]      2.55Ki   0.0%
   0.0%     848 [Other]            1.13Ki   0.0%
   0.0%     917 .gcc_except_table     917   0.0%
   0.0%     608 .dynamic              608   0.0%
   0.0%      16 [None]                  0   0.0%
 100.0%  25.7Mi TOTAL              24.3Mi 100.0%

after:

     VM SIZE                     FILE SIZE
 --------------               --------------
  58.3%  14.6Mi .text          14.6Mi  54.2%
  21.6%  5.41Mi .rodata        5.41Mi  20.1%
   0.0%       0 .strtab        2.13Mi   7.9%
   6.7%  1.67Mi .eh_frame      1.67Mi   6.2%
   5.5%  1.38Mi .bss                0   0.0%
   0.0%       0 .symtab        1.11Mi   4.1%
   3.4%   876Ki .dynstr         876Ki   3.2%
   1.9%   480Ki .dynsym         480Ki   1.7%
   1.1%   280Ki .eh_frame_hdr   280Ki   1.0%
   0.6%   158Ki .gnu.hash       158Ki   0.6%
   0.6%   144Ki .hash           144Ki   0.5%
   0.2%  44.4Ki .data          44.4Ki   0.2%
   0.2%  40.1Ki .gnu.version   40.1Ki   0.1%
   0.0%  11.1Ki .rela.plt      11.1Ki   0.0%
   0.0%  7.44Ki .plt           7.44Ki   0.0%
   0.0%  4.56Ki .data.rel.ro   4.56Ki   0.0%
   0.0%  3.73Ki .got.plt       3.73Ki   0.0%
   0.0%      58 [Unmapped]     3.11Ki   0.0%
   0.0%     624 [ELF Headers]  2.61Ki   0.0%
   0.0%  2.32Ki [Other]        2.60Ki   0.0%
   0.0%      16 [None]              0   0.0%
 100.0%  25.1Mi TOTAL          26.9Mi 100.0%

As I had chat with Honza, we still have problem in GCC that using current 
working sets,
get_hot_bb_threshold () is very close to number of runs, which is effectively 1 
for a single
run. That's mistake and that should be fixed.

Martin



> 
> Without your patch:
> 
>  Performance counter stats for 'g++ -w -Ofast tramp3d-v4.cpp' (10 runs):
> 
>       15749.058451      task-clock (msec)         #    0.997 CPUs utilized    
>         ( +-  0.13% )
>              1,352      context-switches          #    0.086 K/sec            
>         ( +-  0.16% )
>                  7      cpu-migrations            #    0.000 K/sec            
>         ( +-  5.73% )
>            269,142      page-faults               #    0.017 M/sec            
>         ( +-  0.01% )
>     60,676,581,181      cycles                    #    3.853 GHz              
>         ( +-  0.09% )  (83.35%)
>     13,401,784,189      stalled-cycles-frontend   #   22.09% frontend cycles 
> idle     ( +-  0.20% )  (83.33%)
>     12,926,843,370      stalled-cycles-backend    #   21.30% backend cycles 
> idle      ( +-  0.04% )  (83.31%)
>     73,074,099,356      instructions              #    1.20  insn per cycle
>                                                   #    0.18  stalled cycles 
> per insn  ( +-  0.02% )  (83.34%)
>     16,607,220,814      branches                  # 1054.490 M/sec            
>         ( +-  0.03% )  (83.36%)
>        616,673,310      branch-misses             #    3.71% of all branches  
>         ( +-  0.08% )  (83.36%)
> 
>       15.803602619 seconds time elapsed                                       
>    ( +-  0.14% )
> 
> With your patch:
> 
>  Performance counter stats for 'g++ -w -Ofast tramp3d-v4.cpp' (10 runs):
> 
>       15735.220610      task-clock (msec)         #    0.997 CPUs utilized    
>         ( +-  0.11% )
>              1,354      context-switches          #    0.086 K/sec            
>         ( +-  0.22% )
>                  6      cpu-migrations            #    0.000 K/sec            
>         ( +-  6.67% )
>            269,164      page-faults               #    0.017 M/sec            
>         ( +-  0.01% )
>     60,723,862,242      cycles                    #    3.859 GHz              
>         ( +-  0.08% )  (83.35%)
>     13,382,554,421      stalled-cycles-frontend   #   22.04% frontend cycles 
> idle     ( +-  0.14% )  (83.31%)
>     12,912,171,664      stalled-cycles-backend    #   21.26% backend cycles 
> idle      ( +-  0.03% )  (83.34%)
>     73,109,081,227      instructions              #    1.20  insn per cycle
>                                                   #    0.18  stalled cycles 
> per insn  ( +-  0.03% )  (83.34%)
>     16,590,421,798      branches                  # 1054.349 M/sec            
>         ( +-  0.02% )  (83.35%)
>        616,669,135      branch-misses             #    3.72% of all branches  
>         ( +-  0.08% )  (83.36%)
> 
>       15.788772466 seconds time elapsed                                       
>    ( +-  0.12% )
> 
> 
> 
> --
> Markus
>

Re: [PATCH] Introduce 4-stages profiledbootstrap to get a better profile.

Reply via email to