On 05/25/2017 01:22 PM, Markus Trippelsdorf wrote: > On 2017.05.25 at 11:55 +0200, Martin Liška wrote: >> Hi. >> >> As I spoke about the PGO with Honza and Richi, current 3-stage is not ideal >> for following >> 2 reasons: >> >> 1) stageprofile compiler is train just on libraries that are built during >> stage2 >> 2) apart from that, as the compiler is also used to build the final >> compiler, profile >> is being updated during the build. So the stage2 compiler is making >> different decisions. >> >> Both problems can be resolved by adding another step in between current >> stage2 and stage3 >> where we train stage2 compiler by building compiler with default options. >> >> I'm going to do some measurements. > > I did some measurements on gcc67 (trunk with --enable-checking=release). > The apparent speedup is in the noise.
Hello. Thanks for measurements: I can see difference for GCC 7.1: g++-7 tramp3d-v4.ii -O2 && time for i in `seq 1 10` ; do g++-7 tramp3d-v4.ii -O2 ; done before: 2m25.133s after: real 2m25.133s which is 99.09124426480228%. It's probably within a noise level. And apparently file size of binary is bugger: before (using bloaty): VM SIZE FILE SIZE -------------- -------------- 59.0% 15.1Mi .text 15.1Mi 62.3% 21.3% 5.45Mi .rodata 5.45Mi 22.5% 6.6% 1.69Mi .eh_frame 1.69Mi 6.9% 5.4% 1.38Mi .bss 0 0.0% 3.3% 874Ki .dynstr 874Ki 3.5% 1.8% 480Ki .dynsym 480Ki 1.9% 1.1% 285Ki .eh_frame_hdr 285Ki 1.1% 0.6% 158Ki .gnu.hash 158Ki 0.6% 0.5% 144Ki .hash 144Ki 0.6% 0.2% 44.4Ki .data 44.4Ki 0.2% 0.2% 40.0Ki .gnu.version 40.0Ki 0.2% 0.0% 11.1Ki .rela.plt 11.1Ki 0.0% 0.0% 7.44Ki .plt 7.44Ki 0.0% 0.0% 4.56Ki .data.rel.ro 4.56Ki 0.0% 0.0% 3.73Ki .got.plt 3.73Ki 0.0% 0.0% 38 [Unmapped] 2.75Ki 0.0% 0.0% 624 [ELF Headers] 2.55Ki 0.0% 0.0% 848 [Other] 1.13Ki 0.0% 0.0% 917 .gcc_except_table 917 0.0% 0.0% 608 .dynamic 608 0.0% 0.0% 16 [None] 0 0.0% 100.0% 25.7Mi TOTAL 24.3Mi 100.0% after: VM SIZE FILE SIZE -------------- -------------- 58.3% 14.6Mi .text 14.6Mi 54.2% 21.6% 5.41Mi .rodata 5.41Mi 20.1% 0.0% 0 .strtab 2.13Mi 7.9% 6.7% 1.67Mi .eh_frame 1.67Mi 6.2% 5.5% 1.38Mi .bss 0 0.0% 0.0% 0 .symtab 1.11Mi 4.1% 3.4% 876Ki .dynstr 876Ki 3.2% 1.9% 480Ki .dynsym 480Ki 1.7% 1.1% 280Ki .eh_frame_hdr 280Ki 1.0% 0.6% 158Ki .gnu.hash 158Ki 0.6% 0.6% 144Ki .hash 144Ki 0.5% 0.2% 44.4Ki .data 44.4Ki 0.2% 0.2% 40.1Ki .gnu.version 40.1Ki 0.1% 0.0% 11.1Ki .rela.plt 11.1Ki 0.0% 0.0% 7.44Ki .plt 7.44Ki 0.0% 0.0% 4.56Ki .data.rel.ro 4.56Ki 0.0% 0.0% 3.73Ki .got.plt 3.73Ki 0.0% 0.0% 58 [Unmapped] 3.11Ki 0.0% 0.0% 624 [ELF Headers] 2.61Ki 0.0% 0.0% 2.32Ki [Other] 2.60Ki 0.0% 0.0% 16 [None] 0 0.0% 100.0% 25.1Mi TOTAL 26.9Mi 100.0% As I had chat with Honza, we still have problem in GCC that using current working sets, get_hot_bb_threshold () is very close to number of runs, which is effectively 1 for a single run. That's mistake and that should be fixed. Martin > > Without your patch: > > Performance counter stats for 'g++ -w -Ofast tramp3d-v4.cpp' (10 runs): > > 15749.058451 task-clock (msec) # 0.997 CPUs utilized > ( +- 0.13% ) > 1,352 context-switches # 0.086 K/sec > ( +- 0.16% ) > 7 cpu-migrations # 0.000 K/sec > ( +- 5.73% ) > 269,142 page-faults # 0.017 M/sec > ( +- 0.01% ) > 60,676,581,181 cycles # 3.853 GHz > ( +- 0.09% ) (83.35%) > 13,401,784,189 stalled-cycles-frontend # 22.09% frontend cycles > idle ( +- 0.20% ) (83.33%) > 12,926,843,370 stalled-cycles-backend # 21.30% backend cycles > idle ( +- 0.04% ) (83.31%) > 73,074,099,356 instructions # 1.20 insn per cycle > # 0.18 stalled cycles > per insn ( +- 0.02% ) (83.34%) > 16,607,220,814 branches # 1054.490 M/sec > ( +- 0.03% ) (83.36%) > 616,673,310 branch-misses # 3.71% of all branches > ( +- 0.08% ) (83.36%) > > 15.803602619 seconds time elapsed > ( +- 0.14% ) > > With your patch: > > Performance counter stats for 'g++ -w -Ofast tramp3d-v4.cpp' (10 runs): > > 15735.220610 task-clock (msec) # 0.997 CPUs utilized > ( +- 0.11% ) > 1,354 context-switches # 0.086 K/sec > ( +- 0.22% ) > 6 cpu-migrations # 0.000 K/sec > ( +- 6.67% ) > 269,164 page-faults # 0.017 M/sec > ( +- 0.01% ) > 60,723,862,242 cycles # 3.859 GHz > ( +- 0.08% ) (83.35%) > 13,382,554,421 stalled-cycles-frontend # 22.04% frontend cycles > idle ( +- 0.14% ) (83.31%) > 12,912,171,664 stalled-cycles-backend # 21.26% backend cycles > idle ( +- 0.03% ) (83.34%) > 73,109,081,227 instructions # 1.20 insn per cycle > # 0.18 stalled cycles > per insn ( +- 0.03% ) (83.34%) > 16,590,421,798 branches # 1054.349 M/sec > ( +- 0.02% ) (83.35%) > 616,669,135 branch-misses # 3.72% of all branches > ( +- 0.08% ) (83.36%) > > 15.788772466 seconds time elapsed > ( +- 0.12% ) > > > > -- > Markus >