https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120614
--- Comment #7 from kugan at gcc dot gnu.org --- (In reply to Jan Hubicka from comment #6) > Also BTW, I think it is useful to do the dumps wth -details-blocks since > that also dumps BB count inconsistencies caused by AutoFDO that are > otherwise hard to spot. > > In ipa-cp dump it should be visible if constant propagation of stride > happened. This may be useful for vectorizaiton on aarch64. On x86-64 it is > not that important for some reason. (In reply to Jan Hubicka from comment #5) > Note that on x86-64 I get OK scores on x264. This compares no-FDO -Ofast > -flto -march=native to autoFDO. I hacked the scripts to use ref run for > training so it is longer: > > 500.perlbench_r 1 158 10.1 * 1 144 > 11.0 * > 502.gcc_r NR > NR > 505.mcf_r 1 185 8.75 * 1 196 > 8.25 * > 520.omnetpp_r 1 201 6.52 * 1 200 > 6.57 * > 523.xalancbmk_r NR > NR > 525.x264_r 1 85.3 20.5 * 1 89.5 > 19.6 * > 531.deepsjeng_r 1 163 7.03 * 1 178 > 6.45 * > 541.leela_r 1 273 6.07 * 1 296 > 5.60 * > 548.exchange2_r 1 86.1 30.4 * 1 186 > 14.1 * > 557.xz_r 1 224 4.83 * 1 222 > 4.87 * > Est. SPECrate2017_int_base 9.63 > Est. SPECrate2017_int_peak > 8.56 > > This is with default train run > 525.x264_r 1 86.9 20.1 * 1 95.9 > 18.3 * > > so I get 9% difference, to 30%. What is your config file setup for running > perf and merging profile? I do: > > fdo_pre0 = rm -rf ${benchmark}.data ${benchmark}.gcov; \\ > > fdo_run1 = perf record -e ex_ret_brn_tkn:Pu -c 10000000 -b -o > ${benchmark}.data -- ${command}; \\ > create_gcov --binary=${baseexe} --profile=${benchmark}.data > --gcov=current.gcov -gcov_version=2; \\ > if test -e ${benchmark}.gcov ; then profile_merger current.gcov > ${benchmark}.gcov --output_file ${benchmark}.gcov ; else mv current.gcov > ${benchmark}.gcov ; fi \\ > > PASS1_OPTIMIZE = -g -fno-reorder-blocks-and-partition -fno-ipa-icf -fno-lto > PASS2_OPTIMIZE = -fauto-profile=${benchmark}.gcov > > Base profile (nofdo) is > 5.51% x264_r_base.aut x264_r_base.autofdo-m64 [.] > x264_pixel_satd_8x4.lto_pr◆ > 3.75% x264_r_base.aut x264_r_base.autofdo-m64 [.] get_ref.lto_priv.0 > ▒ > 2.71% x264_r_base.aut x264_r_base.autofdo-m64 [.] mc_chroma.lto_priv.0 > ▒ > 1.34% x264_r_base.aut x264_r_base.autofdo-m64 [.] > x264_pixel_satd_4x4.lto_pr▒ > 1.13% x264_r_base.aut x264_r_base.autofdo-m64 [.] x264_me_search_ref > ▒ > 0.79% x264_r_base.aut x264_r_base.autofdo-m64 [.] > sub4x4_dct.lto_priv.0 ▒ > 0.58% x264_r_base.aut x264_r_base.autofdo-m64 [.] > refine_subpel.lto_priv.0 ▒ > 0.57% x264_r_base.aut x264_r_base.autofdo-m64 [.] quant_4x4.lto_priv.0 > ▒ > 0.35% x264_r_base.aut x264_r_base.autofdo-m64 [.] > x264_pixel_sad_x4_8x8.lto_▒ > 0.34% x264_r_base.aut x264_r_base.autofdo-m64 [.] > frame_init_lowres_core.lto▒ > 0.33% x264_r_base.aut x264_r_base.autofdo-m64 [.] > x264_pixel_sad_16x16.lto_p▒ > 0.31% x264_r_base.aut x264_r_base.autofdo-m64 [.] > x264_pixel_sad_8x8.lto_pri▒ > 0.29% x264_r_base.aut x264_r_base.autofdo-m64 [.] > x264_pixel_sad_x4_16x16.lt▒ > 0.27% x264_r_base.aut x264_r_base.autofdo-m64 [.] > pixel_var2_8x8.lto_priv.0 ▒ > 0.23% x264_r_base.aut x264_r_base.autofdo-m64 [.] mc_luma.lto_priv.0 > ▒ > 0.22% x264_r_base.aut x264_r_base.autofdo-m64 [.] > x264_macroblock_cache_load▒ > 0.20% x264_r_base.aut x264_r_base.autofdo-m64 [.] > x264_macroblock_encode ▒ > 0.20% x264_r_base.aut x264_r_base.autofdo-m64 [.] > add4x4_idct.lto_priv.0 ▒ > 0.17% x264_r_base.aut x264_r_base.autofdo-m64 [.] > x264_slicetype_mb_cost ▒ > while peak > 3.76% x264_r_peak.aut x264_r_peak.autofdo-m64 [.] get_ref.lto_priv.0 > ◆ > 2.83% x264_r_peak.aut x264_r_peak.autofdo-m64 [.] > x264_pixel_satd_16x16.lto_▒ > 2.51% x264_r_peak.aut x264_r_peak.autofdo-m64 [.] mc_chroma.lto_priv.0 > ▒ > 2.31% x264_r_peak.aut x264_r_peak.autofdo-m64 [.] > x264_pixel_satd_8x8.lto_pr▒ > 1.22% x264_r_peak.aut x264_r_peak.autofdo-m64 [.] > x264_pixel_satd_4x4.lto_pr▒ > 1.09% x264_r_peak.aut x264_r_peak.autofdo-m64 [.] > hpel_filter.lto_priv.0 ▒ > 1.03% x264_r_peak.aut x264_r_peak.autofdo-m64 [.] x264_me_search_ref > ▒ > 1.02% x264_r_peak.aut x264_r_peak.autofdo-m64 [.] > sub16x16_dct.lto_priv.0 ▒ > 0.77% x264_r_peak.aut x264_r_peak.autofdo-m64 [.] > sub8x8_dct.lto_priv.0 ▒ > 0.74% x264_r_peak.aut x264_r_peak.autofdo-m64 [.] > refine_subpel.lto_priv.0 ▒ > 0.49% x264_r_peak.aut x264_r_peak.autofdo-m64 [.] quant_4x4.lto_priv.0 > ▒ > 0.45% x264_r_peak.aut x264_r_peak.autofdo-m64 [.] > pixel_avg_16x16.lto_priv.0▒ > 0.34% x264_r_peak.aut x264_r_peak.autofdo-m64 [.] > x264_pixel_sad_16x16.lto_p▒ > > > as mentioned by Andrew, it is important to clone and also resolve indirect > calls. Those auto-FDO 0 may prevent it from happening. > It is easy to see in perf profile if the functions are cloned. > > My overall plan is to combine autofdo with guessed profile, when autofdo > samples are missing (i.e. we have 0 at input). There is no 100% correct way > to do so, that is why I am trying to first get benchmarking set up and kind > of working only then start tampering with the profile generation. Thanks for the information. I tried re-creating the same configuration and the results unfortunately is the same. I will look at the dumps further.