https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120614

--- Comment #5 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Note that on x86-64 I get OK scores on x264. This compares no-FDO -Ofast -flto
-march=native to autoFDO.  I hacked the scripts to use ref run for training so
it is longer:

500.perlbench_r       1    158          10.1   *       1    144          11.0  
*
502.gcc_r                                     NR                              
NR
505.mcf_r             1    185           8.75  *       1    196           8.25 
*
520.omnetpp_r         1    201           6.52  *       1    200           6.57 
*
523.xalancbmk_r                               NR                              
NR
525.x264_r            1     85.3        20.5   *       1     89.5        19.6  
*
531.deepsjeng_r       1    163           7.03  *       1    178           6.45 
*
541.leela_r           1    273           6.07  *       1    296           5.60 
*
548.exchange2_r       1     86.1        30.4   *       1    186          14.1  
*
557.xz_r              1    224           4.83  *       1    222           4.87 
*
 Est. SPECrate2017_int_base              9.63
 Est. SPECrate2017_int_peak                                               8.56

This is with default train run
525.x264_r            1       86.9       20.1  *       1       95.9       18.3 
*

so I get 9% difference, to 30%.  What is your config file setup for running
perf and merging profile?  I do:

fdo_pre0 = rm -rf ${benchmark}.data ${benchmark}.gcov; \\

fdo_run1 = perf record -e ex_ret_brn_tkn:Pu -c 10000000 -b -o ${benchmark}.data
-- ${command}; \\
           create_gcov --binary=${baseexe} --profile=${benchmark}.data
--gcov=current.gcov -gcov_version=2;  \\
           if test -e ${benchmark}.gcov ; then profile_merger current.gcov
${benchmark}.gcov --output_file ${benchmark}.gcov ; else mv current.gcov
${benchmark}.gcov ; fi \\

PASS1_OPTIMIZE = -g -fno-reorder-blocks-and-partition  -fno-ipa-icf -fno-lto
PASS2_OPTIMIZE = -fauto-profile=${benchmark}.gcov  

Base profile (nofdo) is
   5.51%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
x264_pixel_satd_8x4.lto_pr◆
   3.75%  x264_r_base.aut  x264_r_base.autofdo-m64  [.] get_ref.lto_priv.0     
  ▒
   2.71%  x264_r_base.aut  x264_r_base.autofdo-m64  [.] mc_chroma.lto_priv.0   
  ▒
   1.34%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
x264_pixel_satd_4x4.lto_pr▒
   1.13%  x264_r_base.aut  x264_r_base.autofdo-m64  [.] x264_me_search_ref     
  ▒
   0.79%  x264_r_base.aut  x264_r_base.autofdo-m64  [.] sub4x4_dct.lto_priv.0  
  ▒
   0.58%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
refine_subpel.lto_priv.0  ▒
   0.57%  x264_r_base.aut  x264_r_base.autofdo-m64  [.] quant_4x4.lto_priv.0   
  ▒
   0.35%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
x264_pixel_sad_x4_8x8.lto_▒
   0.34%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
frame_init_lowres_core.lto▒
   0.33%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
x264_pixel_sad_16x16.lto_p▒
   0.31%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
x264_pixel_sad_8x8.lto_pri▒
   0.29%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
x264_pixel_sad_x4_16x16.lt▒
   0.27%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
pixel_var2_8x8.lto_priv.0 ▒
   0.23%  x264_r_base.aut  x264_r_base.autofdo-m64  [.] mc_luma.lto_priv.0     
  ▒
   0.22%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
x264_macroblock_cache_load▒
   0.20%  x264_r_base.aut  x264_r_base.autofdo-m64  [.] x264_macroblock_encode 
  ▒
   0.20%  x264_r_base.aut  x264_r_base.autofdo-m64  [.] add4x4_idct.lto_priv.0 
  ▒
   0.17%  x264_r_base.aut  x264_r_base.autofdo-m64  [.] x264_slicetype_mb_cost 
  ▒
while peak
   3.76%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.] get_ref.lto_priv.0     
  ◆
   2.83%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.]
x264_pixel_satd_16x16.lto_▒
   2.51%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.] mc_chroma.lto_priv.0   
  ▒
   2.31%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.]
x264_pixel_satd_8x8.lto_pr▒
   1.22%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.]
x264_pixel_satd_4x4.lto_pr▒
   1.09%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.] hpel_filter.lto_priv.0 
  ▒
   1.03%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.] x264_me_search_ref     
  ▒
   1.02%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.] sub16x16_dct.lto_priv.0
  ▒
   0.77%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.] sub8x8_dct.lto_priv.0  
  ▒
   0.74%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.]
refine_subpel.lto_priv.0  ▒
   0.49%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.] quant_4x4.lto_priv.0   
  ▒
   0.45%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.]
pixel_avg_16x16.lto_priv.0▒
   0.34%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.]
x264_pixel_sad_16x16.lto_p▒


as mentioned by Andrew, it is important to clone and also resolve indirect
calls. Those auto-FDO 0 may prevent it from happening.
It is easy to see in perf profile if the functions are cloned.

My overall plan is to combine autofdo with guessed profile, when autofdo
samples are missing (i.e. we have 0 at input).  There is no 100% correct way to
do so, that is why I am trying to first get benchmarking set up and kind of
working only then start tampering with the profile generation.

Reply via email to