[Bug middle-end/120614] 525.x264_r is ~30% slower with AutoFDO

kugan at gcc dot gnu.org via Gcc-bugs Wed, 11 Jun 2025 02:05:02 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120614


--- Comment #7 from kugan at gcc dot gnu.org ---
(In reply to Jan Hubicka from comment #6)
> Also BTW, I think it is useful to do the dumps wth -details-blocks since
> that also dumps BB count inconsistencies caused by AutoFDO that are
> otherwise hard to spot.
> 
> In ipa-cp dump it should be visible if constant propagation of stride
> happened. This may be useful for vectorizaiton on aarch64. On x86-64 it is
> not that important for some reason.
(In reply to Jan Hubicka from comment #5)
> Note that on x86-64 I get OK scores on x264. This compares no-FDO -Ofast
> -flto -march=native to autoFDO.  I hacked the scripts to use ref run for
> training so it is longer:
> 
> 500.perlbench_r       1    158          10.1   *       1    144         
> 11.0   *
> 502.gcc_r                                     NR                            
> NR
> 505.mcf_r             1    185           8.75  *       1    196          
> 8.25  *
> 520.omnetpp_r         1    201           6.52  *       1    200          
> 6.57  *
> 523.xalancbmk_r                               NR                            
> NR
> 525.x264_r            1     85.3        20.5   *       1     89.5       
> 19.6   *
> 531.deepsjeng_r       1    163           7.03  *       1    178          
> 6.45  *
> 541.leela_r           1    273           6.07  *       1    296          
> 5.60  *
> 548.exchange2_r       1     86.1        30.4   *       1    186         
> 14.1   *
> 557.xz_r              1    224           4.83  *       1    222          
> 4.87  *
>  Est. SPECrate2017_int_base              9.63
>  Est. SPECrate2017_int_peak                                              
> 8.56
> 
> This is with default train run
> 525.x264_r            1       86.9       20.1  *       1       95.9      
> 18.3  *
> 
> so I get 9% difference, to 30%.  What is your config file setup for running
> perf and merging profile?  I do:
> 
> fdo_pre0 = rm -rf ${benchmark}.data ${benchmark}.gcov; \\
> 
> fdo_run1 = perf record -e ex_ret_brn_tkn:Pu -c 10000000 -b -o
> ${benchmark}.data -- ${command}; \\
>            create_gcov --binary=${baseexe} --profile=${benchmark}.data
> --gcov=current.gcov -gcov_version=2;  \\
>            if test -e ${benchmark}.gcov ; then profile_merger current.gcov
> ${benchmark}.gcov --output_file ${benchmark}.gcov ; else mv current.gcov
> ${benchmark}.gcov ; fi \\
> 
> PASS1_OPTIMIZE = -g -fno-reorder-blocks-and-partition  -fno-ipa-icf -fno-lto
> PASS2_OPTIMIZE = -fauto-profile=${benchmark}.gcov  
> 
> Base profile (nofdo) is
>    5.51%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
> x264_pixel_satd_8x4.lto_pr◆
>    3.75%  x264_r_base.aut  x264_r_base.autofdo-m64  [.] get_ref.lto_priv.0  
> ▒
>    2.71%  x264_r_base.aut  x264_r_base.autofdo-m64  [.] mc_chroma.lto_priv.0
> ▒
>    1.34%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
> x264_pixel_satd_4x4.lto_pr▒
>    1.13%  x264_r_base.aut  x264_r_base.autofdo-m64  [.] x264_me_search_ref  
> ▒
>    0.79%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
> sub4x4_dct.lto_priv.0     ▒
>    0.58%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
> refine_subpel.lto_priv.0  ▒
>    0.57%  x264_r_base.aut  x264_r_base.autofdo-m64  [.] quant_4x4.lto_priv.0
> ▒
>    0.35%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
> x264_pixel_sad_x4_8x8.lto_▒
>    0.34%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
> frame_init_lowres_core.lto▒
>    0.33%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
> x264_pixel_sad_16x16.lto_p▒
>    0.31%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
> x264_pixel_sad_8x8.lto_pri▒
>    0.29%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
> x264_pixel_sad_x4_16x16.lt▒
>    0.27%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
> pixel_var2_8x8.lto_priv.0 ▒
>    0.23%  x264_r_base.aut  x264_r_base.autofdo-m64  [.] mc_luma.lto_priv.0  
> ▒
>    0.22%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
> x264_macroblock_cache_load▒
>    0.20%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
> x264_macroblock_encode    ▒
>    0.20%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
> add4x4_idct.lto_priv.0    ▒
>    0.17%  x264_r_base.aut  x264_r_base.autofdo-m64  [.]
> x264_slicetype_mb_cost    ▒
> while peak
>    3.76%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.] get_ref.lto_priv.0  
> ◆
>    2.83%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.]
> x264_pixel_satd_16x16.lto_▒
>    2.51%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.] mc_chroma.lto_priv.0
> ▒
>    2.31%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.]
> x264_pixel_satd_8x8.lto_pr▒
>    1.22%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.]
> x264_pixel_satd_4x4.lto_pr▒
>    1.09%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.]
> hpel_filter.lto_priv.0    ▒
>    1.03%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.] x264_me_search_ref  
> ▒
>    1.02%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.]
> sub16x16_dct.lto_priv.0   ▒
>    0.77%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.]
> sub8x8_dct.lto_priv.0     ▒
>    0.74%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.]
> refine_subpel.lto_priv.0  ▒
>    0.49%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.] quant_4x4.lto_priv.0
> ▒
>    0.45%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.]
> pixel_avg_16x16.lto_priv.0▒
>    0.34%  x264_r_peak.aut  x264_r_peak.autofdo-m64  [.]
> x264_pixel_sad_16x16.lto_p▒
> 
> 
> as mentioned by Andrew, it is important to clone and also resolve indirect
> calls. Those auto-FDO 0 may prevent it from happening.
> It is easy to see in perf profile if the functions are cloned.
> 
> My overall plan is to combine autofdo with guessed profile, when autofdo
> samples are missing (i.e. we have 0 at input).  There is no 100% correct way
> to do so, that is why I am trying to first get benchmarking set up and kind
> of working only then start tampering with the profile generation.

Thanks for the information. I tried re-creating the same configuration and the
results unfortunately is the same. I will look at the dumps further.

[Bug middle-end/120614] 525.x264_r is ~30% slower with AutoFDO

Reply via email to