[Bug ipa/114531] Feature proposal for an `-finline-functions-aggressive` compiler option

hubicka at ucw dot cz via Gcc-bugs Tue, 25 Jun 2024 09:49:30 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114531


--- Comment #14 from Jan Hubicka <hubicka at ucw dot cz> ---
As for bit of history on this.  I have introduced the split -O2 and -O3
limits in order to be able to enable -finline-small-functions at -O2
which we found to be really importnat for C++ codebases which no longer
care about explicit use of inline keyword much.

To do that it was necessary to find settings that does not grow -O2
binaries significantly (or reduce it) and yields to measurably better
performance.  Without LTO and SPECCPU the differences were quite small.
With LTO it was more noticeable and with firefox/clang and similar
with LTO they were significant (often double-digit).

Pushing up -O2 limits can make sense, but needs to be done carefully -
in longer term IMO we do not want to let -O2 binaries to grow faster
than their perofrmance. Sadly this figure is not that great.

https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/branchhttps://lnt.opensuse.org/db_default/v4/SPEC/spec_report/branch
loads slowly but has some data.

SPEC2k17 with -O2 -flto on 2nd generation zen performs as follows:
        gcc-7   gcc-8   gcc-9   gcc-10  gcc-11  gcc-12  gcc-13  gcc-14 
gcc-trunk
SPECint 2.55%   2.90%   ~       4.55%   4.47%   11.29%  12.60%  14.13%  13.42%
SPECfp ~        ~       ~       ~       ~       4.15%   4.98%   5.30%   5.18%

Those are scores (bigger is better) compared to gcc-6 in percents. ~ is noise.

Large improvement in gcc-12 is enablement of vectorizer for specint
comes primarily from x264

While text section size:
        gcc-7   gcc-8   gcc-9   gcc-10  gcc-11  gcc-12  gcc-13  gcc-14 
gcc-trunk
int     ~       ~       ~       9.77%   9.57%   8.72%   8.26%   10.68%  10.59%
fp      ~       2.40%   ~       18.30%  18.24%  18.92%  18.66%  22.23%  22.27%
Those are sizes (smaller is better).  So we do get coniderable bloat.

In GCC10 Fortran ABI changed and imporant part of FP 18% FP bloat is
caused by it.  Here are individual changes:


runtime (only benchmarks with off-noise changes):

Test Name       gcc-7   gcc-8   gcc-9   gcc-10  gcc-11  gcc-12  gcc-13  gcc-14 
gcc-trunk
FP/538.imagick  25.01%  25.64%  27.57%  21.51%  21.75%  19.46%  19.88%  23.20% 
22.91%
INT/525.x264_r  7.25%   6.20%   6.58%   7.48%   ~       -37.7%  -40.4%  -41.6% 
-39.90%
INT/548.exchan  -17.9%  -17.8%  -14.9%  -14.1%  -5.88%  -13.9%  -21.6%  -25.0% 
-26.48%
INT/531.deepsj  -2.46%  ~       ~       -15.0%  -16.1%  -17.9%  -18.8%  -19.3% 
-19.62%
FP/503.bwaves_  -6.30%  ~       -2.71%  16.95%  16.71%  16.65%  16.94%  16.94% 
16.70%
FP/527.cam4_r   -2.99%  -2.33%  -10.7%  -11.3%  -10.9%  -11.8%  -11.9%  -12.5% 
-11.37%
FP/521.wrf_r    ~       -2.40%  -5.99%  -6.10%  -5.66%  -9.45%  -9.28%  -9.82% 
-9.95%
FP/554.roms_r   ~       5.79%   2.51%   ~       5.24%   7.95%   9.35%   9.11%  
9.68%
INT/520.omnetp  -3.26%  -3.45%  ~       -3.82%  -6.71%  -7.37%  -6.57%  -6.83% 
-5.62%
FP/549.fotonik  ~       ~       -5.60%  -8.26%  -8.61%  -3.80%  -4.82%  -3.26% 
-5.48%
INT/541.leela_  -2.47%  -2.19%  ~       -4.57%  -6.32%  -4.76%  -5.69%  -6.72% 
-5.88%
INT/500.perlbe  ~       -2.11%  -2.34%  -6.03%  -4.51%  ~       ~       -5.01% 
-4.52%
INT/523.xalanc  -2.42%  -3.18%  -2.26%  -3.75%  -2.31%  -5.95%  -2.02%  -3.52% 
~
FP/511.povray_  ~       ~       5.21%   -6.54%  ~       ~       ~       ~      
~
INT/505.mcf_r   ~       ~       ~       ~       ~       -2.82%  -3.32%  -3.71% 
-4.14%
FP/510.parest_  ~       ~       ~       ~       -3.31%  ~       -2.28%  -3.03% 
-3.39%
FP/519.lbm_r    3.33%   ~       ~       -4.72%  ~       ~       ~       ~      
~
FP/544.nab_r    ~       ~       ~       ~       ~       ~       -2.43%  ~      
-3.15%
FP/508.namd_r   ~       ~       ~       ~       4.20%   ~       ~       -2.35% 
-2.02%
Those are times (smaller is better)

- Imagemagick regression since GCC 7 is store-to-load forwarding where we
  vectorize load in one function of value stored by pieces in another.
- x264 improvement in GCC 12 is vectorization at -O2 (which may be
  argued to help primarily code that should be built with -Ofast/-O3
  anyway)
- exchange improvement in GCC 7 is special handling of self recursive
  functions with nested loops (quite specific to the benchmark)
- forgot what caused changes in deepsjeng in GCC10 and cam4 in GCC9

size

                GCC 6 size      gcc-7   gcc-8   gcc-9   gcc-10  gcc-11  gcc-12 
gcc-13  gcc-14  gcc-trunk
FP/521.wrf_rg   11.85 MB        ~       5.78%   4.43%   33.11%  33.11%  34.41% 
34.41%  38.42%  38.41%
INT/557.xz_rg   75.53 KB        ~       ~       ~       30.10%  29.47%  29.18% 
30.30%  33.28%  33.57%
FP/totalg       28.08 MB        ~       2.40%   ~       18.30%  18.24%  18.92% 
18.66%  22.23%  22.27%
INT/523.xalanc  1.98 MB         ~       ~       15.05%  14.85%  14.54%  13.62% 
13.80%  17.31%  17.07%
FP/526.blender  6.21 MB         ~       ~       -2.50%  15.93%  15.97%  15.70% 
14.08%  18.47%  18.40%
INT/541.leela   74.37 KB        ~       ~       13.36%  -8.84%  -8.54%  -15.7% 
-15.3%  -9.58%  -10.34%
INT/500.perlb   1.50 MB         ~       ~       ~       9.20%   9.08%   10.08% 
9.69%   12.44%  12.38%
INT/502.gcc_r   6.16 MB         ~       ~       -2.18%  10.40%  10.59%  8.50%  
8.07%   10.14%  10.10%
FP/549.fotoni   325.23 KB       ~       ~       ~       4.39%   4.33%   8.35%  
9.28%   11.76%  10.82%
FP/519.lbm_rg   10.53 KB        -3.90%  -5.54%  -4.72%  -3.83%  -3.60%  -6.58% 
-6.43%  -5.28%  -5.28%
FP/538.imagic   1.03 MB         ~       2.32%   ~       7.36%   7.47%   6.49%  
6.22%   5.17%   4.74%
FP/544.nab_rg   83.99 KB        ~       -2.37%  -3.63%  -5.02%  -5.43%  -5.33% 
-7.50%  -5.35%  -5.49%
INT/531.deeps   60.41 KB        ~       ~       ~       2.54%   2.81%   7.06%  
6.65%   9.91%   10.01%
FP/511.povray   771.09 KB       ~       ~       ~       7.83%   9.25%   6.44%  
2.73%   5.84%   5.68%
FP/507.cactuB   2.54 MB         6.59%   ~       -3.85%  ~       ~       2.81%  
5.32%   7.56%   9.22%
FP/527.cam4_r   2.60 MB         ~       2.25%   ~       4.21%   3.92%   3.96%  
5.02%   6.37%   6.11%
INT/548.excha   65.35 KB        -7.62%  2.14%   ~       -3.24%  -3.58%  ~      
~       5.92%   6.14%
FP/510.parest   1.29 MB         -2.11%  ~       9.79%   ~       -2.17%  -3.89% 
-4.43%  3.44%   3.22%
INT/520.omnet   1.07 MB         ~       ~       -3.96%  4.31%   ~       4.71%  
2.48%   3.83%   3.52%
FP/508.namd_r   829.33 KB       ~       ~       13.51%  ~       ~       ~      
~       4.11%   3.25%
INT/505.mcf_r   12.59 KB        ~       -3.25%  -5.23%  -2.24%  -4.12%  -2.88% 
-2.26%  2.71%   ~
INT/525.x264_   404.39 KB       ~       ~       ~       -5.23%  -5.15%  -3.84% 
-3.88%  ~       ~
FP/554.roms_r   563.96 KB       -2.55%  ~       ~       ~       ~       ~      
-4.60%  -4.17%  -4.59%
FP/503.bwaves   30.62 KB        2.74%   ~       -2.24%  -2.52%  -2.43%  ~      
~       ~       ~

So GCC binary for example got 10% bigger

[Bug ipa/114531] Feature proposal for an `-finline-functions-aggressive` compiler option

Reply via email to