Enabling vectorization at -O2 for x86 generic, core and zen tuning

Jan Hubicka Sun, 06 Jan 2019 07:42:22 -0800

Hello,
while running benchmarks for inliner tuning I also run benchmarks
comparing -O2 and -O2 -ftree-vectorize -ftree-slp-vectorize using Martin
Liska's LNT setup (https://lnt.opensuse.org/).  The results are
summarized below but you can also see also colorful table produced
by Martin's LNT magic


https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?num_runs=3&min_percentage_change=0.02&revisions=746f%2C55f&fbclid=IwAR1EhvEnavV5Fg5g404cTrguOXG2cW7b3mRZZvtYn1qy93zihyAanZ7AiWQ
https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?num_runs=10&min_percentage_change=0.02&revisions=746f%2C55f

Overall we got following SPECrate improvements:

 SPECfp2k6   kabylake generic  +7.15%
 SPECfp2k6   kabylake native   +9.36%
 SPECfp2k17  kabylake generic  +5.36%
 SPECfp2k17  kabylake native   +6.03%
 SPECint2k17 kabylake generic  +4.13%

 SPECfp2k6   zen      generic  +9.98%
 SPECfp2k6   zen      native   +7.04%
 SPECfp2k17  zen      generic  +6.11%
 SPECfp2k17  zen      native   +5.46%
 SPECint2k17 zen      generic  +3.61%
 SPECint2k17 zen      native   +5.18%

The performance results seems surprisingly a lot in favor of
vectorization.  Martin's setup is also checking code size which goes up
by as much 26% on leslie 3d, but since many of benchmarks are small,
this is not very representative for overall code size/compile time costs
of vectorization.

I measured compile time/size on larger programs I have available with
notable changes on DealII, but otherwise sub 1% increases.  I also
benchmarked Firefox but there are no significant differences because
build system already uses -O3 for places where it matters (graphics
library etc.)

                   Compile time    code segment size 
Firefox mainlin       in noise     0.8%
gcc from spec2k6        0.5%       0.6%
gdb                     0.8%       0.3%
crafty                  0%         0%
DealII                  3.2%       4%

Note that I benchmarked -ftree-slp-vectorize separately before and
results was hit/miss, so perhaps enabling only -ftree-vectorize would
give better compile time tradeoffs. I was worried of partial memory
stalls, but I will benchmark it and also benchmark difference between
cost models.

There are some performance regressions, most notably in SPEC
 - exchange (all settings),
 - gamess (all settings),
 - calculix (Zen native only),
 - bwaves (zen native) 
and induct2 on all settings and ffft2 zen only from Polyhedron. Botan
seems very noisy, but it is rather special code.

Exchange can be fixed by adding heuristics that it is bad idea to
vectorize withing loop nest of 10 containing recursive call. I believe
gamess and calculix are understood and i can look into the remaining
cases.

Overall I am surprised how many improvements vectorization at -O2 can do
- clearly more parallel CPUs depends it depends on it.  In my experience
from analyzing regressions of gcc -O2 compared to clang -O2 buids,
vectorization is one of most common reasons. Having gcc -O2 producing
lower SPEC scores and comparably large binaries to clang -O2 does not
feel OK and I think the problem is not limited just to artificial
benchmarks.

Even though it is late in release cycle I wonder if we can do that for
GCC 9?  Performance of vectorization is very architecture specific, I
would propose enabling vectorization for Zen, core based chips and
generic in x86-64. I can also run benchmarks on buldozer. I can then
tune down the cheap model to avoid some of more expensive
transformations.

Honza


Kabylake Spec2k6, generic tuning

  improvements:
    SPEC2006/FP/481.wrf                 -31.33%         
    SPEC2006/FP/436.cactusADM           -28.17%         
    SPEC2006/FP/437.leslie3d            -17.21%         
    SPEC2006/FP/434.zeusmp              -12.90%         
    SPEC2006/FP/454.calculix            -6.44%  
    SPEC2006/FP/433.milc                -6.03%  
    SPEC2006/FP/459.GemsFDTD            -4.65%  
    SPEC2006/FP/450.soplex              -2.11%  
    SPEC2006/INT/403.gcc                -6.54%  
    SPEC2006/INT/456.hmmer              -5.45%  
    SPEC2006/INT/464.h264ref            -2.23%  
  regresions:
    SPEC2006/FP/416.gamess              8.51%   
    SPEC2006/FP/447.dealII              2.73%   

Kabylake spec2k6 -march=native

  improvements:
    SPEC2006/FP/436.cactusADM           -45.52%         
    SPEC2006/FP/481.wrf                 -34.13%         
    SPEC2006/FP/434.zeusmp              -20.25%         
    SPEC2006/FP/437.leslie3d            -19.44%         
    SPEC2006/FP/459.GemsFDTD            -6.85%  
    SPEC2006/FP/433.milc                -2.15%  
    SPEC2006/INT/456.hmmer              -8.97%  
    SPEC2006/INT/403.gcc                -7.07%  
    SPEC2006/INT/464.h264ref            -3.00%  
  regressions:
    SPEC2006/FP/416.gamess              7.97%   
    SPEC2006/INT/483.xalancbmk          3.55%   
    SPEC2006/INT/400.perlbench          2.61%   

Kabylake spec2k17 generic tuning

  improvements:
    SPEC2017/INT/525.x264_r             -33.24%         
    SPEC2017/FP/521.wrf_r               -30.63%         
    SPEC2017/FP/538.imagick_r           -9.16%  
    SPEC2017/FP/554.roms_r              -6.29%  
    SPEC2017/INT/523.xalancbmk          -5.69%  
    SPEC2017/FP/527.cam4_r              -5.19%  
    SPEC2017/INT/557.xz_r               -4.58%  
    SPEC2017/FP/510.parest_r            -4.28%  
    SPEC2017/FP/549.fotonik3d           -2.62%  
  regressions:
    SPEC2017/INT/548.exchange2          12.54%  

Kabylake spec2k17 -march=native:

  improvements:
    SPEC2017/FP/521.wrf_r               -37.25%         
    SPEC2017/INT/525.x264_r             -30.31%         
    SPEC2017/FP/554.roms_r              -10.43%         
    SPEC2017/FP/527.cam4_r              -10.05%         
    SPEC2017/FP/549.fotonik3d           -7.82%  
    SPEC2017/FP/510.parest_r            -4.48%  
  regressions:
    SPEC2017/INT/548.exchange2          14.51%  
    SPEC2017/INT/557.xz_r               3.17%   
    SPEC2017/FP/519.lbm_r               2.22%   

Zen spec2k6 genric tuning

  improvements:
    SPEC2006/FP/436.cactusADM           -39.94%         
    SPEC2006/FP/481.wrf                 -33.44%         
    SPEC2006/FP/437.leslie3d            -16.35%         
    SPEC2006/FP/434.zeusmp              -15.83%         
    SPEC2006/FP/433.milc                -13.53%         
    SPEC2006/FP/454.calculix            -9.18%  
    SPEC2006/INT/456.hmmer              -8.22%  
    SPEC2006/FP/459.GemsFDTD            -7.53%  
    SPEC2006/FP/447.dealII              -6.12%  
    SPEC2006/INT/403.gcc                -3.67%  
    SPEC2006/INT/464.h264ref            -2.92%  
    SPEC2006/INT/401.bzip2              -2.07%  
  regressions:
    SPEC2006/FP/416.gamess              8.06%   
    SPEC2006/INT/400.perlbench          6.52%   
    SPEC2006/INT/483.xalancbmk          3.84%   

Zen SPEC2k6 -march=native

  improvements
    SPEC2006/FP/481.wrf                 -31.55%         
    SPEC2006/FP/436.cactusADM           -29.20%         
    SPEC2006/FP/437.leslie3d            -16.91%         
    SPEC2006/FP/433.milc                -14.39%         
    SPEC2006/FP/434.zeusmp              -10.18%         
    SPEC2006/INT/456.hmmer              -8.95%  
    SPEC2006/FP/459.GemsFDTD            -7.23%  
    SPEC2006/FP/447.dealII              -3.31%  
    SPEC2006/INT/464.h264ref            -3.29%  
    SPEC2006/FP/470.lbm                 -2.83%  
    SPEC2006/INT/403.gcc                -2.56%  
  regressions:
    SPEC2006/FP/416.gamess              8.45%   
    SPEC2006/FP/454.calculix            10.07%  

Zen SPEC2k17 generic tuning
  improvements:
    SPEC2017/INT/525.x264_r             -34.06%         
    SPEC2017/FP/521.wrf_r               -29.71%         
    SPEC2017/FP/538.imagick_r           -7.01%  
    SPEC2017/FP/549.fotonik3d           -6.00%  
    SPEC2017/FP/527.cam4_r              -5.95%  
    SPEC2017/FP/510.parest_r            -5.93%  
    SPEC2017/FP/554.roms_r              -5.42%  
    SPEC2017/FP/503.bwaves_r            -4.46%  
    SPEC2017/FP/511.povray_r            -3.76%  
    SPEC2017/INT/523.xalancbmk          -3.10%  
    SPEC2017/FP/507.cactuBSSN           -2.22%  
  regressions:
    SPEC2017/INT/548.exchange2          8.41%   
    SPEC2017/INT/505.mcf_r              2.05%   

Zen SPEC2k17 -march=native
  improvements:
    SPEC2017/INT/525.x264_r             -37.00%         
    SPEC2017/FP/521.wrf_r               -28.70%         
    SPEC2017/FP/538.imagick_r           -17.91%         
    SPEC2017/FP/510.parest_r            -7.25%  
    SPEC2017/FP/527.cam4_r              -5.52%  
    SPEC2017/FP/554.roms_r              -5.10%  
    SPEC2017/INT/523.xalancbmk          -3.82%  
    SPEC2017/FP/549.fotonik3d           -2.52%  
    SPEC2017/FP/507.cactuBSSN           -2.16%  
    SPEC2017/INT/502.gcc                -2.12%  
  regressions:
    SPEC2017/INT/548.exchange2          9.80%   
    SPEC2017/FP/503.bwaves_r            7.81%   
    SPEC2017/INT/531.deepsjeng          2.16%   


Kabylake Polyhedron generic

  improvements:
    tfft2       -23.05%         
    test_fpu2   -18.89%         
    gas_dyn2    -13.55%         
    linpk       -7.77%  
    rnflow      -2.52%  
    nf          -2.24%  
  regressions:
    air         3.76% 
    induct2     216.41%

Zen Polyhedron generic

  improvements:
    gas_dyn2            -36.10%         
    test_fpu2           -20.97%         
    linpk               -6.29%  
    channel2            -5.04%  
    fatigue2            -3.43%  
    nf                  -3.07%  
    capacita            -2.30%  
  regressions:
    induct2             231.04%         
    tfft2               34.25%  
    protein             4.81%   

Kabylake C++ benchmarks generic

  improvements:
    nbench/NEURAL NET                   34.01%  
    botan/CMAC(AES-128) mac             21.62%  
    botan/AES-128/CBC/PKCS7 enc         21.25%  
    botan/AES-128/CBC/PKCS7 dec         18.43%  
    nbench/LU DECOMPOSITION             13.42%  
    botan/AES-128/EAX encrypt           10.93%  
    botan/AES-128/EAX decrypt           10.50%  
    botan/AES-128/OCB encrypt           9.84%   
    botan/AES-128/OCB decrypt           9.29%   
    nbench/ASSIGNMENT                   6.15%   
    botan/AES-128/XTS decrypt           3.74%   
    botan/AES-128/XTS encrypt           3.64%   
    botan/CTR-BE(AES-128) encr          2.61%   
    botan/CTR-BE(AES-128) decr          2.56%   
    botan/AES-128/GCM(16) enct          2.52%   
    botan/AES-128/GCM(16) decr          2.01%   
  regressions:
    botan/Whirlpool hash                -11.35%         
    nbench/HUFFMAN                      -2.31%  
    botan/Keccak-1600(512) hash         -3.61%  
    botan/Tiger(24,3) hash              -2.94%  

Zenith C++ benchmarks generic

  improvements:
    nbench/NEURAL NET                  47.78%   
    botan/AES-128/CBC/PKCS7 encr       21.07%   
    botan/CMAC(AES-128) mac            19.97%   
    botan/CTR-BE(AES-128) encr          15.21%  
    botan/CTR-BE(AES-128) decr          14.24%  
    botan/AES-128/EAX encrypt          13.46%   
    botan/AES-128/EAX decrypt          12.84%   
    nbench/LU DECOMPOSITION             9.12%   
    botan/AES-128/GCM(16) encr          5.66%   
    botan/AES-128/GCM(16) decr          4.40%   
    botan/AES-128/CBC/PKCS7 decr        2.96%   
    botan/ChaCha20Poly1305 decr        2.67%    
    botan/AES-128/XTS encrypt           2.53%   
    botan/Salsa20 encrypt              2.33%    
    botan/Skein-512(512) hash          2.22%    
    botan/ChaCha20Poly1305 encr        2.14%    
 regressions:
    nbench/HUFFMAN                      -12.51%         
    botan/Whirlpool hash               -8.26%   
    botan/Camellia-192 encrypt         -7.12%   
    botan/Camellia-256 decrypt         -7.07%   
    botan/Camellia-192 decrypt         -6.82%   
    botan/Camellia-128 decrypt         -6.73%   
    botan/Camellia-256 encrypt         -6.59%   
    botan/AES-128/XTS decrypt           -6.31%  
    botan/Camellia-128 encrypt         -6.30%   
    botan/XTEA decrypt                 -4.87%   
    nbench/ASSIGNMENT                  -4.85%   
    botan/AES-128/OCB encrypt          -3.36%   
    botan/Keccak-1600(512) hash        -3.08%   
    botan/AES-128 decrypt               -2.52%  
    botan/SHA-160 hash                  -2.31%  

Binary sizes and other stats are in the aforementioned links.

Enabling vectorization at -O2 for x86 generic, core and zen tuning

Reply via email to