Re: Enabling vectorization at -O2 for x86 generic, core and zen tuning

Richard Biener Mon, 07 Jan 2019 00:29:22 -0800

On Sun, 6 Jan 2019, Jan Hubicka wrote:

> Hello,
> while running benchmarks for inliner tuning I also run benchmarks
> comparing -O2 and -O2 -ftree-vectorize -ftree-slp-vectorize using Martin
> Liska's LNT setup (https://lnt.opensuse.org/).  The results are
> summarized below but you can also see also colorful table produced
> by Martin's LNT magic
> 
> https://lnt.opensuse.org/db_default/v4/SPEC/latest_runs_report?num_runs=3&min_percentage_change=0.02&revisions=746f%2C55f&fbclid=IwAR1EhvEnavV5Fg5g404cTrguOXG2cW7b3mRZZvtYn1qy93zihyAanZ7AiWQ
> https://lnt.opensuse.org/db_default/v4/CPP/latest_runs_report?num_runs=10&min_percentage_change=0.02&revisions=746f%2C55f
> 
> Overall we got following SPECrate improvements:
> 
>  SPECfp2k6   kabylake generic  +7.15%
>  SPECfp2k6   kabylake native   +9.36%
>  SPECfp2k17  kabylake generic  +5.36%
>  SPECfp2k17  kabylake native   +6.03%
>  SPECint2k17 kabylake generic  +4.13%
> 
>  SPECfp2k6   zen      generic  +9.98%
>  SPECfp2k6   zen      native   +7.04%
>  SPECfp2k17  zen      generic  +6.11%
>  SPECfp2k17  zen      native   +5.46%
>  SPECint2k17 zen      generic  +3.61%
>  SPECint2k17 zen      native   +5.18%
> 
> The performance results seems surprisingly a lot in favor of
> vectorization.  Martin's setup is also checking code size which goes up
> by as much 26% on leslie 3d, but since many of benchmarks are small,
> this is not very representative for overall code size/compile time costs
> of vectorization.
> 
> I measured compile time/size on larger programs I have available with
> notable changes on DealII, but otherwise sub 1% increases.  I also
> benchmarked Firefox but there are no significant differences because
> build system already uses -O3 for places where it matters (graphics
> library etc.)


Well, as much as compile-time/size of spec is not representable
the performance improvements are.

>                    Compile time    code segment size 
> Firefox       mainlin       in noise     0.8%
> gcc from spec2k6      0.5%       0.6%
> gdb                   0.8%       0.3%
> crafty                        0%         0%
> DealII                        3.2%       4%
> 
> Note that I benchmarked -ftree-slp-vectorize separately before and
> results was hit/miss, so perhaps enabling only -ftree-vectorize would
> give better compile time tradeoffs. I was worried of partial memory
> stalls, but I will benchmark it and also benchmark difference between
> cost models.
>
> There are some performance regressions, most notably in SPEC
>  - exchange (all settings),
>  - gamess (all settings),
>  - calculix (Zen native only),
>  - bwaves (zen native) 
> and induct2 on all settings and ffft2 zen only from Polyhedron. Botan
> seems very noisy, but it is rather special code.
> 
> Exchange can be fixed by adding heuristics that it is bad idea to
> vectorize withing loop nest of 10 containing recursive call. I believe
> gamess and calculix are understood and i can look into the remaining
> cases.
> 
> Overall I am surprised how many improvements vectorization at -O2 can do
> - clearly more parallel CPUs depends it depends on it.  In my experience
> from analyzing regressions of gcc -O2 compared to clang -O2 buids,
> vectorization is one of most common reasons. Having gcc -O2 producing
> lower SPEC scores and comparably large binaries to clang -O2 does not
> feel OK and I think the problem is not limited just to artificial
> benchmarks.
> 
> Even though it is late in release cycle I wonder if we can do that for
> GCC 9?  Performance of vectorization is very architecture specific, I
> would propose enabling vectorization for Zen, core based chips and
> generic in x86-64. I can also run benchmarks on buldozer. I can then
> tune down the cheap model to avoid some of more expensive
> transformations.

I'd rather not do this now, it's _way_ too late (also considering
you are again doing inliner tuning so late).

See our last attempts at this btw.

Richard.
 
> Honza
> 
> 
> Kabylake Spec2k6, generic tuning
> 
>   improvements:
>     SPEC2006/FP/481.wrf               -31.33%         
>     SPEC2006/FP/436.cactusADM                 -28.17%         
>     SPEC2006/FP/437.leslie3d          -17.21%         
>     SPEC2006/FP/434.zeusmp            -12.90%         
>     SPEC2006/FP/454.calculix          -6.44%  
>     SPEC2006/FP/433.milc              -6.03%  
>     SPEC2006/FP/459.GemsFDTD          -4.65%  
>     SPEC2006/FP/450.soplex            -2.11%  
>     SPEC2006/INT/403.gcc              -6.54%  
>     SPEC2006/INT/456.hmmer            -5.45%  
>     SPEC2006/INT/464.h264ref          -2.23%  
>   regresions:
>     SPEC2006/FP/416.gamess            8.51%   
>     SPEC2006/FP/447.dealII            2.73%   
> 
> Kabylake spec2k6 -march=native
> 
>   improvements:
>     SPEC2006/FP/436.cactusADM                 -45.52%         
>     SPEC2006/FP/481.wrf               -34.13%         
>     SPEC2006/FP/434.zeusmp            -20.25%         
>     SPEC2006/FP/437.leslie3d          -19.44%         
>     SPEC2006/FP/459.GemsFDTD          -6.85%  
>     SPEC2006/FP/433.milc              -2.15%  
>     SPEC2006/INT/456.hmmer            -8.97%  
>     SPEC2006/INT/403.gcc              -7.07%  
>     SPEC2006/INT/464.h264ref          -3.00%  
>   regressions:
>     SPEC2006/FP/416.gamess            7.97%   
>     SPEC2006/INT/483.xalancbmk                3.55%   
>     SPEC2006/INT/400.perlbench                2.61%   
> 
> Kabylake spec2k17 generic tuning
> 
>   improvements:
>     SPEC2017/INT/525.x264_r           -33.24%         
>     SPEC2017/FP/521.wrf_r             -30.63%         
>     SPEC2017/FP/538.imagick_r                 -9.16%  
>     SPEC2017/FP/554.roms_r            -6.29%  
>     SPEC2017/INT/523.xalancbmk                -5.69%  
>     SPEC2017/FP/527.cam4_r            -5.19%  
>     SPEC2017/INT/557.xz_r             -4.58%  
>     SPEC2017/FP/510.parest_r          -4.28%  
>     SPEC2017/FP/549.fotonik3d         -2.62%  
>   regressions:
>     SPEC2017/INT/548.exchange2                12.54%  
> 
> Kabylake spec2k17 -march=native:
> 
>   improvements:
>     SPEC2017/FP/521.wrf_r             -37.25%         
>     SPEC2017/INT/525.x264_r           -30.31%         
>     SPEC2017/FP/554.roms_r            -10.43%         
>     SPEC2017/FP/527.cam4_r            -10.05%         
>     SPEC2017/FP/549.fotonik3d         -7.82%  
>     SPEC2017/FP/510.parest_r          -4.48%  
>   regressions:
>     SPEC2017/INT/548.exchange2                14.51%  
>     SPEC2017/INT/557.xz_r             3.17%   
>     SPEC2017/FP/519.lbm_r             2.22%   
> 
> Zen spec2k6 genric tuning
> 
>   improvements:
>     SPEC2006/FP/436.cactusADM                 -39.94%         
>     SPEC2006/FP/481.wrf               -33.44%         
>     SPEC2006/FP/437.leslie3d          -16.35%         
>     SPEC2006/FP/434.zeusmp            -15.83%         
>     SPEC2006/FP/433.milc              -13.53%         
>     SPEC2006/FP/454.calculix          -9.18%  
>     SPEC2006/INT/456.hmmer            -8.22%  
>     SPEC2006/FP/459.GemsFDTD          -7.53%  
>     SPEC2006/FP/447.dealII            -6.12%  
>     SPEC2006/INT/403.gcc              -3.67%  
>     SPEC2006/INT/464.h264ref          -2.92%  
>     SPEC2006/INT/401.bzip2            -2.07%  
>   regressions:
>     SPEC2006/FP/416.gamess            8.06%   
>     SPEC2006/INT/400.perlbench                6.52%   
>     SPEC2006/INT/483.xalancbmk                3.84%   
> 
> Zen SPEC2k6 -march=native
> 
>   improvements
>     SPEC2006/FP/481.wrf               -31.55%         
>     SPEC2006/FP/436.cactusADM                 -29.20%         
>     SPEC2006/FP/437.leslie3d          -16.91%         
>     SPEC2006/FP/433.milc              -14.39%         
>     SPEC2006/FP/434.zeusmp            -10.18%         
>     SPEC2006/INT/456.hmmer            -8.95%  
>     SPEC2006/FP/459.GemsFDTD          -7.23%  
>     SPEC2006/FP/447.dealII            -3.31%  
>     SPEC2006/INT/464.h264ref          -3.29%  
>     SPEC2006/FP/470.lbm               -2.83%  
>     SPEC2006/INT/403.gcc              -2.56%  
>   regressions:
>     SPEC2006/FP/416.gamess            8.45%   
>     SPEC2006/FP/454.calculix          10.07%  
> 
> Zen SPEC2k17 generic tuning
>   improvements:
>     SPEC2017/INT/525.x264_r           -34.06%         
>     SPEC2017/FP/521.wrf_r             -29.71%         
>     SPEC2017/FP/538.imagick_r                 -7.01%  
>     SPEC2017/FP/549.fotonik3d                 -6.00%  
>     SPEC2017/FP/527.cam4_r            -5.95%  
>     SPEC2017/FP/510.parest_r          -5.93%  
>     SPEC2017/FP/554.roms_r            -5.42%  
>     SPEC2017/FP/503.bwaves_r          -4.46%  
>     SPEC2017/FP/511.povray_r          -3.76%  
>     SPEC2017/INT/523.xalancbmk                -3.10%  
>     SPEC2017/FP/507.cactuBSSN                 -2.22%  
>   regressions:
>     SPEC2017/INT/548.exchange2                8.41%   
>     SPEC2017/INT/505.mcf_r            2.05%   
> 
> Zen SPEC2k17 -march=native
>   improvements:
>     SPEC2017/INT/525.x264_r           -37.00%         
>     SPEC2017/FP/521.wrf_r             -28.70%         
>     SPEC2017/FP/538.imagick_r                 -17.91%         
>     SPEC2017/FP/510.parest_r          -7.25%  
>     SPEC2017/FP/527.cam4_r            -5.52%  
>     SPEC2017/FP/554.roms_r            -5.10%  
>     SPEC2017/INT/523.xalancbmk                -3.82%  
>     SPEC2017/FP/549.fotonik3d                 -2.52%  
>     SPEC2017/FP/507.cactuBSSN                 -2.16%  
>     SPEC2017/INT/502.gcc              -2.12%  
>   regressions:
>     SPEC2017/INT/548.exchange2                9.80%   
>     SPEC2017/FP/503.bwaves_r          7.81%   
>     SPEC2017/INT/531.deepsjeng                2.16%   
> 
> 
> Kabylake Polyhedron generic
> 
>   improvements:
>     tfft2     -23.05%         
>     test_fpu2         -18.89%         
>     gas_dyn2  -13.55%         
>     linpk     -7.77%  
>     rnflow    -2.52%  
>     nf                -2.24%  
>   regressions:
>     air       3.76% 
>     induct2   216.41%
> 
> Zen Polyhedron generic
> 
>   improvements:
>     gas_dyn2          -36.10%         
>     test_fpu2                 -20.97%         
>     linpk             -6.29%  
>     channel2          -5.04%  
>     fatigue2          -3.43%  
>     nf                        -3.07%  
>     capacita          -2.30%  
>   regressions:
>     induct2           231.04%         
>     tfft2             34.25%  
>     protein           4.81%   
> 
> Kabylake C++ benchmarks generic
> 
>   improvements:
>     nbench/NEURAL NET                         34.01%  
>     botan/CMAC(AES-128) mac           21.62%  
>     botan/AES-128/CBC/PKCS7 enc               21.25%  
>     botan/AES-128/CBC/PKCS7 dec               18.43%  
>     nbench/LU DECOMPOSITION           13.42%  
>     botan/AES-128/EAX encrypt                 10.93%  
>     botan/AES-128/EAX decrypt                 10.50%  
>     botan/AES-128/OCB encrypt                 9.84%   
>     botan/AES-128/OCB decrypt                 9.29%   
>     nbench/ASSIGNMENT                         6.15%   
>     botan/AES-128/XTS decrypt                 3.74%   
>     botan/AES-128/XTS encrypt                 3.64%   
>     botan/CTR-BE(AES-128) encr                2.61%   
>     botan/CTR-BE(AES-128) decr                2.56%   
>     botan/AES-128/GCM(16) enct                2.52%   
>     botan/AES-128/GCM(16) decr                2.01%   
>   regressions:
>     botan/Whirlpool hash              -11.35%         
>     nbench/HUFFMAN                            -2.31%  
>     botan/Keccak-1600(512) hash               -3.61%  
>     botan/Tiger(24,3) hash            -2.94%  
> 
> Zenith C++ benchmarks generic
> 
>   improvements:
>     nbench/NEURAL NET                        47.78%   
>     botan/AES-128/CBC/PKCS7 encr       21.07%         
>     botan/CMAC(AES-128) mac          19.97%   
>     botan/CTR-BE(AES-128) encr                15.21%  
>     botan/CTR-BE(AES-128) decr                14.24%  
>     botan/AES-128/EAX encrypt                13.46%   
>     botan/AES-128/EAX decrypt                12.84%   
>     nbench/LU DECOMPOSITION           9.12%   
>     botan/AES-128/GCM(16) encr                5.66%   
>     botan/AES-128/GCM(16) decr                4.40%   
>     botan/AES-128/CBC/PKCS7 decr      2.96%   
>     botan/ChaCha20Poly1305 decr              2.67%    
>     botan/AES-128/XTS encrypt                 2.53%   
>     botan/Salsa20 encrypt            2.33%    
>     botan/Skein-512(512) hash                2.22%    
>     botan/ChaCha20Poly1305 encr              2.14%    
>  regressions:
>     nbench/HUFFMAN                    -12.51%         
>     botan/Whirlpool hash             -8.26%   
>     botan/Camellia-192 encrypt               -7.12%   
>     botan/Camellia-256 decrypt               -7.07%   
>     botan/Camellia-192 decrypt               -6.82%   
>     botan/Camellia-128 decrypt               -6.73%   
>     botan/Camellia-256 encrypt               -6.59%   
>     botan/AES-128/XTS decrypt                 -6.31%  
>     botan/Camellia-128 encrypt               -6.30%   
>     botan/XTEA decrypt                       -4.87%   
>     nbench/ASSIGNMENT                        -4.85%   
>     botan/AES-128/OCB encrypt                -3.36%   
>     botan/Keccak-1600(512) hash        -3.08%         
>     botan/AES-128 decrypt             -2.52%  
>     botan/SHA-160 hash                        -2.31%  
> 
> Binary sizes and other stats are in the aforementioned links.
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nuernberg)

Re: Enabling vectorization at -O2 for x86 generic, core and zen tuning

Reply via email to