The series add 2 tune for SRF/CWF according to Intel SOE
Crestmont microarchitecture.

1) Generate vpandn + vpand + vpor instead of vblendvps/vblendvpd/vpblendvb
instruction since 4-operand vex instruction comes from MSROM on Crestmont,
and it's slower than 3-instruction sequence.

2) Don't do 256-bit auto-vectorization when there's cross-lane permutation,
use 128-bit vectorization instead.
Instead of setting tune avx128_optimal for SRF, the patch add a new
tune avx256_avoid_vec_perm for it. so by default, vectorizer still
uses 256-bit VF if cost is profitable, but lowers to 128-bit whenever
256-bit vec_perm is needed for auto-vectorization. w/o vec_perm,
performance of 256-bit vectorization should be similar as 128-bit
ones(some benchmark results show it's even better than 128-bit
vectorization since it enables more parallelism for convert cases.)


Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
The patch generally improves SPEC2017 allrate geomean by 1% with 
-march=sierraforest -Ofast on SRF.

Ready push to trunk.

liuhongt (2):
  [x86] Add new microarchitecture tune for SRF/GRR/CWF.
  [x86] Add a new tune avx256_avoid_vec_perm for SRF.

 gcc/config/i386/i386-expand.cc                | 24 +++++++++----------
 gcc/config/i386/i386.cc                       | 14 ++++++++++-
 gcc/config/i386/i386.h                        |  4 ++++
 gcc/config/i386/x86-tune.def                  | 15 +++++++++++-
 .../gcc.target/i386/avx256_avoid_vec_perm.c   | 22 +++++++++++++++++
 .../gcc.target/i386/sse_movcc_use_blendv.c    | 12 ++++++++++
 6 files changed, 77 insertions(+), 14 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm.c
 create mode 100644 gcc/testsuite/gcc.target/i386/sse_movcc_use_blendv.c

-- 
2.31.1

Reply via email to