The series add 2 tune for SRF/CWF according to Intel SOE Crestmont microarchitecture.
1) Generate vpandn + vpand + vpor instead of vblendvps/vblendvpd/vpblendvb instruction since 4-operand vex instruction comes from MSROM on Crestmont, and it's slower than 3-instruction sequence. 2) Don't do 256-bit auto-vectorization when there's cross-lane permutation, use 128-bit vectorization instead. Instead of setting tune avx128_optimal for SRF, the patch add a new tune avx256_avoid_vec_perm for it. so by default, vectorizer still uses 256-bit VF if cost is profitable, but lowers to 128-bit whenever 256-bit vec_perm is needed for auto-vectorization. w/o vec_perm, performance of 256-bit vectorization should be similar as 128-bit ones(some benchmark results show it's even better than 128-bit vectorization since it enables more parallelism for convert cases.) Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. The patch generally improves SPEC2017 allrate geomean by 1% with -march=sierraforest -Ofast on SRF. Ready push to trunk. liuhongt (2): [x86] Add new microarchitecture tune for SRF/GRR/CWF. [x86] Add a new tune avx256_avoid_vec_perm for SRF. gcc/config/i386/i386-expand.cc | 24 +++++++++---------- gcc/config/i386/i386.cc | 14 ++++++++++- gcc/config/i386/i386.h | 4 ++++ gcc/config/i386/x86-tune.def | 15 +++++++++++- .../gcc.target/i386/avx256_avoid_vec_perm.c | 22 +++++++++++++++++ .../gcc.target/i386/sse_movcc_use_blendv.c | 12 ++++++++++ 6 files changed, 77 insertions(+), 14 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm.c create mode 100644 gcc/testsuite/gcc.target/i386/sse_movcc_use_blendv.c -- 2.31.1