> The following adds X86_TUNE_AVX512_TWO_EPILOGUES tuning and directs the
> vectorizer to produce both a vector AVX2 and SSE epilogue for AVX512
> vectorized loops when set.  The tuning is enabled by default for Zen4
> and Zen5 where I benchmarked it to be overall positive on SPEC CPU 2017 both
> in performance and overall code size.  In particular it speeds up
> 525.x264_r which with only an AVX2 epilogue ends up in unvectorized code
> at the moment.
> 
> Re-bootstrap and regtest running on x86_64-unknown-linux-gnu
> (I've added znver4 to the defaults after benchmarking there and have
> to double-check no -mtune=znver4 testcase is affected).  Note that
> znver4|znver5 is all AMD CPUs with AVX512.
> 
> I did not do any benchmarking on Intel CPUs with AVX512 but I do
> expect 525.x264_r to improve there as well.
> 
> OK for trunk if testing succeeds?
> 
> Thanks,
> Richard.
> 
>       * config/i386/i386.cc (ix86_vector_costs::finish_cost): Set
>       m_suggested_epilogue_mode according to X86_TUNE_AVX512_TWO_EPILOGUES.
>       * config/i386/x86-tune.def (X86_TUNE_AVX512_TWO_EPILOGUES): Add.
>       Enable for znver4 and znver5.

OK,
I wonder - are there Intel cpus for which we do not exable
AVX256_OPTIMAL?

Honza

Reply via email to