> The following adds X86_TUNE_AVX512_TWO_EPILOGUES tuning and directs the > vectorizer to produce both a vector AVX2 and SSE epilogue for AVX512 > vectorized loops when set. The tuning is enabled by default for Zen4 > and Zen5 where I benchmarked it to be overall positive on SPEC CPU 2017 both > in performance and overall code size. In particular it speeds up > 525.x264_r which with only an AVX2 epilogue ends up in unvectorized code > at the moment. > > Re-bootstrap and regtest running on x86_64-unknown-linux-gnu > (I've added znver4 to the defaults after benchmarking there and have > to double-check no -mtune=znver4 testcase is affected). Note that > znver4|znver5 is all AMD CPUs with AVX512. > > I did not do any benchmarking on Intel CPUs with AVX512 but I do > expect 525.x264_r to improve there as well. > > OK for trunk if testing succeeds? > > Thanks, > Richard. > > * config/i386/i386.cc (ix86_vector_costs::finish_cost): Set > m_suggested_epilogue_mode according to X86_TUNE_AVX512_TWO_EPILOGUES. > * config/i386/x86-tune.def (X86_TUNE_AVX512_TWO_EPILOGUES): Add. > Enable for znver4 and znver5.
OK, I wonder - are there Intel cpus for which we do not exable AVX256_OPTIMAL? Honza