On Mon, 11 Nov 2024, Richard Biener wrote: > The following adds X86_TUNE_AVX512_TWO_EPILOGUES tuning and directs the > vectorizer to produce both a vector AVX2 and SSE epilogue for AVX512 > vectorized loops when set. The tuning is enabled by default for Zen4 > and Zen5 where I benchmarked it to be overall positive on SPEC CPU 2017 both > in performance and overall code size. In particular it speeds up > 525.x264_r which with only an AVX2 epilogue ends up in unvectorized code > at the moment. > > Re-bootstrap and regtest running on x86_64-unknown-linux-gnu > (I've added znver4 to the defaults after benchmarking there and have > to double-check no -mtune=znver4 testcase is affected). Note that > znver4|znver5 is all AMD CPUs with AVX512.
I of course failed to git add that change, so re-tested after fixing and now pushed. See below for the pushed patch. Richard. >From 0460a6f669c3ce3646df2b767c33259b4f5fa8fd Mon Sep 17 00:00:00 2001 From: Richard Biener <rguent...@suse.de> Date: Fri, 8 Nov 2024 11:17:22 +0100 Subject: [PATCH] Add X86_TUNE_AVX512_TWO_EPILOGUES, enable for Zen4 and Zen5 To: gcc-patches@gcc.gnu.org The following adds X86_TUNE_AVX512_TWO_EPILOGUES tuning and directs the vectorizer to produce both a vector AVX2 and SSE epilogue for AVX512 vectorized loops when set. The tuning is enabled by default for Zen4 and Zen5 where I benchmarked it to be overall positive on SPEC CPU 2017 both in performance and overall code size. In particular it speeds up 525.x264_r which with only an AVX2 epilogue ends up in unvectorized code at the moment. * config/i386/i386.cc (ix86_vector_costs::finish_cost): Set m_suggested_epilogue_mode according to X86_TUNE_AVX512_TWO_EPILOGUES. * config/i386/x86-tune.def (X86_TUNE_AVX512_TWO_EPILOGUES): Add. Enable for znver4 and znver5. --- gcc/config/i386/i386.cc | 12 ++++++++++++ gcc/config/i386/x86-tune.def | 5 +++++ 2 files changed, 17 insertions(+) diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index 6ac3a5d55f2..526c9df7618 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -25353,6 +25353,18 @@ ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) && TARGET_AVX256_AVOID_VEC_PERM) m_costs[i] = INT_MAX; + /* When X86_TUNE_AVX512_TWO_EPILOGUES is enabled arrange for both + a AVX2 and a SSE epilogue for AVX512 vectorized loops. */ + if (loop_vinfo + && ix86_tune_features[X86_TUNE_AVX512_TWO_EPILOGUES]) + { + if (GET_MODE_SIZE (loop_vinfo->vector_mode) == 64) + m_suggested_epilogue_mode = V32QImode; + else if (LOOP_VINFO_EPILOGUE_P (loop_vinfo) + && GET_MODE_SIZE (loop_vinfo->vector_mode) == 32) + m_suggested_epilogue_mode = V16QImode; + } + vector_costs::finish_cost (scalar_costs); } diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def index 6ebb2fd3414..81dd895ac81 100644 --- a/gcc/config/i386/x86-tune.def +++ b/gcc/config/i386/x86-tune.def @@ -597,6 +597,11 @@ DEF_TUNE (X86_TUNE_AVX512_MOVE_BY_PIECES, "avx512_move_by_pieces", DEF_TUNE (X86_TUNE_AVX512_STORE_BY_PIECES, "avx512_store_by_pieces", m_SAPPHIRERAPIDS | m_ZNVER4 | m_ZNVER5) +/* X86_TUNE_AVX512_TWO_EPILOGUES: Use two vector epilogues for 512-bit + vectorized loops. */ +DEF_TUNE (X86_TUNE_AVX512_TWO_EPILOGUES, "avx512_two_epilogues", + m_ZNVER4 | m_ZNVER5) + /*****************************************************************************/ /*****************************************************************************/ /* Historical relics: tuning flags that helps a specific old CPU designs */ -- 2.43.0