On Mon, Nov 11, 2024 at 8:20 PM Richard Biener <rguent...@suse.de> wrote: > > The following adds X86_TUNE_AVX512_TWO_EPILOGUES tuning and directs the > vectorizer to produce both a vector AVX2 and SSE epilogue for AVX512 > vectorized loops when set. The tuning is enabled by default for Zen4 > and Zen5 where I benchmarked it to be overall positive on SPEC CPU 2017 both > in performance and overall code size. In particular it speeds up > 525.x264_r which with only an AVX2 epilogue ends up in unvectorized code > at the moment. In my experience, even AVX256_TWO_EPILOGUES could help since for some benchmark 64-bit epilogue loop vectorization does matter. I'll benchmark it for SRF and other AVX256_OPTIMAL AVX512 machine. > > Re-bootstrap and regtest running on x86_64-unknown-linux-gnu > (I've added znver4 to the defaults after benchmarking there and have > to double-check no -mtune=znver4 testcase is affected). Note that > znver4|znver5 is all AMD CPUs with AVX512. > > I did not do any benchmarking on Intel CPUs with AVX512 but I do > expect 525.x264_r to improve there as well. > > OK for trunk if testing succeeds? > > Thanks, > Richard. > > * config/i386/i386.cc (ix86_vector_costs::finish_cost): Set > m_suggested_epilogue_mode according to X86_TUNE_AVX512_TWO_EPILOGUES. > * config/i386/x86-tune.def (X86_TUNE_AVX512_TWO_EPILOGUES): Add. > Enable for znver4 and znver5. > --- > gcc/config/i386/i386.cc | 12 ++++++++++++ > gcc/config/i386/x86-tune.def | 5 +++++ > 2 files changed, 17 insertions(+) > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > index 6ac3a5d55f2..526c9df7618 100644 > --- a/gcc/config/i386/i386.cc > +++ b/gcc/config/i386/i386.cc > @@ -25353,6 +25353,18 @@ ix86_vector_costs::finish_cost (const vector_costs > *scalar_costs) > && TARGET_AVX256_AVOID_VEC_PERM) > m_costs[i] = INT_MAX; > > + /* When X86_TUNE_AVX512_TWO_EPILOGUES is enabled arrange for both > + a AVX2 and a SSE epilogue for AVX512 vectorized loops. */ > + if (loop_vinfo > + && ix86_tune_features[X86_TUNE_AVX512_TWO_EPILOGUES]) > + { > + if (GET_MODE_SIZE (loop_vinfo->vector_mode) == 64) > + m_suggested_epilogue_mode = V32QImode; > + else if (LOOP_VINFO_EPILOGUE_P (loop_vinfo) > + && GET_MODE_SIZE (loop_vinfo->vector_mode) == 32) > + m_suggested_epilogue_mode = V16QImode; > + } If m_suggest_unrolled_vector > 1, originally epilogue VF will be the same as the main loop since it's unrolled. I assume now epilogue VF will always be half of the main loop VF if the tune is enabled? > + > vector_costs::finish_cost (scalar_costs); > } > > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def > index 6ebb2fd3414..6bbfc3c3b90 100644 > --- a/gcc/config/i386/x86-tune.def > +++ b/gcc/config/i386/x86-tune.def > @@ -597,6 +597,11 @@ DEF_TUNE (X86_TUNE_AVX512_MOVE_BY_PIECES, > "avx512_move_by_pieces", > DEF_TUNE (X86_TUNE_AVX512_STORE_BY_PIECES, "avx512_store_by_pieces", > m_SAPPHIRERAPIDS | m_ZNVER4 | m_ZNVER5) > > +/* X86_TUNE_AVX512_TWO_EPILOGUES: Use two vector epilogues for 512-bit > + vectorized loops. */ > +DEF_TUNE (X86_TUNE_AVX512_TWO_EPILOGUES, "avx512_two_epilogues", > + m_ZNVER5) > + > > /*****************************************************************************/ > > /*****************************************************************************/ > /* Historical relics: tuning flags that helps a specific old CPU designs > */ > -- > 2.43.0
-- BR, Hongtao