On Tue, 12 Nov 2024, Hongtao Liu wrote: > On Mon, Nov 11, 2024 at 8:20 PM Richard Biener <rguent...@suse.de> wrote: > > > > The following adds X86_TUNE_AVX512_TWO_EPILOGUES tuning and directs the > > vectorizer to produce both a vector AVX2 and SSE epilogue for AVX512 > > vectorized loops when set. The tuning is enabled by default for Zen4 > > and Zen5 where I benchmarked it to be overall positive on SPEC CPU 2017 both > > in performance and overall code size. In particular it speeds up > > 525.x264_r which with only an AVX2 epilogue ends up in unvectorized code > > at the moment. > In my experience, even AVX256_TWO_EPILOGUES could help since for some > benchmark 64-bit epilogue loop vectorization does matter. > I'll benchmark it for SRF and other AVX256_OPTIMAL AVX512 machine.
Sure - since the target can decide individually we can apply more complex heuristics here as well - I just wanted to start simple, specifically to address the 525.x264_r when doing 512bit vectorization. I observed code size improvements for the case where the scalar epilogue would have been fully peeled by the later cunroll pass for example - anticipating that would help. There might be some heuristics that are better implemented in the vectorizer itself of course. As for the usefulness of extra vector epilogues this probably depends on the VF (but the costing of the epilogue vectorization itself might deal with not profitable cases already). Richard. > > > > Re-bootstrap and regtest running on x86_64-unknown-linux-gnu > > (I've added znver4 to the defaults after benchmarking there and have > > to double-check no -mtune=znver4 testcase is affected). Note that > > znver4|znver5 is all AMD CPUs with AVX512. > > > > I did not do any benchmarking on Intel CPUs with AVX512 but I do > > expect 525.x264_r to improve there as well. > > > > OK for trunk if testing succeeds? > > > > Thanks, > > Richard. > > > > * config/i386/i386.cc (ix86_vector_costs::finish_cost): Set > > m_suggested_epilogue_mode according to > > X86_TUNE_AVX512_TWO_EPILOGUES. > > * config/i386/x86-tune.def (X86_TUNE_AVX512_TWO_EPILOGUES): Add. > > Enable for znver4 and znver5. > > --- > > gcc/config/i386/i386.cc | 12 ++++++++++++ > > gcc/config/i386/x86-tune.def | 5 +++++ > > 2 files changed, 17 insertions(+) > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > index 6ac3a5d55f2..526c9df7618 100644 > > --- a/gcc/config/i386/i386.cc > > +++ b/gcc/config/i386/i386.cc > > @@ -25353,6 +25353,18 @@ ix86_vector_costs::finish_cost (const vector_costs > > *scalar_costs) > > && TARGET_AVX256_AVOID_VEC_PERM) > > m_costs[i] = INT_MAX; > > > > + /* When X86_TUNE_AVX512_TWO_EPILOGUES is enabled arrange for both > > + a AVX2 and a SSE epilogue for AVX512 vectorized loops. */ > > + if (loop_vinfo > > + && ix86_tune_features[X86_TUNE_AVX512_TWO_EPILOGUES]) > > + { > > + if (GET_MODE_SIZE (loop_vinfo->vector_mode) == 64) > > + m_suggested_epilogue_mode = V32QImode; > > + else if (LOOP_VINFO_EPILOGUE_P (loop_vinfo) > > + && GET_MODE_SIZE (loop_vinfo->vector_mode) == 32) > > + m_suggested_epilogue_mode = V16QImode; > > + } > If m_suggest_unrolled_vector > 1, originally epilogue VF will be the > same as the main loop since it's unrolled. > I assume now epilogue VF will always be half of the main loop VF if > the tune is enabled? > > + > > vector_costs::finish_cost (scalar_costs); > > } > > > > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def > > index 6ebb2fd3414..6bbfc3c3b90 100644 > > --- a/gcc/config/i386/x86-tune.def > > +++ b/gcc/config/i386/x86-tune.def > > @@ -597,6 +597,11 @@ DEF_TUNE (X86_TUNE_AVX512_MOVE_BY_PIECES, > > "avx512_move_by_pieces", > > DEF_TUNE (X86_TUNE_AVX512_STORE_BY_PIECES, "avx512_store_by_pieces", > > m_SAPPHIRERAPIDS | m_ZNVER4 | m_ZNVER5) > > > > +/* X86_TUNE_AVX512_TWO_EPILOGUES: Use two vector epilogues for 512-bit > > + vectorized loops. */ > > +DEF_TUNE (X86_TUNE_AVX512_TWO_EPILOGUES, "avx512_two_epilogues", > > + m_ZNVER5) > > + > > > > /*****************************************************************************/ > > > > /*****************************************************************************/ > > /* Historical relics: tuning flags that helps a specific old CPU designs > > */ > > -- > > 2.43.0 > > > > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)