On Mon, 11 Nov 2024, Richard Biener wrote:

> The following adds X86_TUNE_AVX512_TWO_EPILOGUES tuning and directs the
> vectorizer to produce both a vector AVX2 and SSE epilogue for AVX512
> vectorized loops when set.  The tuning is enabled by default for Zen4
> and Zen5 where I benchmarked it to be overall positive on SPEC CPU 2017 both
> in performance and overall code size.  In particular it speeds up
> 525.x264_r which with only an AVX2 epilogue ends up in unvectorized code
> at the moment.
> 
> Re-bootstrap and regtest running on x86_64-unknown-linux-gnu
> (I've added znver4 to the defaults after benchmarking there and have
> to double-check no -mtune=znver4 testcase is affected).  Note that
> znver4|znver5 is all AMD CPUs with AVX512.

I of course failed to git add that change, so re-tested after fixing
and now pushed.  See below for the pushed patch.

Richard.

>From 0460a6f669c3ce3646df2b767c33259b4f5fa8fd Mon Sep 17 00:00:00 2001
From: Richard Biener <rguent...@suse.de>
Date: Fri, 8 Nov 2024 11:17:22 +0100
Subject: [PATCH] Add X86_TUNE_AVX512_TWO_EPILOGUES, enable for Zen4 and Zen5
To: gcc-patches@gcc.gnu.org

The following adds X86_TUNE_AVX512_TWO_EPILOGUES tuning and directs the
vectorizer to produce both a vector AVX2 and SSE epilogue for AVX512
vectorized loops when set.  The tuning is enabled by default for Zen4
and Zen5 where I benchmarked it to be overall positive on SPEC CPU 2017 both
in performance and overall code size.  In particular it speeds up
525.x264_r which with only an AVX2 epilogue ends up in unvectorized code
at the moment.

        * config/i386/i386.cc (ix86_vector_costs::finish_cost): Set
        m_suggested_epilogue_mode according to X86_TUNE_AVX512_TWO_EPILOGUES.
        * config/i386/x86-tune.def (X86_TUNE_AVX512_TWO_EPILOGUES): Add.
        Enable for znver4 and znver5.
---
 gcc/config/i386/i386.cc      | 12 ++++++++++++
 gcc/config/i386/x86-tune.def |  5 +++++
 2 files changed, 17 insertions(+)

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 6ac3a5d55f2..526c9df7618 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -25353,6 +25353,18 @@ ix86_vector_costs::finish_cost (const vector_costs 
*scalar_costs)
        && TARGET_AVX256_AVOID_VEC_PERM)
       m_costs[i] = INT_MAX;
 
+  /* When X86_TUNE_AVX512_TWO_EPILOGUES is enabled arrange for both
+     a AVX2 and a SSE epilogue for AVX512 vectorized loops.  */
+  if (loop_vinfo
+      && ix86_tune_features[X86_TUNE_AVX512_TWO_EPILOGUES])
+    {
+      if (GET_MODE_SIZE (loop_vinfo->vector_mode) == 64)
+       m_suggested_epilogue_mode = V32QImode;
+      else if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+              && GET_MODE_SIZE (loop_vinfo->vector_mode) == 32)
+       m_suggested_epilogue_mode = V16QImode;
+    }
+
   vector_costs::finish_cost (scalar_costs);
 }
 
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 6ebb2fd3414..81dd895ac81 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -597,6 +597,11 @@ DEF_TUNE (X86_TUNE_AVX512_MOVE_BY_PIECES, 
"avx512_move_by_pieces",
 DEF_TUNE (X86_TUNE_AVX512_STORE_BY_PIECES, "avx512_store_by_pieces",
          m_SAPPHIRERAPIDS | m_ZNVER4 | m_ZNVER5)
 
+/* X86_TUNE_AVX512_TWO_EPILOGUES: Use two vector epilogues for 512-bit
+   vectorized loops.  */
+DEF_TUNE (X86_TUNE_AVX512_TWO_EPILOGUES, "avx512_two_epilogues",
+         m_ZNVER4 | m_ZNVER5)
+
 /*****************************************************************************/
 /*****************************************************************************/
 /* Historical relics: tuning flags that helps a specific old CPU designs     */
-- 
2.43.0

Reply via email to