Similar to the X86_TUNE_AVX512_TWO_EPILOGUES tuning which enables
an extra 128bit SSE vector epilouge when doing 512bit AVX512
vectorization in the main loop the following allows a 64bit SSE
vector epilogue to be generated when the previous vector epilogue
still had a vectorization factor of 16 or larger (which usually
means we are operating on char data).

This effectively applies to 256bit and 512bit AVX2/AVX512 main loops,
a 128bit SSE main loop would already get a 64bit SSE vector epilogue.

Together with X86_TUNE_AVX512_TWO_EPILOGUES this means three
vector epilogues for 512bit and two vector epilogues when enabling
256bit vectorization.  I have not added another tunable for this
RFC - suggestions on how to avoid inflation there welcome.

This speeds up 525.x264_r to within 5% of the -mprefer-vector-size=128
speed with -mprefer-vector-size=256 or -mprefer-vector-size=512
(the latter only when -mtune-crtl=avx512_two_epilogues is in effect).

I have not done any further benchmarking, this merely shows the
possibility and looks for guidance on how to expose this to the
uarch tunings or to the user (at all?) if not gating on any uarch
specific tuning.

Note 64bit SSE isn't a native vector size so we rely on emulation
being "complete" (if not epilogue vectorization will only fail, so
it's "safe" in this regard).  With AVX512 ISA available an alternative
is a predicated epilog, but due to possible STLF issues user control
would be required here.

Bootstrapped on x86_64-unknown-linux-gnu, testing in progress
(I expect some fallout in scans due to some extra epilogues, let's see)

        * config/i386/i386.cc (ix86_vector_costs::finish_cost): For an
        128bit SSE epilogue request a 64bit SSE epilogue if the 128bit
        SSE epilogue VF was 16 or higher.
---
 gcc/config/i386/i386.cc | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index c7e70c21999..f2e8de3aafc 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -25495,6 +25495,13 @@ ix86_vector_costs::finish_cost (const vector_costs 
*scalar_costs)
               && GET_MODE_SIZE (loop_vinfo->vector_mode) == 32)
        m_suggested_epilogue_mode = V16QImode;
     }
+  /* When a 128bit SSE vectorized epilogue still has a VF of 16 or larger
+     enable a 64bit SSE epilogue.  */
+  if (loop_vinfo
+      && LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+      && GET_MODE_SIZE (loop_vinfo->vector_mode) == 16
+      && LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant () >= 16)
+    m_suggested_epilogue_mode = V8QImode;
 
   vector_costs::finish_cost (scalar_costs);
 }
-- 
2.43.0

Reply via email to