This tries to apply the same trick to sminmax reduction patterns
as for the reduc_plus_scal ones, namely reduce %zmm -> %ymm -> %xmm
first.  On a microbenchmark this improves performance on Zen
by ~30% for AVX2 and on Skylake-SP by ~10% for AVX512 (for AVX2
there's no measurable difference).

I guess I'm mostly looking for feedback on the approach I took
in not rewriting ix86_expand_reduc but instead "recurse" on the
expanders as well as the need to define recursion stops for
SSE modes previously not covered.

I'll throw this on a bootstrap & regtest on x86_64-unknown-linux-gnu
later.

Any comments sofar?  Writing .md patterns is new for me ;)

Thanks,
Richard.

2018-10-04  Richard Biener  <rguent...@suse.de>

        * config/i386/sse.md (reduc_<code>_scal_<mode>): Split
        into part reducing to half width and recursing and
        SSE2 vector variant doing the final reduction with
        ix86_expand_reduc.

Index: gcc/config/i386/sse.md
===================================================================
--- gcc/config/i386/sse.md      (revision 264837)
+++ gcc/config/i386/sse.md      (working copy)
@@ -2544,11 +2544,29 @@ (define_expand "reduc_plus_scal_v4sf"
 })
 
 ;; Modes handled by reduc_sm{in,ax}* patterns.
+(define_mode_iterator REDUC_SSE_SMINMAX_MODE
+  [(V4SF "TARGET_SSE") (V2DF "TARGET_SSE")
+   (V2DI "TARGET_SSE") (V4SI "TARGET_SSE") (V8HI "TARGET_SSE")
+   (V16QI "TARGET_SSE")])
+
+(define_expand "reduc_<code>_scal_<mode>"
+  [(smaxmin:REDUC_SSE_SMINMAX_MODE
+     (match_operand:<ssescalarmode> 0 "register_operand")
+     (match_operand:REDUC_SSE_SMINMAX_MODE 1 "register_operand"))]
+  ""
+{
+  rtx tmp = gen_reg_rtx (<MODE>mode);
+  ix86_expand_reduc (gen_<code><mode>3, tmp, operands[1]);
+  emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp,
+                                                       const0_rtx));
+  DONE;
+})
+
 (define_mode_iterator REDUC_SMINMAX_MODE
   [(V32QI "TARGET_AVX2") (V16HI "TARGET_AVX2")
    (V8SI "TARGET_AVX2") (V4DI "TARGET_AVX2")
    (V8SF "TARGET_AVX") (V4DF "TARGET_AVX")
-   (V4SF "TARGET_SSE") (V64QI "TARGET_AVX512BW")
+   (V64QI "TARGET_AVX512BW")
    (V32HI "TARGET_AVX512BW") (V16SI "TARGET_AVX512F")
    (V8DI "TARGET_AVX512F") (V16SF "TARGET_AVX512F")
    (V8DF "TARGET_AVX512F")])
@@ -2559,10 +2577,12 @@ (define_expand "reduc_<code>_scal_<mode>
      (match_operand:REDUC_SMINMAX_MODE 1 "register_operand"))]
   ""
 {
-  rtx tmp = gen_reg_rtx (<MODE>mode);
-  ix86_expand_reduc (gen_<code><mode>3, tmp, operands[1]);
-  emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp,
-                                                       const0_rtx));
+  rtx tmp = gen_reg_rtx (<ssehalfvecmode>mode);
+  emit_insn (gen_vec_extract_hi_<mode> (tmp, operands[1]));
+  rtx tmp2 = gen_reg_rtx (<ssehalfvecmode>mode);
+  emit_insn (gen_<code><ssehalfvecmodelower>3
+    (tmp2, tmp, gen_lowpart (<ssehalfvecmode>mode, operands[1])));
+  emit_insn (gen_reduc_<code>_scal_<ssehalfvecmodelower> (operands[0], tmp2));
   DONE;
 })
 

Reply via email to