This tries to apply the same trick to sminmax reduction patterns as for the reduc_plus_scal ones, namely reduce %zmm -> %ymm -> %xmm first. On a microbenchmark this improves performance on Zen by ~30% for AVX2 and on Skylake-SP by ~10% for AVX512 (for AVX2 there's no measurable difference).
I guess I'm mostly looking for feedback on the approach I took in not rewriting ix86_expand_reduc but instead "recurse" on the expanders as well as the need to define recursion stops for SSE modes previously not covered. I'll throw this on a bootstrap & regtest on x86_64-unknown-linux-gnu later. Any comments sofar? Writing .md patterns is new for me ;) Thanks, Richard. 2018-10-04 Richard Biener <rguent...@suse.de> * config/i386/sse.md (reduc_<code>_scal_<mode>): Split into part reducing to half width and recursing and SSE2 vector variant doing the final reduction with ix86_expand_reduc. Index: gcc/config/i386/sse.md =================================================================== --- gcc/config/i386/sse.md (revision 264837) +++ gcc/config/i386/sse.md (working copy) @@ -2544,11 +2544,29 @@ (define_expand "reduc_plus_scal_v4sf" }) ;; Modes handled by reduc_sm{in,ax}* patterns. +(define_mode_iterator REDUC_SSE_SMINMAX_MODE + [(V4SF "TARGET_SSE") (V2DF "TARGET_SSE") + (V2DI "TARGET_SSE") (V4SI "TARGET_SSE") (V8HI "TARGET_SSE") + (V16QI "TARGET_SSE")]) + +(define_expand "reduc_<code>_scal_<mode>" + [(smaxmin:REDUC_SSE_SMINMAX_MODE + (match_operand:<ssescalarmode> 0 "register_operand") + (match_operand:REDUC_SSE_SMINMAX_MODE 1 "register_operand"))] + "" +{ + rtx tmp = gen_reg_rtx (<MODE>mode); + ix86_expand_reduc (gen_<code><mode>3, tmp, operands[1]); + emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp, + const0_rtx)); + DONE; +}) + (define_mode_iterator REDUC_SMINMAX_MODE [(V32QI "TARGET_AVX2") (V16HI "TARGET_AVX2") (V8SI "TARGET_AVX2") (V4DI "TARGET_AVX2") (V8SF "TARGET_AVX") (V4DF "TARGET_AVX") - (V4SF "TARGET_SSE") (V64QI "TARGET_AVX512BW") + (V64QI "TARGET_AVX512BW") (V32HI "TARGET_AVX512BW") (V16SI "TARGET_AVX512F") (V8DI "TARGET_AVX512F") (V16SF "TARGET_AVX512F") (V8DF "TARGET_AVX512F")]) @@ -2559,10 +2577,12 @@ (define_expand "reduc_<code>_scal_<mode> (match_operand:REDUC_SMINMAX_MODE 1 "register_operand"))] "" { - rtx tmp = gen_reg_rtx (<MODE>mode); - ix86_expand_reduc (gen_<code><mode>3, tmp, operands[1]); - emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp, - const0_rtx)); + rtx tmp = gen_reg_rtx (<ssehalfvecmode>mode); + emit_insn (gen_vec_extract_hi_<mode> (tmp, operands[1])); + rtx tmp2 = gen_reg_rtx (<ssehalfvecmode>mode); + emit_insn (gen_<code><ssehalfvecmodelower>3 + (tmp2, tmp, gen_lowpart (<ssehalfvecmode>mode, operands[1]))); + emit_insn (gen_reduc_<code>_scal_<ssehalfvecmodelower> (operands[0], tmp2)); DONE; })