On Thu, 4 Oct 2018, Richard Biener wrote: > > This tries to apply the same trick to sminmax reduction patterns > as for the reduc_plus_scal ones, namely reduce %zmm -> %ymm -> %xmm > first. On a microbenchmark this improves performance on Zen > by ~30% for AVX2 and on Skylake-SP by ~10% for AVX512 (for AVX2 > there's no measurable difference). > > I guess I'm mostly looking for feedback on the approach I took > in not rewriting ix86_expand_reduc but instead "recurse" on the > expanders as well as the need to define recursion stops for > SSE modes previously not covered. > > I'll throw this on a bootstrap & regtest on x86_64-unknown-linux-gnu > later. > > Any comments sofar? Writing .md patterns is new for me ;)
Btw, ICC does the same for AVX2 but for KNL (the only one I could force it to use AVX512) it does vextractf64x4 $1, %zmm4, %ymm0 #5.15 c1 vmaxpd %ymm4, %ymm0, %ymm1 #5.15 c3 valignq $3, %zmm1, %zmm1, %zmm16 #5.15 c7 valignq $2, %zmm1, %zmm1, %zmm17 #5.15 c7 valignq $1, %zmm1, %zmm1, %zmm18 #5.15 c9 vmaxsd %xmm18, %xmm17, %xmm2 #5.15 c13 vmaxsd %xmm1, %xmm16, %xmm3 #5.15 c15 vmaxsd %xmm3, %xmm2, %xmm4 #5.15 c15 I suppose that pipelines a bit better at the expense of using more registers (but I wonder why it doesn't use the VL %ymm encodings for the valignq instructions). Similar tricks would be possible for AVX2 but ICC doesn't use them, possibly because the required shuffles have higher latency or lower throughput. Richard. > Thanks, > Richard. > > 2018-10-04 Richard Biener <rguent...@suse.de> > > * config/i386/sse.md (reduc_<code>_scal_<mode>): Split > into part reducing to half width and recursing and > SSE2 vector variant doing the final reduction with > ix86_expand_reduc. > > Index: gcc/config/i386/sse.md > =================================================================== > --- gcc/config/i386/sse.md (revision 264837) > +++ gcc/config/i386/sse.md (working copy) > @@ -2544,11 +2544,29 @@ (define_expand "reduc_plus_scal_v4sf" > }) > > ;; Modes handled by reduc_sm{in,ax}* patterns. > +(define_mode_iterator REDUC_SSE_SMINMAX_MODE > + [(V4SF "TARGET_SSE") (V2DF "TARGET_SSE") > + (V2DI "TARGET_SSE") (V4SI "TARGET_SSE") (V8HI "TARGET_SSE") > + (V16QI "TARGET_SSE")]) > + > +(define_expand "reduc_<code>_scal_<mode>" > + [(smaxmin:REDUC_SSE_SMINMAX_MODE > + (match_operand:<ssescalarmode> 0 "register_operand") > + (match_operand:REDUC_SSE_SMINMAX_MODE 1 "register_operand"))] > + "" > +{ > + rtx tmp = gen_reg_rtx (<MODE>mode); > + ix86_expand_reduc (gen_<code><mode>3, tmp, operands[1]); > + emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp, > + const0_rtx)); > + DONE; > +}) > + > (define_mode_iterator REDUC_SMINMAX_MODE > [(V32QI "TARGET_AVX2") (V16HI "TARGET_AVX2") > (V8SI "TARGET_AVX2") (V4DI "TARGET_AVX2") > (V8SF "TARGET_AVX") (V4DF "TARGET_AVX") > - (V4SF "TARGET_SSE") (V64QI "TARGET_AVX512BW") > + (V64QI "TARGET_AVX512BW") > (V32HI "TARGET_AVX512BW") (V16SI "TARGET_AVX512F") > (V8DI "TARGET_AVX512F") (V16SF "TARGET_AVX512F") > (V8DF "TARGET_AVX512F")]) > @@ -2559,10 +2577,12 @@ (define_expand "reduc_<code>_scal_<mode> > (match_operand:REDUC_SMINMAX_MODE 1 "register_operand"))] > "" > { > - rtx tmp = gen_reg_rtx (<MODE>mode); > - ix86_expand_reduc (gen_<code><mode>3, tmp, operands[1]); > - emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp, > - const0_rtx)); > + rtx tmp = gen_reg_rtx (<ssehalfvecmode>mode); > + emit_insn (gen_vec_extract_hi_<mode> (tmp, operands[1])); > + rtx tmp2 = gen_reg_rtx (<ssehalfvecmode>mode); > + emit_insn (gen_<code><ssehalfvecmodelower>3 > + (tmp2, tmp, gen_lowpart (<ssehalfvecmode>mode, operands[1]))); > + emit_insn (gen_reduc_<code>_scal_<ssehalfvecmodelower> (operands[0], > tmp2)); > DONE; > }) > > -- Richard Biener <rguent...@suse.de> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)