On Thu, 4 Oct 2018, Richard Biener wrote:

> 
> This tries to apply the same trick to sminmax reduction patterns
> as for the reduc_plus_scal ones, namely reduce %zmm -> %ymm -> %xmm
> first.  On a microbenchmark this improves performance on Zen
> by ~30% for AVX2 and on Skylake-SP by ~10% for AVX512 (for AVX2
> there's no measurable difference).
> 
> I guess I'm mostly looking for feedback on the approach I took
> in not rewriting ix86_expand_reduc but instead "recurse" on the
> expanders as well as the need to define recursion stops for
> SSE modes previously not covered.
> 
> I'll throw this on a bootstrap & regtest on x86_64-unknown-linux-gnu
> later.
> 
> Any comments sofar?  Writing .md patterns is new for me ;)

Btw, ICC does the same for AVX2 but for KNL (the only one I
could force it to use AVX512) it does

        vextractf64x4 $1, %zmm4, %ymm0                          #5.15 c1
        vmaxpd    %ymm4, %ymm0, %ymm1                           #5.15 c3
        valignq   $3, %zmm1, %zmm1, %zmm16                      #5.15 c7
        valignq   $2, %zmm1, %zmm1, %zmm17                      #5.15 c7
        valignq   $1, %zmm1, %zmm1, %zmm18                      #5.15 c9
        vmaxsd    %xmm18, %xmm17, %xmm2                         #5.15 c13
        vmaxsd    %xmm1, %xmm16, %xmm3                          #5.15 c15
        vmaxsd    %xmm3, %xmm2, %xmm4                           #5.15 c15

I suppose that pipelines a bit better at the expense of using more
registers (but I wonder why it doesn't use the VL %ymm encodings
for the valignq instructions).  Similar tricks would be possible
for AVX2 but ICC doesn't use them, possibly because the required
shuffles have higher latency or lower throughput.

Richard.

> Thanks,
> Richard.
> 
> 2018-10-04  Richard Biener  <rguent...@suse.de>
> 
>       * config/i386/sse.md (reduc_<code>_scal_<mode>): Split
>       into part reducing to half width and recursing and
>       SSE2 vector variant doing the final reduction with
>       ix86_expand_reduc.
> 
> Index: gcc/config/i386/sse.md
> ===================================================================
> --- gcc/config/i386/sse.md    (revision 264837)
> +++ gcc/config/i386/sse.md    (working copy)
> @@ -2544,11 +2544,29 @@ (define_expand "reduc_plus_scal_v4sf"
>  })
>  
>  ;; Modes handled by reduc_sm{in,ax}* patterns.
> +(define_mode_iterator REDUC_SSE_SMINMAX_MODE
> +  [(V4SF "TARGET_SSE") (V2DF "TARGET_SSE")
> +   (V2DI "TARGET_SSE") (V4SI "TARGET_SSE") (V8HI "TARGET_SSE")
> +   (V16QI "TARGET_SSE")])
> +
> +(define_expand "reduc_<code>_scal_<mode>"
> +  [(smaxmin:REDUC_SSE_SMINMAX_MODE
> +     (match_operand:<ssescalarmode> 0 "register_operand")
> +     (match_operand:REDUC_SSE_SMINMAX_MODE 1 "register_operand"))]
> +  ""
> +{
> +  rtx tmp = gen_reg_rtx (<MODE>mode);
> +  ix86_expand_reduc (gen_<code><mode>3, tmp, operands[1]);
> +  emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp,
> +                                                     const0_rtx));
> +  DONE;
> +})
> +
>  (define_mode_iterator REDUC_SMINMAX_MODE
>    [(V32QI "TARGET_AVX2") (V16HI "TARGET_AVX2")
>     (V8SI "TARGET_AVX2") (V4DI "TARGET_AVX2")
>     (V8SF "TARGET_AVX") (V4DF "TARGET_AVX")
> -   (V4SF "TARGET_SSE") (V64QI "TARGET_AVX512BW")
> +   (V64QI "TARGET_AVX512BW")
>     (V32HI "TARGET_AVX512BW") (V16SI "TARGET_AVX512F")
>     (V8DI "TARGET_AVX512F") (V16SF "TARGET_AVX512F")
>     (V8DF "TARGET_AVX512F")])
> @@ -2559,10 +2577,12 @@ (define_expand "reduc_<code>_scal_<mode>
>       (match_operand:REDUC_SMINMAX_MODE 1 "register_operand"))]
>    ""
>  {
> -  rtx tmp = gen_reg_rtx (<MODE>mode);
> -  ix86_expand_reduc (gen_<code><mode>3, tmp, operands[1]);
> -  emit_insn (gen_vec_extract<mode><ssescalarmodelower> (operands[0], tmp,
> -                                                     const0_rtx));
> +  rtx tmp = gen_reg_rtx (<ssehalfvecmode>mode);
> +  emit_insn (gen_vec_extract_hi_<mode> (tmp, operands[1]));
> +  rtx tmp2 = gen_reg_rtx (<ssehalfvecmode>mode);
> +  emit_insn (gen_<code><ssehalfvecmodelower>3
> +    (tmp2, tmp, gen_lowpart (<ssehalfvecmode>mode, operands[1])));
> +  emit_insn (gen_reduc_<code>_scal_<ssehalfvecmodelower> (operands[0], 
> tmp2));
>    DONE;
>  })
>  
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nuernberg)

Reply via email to