> On Mon, 1 Oct 2018, Richard Biener wrote:
>
> >
> > I notice that for Zen we create
> >
> > 0.00 │ vhaddp %ymm3,%ymm3,%ymm3
> > 1.41 │ vperm2 $0x1,%ymm3,%ymm3,%ymm1
> > 1.45 │ vaddpd %ymm1,%ymm2,%ymm2
> >
> > from reduc_plus_scal_v4df which uses a cross-lane permute vperm2f128
> > even though the upper half of the result is unused in the end
> > (we only use the single-precision element zero). Much better would
> > be to use vextractf128 which is well-pipelined and has good throughput
> > (though using vhaddp in itself is quite bad for Zen I didn't try
> > benchmarking it against open-coding that yet, aka disabling the
> > expander).
>
> So here's an actual patch that I benchmarked and compared against
> the sequence generated by ICC when tuning for core-avx. ICC seems
> to avoid haddpd for both V4DF and V2DF in favor of using
> vunpckhpd plus an add so this is what I do now. Zen is also happy
> with that in addition to the lower overall latency and higher
> throughput "on the paper" (consulting Agners tables).
>
> The disadvantage for the V2DF case is that it now needs two
> registers compared to just one, the advantage is that we can
> enable it for SSE2 where we previously got psrldq + addpd
> and now unpckhpd + addpd (psrldq has similar latency but eventually the
> wrong non-FP "domain"?).
>
> It speeds up 482.sphinx3 on Zen with -mprefer-avx256 by ~7% (with
> -mprefer-avx128 the difference is in the noise). The major issue
> with the old sequence seems to be throughput of hadd and vperm2f128
> with the two parallel reductions 482.sphinx3 performs resulting in
> even worse latency while with the new sequence two reduction sequences
> can be fully carried out in parallel (on Zen).
>
> Bootstrap and regtest running on x86_64-unknown-linux-gnu.
>
> OK for trunk?
>
> The v8sf/v4sf reduction sequences and ix86_expand_reduc probably
> benefit from similar treatment (ix86_expand_reduc from first reducing
> %zmm -> %ymm -> %xmm).
>
> Thanks,
> Richard.
>
> 2018-10-02 Richard Biener <[email protected]>
>
> * config/i386/sse.md (reduc_plus_scal_v4df): Avoid the use
> of haddv4df, first reduce to SSE width and exploit the fact
> that we only need element zero with the reduction result.
> (reduc_plus_scal_v2df): Likewise.
OK,
thanks!
Honza
>
> Index: gcc/config/i386/sse.md
> ===================================================================
> --- gcc/config/i386/sse.md (revision 264758)
> +++ gcc/config/i386/sse.md (working copy)
> @@ -2473,24 +2473,28 @@ (define_expand "reduc_plus_scal_v4df"
> (match_operand:V4DF 1 "register_operand")]
> "TARGET_AVX"
> {
> - rtx tmp = gen_reg_rtx (V4DFmode);
> - rtx tmp2 = gen_reg_rtx (V4DFmode);
> - rtx vec_res = gen_reg_rtx (V4DFmode);
> - emit_insn (gen_avx_haddv4df3 (tmp, operands[1], operands[1]));
> - emit_insn (gen_avx_vperm2f128v4df3 (tmp2, tmp, tmp, GEN_INT (1)));
> - emit_insn (gen_addv4df3 (vec_res, tmp, tmp2));
> - emit_insn (gen_vec_extractv4dfdf (operands[0], vec_res, const0_rtx));
> + rtx tmp = gen_reg_rtx (V2DFmode);
> + emit_insn (gen_vec_extract_hi_v4df (tmp, operands[1]));
> + rtx tmp2 = gen_reg_rtx (V2DFmode);
> + emit_insn (gen_addv2df3 (tmp2, tmp, gen_lowpart (V2DFmode, operands[1])));
> + rtx tmp3 = gen_reg_rtx (V2DFmode);
> + emit_insn (gen_vec_interleave_highv2df (tmp3, tmp2, tmp2));
> + emit_insn (gen_adddf3 (operands[0],
> + gen_lowpart (DFmode, tmp2),
> + gen_lowpart (DFmode, tmp3)));
> DONE;
> })
>
> (define_expand "reduc_plus_scal_v2df"
> [(match_operand:DF 0 "register_operand")
> (match_operand:V2DF 1 "register_operand")]
> - "TARGET_SSE3"
> + "TARGET_SSE2"
> {
> rtx tmp = gen_reg_rtx (V2DFmode);
> - emit_insn (gen_sse3_haddv2df3 (tmp, operands[1], operands[1]));
> - emit_insn (gen_vec_extractv2dfdf (operands[0], tmp, const0_rtx));
> + emit_insn (gen_vec_interleave_highv2df (tmp, operands[1], operands[1]));
> + emit_insn (gen_adddf3 (operands[0],
> + gen_lowpart (DFmode, tmp),
> + gen_lowpart (DFmode, operands[1])));
> DONE;
> })
>