On Fri, Jun 10, 2016 at 11:20:22AM +0200, Richard Biener wrote:
> With the proposed cost change for vector construction we will end up
> vectorizing the testcase in PR68961 again (on x86_64 and likely
> on ppc64le as well after that target gets adjustments). Currently
> we can't optimize that away again noticing the direct overlap of
> argument and return registers. The obstackle is
>
> (insn 7 4 8 2 (set (reg:V2DF 93)
> (vec_concat:V2DF (reg/v:DF 91 [ a ])
> (reg/v:DF 92 [ aa ])))
> ...
> (insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ])
> (subreg:DI (reg:TI 88 [ D.1756 ]) 0))
> (insn 24 21 11 2 (set (reg:DI 100 [+8 ])
> (subreg:DI (reg:TI 88 [ D.1756 ]) 8))
>
> which we eventually optimize to DFmode subregs of (reg:V2DF 93).
>
> First of all simplify_subreg doesn't handle the subregs of a vec_concat
> (easy fix below).
>
> Then combine doesn't like to simplify the multi-use (it tries some
> parallel it seems).
Combine will not do a 2->2 combination currently. Say it is combining
A with a later B into C, and the result of A is used again later, then
it tries a parallel of A with C. That usually does not match an insn for
the target.
If this were a 3->2 (or 4->2) combination, or A or C are no-op moves
(so that they will disappear later in combines), combine will break the
parallel into two and see if that matches. We can in fact do that for
2->2 combinations as well: this removes a log_link (from A to B), so
combine cannot get into an infinite loop, even though it does not make
the number of RTL insns lower.
So I tried out the patch below. It decreases code size on most targets
(mostly fixed length insn targets), and increases it a small bit on some
variable length insn targets (doing an op twice, instead of doing it once
and doing a move). It looks to be all good there too, but there are so
many changes that it is almost impossible to really check.
So: can people try this out with their favourite benchmarks, please?
Segher
diff --git a/gcc/combine.c b/gcc/combine.c
index 6b5d000..2c99b4e 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -3933,8 +3933,6 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1,
rtx_insn *i0,
&& XVECLEN (newpat, 0) == 2
&& GET_CODE (XVECEXP (newpat, 0, 0)) == SET
&& GET_CODE (XVECEXP (newpat, 0, 1)) == SET
- && (i1 || set_noop_p (XVECEXP (newpat, 0, 0))
- || set_noop_p (XVECEXP (newpat, 0, 1)))
&& GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != ZERO_EXTRACT
&& GET_CODE (SET_DEST (XVECEXP (newpat, 0, 0))) != STRICT_LOW_PART
&& GET_CODE (SET_DEST (XVECEXP (newpat, 0, 1))) != ZERO_EXTRACT