On Tue, 5 Jul 2016, Richard Sandiford wrote:

> Richard Biener <rguent...@suse.de> writes:
> > On Sun, 3 Jul 2016, Richard Sandiford wrote:
> >
> >> Richard Biener <rguent...@suse.de> writes:
> >> > On Wed, 15 Jun 2016, Richard Sandiford wrote:
> >> >
> >> >> Richard Biener <rguent...@suse.de> writes:
> >> >> > With the proposed cost change for vector construction we will end up
> >> >> > vectorizing the testcase in PR68961 again (on x86_64 and likely
> >> >> > on ppc64le as well after that target gets adjustments).  Currently
> >> >> > we can't optimize that away again noticing the direct overlap of
> >> >> > argument and return registers.  The obstackle is
> >> >> >
> >> >> > (insn 7 4 8 2 (set (reg:V2DF 93)
> >> >> >         (vec_concat:V2DF (reg/v:DF 91 [ a ])
> >> >> >             (reg/v:DF 92 [ aa ]))) 
> >> >> > ...
> >> >> > (insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ])
> >> >> >         (subreg:DI (reg:TI 88 [ D.1756 ]) 0))
> >> >> > (insn 24 21 11 2 (set (reg:DI 100 [+8 ])
> >> >> >         (subreg:DI (reg:TI 88 [ D.1756 ]) 8))
> >> >> >
> >> >> > which we eventually optimize to DFmode subregs of (reg:V2DF 93).
> >> >> >
> >> >> > First of all simplify_subreg doesn't handle the subregs of a 
> >> >> > vec_concat
> >> >> > (easy fix below).
> >> >> >
> >> >> > Then combine doesn't like to simplify the multi-use (it tries some
> >> >> > parallel it seems).  So I went to forwprop which eventually manages
> >> >> > to do this but throws away the result (reg:DF 91) or (reg:DF 92)
> >> >> > because it is not a constant.  Thus I allow arbitrary simplification
> >> >> > results for SUBREGs of [VEC_]CONCAT operations.  There doesn't seem
> >> >> > to be a magic flag to tell it to restrict to the case where all
> >> >> > uses can be simplified or so, nor to restrict simplifications to a 
> >> >> > REG.
> >> >> > But I don't see any undesirable simplifications of (subreg 
> >> >> > ([vec_]concat)).
> >> >> 
> >> >> Adding that as a special case to propgate_rtx feels like a hack though 
> >> >> :-)
> >> >> I think:
> >> >> 
> >> >> > Index: gcc/fwprop.c
> >> >> > ===================================================================
> >> >> > *** gcc/fwprop.c      (revision 237286)
> >> >> > --- gcc/fwprop.c      (working copy)
> >> >> > *************** propagate_rtx (rtx x, machine_mode mode,
> >> >> > *** 664,670 ****
> >> >> >         || (GET_CODE (new_rtx) == SUBREG
> >> >> >         && REG_P (SUBREG_REG (new_rtx))
> >> >> >         && (GET_MODE_SIZE (mode)
> >> >> > !           <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx))))))
> >> >> >       flags |= PR_CAN_APPEAR;
> >> >> >     if (!varying_mem_p (new_rtx))
> >> >> >       flags |= PR_HANDLE_MEM;
> >> >> > --- 664,673 ----
> >> >> >         || (GET_CODE (new_rtx) == SUBREG
> >> >> >         && REG_P (SUBREG_REG (new_rtx))
> >> >> >         && (GET_MODE_SIZE (mode)
> >> >> > !           <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx)))))
> >> >> > !       || ((GET_CODE (new_rtx) == VEC_CONCAT
> >> >> > !        || GET_CODE (new_rtx) == CONCAT)
> >> >> > !       && GET_CODE (x) == SUBREG))
> >> >> >       flags |= PR_CAN_APPEAR;
> >> >> >     if (!varying_mem_p (new_rtx))
> >> >> >       flags |= PR_HANDLE_MEM;
> >> >> 
> >> >> ...this if statement should fundamentally only test new_rtx.
> >> >> E.g. we'd want the same thing for any SUBREG inside X.
> >> >> 
> >> >> How about changing:
> >> >> 
> >> >>   /* The replacement we made so far is valid, if all of the recursive
> >> >>      replacements were valid, or we could simplify everything to
> >> >>      a constant.  */
> >> >>   return valid_ops || can_appear || CONSTANT_P (tem);
> >> >> 
> >> >> so that (REG_P (tem) && !HARD_REGISTER_P (tem)) is also valid?
> >> >> I suppose that's likely to increase register pressure though,
> >> >> if only some uses of new_rtx simplify.  (There again, requiring all
> >> >> uses to be replacable could make hot code the hostage of cold code.)
> >> >
> >> > Yes, my fear was about register presure increase for the case not all
> >> > uses can be replaced (fwprop doesn't seem to have code to verify or
> >> > require that).
> >> >
> >> > I can avoid checking for GET_CODE (x) == SUBREG and add a PR_REG
> >> > case to restrict REG_P (tem) && !HARD_REGISTER_P (tem) to the
> >> > new_rtx == [VEC_]CONCAT case for example.
> >> 
> >> I don't think that helps though.  There might be other uses of a
> >> VEC_CONCAT that aren't SUBREGs, in which case we'd have the same
> >> problem of keeping both values live at once.
> >> 
> >> How about restricting the REG_P (tem) && !HARD_REGISTER_P (tem)
> >> to cases where new_rtx has more words than tem?
> >
> > So would you really make a simple mode-size check here?
> 
> I thought a mode check would be worth trying since (for better or worse)
> words are special for subregs.  But...
> 
> > I wonder which cases are there other than the subreg of [vec_concat]
> > that would end up with this case.  That is,
> >
> >   if (REG_P (tem) && !HARD_REGISTER_P (tem)
> >       && GET_MODE (tem) == GET_MODE_INNER (GET_MODE (new_rtx))
> >       && (VECTOR_MODE_P (GET_MODE (new_rtx))
> >           || COMPLEX_MODE_P (GET_MODE (new_rtx))))
> >     return true;
> 
> this looks good to me too FWIW.  I think it reads more naturally if
> you do the GET_MODE_INNER check last.
> 
> > works as would constraining new_rtx to [VEC_]CONCAT instead of
> > vector/complex modes.  I'm worried about relaxing it further
> > as partial int modes also have a GET_MODE_INNER - relaxing to
> >
> >       && GET_MODE_INNER (GET_MODE (new_rtx)) != GET_MODE (new_rtx)
> >
> > I mean.  If we restrict it to [VEC_]CONCAT we can even allow
> > any REG_P && !HARD_REGISTER_P as I can't see any outer (non-noop)
> > operation that will simplify into a REG_P besides a subreg.
> 
> I think the simplification could also trigger for VEC_DUPLICATE and
> VEC_MERGE, so the patch seems better than a check for specific codes.
> 
> > But the above form would capture the intent to allow simplifying
> > operations on vector / complex to a scalar component.
> >
> > Which means I can test the following if that looks better
> > than what I had originally.  There's Seghers 2->2 combine
> > patch but that has wider impact as well.
> >
> > Thanks,
> > Richard.
> >
> > 2016-07-04  Richard Biener  <rguent...@suse.de>
> >
> >     PR rtl-optimization/68961
> >     * fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT
> >     to simplify to a non-constant.
> >
> >     * gcc.target/i386/pr68961.c: New testcase.
> 
> Thanks, LGTM.

Bootstrapped and tested on x86_64-unknown-linux-gnu, it causes

FAIL: gcc.target/i386/sse2-load-multi.c scan-assembler-times movup 2

as the peephole created for that testcase no longer applies as fwprop
does

In insn 10, replacing
 (vec_concat:V2DF (vec_select:DF (reg:V2DF 91)
            (parallel [
                    (const_int 0 [0])
                ]))
        (mem:DF (reg/f:DI 95) [0  S8 A128]))
 with (vec_concat:V2DF (reg:DF 93 [ MEM[(const double *)&a + 8B] ])
        (mem:DF (reg/f:DI 95) [0  S8 A128]))
Changed insn 10

resulting in

        movsd   a+8(%rip), %xmm0
        movhpd  a+16(%rip), %xmm0

again rather than movupd.

Uros, there is probably a missing peephole for the new form - can you
fix this as a followup or should I hold on this patch for a bit longer?

Thanks,
Richard.

2016-07-06  Richard Biener  <rguent...@suse.de>

        PR rtl-optimization/68961
        * fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT
        to simplify to a non-constant.

        * gcc.target/i386/pr68961.c: New testcase.

Index: gcc/testsuite/gcc.target/i386/pr68961.c
===================================================================
*** gcc/testsuite/gcc.target/i386/pr68961.c     (revision 0)
--- gcc/testsuite/gcc.target/i386/pr68961.c     (working copy)
***************
*** 0 ****
--- 1,19 ----
+ /* { dg-do compile { target lp64 } } */
+ /* { dg-options "-O3 -fno-vect-cost-model -fdump-tree-slp2-details" } */
+ 
+ struct x { double d[2]; };
+ 
+ struct x
+ pack (double a, double aa)
+ {
+   struct x u;
+   u.d[0] = a;
+   u.d[1] = aa;
+   return u;
+ }
+ 
+ /* The function should be optimized to just return as arguments and
+    result exactly overlap even when previously vectorized.  */
+ 
+ /* { dg-final { scan-tree-dump "basic block vectorized" "slp2" } } */
+ /* { dg-final { scan-assembler-not "mov" } } */
Index: gcc/fwprop.c
===================================================================
*** gcc/fwprop.c        (revision 238005)
--- gcc/fwprop.c        (working copy)
*************** propagate_rtx_1 (rtx *px, rtx old_rtx, r
*** 619,624 ****
--- 619,633 ----
  
    *px = tem;
  
+   /* Allow replacements that simplify operations on a vector or complex
+      value to a component.  The most prominent case is
+      (subreg ([vec_]concat ...)).   */
+   if (REG_P (tem) && !HARD_REGISTER_P (tem)
+       && (VECTOR_MODE_P (GET_MODE (new_rtx))
+         || COMPLEX_MODE_P (GET_MODE (new_rtx)))
+       && GET_MODE (tem) == GET_MODE_INNER (GET_MODE (new_rtx)))
+     return true;
+ 
    /* The replacement we made so far is valid, if all of the recursive
       replacements were valid, or we could simplify everything to
       a constant.  */

Reply via email to