On Tue, 5 Jul 2016, Richard Sandiford wrote: > Richard Biener <rguent...@suse.de> writes: > > On Sun, 3 Jul 2016, Richard Sandiford wrote: > > > >> Richard Biener <rguent...@suse.de> writes: > >> > On Wed, 15 Jun 2016, Richard Sandiford wrote: > >> > > >> >> Richard Biener <rguent...@suse.de> writes: > >> >> > With the proposed cost change for vector construction we will end up > >> >> > vectorizing the testcase in PR68961 again (on x86_64 and likely > >> >> > on ppc64le as well after that target gets adjustments). Currently > >> >> > we can't optimize that away again noticing the direct overlap of > >> >> > argument and return registers. The obstackle is > >> >> > > >> >> > (insn 7 4 8 2 (set (reg:V2DF 93) > >> >> > (vec_concat:V2DF (reg/v:DF 91 [ a ]) > >> >> > (reg/v:DF 92 [ aa ]))) > >> >> > ... > >> >> > (insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ]) > >> >> > (subreg:DI (reg:TI 88 [ D.1756 ]) 0)) > >> >> > (insn 24 21 11 2 (set (reg:DI 100 [+8 ]) > >> >> > (subreg:DI (reg:TI 88 [ D.1756 ]) 8)) > >> >> > > >> >> > which we eventually optimize to DFmode subregs of (reg:V2DF 93). > >> >> > > >> >> > First of all simplify_subreg doesn't handle the subregs of a > >> >> > vec_concat > >> >> > (easy fix below). > >> >> > > >> >> > Then combine doesn't like to simplify the multi-use (it tries some > >> >> > parallel it seems). So I went to forwprop which eventually manages > >> >> > to do this but throws away the result (reg:DF 91) or (reg:DF 92) > >> >> > because it is not a constant. Thus I allow arbitrary simplification > >> >> > results for SUBREGs of [VEC_]CONCAT operations. There doesn't seem > >> >> > to be a magic flag to tell it to restrict to the case where all > >> >> > uses can be simplified or so, nor to restrict simplifications to a > >> >> > REG. > >> >> > But I don't see any undesirable simplifications of (subreg > >> >> > ([vec_]concat)). > >> >> > >> >> Adding that as a special case to propgate_rtx feels like a hack though > >> >> :-) > >> >> I think: > >> >> > >> >> > Index: gcc/fwprop.c > >> >> > =================================================================== > >> >> > *** gcc/fwprop.c (revision 237286) > >> >> > --- gcc/fwprop.c (working copy) > >> >> > *************** propagate_rtx (rtx x, machine_mode mode, > >> >> > *** 664,670 **** > >> >> > || (GET_CODE (new_rtx) == SUBREG > >> >> > && REG_P (SUBREG_REG (new_rtx)) > >> >> > && (GET_MODE_SIZE (mode) > >> >> > ! <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx)))))) > >> >> > flags |= PR_CAN_APPEAR; > >> >> > if (!varying_mem_p (new_rtx)) > >> >> > flags |= PR_HANDLE_MEM; > >> >> > --- 664,673 ---- > >> >> > || (GET_CODE (new_rtx) == SUBREG > >> >> > && REG_P (SUBREG_REG (new_rtx)) > >> >> > && (GET_MODE_SIZE (mode) > >> >> > ! <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx))))) > >> >> > ! || ((GET_CODE (new_rtx) == VEC_CONCAT > >> >> > ! || GET_CODE (new_rtx) == CONCAT) > >> >> > ! && GET_CODE (x) == SUBREG)) > >> >> > flags |= PR_CAN_APPEAR; > >> >> > if (!varying_mem_p (new_rtx)) > >> >> > flags |= PR_HANDLE_MEM; > >> >> > >> >> ...this if statement should fundamentally only test new_rtx. > >> >> E.g. we'd want the same thing for any SUBREG inside X. > >> >> > >> >> How about changing: > >> >> > >> >> /* The replacement we made so far is valid, if all of the recursive > >> >> replacements were valid, or we could simplify everything to > >> >> a constant. */ > >> >> return valid_ops || can_appear || CONSTANT_P (tem); > >> >> > >> >> so that (REG_P (tem) && !HARD_REGISTER_P (tem)) is also valid? > >> >> I suppose that's likely to increase register pressure though, > >> >> if only some uses of new_rtx simplify. (There again, requiring all > >> >> uses to be replacable could make hot code the hostage of cold code.) > >> > > >> > Yes, my fear was about register presure increase for the case not all > >> > uses can be replaced (fwprop doesn't seem to have code to verify or > >> > require that). > >> > > >> > I can avoid checking for GET_CODE (x) == SUBREG and add a PR_REG > >> > case to restrict REG_P (tem) && !HARD_REGISTER_P (tem) to the > >> > new_rtx == [VEC_]CONCAT case for example. > >> > >> I don't think that helps though. There might be other uses of a > >> VEC_CONCAT that aren't SUBREGs, in which case we'd have the same > >> problem of keeping both values live at once. > >> > >> How about restricting the REG_P (tem) && !HARD_REGISTER_P (tem) > >> to cases where new_rtx has more words than tem? > > > > So would you really make a simple mode-size check here? > > I thought a mode check would be worth trying since (for better or worse) > words are special for subregs. But... > > > I wonder which cases are there other than the subreg of [vec_concat] > > that would end up with this case. That is, > > > > if (REG_P (tem) && !HARD_REGISTER_P (tem) > > && GET_MODE (tem) == GET_MODE_INNER (GET_MODE (new_rtx)) > > && (VECTOR_MODE_P (GET_MODE (new_rtx)) > > || COMPLEX_MODE_P (GET_MODE (new_rtx)))) > > return true; > > this looks good to me too FWIW. I think it reads more naturally if > you do the GET_MODE_INNER check last. > > > works as would constraining new_rtx to [VEC_]CONCAT instead of > > vector/complex modes. I'm worried about relaxing it further > > as partial int modes also have a GET_MODE_INNER - relaxing to > > > > && GET_MODE_INNER (GET_MODE (new_rtx)) != GET_MODE (new_rtx) > > > > I mean. If we restrict it to [VEC_]CONCAT we can even allow > > any REG_P && !HARD_REGISTER_P as I can't see any outer (non-noop) > > operation that will simplify into a REG_P besides a subreg. > > I think the simplification could also trigger for VEC_DUPLICATE and > VEC_MERGE, so the patch seems better than a check for specific codes. > > > But the above form would capture the intent to allow simplifying > > operations on vector / complex to a scalar component. > > > > Which means I can test the following if that looks better > > than what I had originally. There's Seghers 2->2 combine > > patch but that has wider impact as well. > > > > Thanks, > > Richard. > > > > 2016-07-04 Richard Biener <rguent...@suse.de> > > > > PR rtl-optimization/68961 > > * fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT > > to simplify to a non-constant. > > > > * gcc.target/i386/pr68961.c: New testcase. > > Thanks, LGTM.
Bootstrapped and tested on x86_64-unknown-linux-gnu, it causes FAIL: gcc.target/i386/sse2-load-multi.c scan-assembler-times movup 2 as the peephole created for that testcase no longer applies as fwprop does In insn 10, replacing (vec_concat:V2DF (vec_select:DF (reg:V2DF 91) (parallel [ (const_int 0 [0]) ])) (mem:DF (reg/f:DI 95) [0 S8 A128])) with (vec_concat:V2DF (reg:DF 93 [ MEM[(const double *)&a + 8B] ]) (mem:DF (reg/f:DI 95) [0 S8 A128])) Changed insn 10 resulting in movsd a+8(%rip), %xmm0 movhpd a+16(%rip), %xmm0 again rather than movupd. Uros, there is probably a missing peephole for the new form - can you fix this as a followup or should I hold on this patch for a bit longer? Thanks, Richard. 2016-07-06 Richard Biener <rguent...@suse.de> PR rtl-optimization/68961 * fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT to simplify to a non-constant. * gcc.target/i386/pr68961.c: New testcase. Index: gcc/testsuite/gcc.target/i386/pr68961.c =================================================================== *** gcc/testsuite/gcc.target/i386/pr68961.c (revision 0) --- gcc/testsuite/gcc.target/i386/pr68961.c (working copy) *************** *** 0 **** --- 1,19 ---- + /* { dg-do compile { target lp64 } } */ + /* { dg-options "-O3 -fno-vect-cost-model -fdump-tree-slp2-details" } */ + + struct x { double d[2]; }; + + struct x + pack (double a, double aa) + { + struct x u; + u.d[0] = a; + u.d[1] = aa; + return u; + } + + /* The function should be optimized to just return as arguments and + result exactly overlap even when previously vectorized. */ + + /* { dg-final { scan-tree-dump "basic block vectorized" "slp2" } } */ + /* { dg-final { scan-assembler-not "mov" } } */ Index: gcc/fwprop.c =================================================================== *** gcc/fwprop.c (revision 238005) --- gcc/fwprop.c (working copy) *************** propagate_rtx_1 (rtx *px, rtx old_rtx, r *** 619,624 **** --- 619,633 ---- *px = tem; + /* Allow replacements that simplify operations on a vector or complex + value to a component. The most prominent case is + (subreg ([vec_]concat ...)). */ + if (REG_P (tem) && !HARD_REGISTER_P (tem) + && (VECTOR_MODE_P (GET_MODE (new_rtx)) + || COMPLEX_MODE_P (GET_MODE (new_rtx))) + && GET_MODE (tem) == GET_MODE_INNER (GET_MODE (new_rtx))) + return true; + /* The replacement we made so far is valid, if all of the recursive replacements were valid, or we could simplify everything to a constant. */