On 16/05/18 09:37, Kyrill Tkachov wrote: > > On 15/05/18 10:58, Richard Biener wrote: >> On Tue, May 15, 2018 at 10:20 AM Kyrill Tkachov >> <kyrylo.tkac...@foss.arm.com> >> wrote: >> >>> Hi all, >>> This is a respin of James's patch from: >> https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html >>> The original patch was approved and committed but was later reverted >> because of failures on big-endian. >>> This tweaked version fixes the big-endian failures in >> aarch64_expand_vector_init by picking the right >>> element of VALS to move into the low part of the vector register >> depending on endianness. The rest of the patch >>> stays the same. I'm looking for approval on the aarch64 parts, as they >> are the ones that have changed >>> since the last approved version of the patch. >>> ----------------------------------------------------------------------- >>> In the testcase in this patch we create an SLP vector with only two >>> elements. Our current vector initialisation code will first duplicate >>> the first element to both lanes, then overwrite the top lane with a new >>> value. >>> This duplication can be clunky and wasteful. >>> Better would be to simply use the fact that we will always be >>> overwriting >>> the remaining bits, and simply move the first element to the corrcet >>> place >>> (implicitly zeroing all other bits). >>> This reduces the code generation for this case, and can allow more >>> efficient addressing modes, and other second order benefits for AArch64 >>> code which has been vectorized to V2DI mode. >>> Note that the change is generic enough to catch the case for any vector >>> mode, but is expected to be most useful for 2x64-bit vectorization. >>> Unfortunately, on its own, this would cause failures in >>> gcc.target/aarch64/load_v2vec_lanes_1.c and >>> gcc.target/aarch64/store_v2vec_lanes.c , which expect to see many more >>> vec_merge and vec_duplicate for their simplifications to apply. To fix >>> this, >>> add a special case to the AArch64 code if we are loading from two memory >>> addresses, and use the load_pair_lanes patterns directly. >>> We also need a new pattern in simplify-rtx.c:simplify_ternary_operation >>> , to >>> catch: >>> (vec_merge:OUTER >>> (vec_duplicate:OUTER x:INNER) >>> (subreg:OUTER y:INNER 0) >>> (const_int N)) >>> And simplify it to: >>> (vec_concat:OUTER x:INNER y:INNER) or (vec_concat y x) >>> This is similar to the existing patterns which are tested in this >>> function, >>> without requiring the second operand to also be a vec_duplicate. >>> Bootstrapped and tested on aarch64-none-linux-gnu and tested on >>> aarch64-none-elf. >>> Note that this requires >>> https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html >>> if we don't want to ICE creating broken vector zero extends. >>> Are the non-AArch64 parts OK? >> Is (vec_merge (subreg ..) (vec_duplicate)) canonicalized to the form >> you handle? I see the (vec_merge (vec_duplicate...) (vec_concat)) case >> also doesn't handle the swapped operand case. >> >> Otherwise the middle-end parts looks ok. > > I don't see any explicit canonicalisation code for it. > I've updated the simplify-rtx part to handle the swapped operand case. > Is the attached patch better in this regard? I couldn't think of a clean > way to avoid > duplicating some logic (beyond creating a new function away from the > callsite). > > Thanks, > Kyrill > >> Thanks, >> Richard. >> >>> Thanks, >>> James >>> --- >>> 2018-05-15 James Greenhalgh <james.greenha...@arm.com> >>> Kyrylo Tkachov <kyrylo.tkac...@arm.com> >>> * config/aarch64/aarch64.c (aarch64_expand_vector_init): >>> Modify >>> code generation for cases where splatting a value is not >>> useful. >>> * simplify-rtx.c (simplify_ternary_operation): Simplify >>> vec_merge across a vec_duplicate and a paradoxical subreg >> forming a vector >>> mode to a vec_concat. >>> 2018-05-15 James Greenhalgh <james.greenha...@arm.com> >>> * gcc.target/aarch64/vect-slp-dup.c: New. >
I'm surprised we don't seem to have a function in the compiler that performs this check: + && rtx_equal_p (XEXP (x1, 0), + plus_constant (Pmode, + XEXP (x0, 0), + GET_MODE_SIZE (inner_mode)))) Without generating dead RTL (plus_constant will rarely be able to return a subexpression of the original pattern). I would have thought this sort of test was not that uncommon. However, I don't think that needs to hold up this patch. OK. R. > > vec-splat.patch > > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index > a2003fe52875f1653d644347bafd7773d1f01e91..6bf6c05535b61eef1021d46bcd8448fb3a0b25f4 > 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -13916,9 +13916,54 @@ aarch64_expand_vector_init (rtx target, rtx vals) > maxv = matches[i][1]; > } > > - /* Create a duplicate of the most common element. */ > - rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement)); > - aarch64_emit_move (target, gen_vec_duplicate (mode, x)); > + /* Create a duplicate of the most common element, unless all elements > + are equally useless to us, in which case just immediately set the > + vector register using the first element. */ > + > + if (maxv == 1) > + { > + /* For vectors of two 64-bit elements, we can do even better. */ > + if (n_elts == 2 > + && (inner_mode == E_DImode > + || inner_mode == E_DFmode)) > + > + { > + rtx x0 = XVECEXP (vals, 0, 0); > + rtx x1 = XVECEXP (vals, 0, 1); > + /* Combine can pick up this case, but handling it directly > + here leaves clearer RTL. > + > + This is load_pair_lanes<mode>, and also gives us a clean-up > + for store_pair_lanes<mode>. */ > + if (memory_operand (x0, inner_mode) > + && memory_operand (x1, inner_mode) > + && !STRICT_ALIGNMENT > + && rtx_equal_p (XEXP (x1, 0), > + plus_constant (Pmode, > + XEXP (x0, 0), > + GET_MODE_SIZE (inner_mode)))) > + { > + rtx t; > + if (inner_mode == DFmode) > + t = gen_load_pair_lanesdf (target, x0, x1); > + else > + t = gen_load_pair_lanesdi (target, x0, x1); > + emit_insn (t); > + return; > + } > + } > + /* The subreg-move sequence below will move into lane zero of the > + vector register. For big-endian we want that position to hold > + the last element of VALS. */ > + maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0; > + rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement)); > + aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode)); > + } > + else > + { > + rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement)); > + aarch64_emit_move (target, gen_vec_duplicate (mode, x)); > + } > > /* Insert the rest. */ > for (int i = 0; i < n_elts; i++) > diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c > index > e96c9d1b441fcfcbddcbffb0c1c7c0e2a871a2a3..d2714db7ae8ef946a6ce035bea59ddbec890e905 > 100644 > --- a/gcc/simplify-rtx.c > +++ b/gcc/simplify-rtx.c > @@ -5891,6 +5891,60 @@ simplify_ternary_operation (enum rtx_code code, > machine_mode mode, > return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1); > } > > + /* Replace: > + > + (vec_merge:outer (vec_duplicate:outer x:inner) > + (subreg:outer y:inner 0) > + (const_int N)) > + > + with (vec_concat:outer x:inner y:inner) if N == 1, > + or (vec_concat:outer y:inner x:inner) if N == 2. > + > + Implicitly, this means we have a paradoxical subreg, but such > + a check is cheap, so make it anyway. > + > + Only applies for vectors of two elements. */ > + if (GET_CODE (op0) == VEC_DUPLICATE > + && GET_CODE (op1) == SUBREG > + && GET_MODE (op1) == GET_MODE (op0) > + && GET_MODE (SUBREG_REG (op1)) == GET_MODE (XEXP (op0, 0)) > + && paradoxical_subreg_p (op1) > + && subreg_lowpart_p (op1) > + && known_eq (GET_MODE_NUNITS (GET_MODE (op0)), 2) > + && known_eq (GET_MODE_NUNITS (GET_MODE (op1)), 2) > + && IN_RANGE (sel, 1, 2)) > + { > + rtx newop0 = XEXP (op0, 0); > + rtx newop1 = SUBREG_REG (op1); > + if (sel == 2) > + std::swap (newop0, newop1); > + return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1); > + } > + > + /* Same as above but with switched operands: > + Replace (vec_merge:outer (subreg:outer x:inner 0) > + (vec_duplicate:outer y:inner) > + (const_int N)) > + > + with (vec_concat:outer x:inner y:inner) if N == 1, > + or (vec_concat:outer y:inner x:inner) if N == 2. */ > + if (GET_CODE (op1) == VEC_DUPLICATE > + && GET_CODE (op0) == SUBREG > + && GET_MODE (op0) == GET_MODE (op1) > + && GET_MODE (SUBREG_REG (op0)) == GET_MODE (XEXP (op1, 0)) > + && paradoxical_subreg_p (op0) > + && subreg_lowpart_p (op0) > + && known_eq (GET_MODE_NUNITS (GET_MODE (op1)), 2) > + && known_eq (GET_MODE_NUNITS (GET_MODE (op0)), 2) > + && IN_RANGE (sel, 1, 2)) > + { > + rtx newop0 = SUBREG_REG (op0); > + rtx newop1 = XEXP (op1, 0); > + if (sel == 2) > + std::swap (newop0, newop1); > + return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1); > + } > + > /* Replace (vec_merge (vec_duplicate x) (vec_duplicate y) > (const_int n)) > with (vec_concat x y) or (vec_concat y x) depending on value > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c > b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c > new file mode 100644 > index > 0000000000000000000000000000000000000000..0541e480d1f8561dbd9b2a56926c8df60d667a54 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c > @@ -0,0 +1,20 @@ > +/* { dg-do compile } */ > + > +/* { dg-options "-O3 -ftree-vectorize -fno-vect-cost-model" } */ > + > +void bar (double); > + > +void > +foo (double *restrict in, double *restrict in2, > + double *restrict out1, double *restrict out2) > +{ > + for (int i = 0; i < 1024; i++) > + { > + out1[i] = in[i] + 2.0 * in[i+128]; > + out1[i+1] = in[i+1] + 2.0 * in2[i]; > + bar (in[i]); > + } > +} > + > +/* { dg-final { scan-assembler-not "dup\tv\[0-9\]+.2d, v\[0-9\]+" } } */ > + >