On 12/11/2017 08:44 AM, James Greenhalgh wrote: > Hi, > > In the testcase in this patch we create an SLP vector with only two > elements. Our current vector initialisation code will first duplicate > the first element to both lanes, then overwrite the top lane with a new > value. > > This duplication can be clunky and wasteful. > > Better would be to simply use the fact that we will always be overwriting > the remaining bits, and simply move the first element to the corrcet place > (implicitly zeroing all other bits). > > This reduces the code generation for this case, and can allow more > efficient addressing modes, and other second order benefits for AArch64 > code which has been vectorized to V2DI mode. > > Note that the change is generic enough to catch the case for any vector > mode, but is expected to be most useful for 2x64-bit vectorization. > > Unfortunately, on its own, this would cause failures in > gcc.target/aarch64/load_v2vec_lanes_1.c and > gcc.target/aarch64/store_v2vec_lanes.c , which expect to see many more > vec_merge and vec_duplicate for their simplifications to apply. To fix this, > add a special case to the AArch64 code if we are loading from two memory > addresses, and use the load_pair_lanes patterns directly. > > We also need a new pattern in simplify-rtx.c:simplify_ternary_operation , to > catch: > > (vec_merge:OUTER > (vec_duplicate:OUTER x:INNER) > (subreg:OUTER y:INNER 0) > (const_int N)) > > And simplify it to: > > (vec_concat:OUTER x:INNER y:INNER) or (vec_concat y x) > > This is similar to the existing patterns which are tested in this function, > without requiring the second operand to also be a vec_duplicate. > > Bootstrapped and tested on aarch64-none-linux-gnu and tested on > aarch64-none-elf. > > Note that this requires > https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html > if we don't want to ICE creating broken vector zero extends. > > Are the non-AArch64 parts OK? > > Thanks, > James > > --- > 2017-12-11 James Greenhalgh <james.greenha...@arm.com> > > * config/aarch64/aarch64.c (aarch64_expand_vector_init): Modify code > generation for cases where splatting a value is not useful. > * simplify-rtx.c (simplify_ternary_operation): Simplify vec_merge > across a vec_duplicate and a paradoxical subreg forming a vector > mode to a vec_concat. > > 2017-12-11 James Greenhalgh <james.greenha...@arm.com> > > * gcc.target/aarch64/vect-slp-dup.c: New. > > > 0001-patch-AArch64-Do-not-perform-a-vector-splat-for-vect.patch > >
> diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c > index 806c309..ed16f70 100644 > --- a/gcc/simplify-rtx.c > +++ b/gcc/simplify-rtx.c > @@ -5785,6 +5785,36 @@ simplify_ternary_operation (enum rtx_code code, > machine_mode mode, > return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1); > } > > + /* Replace: > + > + (vec_merge:outer (vec_duplicate:outer x:inner) > + (subreg:outer y:inner 0) > + (const_int N)) > + > + with (vec_concat:outer x:inner y:inner) if N == 1, > + or (vec_concat:outer y:inner x:inner) if N == 2. > + > + Implicitly, this means we have a paradoxical subreg, but such > + a check is cheap, so make it anyway. I'm going to assume that N == 1 and N == 3 are handled elsewhere and do not show up here in practice. So is it advisable to handle the case where the VEC_DUPLICATE and SUBREG show up in the opposite order? Or is there some canonicalization that prevents that? simplify-rtx bits are OK as-is if we're certain we're not going to get the alternate ordering of the VEC_MERGE operands. ALso OK if you either generalize this chunk of code or duplicate & twiddle it to handle the alternate order. I didn't look at the aarch64 specific bits. jeff