Hi, Sorry for the slow review.
Przemyslaw Wirkus <przemyslaw.wir...@arm.com> writes: > Hi, > > Introduce simple peephole2 optimization which substitutes a sequence of > four consecutive load or store (LDR, STR) instructions with two load or > store pair (LDP, STP) instructions for 2 element supported vector modes > (V2SI, V2SF, V2DI, and V2DF). > Generated load / store pair instruction offset is adjusted accordingly. > > Bootstrapped and tested on aarch64-none-linux-gnu. > > Example: > $ cat stp_vec_v2sf.c > typedef float __attribute__((vector_size(8))) vec; > > void > store_adjusted(vec *out, vec x, vec y) > { > out[400] = x; > out[401] = y; > out[402] = y; > out[403] = x; > } > > Example compiled with: > $ ./aarch64-none-linux-gnu-gcc -S -O2 stp_vec_v2sf.c -dp > > Before the patch: > > store_adjusted: > str d0, [x0, 3200] // 9 [c=4 l=4] *aarch64_simd_movv2si/2 > str d1, [x0, 3208] // 11 [c=4 l=4] *aarch64_simd_movv2si/2 > str d1, [x0, 3216] // 13 [c=4 l=4] *aarch64_simd_movv2si/2 > str d0, [x0, 3224] // 15 [c=4 l=4] *aarch64_simd_movv2si/2 > ret // 26 [c=0 l=4] *do_return > > After the patch: > > store_adjusted: > add x1, x0, 3200 // 27 [c=4 l=4] *adddi3_aarch64/0 > stp d0, d1, [x1] // 28 [c=0 l=4] vec_store_pairv2siv2si > stp d1, d0, [x1, 16] // 29 [c=0 l=4] vec_store_pairv2siv2si > ret // 22 [c=0 l=4] *do_return > > > OK for master ? > > kind regards, > Przemyslaw Thanks for doing this, looks good. My only real comment is that I wonder if we really need the nunits parameter, or if we should instead change the scalar_mode to a general machine_mode and just pass the vector mode to that. Passing nunits means that we don't need a to_constant here: > @@ -21970,7 +21973,7 @@ aarch64_operands_adjust_ok_for_ldpstp (rtx *operands, > bool load, > for (int i = 0; i < num_insns; i++) > offvals[i] = INTVAL (offset[i]); > > - msize = GET_MODE_SIZE (mode); > + msize = GET_MODE_SIZE (mode) * nunits; > > /* Check if the offsets can be put in the right order to do a ldp/stp. */ > qsort (offvals, num_insns, sizeof (HOST_WIDE_INT), but I think adding to_constant is fine in this context, since it's inherently non-SVE code. That would also avoid having to recalculate the mode: > @@ -22114,8 +22118,11 @@ aarch64_gen_adjusted_ldpstp (rtx *operands, bool > load, > replace_equiv_address_nv (mem_4, plus_constant (Pmode, operands[8], > new_off_3 + msize), true); > > - if (!aarch64_mem_pair_operand (mem_1, mode) > - || !aarch64_mem_pair_operand (mem_3, mode)) > + /* If nunits > 1 we are adjusting for vector mode. In this case we should > + generate mode for vector built from nunits and scalar_mode provided. */ > + mem_mode = (nunits == 1) ? mode : mode_for_vector(mode, > nunits).else_void(); > + if (!aarch64_mem_pair_operand (mem_1, mem_mode) > + || !aarch64_mem_pair_operand (mem_3, mem_mode)) > return false; > > if (code == ZERO_EXTEND) …here. One other (very) minor thing is that some of the lines were over the 80 character limit, but removing the nunits parameter might fix that :-) > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_vec_v2sf.c > b/gcc/testsuite/gcc.target/aarch64/ldp_vec_v2sf.c > new file mode 100644 > index > 0000000000000000000000000000000000000000..f46dea1f748a094509ecfa0292a7c54e94164c9a > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_vec_v2sf.c > @@ -0,0 +1,14 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2" } */ > + > +typedef float __attribute__((vector_size(8))) vec; > + > +vec > +load_long(vec *v) { > + return v[110] + v[111] + v[112] + v[113]; > +} > + > +/* { dg-final { scan-assembler "add\tx\[0-9\]+, x\[0-9\]+, 880" } } */ > +/* { dg-final { scan-assembler "ldp\td\[0-9\]+, d\[0-9\]+, > \\\[x\[0-9\]+\\\]" } } */ > +/* { dg-final { scan-assembler "ldp\td\[0-9\]+, d\[0-9\]+, \\\[x\[0-9\]+, > 16\\\]" } } */ FWIW, it's possible to avoid many of these backslashes by quoting the regexp with {…} rather than "…". E.g.: /* { dg-final { scan-assembler {ldp\td[0-9]+, d[0-9]+, \[x[0-9]+, 16\]} } } */ The above is fine too though. Thanks, Richard