Hi all, Our movmem expansion currently emits TImode loads and stores when copying 128-bit chunks. This generates X-register LDP/STP sequences as these are the most preferred registers for that mode.
For the purpose of copying memory, however, we want to prefer Q-registers. This uses one fewer register, so helping with register pressure. It also allows merging of 256-bit and larger copies into Q-reg LDP/STP, further helping code size. The implementation of that is easy: we just use a 128-bit vector mode (V4SImode in this patch) rather than a TImode. With this patch the testcase: #define N 8 int src[N], dst[N]; void foo (void) { __builtin_memcpy (dst, src, N * sizeof (int)); } generates: foo: adrp x1, src add x1, x1, :lo12:src adrp x0, dst add x0, x0, :lo12:dst ldp q1, q0, [x1] stp q1, q0, [x0] ret instead of: foo: adrp x1, src add x1, x1, :lo12:src adrp x0, dst add x0, x0, :lo12:dst ldp x2, x3, [x1] stp x2, x3, [x0] ldp x2, x3, [x1, 16] stp x2, x3, [x0, 16] ret Bootstrapped and tested on aarch64-none-linux-gnu. I hope this is a small enough change for GCC 9. One could argue that it is finishing up the work done this cycle to support Q-register LDP/STPs I've seen this give about 1.8% on 541.leela_r on Cortex-A57 with other changes in SPEC2017 in the noise but there is reduction in code size everywhere (due to more LDP/STP-Q pairs being formed) Ok for trunk? Thanks, Kyrill 2018-12-21 Kyrylo Tkachov <kyrylo.tkac...@arm.com> * config/aarch64/aarch64.c (aarch64_expand_movmem): Use V4SImode for 128-bit moves. 2018-12-21 Kyrylo Tkachov <kyrylo.tkac...@arm.com> * gcc.target/aarch64/movmem-q-reg_1.c: New test.
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 88b14179a4cbc5357dfabe21227ff9c8a111804c..a8dcdd4c9e22a7583a197372e500c787c91fe459 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -16448,6 +16448,16 @@ aarch64_expand_movmem (rtx *operands) if (GET_MODE_BITSIZE (mode_iter.require ()) <= MIN (n, copy_limit)) cur_mode = mode_iter.require (); + /* If we want to use 128-bit chunks use a vector mode to prefer the use + of Q registers. This is preferable to using load/store-pairs of X + registers as we need 1 Q-register vs 2 X-registers. + Also, for targets that prefer it, further passes can create + LDP/STP of Q-regs to further reduce the code size. */ + if (TARGET_SIMD + && known_eq (GET_MODE_SIZE (cur_mode), GET_MODE_SIZE (TImode))) + cur_mode = V4SImode; + + gcc_assert (cur_mode != BLKmode); mode_bits = GET_MODE_BITSIZE (cur_mode).to_constant (); diff --git a/gcc/testsuite/gcc.target/aarch64/movmem-q-reg_1.c b/gcc/testsuite/gcc.target/aarch64/movmem-q-reg_1.c new file mode 100644 index 0000000000000000000000000000000000000000..09afad59712b939e25519f02153b5156ddacbf5a --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/movmem-q-reg_1.c @@ -0,0 +1,16 @@ +/* { dg-do compile } */ +/* { dg-options "-O2" } */ + +#define N 8 +int src[N], dst[N]; + +void +foo (void) +{ + __builtin_memcpy (dst, src, N * sizeof (int)); +} + +/* { dg-final { scan-assembler {ld[rp]\tq[0-9]*} } } */ +/* { dg-final { scan-assembler-not {ld[rp]\tx[0-9]*} } } */ +/* { dg-final { scan-assembler {st[rp]\tq[0-9]*} } } */ +/* { dg-final { scan-assembler-not {st[rp]\tx[0-9]*} } } */ \ No newline at end of file