Re: [PATCH][AArch64] Use Q-reg loads/stores in movmem expansion

Kyrill Tkachov Thu, 10 Jan 2019 02:16:17 -0800

Hi James,

On 09/01/19 17:50, James Greenhalgh wrote:

On Fri, Dec 21, 2018 at 06:30:49AM -0600, Kyrill Tkachov wrote:

Hi all,


Our movmem expansion currently emits TImode loads and stores when copying 
128-bit chunks.
This generates X-register LDP/STP sequences as these are the most preferred 
registers for that mode.

For the purpose of copying memory, however, we want to prefer Q-registers.
This uses one fewer register, so helping with register pressure.
It also allows merging of 256-bit and larger copies into Q-reg LDP/STP, further 
helping code size.

The implementation of that is easy: we just use a 128-bit vector mode (V4SImode 
in this patch)
rather than a TImode.

With this patch the testcase:
#define N 8
int src[N], dst[N];

void
foo (void)
{
    __builtin_memcpy (dst, src, N * sizeof (int));
}

generates:
foo:
          adrp    x1, src
          add     x1, x1, :lo12:src
          adrp    x0, dst
          add     x0, x0, :lo12:dst
          ldp     q1, q0, [x1]
          stp     q1, q0, [x0]
          ret

instead of:
foo:
          adrp    x1, src
          add     x1, x1, :lo12:src
          adrp    x0, dst
          add     x0, x0, :lo12:dst
          ldp     x2, x3, [x1]
          stp     x2, x3, [x0]
          ldp     x2, x3, [x1, 16]
          stp     x2, x3, [x0, 16]
          ret

Bootstrapped and tested on aarch64-none-linux-gnu.
I hope this is a small enough change for GCC 9.
One could argue that it is finishing up the work done this cycle to support 
Q-register LDP/STPs

I've seen this give about 1.8% on 541.leela_r on Cortex-A57 with other changes 
in SPEC2017 in the noise
but there is reduction in code size everywhere (due to more LDP/STP-Q pairs 
being formed)

Ok for trunk?

I'm surprised by the logic. If we want to use 256-bit copies, shouldn't we
be explicit about that in the movmem code, rather than using 128-bit copies
that get merged.


To emit the Q-reg pairs here we'd need to:
1) Adjust the copy_limit to 256 bits after checking 
AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS in the tuning.
2) Adjust aarch64_copy_one_block_and_progress_pointers to handle 256-bit moves. 
There's a couple of ways to do it:
  i) Emit OImode moves. For that we'd need add patterns/alternatives that use 
LDP/STP-Q for OImode moves (currently we only use OImode as a container mode 
for LD1/ST1 operations).

  ii) Emit explicit load/store pairs of TImode values. For that we'd need to 
generate two MEMs and two registers, which would complicate 
aarch64_copy_one_block_and_progress_pointers a bit more. Furthermore we'd need 
to add {load,store}_pairtiti patterns in in aarch64-simd.md that actually
  handle TImode pairs. These patterns wouldn't be very useful in other contexts 
as for the compiler to form the register allocation should have chosen to use Q 
regs for the individual TImode operations (for the peepholing to match them), 
which is unlikely. So these patterns would largely exist only for this bit of 
code

  iii) Emit explicit V4SI (or any other 128-bit vector mode) pairs ldp/stps. 
This wouldn't need any adjustments to MD patterns, but would make 
aarch64_copy_one_block_and_progress_pointers more complex as it would now have 
two paths, where one handles two adjacent memory addresses in one calls.

  Why do TImode loads require two X registers? Shouldn't we
just fix TImode loads to use Q registers if that is preferable?

We do have both alternatives in our movti pattern. The Q-reg alternatives comes 
later in the list, so I guess it's slightly less preferable.

I believe the most common use of TImode is for wide integer arithmetic (our 
add-with-overflow patterns)
which are usually done on X registers. So when preferring to use X registers 
makes sense in that scenario.
We only care about using Q-regs for TImode when moving memory.


I'm not opposed to the principle of using LDP-Q in our movmem, but is this
the best way to make that happen?


It looked like the minimal patch to achieve this. If you'd like to see explicit 
pair creating straight away during expand,
I think approach iii) above is the least bad option.

Thanks,
Kyrill

Thanks,
James

2018-12-21  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>

      * config/aarch64/aarch64.c (aarch64_expand_movmem): Use V4SImode for
      128-bit moves.

2018-12-21  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>

      * gcc.target/aarch64/movmem-q-reg_1.c: New test.

Re: [PATCH][AArch64] Use Q-reg loads/stores in movmem expansion

Reply via email to