Hi James,
On 09/01/19 17:50, James Greenhalgh wrote:
On Fri, Dec 21, 2018 at 06:30:49AM -0600, Kyrill Tkachov wrote:
Hi all,
Our movmem expansion currently emits TImode loads and stores when copying
128-bit chunks.
This generates X-register LDP/STP sequences as these are the most preferred
registers for that mode.
For the purpose of copying memory, however, we want to prefer Q-registers.
This uses one fewer register, so helping with register pressure.
It also allows merging of 256-bit and larger copies into Q-reg LDP/STP, further
helping code size.
The implementation of that is easy: we just use a 128-bit vector mode (V4SImode
in this patch)
rather than a TImode.
With this patch the testcase:
#define N 8
int src[N], dst[N];
void
foo (void)
{
__builtin_memcpy (dst, src, N * sizeof (int));
}
generates:
foo:
adrp x1, src
add x1, x1, :lo12:src
adrp x0, dst
add x0, x0, :lo12:dst
ldp q1, q0, [x1]
stp q1, q0, [x0]
ret
instead of:
foo:
adrp x1, src
add x1, x1, :lo12:src
adrp x0, dst
add x0, x0, :lo12:dst
ldp x2, x3, [x1]
stp x2, x3, [x0]
ldp x2, x3, [x1, 16]
stp x2, x3, [x0, 16]
ret
Bootstrapped and tested on aarch64-none-linux-gnu.
I hope this is a small enough change for GCC 9.
One could argue that it is finishing up the work done this cycle to support
Q-register LDP/STPs
I've seen this give about 1.8% on 541.leela_r on Cortex-A57 with other changes
in SPEC2017 in the noise
but there is reduction in code size everywhere (due to more LDP/STP-Q pairs
being formed)
Ok for trunk?
I'm surprised by the logic. If we want to use 256-bit copies, shouldn't we
be explicit about that in the movmem code, rather than using 128-bit copies
that get merged.
To emit the Q-reg pairs here we'd need to:
1) Adjust the copy_limit to 256 bits after checking
AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS in the tuning.
2) Adjust aarch64_copy_one_block_and_progress_pointers to handle 256-bit moves.
There's a couple of ways to do it:
i) Emit OImode moves. For that we'd need add patterns/alternatives that use
LDP/STP-Q for OImode moves (currently we only use OImode as a container mode
for LD1/ST1 operations).
ii) Emit explicit load/store pairs of TImode values. For that we'd need to
generate two MEMs and two registers, which would complicate
aarch64_copy_one_block_and_progress_pointers a bit more. Furthermore we'd need
to add {load,store}_pairtiti patterns in in aarch64-simd.md that actually
handle TImode pairs. These patterns wouldn't be very useful in other contexts
as for the compiler to form the register allocation should have chosen to use Q
regs for the individual TImode operations (for the peepholing to match them),
which is unlikely. So these patterns would largely exist only for this bit of
code
iii) Emit explicit V4SI (or any other 128-bit vector mode) pairs ldp/stps.
This wouldn't need any adjustments to MD patterns, but would make
aarch64_copy_one_block_and_progress_pointers more complex as it would now have
two paths, where one handles two adjacent memory addresses in one calls.
Why do TImode loads require two X registers? Shouldn't we
just fix TImode loads to use Q registers if that is preferable?
We do have both alternatives in our movti pattern. The Q-reg alternatives comes
later in the list, so I guess it's slightly less preferable.
I believe the most common use of TImode is for wide integer arithmetic (our
add-with-overflow patterns)
which are usually done on X registers. So when preferring to use X registers
makes sense in that scenario.
We only care about using Q-regs for TImode when moving memory.
I'm not opposed to the principle of using LDP-Q in our movmem, but is this
the best way to make that happen?
It looked like the minimal patch to achieve this. If you'd like to see explicit
pair creating straight away during expand,
I think approach iii) above is the least bad option.
Thanks,
Kyrill
Thanks,
James
2018-12-21 Kyrylo Tkachov <kyrylo.tkac...@arm.com>
* config/aarch64/aarch64.c (aarch64_expand_movmem): Use V4SImode for
128-bit moves.
2018-12-21 Kyrylo Tkachov <kyrylo.tkac...@arm.com>
* gcc.target/aarch64/movmem-q-reg_1.c: New test.