Sudakshina Das <sudi....@arm.com> writes: > Hi > > This is my attempt at reviving the old patch > https://gcc.gnu.org/pipermail/gcc-patches/2019-January/514632.html > > I have followed on Kyrill's comment upstream on the link above and I am using > the recommended option iii that he mentioned. > "1) Adjust the copy_limit to 256 bits after checking > AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS in the tuning. > 2) Adjust aarch64_copy_one_block_and_progress_pointers to handle 256-bit > moves. by iii: > iii) Emit explicit V4SI (or any other 128-bit vector mode) pairs ldp/stps. > This wouldn't need any adjustments to > MD patterns, but would make > aarch64_copy_one_block_and_progress_pointers more complex as it would now have > two paths, where one handles two adjacent memory addresses in one > calls." > > With this patch the following test > > #define N 8 > extern int src[N], dst[N]; > > void > foo (void) > { > __builtin_memcpy (dst, src, N * sizeof (int)); > } > > which was originally giving > foo: > adrp x1, src > add x1, x1, :lo12:src > ldp x4, x5, [x1] > adrp x0, dst > add x0, x0, :lo12:dst > ldp x2, x3, [x1, 16] > stp x4, x5, [x0] > stp x2, x3, [x0, 16] > ret > > > changes to the following > foo: > adrp x1, src > add x1, x1, :lo12:src > adrp x0, dst > add x0, x0, :lo12:dst > ldp q1, q0, [x1] > stp q1, q0, [x0] > ret > > This gives about 1.3% improvement on 523.xalancbmk_r in SPEC2017 and an > overall code size reduction on most > SPEC2017 Int benchmarks on Neoverse N1 due to more LDP/STP Q pair registers.
Sorry for the slow review. LGTM with a very minor nit (sorry)… > @@ -21150,9 +21177,12 @@ aarch64_expand_cpymem (rtx *operands) > /* Convert n to bits to make the rest of the code simpler. */ > n = n * BITS_PER_UNIT; > > - /* Maximum amount to copy in one go. The AArch64 back-end has integer > modes > - larger than TImode, but we should not use them for loads/stores here. > */ > - const int copy_limit = GET_MODE_BITSIZE (TImode); > + /* Maximum amount to copy in one go. We allow 256-bit chunks based on the > + AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter and TARGET_SIMD. > */ > + const int copy_limit = ((aarch64_tune_params.extra_tuning_flags > + & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS) > + || !TARGET_SIMD) > + ? GET_MODE_BITSIZE (TImode) : 256; Should only be one space before “256”. I guess at some point we should consider handling fixed-length SVE too, but that's only worth it for -msve-vector-bits=512 and higher. Thanks, Richard