Sudakshina Das <sudi....@arm.com> writes:
> Hi
>
> This is my attempt at reviving the old patch 
> https://gcc.gnu.org/pipermail/gcc-patches/2019-January/514632.html
>
> I have followed on Kyrill's comment upstream on the link above and I am using 
> the recommended option iii that he mentioned.
> "1) Adjust the copy_limit to 256 bits after checking 
> AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS in the tuning.
>  2) Adjust aarch64_copy_one_block_and_progress_pointers to handle 256-bit 
> moves. by iii:
>    iii) Emit explicit V4SI (or any other 128-bit vector mode) pairs ldp/stps. 
> This wouldn't need any adjustments to
>         MD patterns, but would make 
> aarch64_copy_one_block_and_progress_pointers more complex as it would now have
>         two paths, where one handles two adjacent memory addresses in one 
> calls."
>
> With this patch the following test
>
> #define N 8
> extern int src[N], dst[N];
>
> void
> foo (void)
> {
>   __builtin_memcpy (dst, src, N * sizeof (int));
> }
>
> which was originally giving
> foo:
>         adrp    x1, src
>         add     x1, x1, :lo12:src
>         ldp     x4, x5, [x1]
>         adrp    x0, dst
>         add     x0, x0, :lo12:dst
>         ldp     x2, x3, [x1, 16]
>         stp     x4, x5, [x0]
>         stp     x2, x3, [x0, 16]
>         ret
>
>
> changes to the following
> foo:
>         adrp    x1, src
>         add     x1, x1, :lo12:src
>         adrp    x0, dst
>         add     x0, x0, :lo12:dst
>         ldp     q1, q0, [x1]
>         stp     q1, q0, [x0]
>         ret
>
> This gives about 1.3% improvement on 523.xalancbmk_r in SPEC2017 and an 
> overall code size reduction on most
> SPEC2017 Int benchmarks on Neoverse N1 due to more LDP/STP Q pair registers.

Sorry for the slow review.  LGTM with a very minor nit (sorry)…

> @@ -21150,9 +21177,12 @@ aarch64_expand_cpymem (rtx *operands)
>    /* Convert n to bits to make the rest of the code simpler.  */
>    n = n * BITS_PER_UNIT;
>  
> -  /* Maximum amount to copy in one go.  The AArch64 back-end has integer 
> modes
> -     larger than TImode, but we should not use them for loads/stores here.  
> */
> -  const int copy_limit = GET_MODE_BITSIZE (TImode);
> +  /* Maximum amount to copy in one go.  We allow 256-bit chunks based on the
> +     AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter and TARGET_SIMD.  
> */
> +  const int copy_limit = ((aarch64_tune_params.extra_tuning_flags
> +                        & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)
> +                       || !TARGET_SIMD)
> +                      ? GET_MODE_BITSIZE (TImode) :  256;

Should only be one space before “256”.

I guess at some point we should consider handling fixed-length SVE too,
but that's only worth it for -msve-vector-bits=512 and higher.

Thanks,
Richard

Reply via email to