Tamar Christina <tamar.christ...@arm.com> writes:
> Hi All,
>
> In this patch series I'm adding support for zero extending using permutes
> instead of requiring multi-step decomposition.
>
> This codegen has the benefit of needing fewer instructions and having much
> higher throughput than uxtl.  We previously replaced pairs of uxtl/uxtl2s with
> ZIPs to increase throughput, but latency was still an issue due to the
> depencency chain created.
>
> To fix this we can now use TBLs.  The indexes are listed output of the loop as
> well and can be shared amongst zero extends of the same type.  The additional
> benefit of this is that if the values are being permuted after extensions they
> will be simplified and merged leading to better overall code.
>
> e.g. an LD4 can be replaced with LDR since the permutes being performance for
> the extensions can be merged with the load permutes.  The way LOAD_LANES is
> currently implemented means this can't be done in GCC yet, but I'm aiming for
> this in GCC 15.
>
> I've additionally only added support for non-VLA.  The problem with VLA is 
> that
> the index registers are hard or impossible to make.
>
> On Adv. SIMD we use -1 to indicate an out of range register so we can 
> transform
> the two regs TBL into a one reg one.  However on e.g. a byte array, on VLA 255
> would be a valid entry. e.g, at VL > 2048.  Which means that's already not a
> transformation we can make.
>
> Secondly the actual mask looks something like {x,x,x,n,n+1, x,x,x, n+2, n+3}
> and while I think I can represent this in vect_perm_builder, I couldn't think 
> of
> any real efficient VLA way to build such masks..  It would require a lot of
> setup code.
>
> Lastly I don't think this transformation would make much sense for SVE, as SVE
> has loads and converts that can do multi-step types.  For instance the loop
> below is already pretty good for SVE (though it's missed that the load can do
> more than one step, presumably because a single extend is merged only in RTL).
>
> While I tried hard, for these reasons I don't support VLA, which I hope is 
> ok..

Yeah, I agree we can skip VLA.  For .B->.D, we could in principle use
a TBL .B with a mask created by:

        INDEX   Zmask.D, #0, #1

and using INCD to generate subsequence vectors.
We could then mask the result with:

        AND     Zresult.D, Zresult.D, #0xff

which still saves one level of unpacks.  But like you say, it doesn't
scale to VL>2048 bits, and setting up each index, although relatively
simple individually, is going to mount up.
    
> Concretely on AArch64 this changes:
>
> void test4(unsigned char *x, long long *y, int n) {
>     for(int i = 0; i < n; i++) {
>         y[i] = x[i];
>     }
> }
>
> from generating:
>
> .L4:
>         ldr     q30, [x4], 16
>         add     x3, x3, 128
>         zip1    v1.16b, v30.16b, v31.16b
>         zip2    v30.16b, v30.16b, v31.16b
>         zip1    v2.8h, v1.8h, v31.8h
>         zip1    v0.8h, v30.8h, v31.8h
>         zip2    v1.8h, v1.8h, v31.8h
>         zip2    v30.8h, v30.8h, v31.8h
>         zip1    v26.4s, v2.4s, v31.4s
>         zip1    v29.4s, v0.4s, v31.4s
>         zip1    v28.4s, v1.4s, v31.4s
>         zip1    v27.4s, v30.4s, v31.4s
>         zip2    v2.4s, v2.4s, v31.4s
>         zip2    v0.4s, v0.4s, v31.4s
>         zip2    v1.4s, v1.4s, v31.4s
>         zip2    v30.4s, v30.4s, v31.4s
>         stp     q26, q2, [x3, -128]
>         stp     q28, q1, [x3, -96]
>         stp     q29, q0, [x3, -64]
>         stp     q27, q30, [x3, -32]
>         cmp     x4, x5
>         bne     .L4
>
> and instead we get:
>
> .L4:
>         add     x3, x3, 128
>         ldr     q23, [x4], 16
>         tbl     v5.16b, {v23.16b}, v31.16b
>         tbl     v4.16b, {v23.16b}, v30.16b
>         tbl     v3.16b, {v23.16b}, v29.16b
>         tbl     v2.16b, {v23.16b}, v28.16b
>         tbl     v1.16b, {v23.16b}, v27.16b
>         tbl     v0.16b, {v23.16b}, v26.16b
>         tbl     v22.16b, {v23.16b}, v25.16b
>         tbl     v23.16b, {v23.16b}, v24.16b
>         stp     q5, q4, [x3, -128]
>         stp     q3, q2, [x3, -96]
>         stp     q1, q0, [x3, -64]
>         stp     q22, q23, [x3, -32]
>         cmp     x4, x5
>         bne     .L4
>
> Which results in up to 40% performance uplift on certain workloads.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>       * config/aarch64/aarch64.cc (aarch64_use_permute_for_promotion): New.
>       (TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION): Use it.
>
> gcc/testsuite/ChangeLog:
>
>       * gcc.target/aarch64/vect-tbl-zero-extend_1.c: New test.

LGTM, but...

> ---
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> 102680a0efca1ce928e6945033c01cfb68a65152..b90577f4fc8157b3e02936256c8af8b2b7fac144
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -28404,6 +28404,29 @@ aarch64_empty_mask_is_expensive (unsigned)
>    return false;
>  }
>  
> +/* Implement TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION.  Assume that
> +   predicated operations when available are beneficial when doing more than
> +   one step conversion.  */
> +
> +static bool
> +aarch64_use_permute_for_promotion (const_tree in_type, const_tree out_type)
> +{
> +  /* AArch64's vect_perm_constant doesn't currently support two 64 bit 
> shuffles
> +     into a 128 bit vector type.  So for now reject it.  */
> +  if (maybe_ne (GET_MODE_BITSIZE (TYPE_MODE (in_type)),
> +             GET_MODE_BITSIZE (TYPE_MODE (out_type))))
> +    return false;

...I think we should explicitly reject VLA vectors here, since it isn't
a documented part of the interface that the vectors must be constant sized
(and IMO that's a good thing).

Also, it would be good to have SVE tests with -msve-vector-bits=512 or
something.

Thanks,
Richard

> +
> +  auto bitsize_in = element_precision (in_type);
> +  auto bitsize_out = element_precision (out_type);
> +
> +  /* We don't want to use the permutes for a single widening step because 
> we're
> +     picking there between two zip and tbl sequences with the same throughput
> +     and latencies.  However the zip doesn't require a mask and uses less
> +     registers so we prefer that.  */
> +  return (bitsize_out / bitsize_in) > 2;
> +}
> +
>  /* Return 1 if pseudo register should be created and used to hold
>     GOT address for PIC code.  */
>  
> @@ -31113,6 +31136,9 @@ aarch64_libgcc_floating_mode_supported_p
>  #undef TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE
>  #define TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE \
>    aarch64_conditional_operation_is_expensive
> +#undef TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION
> +#define TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION \
> +  aarch64_use_permute_for_promotion
>  #undef TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
>  #define TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE \
>    aarch64_empty_mask_is_expensive
> diff --git a/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_1.c 
> b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_1.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..3c088ced63543c203d1cc020de5d67807b48b3fb
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_1.c
> @@ -0,0 +1,52 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3 -std=c99 -march=armv8-a" } */
> +
> +void test1(unsigned short *x, double *y, int n) {
> +    for(int i = 0; i < (n & -8); i++) {
> +        unsigned short a = x[i*4+0];
> +        unsigned short b = x[i*4+1];
> +        unsigned short c = x[i*4+2];
> +        unsigned short d = x[i*4+3];
> +        y[i] = (double)a + (double)b + (double)c + (double)d;
> +    }
> +}
> +
> +void test2(unsigned char *x, double *y, int n) {
> +    for(int i = 0; i < (n & -8); i++) {
> +        unsigned short a = x[i*4+0];
> +        unsigned short b = x[i*4+1];
> +        unsigned short c = x[i*4+2];
> +        unsigned short d = x[i*4+3];
> +        y[i] = (double)a + (double)b + (double)c + (double)d;
> +    }
> +}
> +
> +void test3(unsigned short *x, double *y, int n) {
> +    for(int i = 0; i < (n & -8); i++) {
> +        unsigned int a = x[i];
> +        y[i] = (double)a;
> +    }
> +}
> +
> +void test4(unsigned short *x, long long *y, int n) {
> +    for(int i = 0; i < (n & -8); i++) {
> +        y[i] = x[i];
> +    }
> +}
> +
> +void test5(unsigned int *x, long long *y, int n) {
> +    for(int i = 0; i < (n & -8); i++) {
> +        y[i] = x[i];
> +    }
> +}
> +
> +void test6(unsigned char *x, long long *y, int n) {
> +    for(int i = 0; i < (n & -8); i++) {
> +        y[i] = x[i];
> +    }
> +}
> +
> +/* { dg-final { scan-assembler-times {\tzip1} 1 } } */
> +/* { dg-final { scan-assembler-times {\tzip2} 1 } } */
> +/* { dg-final { scan-assembler-times {\ttbl} 64 } } */
> +/* { dg-final { scan-assembler-times {\.LC[0-9]+:} 12 } } */

Reply via email to