Tamar Christina <tamar.christ...@arm.com> writes: > Hi All, > > In this patch series I'm adding support for zero extending using permutes > instead of requiring multi-step decomposition. > > This codegen has the benefit of needing fewer instructions and having much > higher throughput than uxtl. We previously replaced pairs of uxtl/uxtl2s with > ZIPs to increase throughput, but latency was still an issue due to the > depencency chain created. > > To fix this we can now use TBLs. The indexes are listed output of the loop as > well and can be shared amongst zero extends of the same type. The additional > benefit of this is that if the values are being permuted after extensions they > will be simplified and merged leading to better overall code. > > e.g. an LD4 can be replaced with LDR since the permutes being performance for > the extensions can be merged with the load permutes. The way LOAD_LANES is > currently implemented means this can't be done in GCC yet, but I'm aiming for > this in GCC 15. > > I've additionally only added support for non-VLA. The problem with VLA is > that > the index registers are hard or impossible to make. > > On Adv. SIMD we use -1 to indicate an out of range register so we can > transform > the two regs TBL into a one reg one. However on e.g. a byte array, on VLA 255 > would be a valid entry. e.g, at VL > 2048. Which means that's already not a > transformation we can make. > > Secondly the actual mask looks something like {x,x,x,n,n+1, x,x,x, n+2, n+3} > and while I think I can represent this in vect_perm_builder, I couldn't think > of > any real efficient VLA way to build such masks.. It would require a lot of > setup code. > > Lastly I don't think this transformation would make much sense for SVE, as SVE > has loads and converts that can do multi-step types. For instance the loop > below is already pretty good for SVE (though it's missed that the load can do > more than one step, presumably because a single extend is merged only in RTL). > > While I tried hard, for these reasons I don't support VLA, which I hope is > ok..
Yeah, I agree we can skip VLA. For .B->.D, we could in principle use a TBL .B with a mask created by: INDEX Zmask.D, #0, #1 and using INCD to generate subsequence vectors. We could then mask the result with: AND Zresult.D, Zresult.D, #0xff which still saves one level of unpacks. But like you say, it doesn't scale to VL>2048 bits, and setting up each index, although relatively simple individually, is going to mount up. > Concretely on AArch64 this changes: > > void test4(unsigned char *x, long long *y, int n) { > for(int i = 0; i < n; i++) { > y[i] = x[i]; > } > } > > from generating: > > .L4: > ldr q30, [x4], 16 > add x3, x3, 128 > zip1 v1.16b, v30.16b, v31.16b > zip2 v30.16b, v30.16b, v31.16b > zip1 v2.8h, v1.8h, v31.8h > zip1 v0.8h, v30.8h, v31.8h > zip2 v1.8h, v1.8h, v31.8h > zip2 v30.8h, v30.8h, v31.8h > zip1 v26.4s, v2.4s, v31.4s > zip1 v29.4s, v0.4s, v31.4s > zip1 v28.4s, v1.4s, v31.4s > zip1 v27.4s, v30.4s, v31.4s > zip2 v2.4s, v2.4s, v31.4s > zip2 v0.4s, v0.4s, v31.4s > zip2 v1.4s, v1.4s, v31.4s > zip2 v30.4s, v30.4s, v31.4s > stp q26, q2, [x3, -128] > stp q28, q1, [x3, -96] > stp q29, q0, [x3, -64] > stp q27, q30, [x3, -32] > cmp x4, x5 > bne .L4 > > and instead we get: > > .L4: > add x3, x3, 128 > ldr q23, [x4], 16 > tbl v5.16b, {v23.16b}, v31.16b > tbl v4.16b, {v23.16b}, v30.16b > tbl v3.16b, {v23.16b}, v29.16b > tbl v2.16b, {v23.16b}, v28.16b > tbl v1.16b, {v23.16b}, v27.16b > tbl v0.16b, {v23.16b}, v26.16b > tbl v22.16b, {v23.16b}, v25.16b > tbl v23.16b, {v23.16b}, v24.16b > stp q5, q4, [x3, -128] > stp q3, q2, [x3, -96] > stp q1, q0, [x3, -64] > stp q22, q23, [x3, -32] > cmp x4, x5 > bne .L4 > > Which results in up to 40% performance uplift on certain workloads. > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > Ok for master? > > Thanks, > Tamar > > gcc/ChangeLog: > > * config/aarch64/aarch64.cc (aarch64_use_permute_for_promotion): New. > (TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION): Use it. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/vect-tbl-zero-extend_1.c: New test. LGTM, but... > --- > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index > 102680a0efca1ce928e6945033c01cfb68a65152..b90577f4fc8157b3e02936256c8af8b2b7fac144 > 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -28404,6 +28404,29 @@ aarch64_empty_mask_is_expensive (unsigned) > return false; > } > > +/* Implement TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION. Assume that > + predicated operations when available are beneficial when doing more than > + one step conversion. */ > + > +static bool > +aarch64_use_permute_for_promotion (const_tree in_type, const_tree out_type) > +{ > + /* AArch64's vect_perm_constant doesn't currently support two 64 bit > shuffles > + into a 128 bit vector type. So for now reject it. */ > + if (maybe_ne (GET_MODE_BITSIZE (TYPE_MODE (in_type)), > + GET_MODE_BITSIZE (TYPE_MODE (out_type)))) > + return false; ...I think we should explicitly reject VLA vectors here, since it isn't a documented part of the interface that the vectors must be constant sized (and IMO that's a good thing). Also, it would be good to have SVE tests with -msve-vector-bits=512 or something. Thanks, Richard > + > + auto bitsize_in = element_precision (in_type); > + auto bitsize_out = element_precision (out_type); > + > + /* We don't want to use the permutes for a single widening step because > we're > + picking there between two zip and tbl sequences with the same throughput > + and latencies. However the zip doesn't require a mask and uses less > + registers so we prefer that. */ > + return (bitsize_out / bitsize_in) > 2; > +} > + > /* Return 1 if pseudo register should be created and used to hold > GOT address for PIC code. */ > > @@ -31113,6 +31136,9 @@ aarch64_libgcc_floating_mode_supported_p > #undef TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE > #define TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE \ > aarch64_conditional_operation_is_expensive > +#undef TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION > +#define TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION \ > + aarch64_use_permute_for_promotion > #undef TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE > #define TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE \ > aarch64_empty_mask_is_expensive > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_1.c > b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_1.c > new file mode 100644 > index > 0000000000000000000000000000000000000000..3c088ced63543c203d1cc020de5d67807b48b3fb > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_1.c > @@ -0,0 +1,52 @@ > +/* { dg-do compile } */ > +/* { dg-additional-options "-O3 -std=c99 -march=armv8-a" } */ > + > +void test1(unsigned short *x, double *y, int n) { > + for(int i = 0; i < (n & -8); i++) { > + unsigned short a = x[i*4+0]; > + unsigned short b = x[i*4+1]; > + unsigned short c = x[i*4+2]; > + unsigned short d = x[i*4+3]; > + y[i] = (double)a + (double)b + (double)c + (double)d; > + } > +} > + > +void test2(unsigned char *x, double *y, int n) { > + for(int i = 0; i < (n & -8); i++) { > + unsigned short a = x[i*4+0]; > + unsigned short b = x[i*4+1]; > + unsigned short c = x[i*4+2]; > + unsigned short d = x[i*4+3]; > + y[i] = (double)a + (double)b + (double)c + (double)d; > + } > +} > + > +void test3(unsigned short *x, double *y, int n) { > + for(int i = 0; i < (n & -8); i++) { > + unsigned int a = x[i]; > + y[i] = (double)a; > + } > +} > + > +void test4(unsigned short *x, long long *y, int n) { > + for(int i = 0; i < (n & -8); i++) { > + y[i] = x[i]; > + } > +} > + > +void test5(unsigned int *x, long long *y, int n) { > + for(int i = 0; i < (n & -8); i++) { > + y[i] = x[i]; > + } > +} > + > +void test6(unsigned char *x, long long *y, int n) { > + for(int i = 0; i < (n & -8); i++) { > + y[i] = x[i]; > + } > +} > + > +/* { dg-final { scan-assembler-times {\tzip1} 1 } } */ > +/* { dg-final { scan-assembler-times {\tzip2} 1 } } */ > +/* { dg-final { scan-assembler-times {\ttbl} 64 } } */ > +/* { dg-final { scan-assembler-times {\.LC[0-9]+:} 12 } } */