Soumya AR <soum...@nvidia.com> writes:
> Currently, we vectorize CTZ for SVE by using the following operation:
> .CTZ (X) = (PREC - 1) - .CLZ (X & -X)
>
> Instead, this patch expands CTZ to RBIT + CLZ for SVE, as suggested in 
> PR109498.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
>
> Signed-off-by: Soumya AR <soum...@nvidia.com>
>
> gcc/ChangeLog:
>       PR target/109498
>       * config/aarch64/aarch64-sve.md (ctz<mode>2): Added pattern to expand
>         CTZ to RBIT + CLZ for SVE.
>
> gcc/testsuite/ChangeLog:
>       PR target/109498
>       * gcc.target/aarch64/sve/ctz.c: New test.

Generally looks good, but a couple of comments:

> ---
>  gcc/config/aarch64/aarch64-sve.md          | 16 +++++++
>  gcc/testsuite/gcc.target/aarch64/sve/ctz.c | 49 ++++++++++++++++++++++
>  2 files changed, 65 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/ctz.c
>
> diff --git a/gcc/config/aarch64/aarch64-sve.md 
> b/gcc/config/aarch64/aarch64-sve.md
> index bfa28849adf..10094f156b3 100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> @@ -3088,6 +3088,22 @@
>  ;; - NOT
>  ;; -------------------------------------------------------------------------
>  
> +(define_expand "ctz<mode>2"
> +  [(set (match_operand:SVE_I 0 "register_operand")
> +     (unspec:SVE_I
> +       [(match_dup 2)
> +        (ctz:SVE_I
> +          (match_operand:SVE_I 1 "register_operand"))]
> +       UNSPEC_PRED_X))]
> +  "TARGET_SVE"
> +  {
> +     operands[2] = aarch64_ptrue_reg (<VPRED>mode);

There's no real need to use operands[...] here.  It can just be
a local variable.

> +     emit_insn (gen_aarch64_pred_rbit<mode> (operands[0], 
> operands[2],operands[1]));
> +     emit_insn (gen_aarch64_pred_clz<mode> (operands[0], operands[2], 
> operands[0]));

Formatting nit: C++ lines should be 80 characters or fewer.

More importantly, I think we should use a fresh register for the
temporary (RBIT) result, since that tends to permit more optimisation.

Thanks,
Richard



> +     DONE;
> +  }
> +)
> +
>  ;; Unpredicated integer unary arithmetic.
>  (define_expand "<optab><mode>2"
>    [(set (match_operand:SVE_I 0 "register_operand")
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/ctz.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/ctz.c
> new file mode 100644
> index 00000000000..433a9174f48
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/ctz.c
> @@ -0,0 +1,49 @@
> +/* { dg-final { check-function-bodies "**" "" } } */
> +/* { dg-options "-O3 --param aarch64-autovec-preference=sve-only" } */
> +
> +#include <stdint.h>
> +
> +#define FUNC(FUNCTION, NAME, DTYPE)                 \
> +void                                                     \
> +NAME (DTYPE *__restrict x, DTYPE *__restrict y, int n) { \
> +  for (int i = 0; i < n; i++)                            \
> +    x[i] = FUNCTION (y[i]);                         \
> +}                                                        \
> +
> +
> +/*
> +** ctz_uint8:
> +**   ...
> +**   rbit    z[0-9]+\.b, p[0-7]/m, z[0-9]+\.b
> +**   clz     z[0-9]+\.b, p[0-7]/m, z[0-9]+\.b
> +**   ...
> +*/
> +FUNC (__builtin_ctzg, ctz_uint8, uint8_t)
> +
> +/*
> +** ctz_uint16:
> +**   ...
> +**   rbit    z[0-9]+\.h, p[0-7]/m, z[0-9]+\.h
> +**   clz     z[0-9]+\.h, p[0-7]/m, z[0-9]+\.h
> +**   ...
> +*/
> +FUNC (__builtin_ctzg, ctz_uint16, uint16_t)
> +
> +/*
> +** ctz_uint32:
> +**   ...
> +**   rbit    z[0-9]+\.s, p[0-7]/m, z[0-9]+\.s
> +**   clz     z[0-9]+\.s, p[0-7]/m, z[0-9]+\.s
> +**   ...
> +*/
> +FUNC (__builtin_ctz, ctz_uint32, uint32_t)
> +
> +/*
> +** ctz_uint64:
> +**   ...
> +**   rbit    z[0-9]+\.d, p[0-7]/m, z[0-9]+\.d
> +**   clz     z[0-9]+\.d, p[0-7]/m, z[0-9]+\.d
> +**   ...
> +*/
> +FUNC (__builtin_ctzll, ctz_uint64, uint64_t)
> +
> -- 
> 2.43.2

Reply via email to