Re: [PATCH 1/2] aarch64: Use standard names for saturating arithmetic

Richard Sandiford Wed, 23 Oct 2024 04:20:27 -0700

Akram Ahmad <akram.ah...@arm.com> writes:
> This renames the existing {s,u}q{add,sub} instructions to use the
> standard names {s,u}s{add,sub}3 which are used by IFN_SAT_ADD and
> IFN_SAT_SUB.
>
> The NEON intrinsics for saturating arithmetic and their corresponding
> builtins are changed to use these standard names too.
>
> Using the standard names for the instructions causes 32 and 64-bit
> unsigned scalar saturating arithmetic to use the NEON instructions,
> resulting in an additional (and inefficient) FMOV to be generated when
> the original operands are in GP registers. This patch therefore also
> restores the original behaviour of using the adds/subs instructions
> in this circumstance.
>
> Additional tests are written for the scalar and Adv. SIMD cases to
> ensure that the correct instructions are used. The NEON intrinsics are
> already tested elsewhere.


Thanks for doing this.  The approach looks good.  My main question is:
are we sure that we want to use the Advanced SIMD instructions for
signed saturating SI and DI arithmetic on GPRs?  E.g. for addition,
we only saturate at the negative limit if both operands are negative,
and only saturate at the positive limit if both operands are positive.
So for 32-bit values we can use:

        asr     tmp, x or y, #31
        eor     tmp, tmp, #0x80000000

to calculate the saturation value and:

        adds    res, x, y
        csel    res, tmp, res, vs

to calculate the full result.  That's the same number of instructions
as two fmovs for the inputs, the sqadd, and the fmov for the result,
but it should be more efficient.

The reason for asking now, rather than treating it as a potential
future improvement, is that it would also avoid splitting the patterns
for signed and unsigned ops.  (The length of the split alternative can be
conservatively set to 16 even for the unsigned version, since nothing
should care in practice.  The split will have happened before
shorten_branches.)

> gcc/ChangeLog:
>
>       * config/aarch64/aarch64-builtins.cc: Expand iterators.
>       * config/aarch64/aarch64-simd-builtins.def: Use standard names
>       * config/aarch64/aarch64-simd.md: Use standard names, split insn
>       definitions on signedness of operator and type of operands.
>       * config/aarch64/arm_neon.h: Use standard builtin names.
>       * config/aarch64/iterators.md: Add VSDQ_I_QI_HI iterator to
>       simplify splitting of insn for unsigned scalar arithmetic.
>
> gcc/testsuite/ChangeLog:
>
>       * 
> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect.inc:
>       Template file for unsigned vector saturating arithmetic tests.
>       * 
> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c:
>       8-bit vector type tests.
>       * 
> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_2.c:
>       16-bit vector type tests.
>       * 
> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_3.c:
>       32-bit vector type tests.
>       * 
> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_4.c:
>       64-bit vector type tests.
>       * gcc.target/aarch64/saturating_arithmetic.inc: Template file
>       for scalar saturating arithmetic tests.
>       * gcc.target/aarch64/saturating_arithmetic_1.c: 8-bit tests.
>       * gcc.target/aarch64/saturating_arithmetic_2.c: 16-bit tests.
>       * gcc.target/aarch64/saturating_arithmetic_3.c: 32-bit tests.
>       * gcc.target/aarch64/saturating_arithmetic_4.c: 64-bit tests.
> diff --git 
> a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
>  
> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
> new file mode 100644
> index 00000000000..63eb21e438b
> --- /dev/null
> +++ 
> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
> @@ -0,0 +1,79 @@
> +/* { dg-do assemble { target { aarch64*-*-* } } } */
> +/* { dg-options "-O2 --save-temps -ftree-vectorize" } */
> +/* { dg-final { check-function-bodies "**" "" "-DCHECK_ASM" } } */
> +
> +/*
> +** uadd_lane: { xfail *-*-* }

Just curious: why does this fail?  Is it a vector costing issue?

> +**   dup\tv([0-9]+).8b, w0
> +**   uqadd\tb([0-9]+), b\1, b0
> +**   umov\tw0, v\2.b\[0]
> +**   ret
> +*/
> +/*
> +** uaddq:
> +** ...
> +**   ldr\tq([0-9]+), .*
> +**   ldr\tq([0-9]+), .*
> +**   uqadd\tv\2.16b, v\1.16b, v\2.16b

Since the operands are commutative, and since there's no restriction
on the choice of destination register, it's probably safer to use:

> +**   uqadd\tv[0-9].16b, (?:v\1.16b, v\2.16b|v\2.16b, v\1.16b)

Similarly for the other qadds.  The qsubs do of course have a fixed
order, but the destination is similarly not restricted, so should use
[0-9]+ rather than \n.

Thanks,
Richard

Re: [PATCH 1/2] aarch64: Use standard names for saturating arithmetic

Reply via email to