Hi Akram

> On 8 Jan 2025, at 16:23, Akram Ahmad <akram.ah...@arm.com> wrote:
> 
> Hi Kyrill,
> 
> Thanks for the feedback on V2. I found a pattern which works for
> the open-coded signed arithmetic, and I've implemented the other
> feedback you provided as well.
> 
> I've send the modified patch in this thread as the SVE patch [2/2]
> hasn't been changed, but I'm happy to send the entire V3 patch
> series as a new thread if that's easier. Patch continues below.
> 
> If this is OK, please could you commit on my behalf?
> 

Thanks for the rework!
This looks almost ready. It would be good to give Richard a change to have a 
look, but in the absence of further feedback I can commit it for you next week 
with the following changes...

> Many thanks,
> 
> Akram
> 
> ---
> 
> This renames the existing {s,u}q{add,sub} instructions to use the
> standard names {s,u}s{add,sub}3 which are used by IFN_SAT_ADD and
> IFN_SAT_SUB.
> 
> The NEON intrinsics for saturating arithmetic and their corresponding
> builtins are changed to use these standard names too.
> 
> Using the standard names for the instructions causes 32 and 64-bit
> unsigned scalar saturating arithmetic to use the NEON instructions,
> resulting in an additional (and inefficient) FMOV to be generated when
> the original operands are in GP registers. This patch therefore also
> restores the original behaviour of using the adds/subs instructions
> in this circumstance.
> 
> Furthermore, this patch introduces a new optimisation for signed 32
> and 64-bit scalar saturating arithmetic which uses adds/subs in place
> of the NEON instruction.
> 
> Addition, before:
> fmov d0, x0
> fmov d1, x1
> sqadd d0, d0, d1
> fmov x0, d0
> 
> Addition, after:
> asr x2, x1, 63
> adds x0, x0, x1
> eor x2, x2, 0x8000000000000000
> csinv x0, x0, x2, vc
> 
> In the above example, subtraction replaces the adds with subs and the
> csinv with csel. The 32-bit case follows the same approach. Arithmetic
> with a constant operand is simplified further by directly storing the
> saturating limit in the temporary register, resulting in only three
> instructions being used. It is important to note that this only works
> when early-ra is disabled due to an early-ra bug which erroneously
> assigns FP registers to the operands; if early-ra is enabled, then the
> original behaviour (NEON instruction) occurs.
> 
> Additional tests are written for the scalar and Adv. SIMD cases to
> ensure that the correct instructions are used. The NEON intrinsics are
> already tested elsewhere. The signed scalar case is also tested with
> an execution test to check the results.
> 
> gcc/ChangeLog:
> 
> * config/aarch64/aarch64-builtins.cc: Expand iterators.
> * config/aarch64/aarch64-simd-builtins.def: Use standard names
> * config/aarch64/aarch64-simd.md: Use standard names, split insn
> definitions on signedness of operator and type of operands.
> * config/aarch64/arm_neon.h: Use standard builtin names.
> * config/aarch64/iterators.md: Add VSDQ_I_QI_HI iterator to
> simplify splitting of insn for scalar arithmetic.
> 
> gcc/testsuite/ChangeLog:
> 
> * gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect.inc:
> Template file for unsigned vector saturating arithmetic tests.
> * gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c:
> 8-bit vector type tests.
> * gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_2.c:
> 16-bit vector type tests.
> * gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_3.c:
> 32-bit vector type tests.
> * gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_4.c:
> 64-bit vector type tests.
> * gcc.target/aarch64/saturating_arithmetic.inc: Template file
> for scalar saturating arithmetic tests.
> * gcc.target/aarch64/saturating_arithmetic_1.c: 8-bit tests.
> * gcc.target/aarch64/saturating_arithmetic_2.c: 16-bit tests.
> * gcc.target/aarch64/saturating_arithmetic_3.c: 32-bit tests.
> * gcc.target/aarch64/saturating_arithmetic_4.c: 64-bit tests.
> * gcc.target/aarch64/saturating_arithmetic_signed.c: Signed tests.
> ---
> gcc/config/aarch64/aarch64-builtins.cc        |  13 +
> gcc/config/aarch64/aarch64-simd-builtins.def  |   8 +-
> gcc/config/aarch64/aarch64-simd.md            | 218 +++++++++++++-
> gcc/config/aarch64/arm_neon.h                 |  96 +++----
> gcc/config/aarch64/iterators.md               |   4 +
> .../saturating_arithmetic_autovect.inc        |  58 ++++
> .../saturating_arithmetic_autovect_1.c        |  79 +++++
> .../saturating_arithmetic_autovect_2.c        |  79 +++++
> .../saturating_arithmetic_autovect_3.c        |  75 +++++
> .../saturating_arithmetic_autovect_4.c        |  77 +++++
> .../aarch64/saturating-arithmetic-signed.c    | 270 ++++++++++++++++++
> .../aarch64/saturating_arithmetic.inc         |  39 +++
> .../aarch64/saturating_arithmetic_1.c         |  36 +++
> .../aarch64/saturating_arithmetic_2.c         |  36 +++
> .../aarch64/saturating_arithmetic_3.c         |  30 ++
> .../aarch64/saturating_arithmetic_4.c         |  30 ++
> 16 files changed, 1092 insertions(+), 56 deletions(-)
> create mode 100644 
> gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect.inc
> create mode 100644 
> gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
> create mode 100644 
> gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_2.c
> create mode 100644 
> gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_3.c
> create mode 100644 
> gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_4.c
> create mode 100644 
> gcc/testsuite/gcc.target/aarch64/saturating-arithmetic-signed.c
> create mode 100644 gcc/testsuite/gcc.target/aarch64/saturating_arithmetic.inc
> create mode 100644 gcc/testsuite/gcc.target/aarch64/saturating_arithmetic_1.c
> create mode 100644 gcc/testsuite/gcc.target/aarch64/saturating_arithmetic_2.c
> create mode 100644 gcc/testsuite/gcc.target/aarch64/saturating_arithmetic_3.c
> create mode 100644 gcc/testsuite/gcc.target/aarch64/saturating_arithmetic_4.c
> 
> diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
> b/gcc/config/aarch64/aarch64-builtins.cc
> index 86d96e47f01..79e43d0c0b3 100644
> --- a/gcc/config/aarch64/aarch64-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-builtins.cc
> @@ -3863,6 +3863,19 @@ aarch64_general_gimple_fold_builtin (unsigned int 
> fcode, gcall *stmt,
>  new_stmt = gimple_build_assign (gimple_call_lhs (stmt),
>  LSHIFT_EXPR, args[0], args[1]);
> break;
> +
> +      /* lower saturating add/sub neon builtins to gimple.  */
> +      BUILTIN_VSDQ_I (BINOP, ssadd, 3, NONE)
> +      BUILTIN_VSDQ_I (BINOPU, usadd, 3, NONE)
> + new_stmt = gimple_build_call_internal (IFN_SAT_ADD, 2, args[0], args[1]);
> + gimple_call_set_lhs (new_stmt, gimple_call_lhs (stmt));
> + break;
> +      BUILTIN_VSDQ_I (BINOP, sssub, 3, NONE)
> +      BUILTIN_VSDQ_I (BINOPU, ussub, 3, NONE)
> + new_stmt = gimple_build_call_internal (IFN_SAT_SUB, 2, args[0], args[1]);
> + gimple_call_set_lhs (new_stmt, gimple_call_lhs (stmt));
> + break;
> +
>       BUILTIN_VSDQ_I_DI (BINOP, sshl, 0, NONE)
>       BUILTIN_VSDQ_I_DI (BINOP_UUS, ushl, 0, NONE)
> {
> diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def 
> b/gcc/config/aarch64/aarch64-simd-builtins.def
> index 0814f8ba14f..43a0a62caee 100644
> --- a/gcc/config/aarch64/aarch64-simd-builtins.def
> +++ b/gcc/config/aarch64/aarch64-simd-builtins.def
> @@ -71,10 +71,10 @@
>   BUILTIN_VSDQ_I (BINOP, sqrshl, 0, NONE)
>   BUILTIN_VSDQ_I (BINOP_UUS, uqrshl, 0, NONE)
>   /* Implemented by aarch64_<su_optab><optab><mode>.  */
> -  BUILTIN_VSDQ_I (BINOP, sqadd, 0, NONE)
> -  BUILTIN_VSDQ_I (BINOPU, uqadd, 0, NONE)
> -  BUILTIN_VSDQ_I (BINOP, sqsub, 0, NONE)
> -  BUILTIN_VSDQ_I (BINOPU, uqsub, 0, NONE)
> +  BUILTIN_VSDQ_I (BINOP, ssadd, 3, NONE)
> +  BUILTIN_VSDQ_I (BINOPU, usadd, 3, NONE)
> +  BUILTIN_VSDQ_I (BINOP, sssub, 3, NONE)
> +  BUILTIN_VSDQ_I (BINOPU, ussub, 3, NONE)
>   /* Implemented by aarch64_<sur>qadd<mode>.  */
>   BUILTIN_VSDQ_I (BINOP_SSU, suqadd, 0, NONE)
>   BUILTIN_VSDQ_I (BINOP_UUS, usqadd, 0, NONE)
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index e456f693d2f..ef5e2823673 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -5230,15 +5230,225 @@
> )
> ;; <su>q<addsub>
> 
> -(define_insn "aarch64_<su_optab>q<addsub><mode><vczle><vczbe>"
> -  [(set (match_operand:VSDQ_I 0 "register_operand" "=w")
> - (BINQOPS:VSDQ_I (match_operand:VSDQ_I 1 "register_operand" "w")
> - (match_operand:VSDQ_I 2 "register_operand" "w")))]
> +(define_insn "<su_optab>s<addsub><mode>3<vczle><vczbe>"
> +  [(set (match_operand:VSDQ_I_QI_HI 0 "register_operand" "=w")
> + (BINQOPS:VSDQ_I_QI_HI (match_operand:VSDQ_I_QI_HI 1 "register_operand" "w")
> +      (match_operand:VSDQ_I_QI_HI 2 "register_operand" "w")))]
>   "TARGET_SIMD"
>   "<su_optab>q<addsub>\\t%<v>0<Vmtype>, %<v>1<Vmtype>, %<v>2<Vmtype>"
>   [(set_attr "type" "neon_q<addsub><q>")]
> )
> 
> +(define_expand "<su_optab>s<addsub><mode>3"
> +  [(parallel [(set (match_operand:GPI 0 "register_operand")
> + (SBINQOPS:GPI (match_operand:GPI 1 "register_operand")
> +      (match_operand:GPI 2 "aarch64_plus_operand")))
> +    (clobber (scratch:GPI))
> +    (clobber (reg:CC CC_REGNUM))])]
> +)
> +
> +;; Introducing a temporary GP reg allows signed saturating arithmetic with 
> GPR
> +;; operands to be calculated without the use of costly transfers to and from 
> FP
> +;; registers.  For example, saturating addition usually uses three FMOVs:
> +;;
> +;;   fmov d0, x0
> +;;   fmov d1, x1
> +;;   sqadd d0, d0, d1
> +;;   fmov x0, d0
> +;;
> +;; Using a temporary register results in three cheaper instructions being 
> used
> +;; in place of the three FMOVs, which calculate the saturating limit 
> accounting
> +;; for the signedness of operand2:
> +;;
> +;;   asr x2, x1, 63
> +;;   adds x0, x0, x1
> +;;   eor x2, x2, 0x8000000000000000
> +;;   csinv x0, x0, x2, vc
> +;;
> +;; If operand2 is a constant value, the temporary register can be used to 
> store
> +;; the saturating limit without the need for asr, xor to calculate said 
> limit.
> +
> +(define_insn_and_split "aarch64_<su_optab>s<addsub><mode>3<vczle><vczbe>"
> +  [(set (match_operand:GPI 0 "register_operand")
> + (SBINQOPS:GPI (match_operand:GPI 1 "register_operand")
> +      (match_operand:GPI 2 "aarch64_plus_operand")))
> +    (clobber (match_scratch:GPI 3))
> +    (clobber (reg:CC CC_REGNUM))]
> +  ""
> +  {@ [ cons: =0, 1 , 2   , =3 ; attrs: type       , arch , length ]
> +     [ w       , w , w   , X  ; neon_q<addsub><q> , simd , 4      ] 
> <su_optab>q<addsub>\\t%<v>0<Vmtype>, %<v>1<Vmtype>, %<v>2<Vmtype>
> +     [ r       , r , JIr , &r ; *  , *    , 8      ] #
> +  }
> +  "&& reload_completed && GP_REGNUM_P (REGNO (operands[0]))"
> +  [(set (match_dup 0)
> + (if_then_else:GPI
> +  (match_operator 4 "comparison_operator" [(reg:CC_V CC_REGNUM) (const_int 
> 0)])
> +  (match_dup 5)
> +  (match_dup 6)))]
> +  {
> +    if (REG_P (operands[2]))
> +      {
> +      switch (<MODE>mode)
> + {
> + case SImode:
> +  emit_insn (gen_ashr<mode>3 (operands[3], operands[2],
> +      gen_int_mode (31, <MODE>mode)));
> +  emit_insn (gen_xor<mode>3 (operands[3], operands[3],
> +     gen_int_mode (0x80000000, <MODE>mode)));
> +  break;
> + case DImode:
> +  emit_insn (gen_ashr<mode>3 (operands[3], operands[2],
> +      gen_int_mode (63, <MODE>mode)));
> +  emit_insn (gen_xor<mode>3 (operands[3], operands[3],
> +     gen_int_mode (0x8000000000000000,
> +   <MODE>mode)));
> +  break;
> + default:
> +  break;
> + }

… This switch statement can be collapsed as the two cases look very similar.
You can use GET_MODE_BITSIZE (<MODE>mode) - 1 for the 31 and 63 constants and
HOST_WIDE_INT_1U << (GET_MODE_BITSIZE (<MODE>mode) - 1) for the two hex 
constants.


> + switch (<CODE>)
> +  {
> +  case SS_MINUS:
> +    emit_insn (gen_sub<mode>3_compare1 (operands[0], operands[1],
> + operands[2]));
> +    break;
> +  case SS_PLUS:
> +    emit_insn (gen_add<mode>3_compare0 (operands[0], operands[1],
> + operands[2]));
> +    break;
> +  default:
> +    break;
> +  }
> +
> +      rtx ccin = gen_rtx_REG (E_CC_Vmode, CC_REGNUM);
> +      switch (<CODE>)
> + {
> + case SS_PLUS:
> +  operands[4] = gen_rtx_NE (<MODE>mode, ccin, const0_rtx);
> +  operands[5] = gen_rtx_NOT (<MODE>mode, operands[3]);
> +  operands[6] = operands[0];
> +  break;
> + case SS_MINUS:
> +  operands[4] = gen_rtx_EQ (<MODE>mode, ccin, const0_rtx);
> +  operands[5] = operands[0];
> +  operands[6] = operands[3];
> +  break;
> + default:
> +  break;
> + }

It is considered good practice in GCC to have a gcc_unreachable () in the 
default case to ensure that if the code iterator is widened in the future that 
the compiler reliably crashes if the new codes are not handled properly.
Same with the other switches on <CODE> in this patch.

Thanks again for your work on this.
Kyrill

Reply via email to