Akram Ahmad <akram.ah...@arm.com> writes:
> Hi Kyrill,
>
> Thanks for the feedback on V2. I found a pattern which works for
> the open-coded signed arithmetic, and I've implemented the other
> feedback you provided as well.
>
> I've send the modified patch in this thread as the SVE patch [2/2]
> hasn't been changed, but I'm happy to send the entire V3 patch
> series as a new thread if that's easier. Patch continues below.
>
> If this is OK, please could you commit on my behalf?
>
> Many thanks,
>
> Akram
>
> ---
>
> This renames the existing {s,u}q{add,sub} instructions to use the
> standard names {s,u}s{add,sub}3 which are used by IFN_SAT_ADD and
> IFN_SAT_SUB.
>
> The NEON intrinsics for saturating arithmetic and their corresponding
> builtins are changed to use these standard names too.
>
> Using the standard names for the instructions causes 32 and 64-bit
> unsigned scalar saturating arithmetic to use the NEON instructions,
> resulting in an additional (and inefficient) FMOV to be generated when
> the original operands are in GP registers. This patch therefore also
> restores the original behaviour of using the adds/subs instructions
> in this circumstance.
>
> Furthermore, this patch introduces a new optimisation for signed 32
> and 64-bit scalar saturating arithmetic which uses adds/subs in place
> of the NEON instruction.
>
> Addition, before:
>       fmov    d0, x0
>       fmov    d1, x1
>       sqadd   d0, d0, d1
>       fmov    x0, d0
>
> Addition, after:
>       asr     x2, x1, 63
>       adds    x0, x0, x1
>       eor     x2, x2, 0x8000000000000000
>       csinv   x0, x0, x2, vc
>
> In the above example, subtraction replaces the adds with subs and the
> csinv with csel. The 32-bit case follows the same approach. Arithmetic
> with a constant operand is simplified further by directly storing the
> saturating limit in the temporary register, resulting in only three
> instructions being used. It is important to note that this only works
> when early-ra is disabled due to an early-ra bug which erroneously
> assigns FP registers to the operands; if early-ra is enabled, then the
> original behaviour (NEON instruction) occurs.

This can be fixed by changing:

        case CT_REGISTER:
          if (REG_P (op) || SUBREG_P (op))
            return true;
          break;

to:

        case CT_REGISTER:
          if (REG_P (op) || SUBREG_P (op) || GET_CODE (op) == SCRATCH)
            return true;
          break;

But I can test & post that as a follow-up if you prefer.

> Additional tests are written for the scalar and Adv. SIMD cases to
> ensure that the correct instructions are used. The NEON intrinsics are
> already tested elsewhere. The signed scalar case is also tested with
> an execution test to check the results.

It looks like this is based on a relatively old version of trunk.
(Probably from the same time as v1?  Sorry that this has been in review
for a while.)

Otherwise it mostly LGTM too, but some additional comments on top of Kyrill's:

> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index e456f693d2f..ef5e2823673 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -5230,15 +5230,225 @@
>  )
>  ;; <su>q<addsub>
>  
> -(define_insn "aarch64_<su_optab>q<addsub><mode><vczle><vczbe>"
> -  [(set (match_operand:VSDQ_I 0 "register_operand" "=w")
> -     (BINQOPS:VSDQ_I (match_operand:VSDQ_I 1 "register_operand" "w")
> -                     (match_operand:VSDQ_I 2 "register_operand" "w")))]
> +(define_insn "<su_optab>s<addsub><mode>3<vczle><vczbe>"
> +  [(set (match_operand:VSDQ_I_QI_HI 0 "register_operand" "=w")
> +     (BINQOPS:VSDQ_I_QI_HI (match_operand:VSDQ_I_QI_HI 1 "register_operand" 
> "w")
> +                           (match_operand:VSDQ_I_QI_HI 2 "register_operand" 
> "w")))]

Very minor, wouldn't have raised it if it wasn't for the comments
below, but: it'd be good to put the input match_operands on their own
line now that this overflows 80 chars.

>    "TARGET_SIMD"
>    "<su_optab>q<addsub>\\t%<v>0<Vmtype>, %<v>1<Vmtype>, %<v>2<Vmtype>"
>    [(set_attr "type" "neon_q<addsub><q>")]
>  )
>  
> +(define_expand "<su_optab>s<addsub><mode>3"
> +  [(parallel [(set (match_operand:GPI 0 "register_operand")
> +     (SBINQOPS:GPI (match_operand:GPI 1 "register_operand")
> +                   (match_operand:GPI 2 "aarch64_plus_operand")))
> +    (clobber (scratch:GPI))
> +    (clobber (reg:CC CC_REGNUM))])]

Likewise very minor, but I think a more usual formatting would be:

  [(parallel
     [(set (match_operand:GPI 0 "register_operand")
           (SBINQOPS:GPI (match_operand:GPI 1 "register_operand")
                         (match_operand:GPI 2 "aarch64_plus_operand")))
     (clobber (scratch:GPI))
     (clobber (reg:CC CC_REGNUM))])]

> +)
> +
> +;; Introducing a temporary GP reg allows signed saturating arithmetic with 
> GPR
> +;; operands to be calculated without the use of costly transfers to and from 
> FP
> +;; registers.  For example, saturating addition usually uses three FMOVs:
> +;;
> +;;   fmov    d0, x0
> +;;   fmov    d1, x1
> +;;   sqadd   d0, d0, d1
> +;;   fmov    x0, d0
> +;;
> +;; Using a temporary register results in three cheaper instructions being 
> used
> +;; in place of the three FMOVs, which calculate the saturating limit 
> accounting
> +;; for the signedness of operand2:
> +;;
> +;;   asr     x2, x1, 63
> +;;   adds    x0, x0, x1
> +;;   eor     x2, x2, 0x8000000000000000
> +;;   csinv   x0, x0, x2, vc
> +;;
> +;; If operand2 is a constant value, the temporary register can be used to 
> store
> +;; the saturating limit without the need for asr, xor to calculate said 
> limit.
> +
> +(define_insn_and_split "aarch64_<su_optab>s<addsub><mode>3<vczle><vczbe>"
> +  [(set (match_operand:GPI 0 "register_operand")
> +     (SBINQOPS:GPI (match_operand:GPI 1 "register_operand")
> +                   (match_operand:GPI 2 "aarch64_plus_operand")))
> +    (clobber (match_scratch:GPI 3))
> +    (clobber (reg:CC CC_REGNUM))]
> +  ""
> +  {@ [ cons: =0, 1 , 2   , =3 ; attrs: type       , arch , length ]
> +     [ w       , w , w   , X  ; neon_q<addsub><q> , simd , 4      ] 
> <su_optab>q<addsub>\\t%<v>0<Vmtype>, %<v>1<Vmtype>, %<v>2<Vmtype>
> +     [ r       , r , JIr , &r ; *              , *    , 8      ] #
> +  }
> +  "&& reload_completed && GP_REGNUM_P (REGNO (operands[0]))"
> +  [(set (match_dup 0)
> +     (if_then_else:GPI
> +       (match_operator 4 "comparison_operator" [(reg:CC_V CC_REGNUM) 
> (const_int 0)])

The comparison_operator predicate isn't applied, since operand 4 is
always provided by the split pattern.  (Seems like a missing diagnostic
from the gen* machinery.)  The way to write it as a dup is:

          (match_op_dup 4 [(reg:CC_V CC_REGNUM) (const_int 0)])

but it's simpler as just:

          (match_dup 4)

since the split code generates the full expression itself.

> +       (match_dup 5)
> +       (match_dup 6)))]
> +  {
> +    if (REG_P (operands[2]))
> +      {
> +      switch (<MODE>mode)
> +     {
> +     case SImode:
> +       emit_insn (gen_ashr<mode>3 (operands[3], operands[2],
> +                                   gen_int_mode (31, <MODE>mode)));
> +       emit_insn (gen_xor<mode>3 (operands[3], operands[3],
> +                                  gen_int_mode (0x80000000, <MODE>mode)));
> +       break;
> +     case DImode:
> +       emit_insn (gen_ashr<mode>3 (operands[3], operands[2],
> +                                   gen_int_mode (63, <MODE>mode)));
> +       emit_insn (gen_xor<mode>3 (operands[3], operands[3],
> +                                  gen_int_mode (0x8000000000000000,
> +                                                <MODE>mode)));
> +       break;
> +     default:
> +       break;
> +     }
> +     switch (<CODE>)
> +       {
> +       case SS_MINUS:
> +         emit_insn (gen_sub<mode>3_compare1 (operands[0], operands[1],
> +                                             operands[2]));
> +         break;
> +       case SS_PLUS:
> +         emit_insn (gen_add<mode>3_compare0 (operands[0], operands[1],
> +                                             operands[2]));
> +         break;
> +       default:
> +         break;
> +       }
> +
> +      rtx ccin = gen_rtx_REG (E_CC_Vmode, CC_REGNUM);
> +      switch (<CODE>)
> +     {
> +     case SS_PLUS:
> +       operands[4] = gen_rtx_NE (<MODE>mode, ccin, const0_rtx);
> +       operands[5] = gen_rtx_NOT (<MODE>mode, operands[3]);
> +       operands[6] = operands[0];
> +       break;
> +     case SS_MINUS:
> +       operands[4] = gen_rtx_EQ (<MODE>mode, ccin, const0_rtx);
> +       operands[5] = operands[0];
> +       operands[6] = operands[3];
> +       break;
> +     default:
> +       break;
> +     }
> +      }
> +    else
> +      {
> +     long imm = INTVAL (operands[2]);

If this uses a named type, it should be HOST_WIDE_INT rather than "long".
But "auto imm" is probably easier. 

> +     gcc_assert (imm != 0);

The constraints do allow 0, so I'm not sure this assert is safe.
Certainly we shouldn't usually get unfolded instructions, but strange
things can happen with fuzzed options.

Does the code mishandle that case?  It looked like it should be ok.

> +     rtx neg_imm = gen_int_mode (-imm, <MODE>mode);
> +     wide_int limit;
> +
> +     switch (<CODE>)
> +       {
> +       case SS_MINUS:
> +         emit_insn (gen_sub<mode>3_compare1_imm (operands[0], operands[1],
> +                                                 operands[2], neg_imm));
> +         limit = (imm >> 63) + 1 ? wi::min_value (<MODE>mode, SIGNED)
> +                                 : wi::max_value (<MODE>mode, SIGNED);

INTVALs are always stored in sign-extended form, and we're not supposed
to rely on >> being an arithmetic shift, so this is IMO more obvious as
imm >= 0 ? ... : ...

Same for SS_PLUS.

> +         break;
> +       case SS_PLUS:
> +         emit_insn (gen_sub<mode>3_compare1_imm (operands[0], operands[1],
> +                                                 neg_imm, operands[2]));
> +         limit = (imm >> 63) + 1 ? wi::max_value (<MODE>mode, SIGNED)
> +                                 : wi::min_value (<MODE>mode, SIGNED);
> +         break;
> +       default:
> +         break;
> +       }
> +
> +      rtx sat_limit = immed_wide_int_const (limit, <MODE>mode);
> +      emit_insn (gen_rtx_SET (operands[3], sat_limit));
> +
> +      rtx ccin = gen_rtx_REG (E_CC_Vmode, CC_REGNUM);
> +      operands[4] = gen_rtx_EQ (<MODE>mode, ccin, const0_rtx);
> +      operands[5] = operands[0];
> +      operands[6] = operands[3];
> +      }
> +  }
> +)
> +
> +;; Unsigned saturating arithmetic with GPR operands can be optimised 
> similarly
> +;; to the signed case, albeit without the need for a temporary register as 
> the
> +;; saturating limit can be inferred from the <addsub> code.  This applies 
> only
> +;; to SImode and DImode.
> +
> +(define_insn_and_split "<su_optab>s<addsub><mode>3<vczle><vczbe>"
> +  [(set (match_operand:GPI 0 "register_operand")
> +     (UBINQOPS:GPI (match_operand:GPI 1 "register_operand")
> +                   (match_operand:GPI 2 "aarch64_plus_operand")))
> +    (clobber (reg:CC CC_REGNUM))]
> +  ""
> +  {@ [ cons: =0, 1 , 2   ; attrs: type       , arch , length ]
> +     [ w       , w , w   ; neon_q<addsub><q> , simd , 4      ] 
> <su_optab>q<addsub>\\t%<v>0<Vmtype>, %<v>1<Vmtype>, %<v>2<Vmtype>
> +     [ r       , r , JIr ; *              , *    , 8      ] #
> +  }
> +  "&& reload_completed && GP_REGNUM_P (REGNO (operands[0]))"
> +  [(set (match_dup 0)
> +     (if_then_else:GPI
> +       (match_operator 3 "comparison_operator" [(reg:CC CC_REGNUM) 
> (const_int 0)])

Similarly to the above, this could be (match_dup 3).  (It matters a little
more here, since the code produces (reg:CC_C CC_REGNUM) rather than
(reg:CC ...).)

> +       (match_dup 0)
> +       (match_operand:GPI 4 "immediate_operand" "i")))]

And this could be (match_dup 4).

> +  {
> +
> +    if (REG_P (operands[2]))
> +      {
> +     switch (<CODE>)
> +       {
> +       case US_MINUS:
> +         emit_insn (gen_sub<mode>3_compare1 (operands[0], operands[1],
> +                                             operands[2]));
> +         break;
> +       case US_PLUS:
> +         emit_insn (gen_add<mode>3_compare0 (operands[0], operands[1],
> +                                             operands[2]));
> +         break;
> +       default:
> +         break;
> +       }
> +      }
> +    else
> +      {
> +     unsigned long imm = UINTVAL (operands[2]);
> +     gcc_assert (imm != 0);
> +     rtx neg_imm = gen_int_mode (-imm, <MODE>mode);
> +     switch (<CODE>)
> +       {
> +       case US_MINUS:
> +         emit_insn (gen_sub<mode>3_compare1_imm (operands[0], operands[1],
> +                                                 operands[2], neg_imm));
> +         break;
> +       case US_PLUS:
> +         emit_insn (gen_sub<mode>3_compare1_imm (operands[0], operands[1],
> +                                                 neg_imm, operands[2]));
> +         break;
> +       default:
> +         break;
> +       }
> +      }
> +
> +    rtx ccin = gen_rtx_REG (CC_Cmode, CC_REGNUM);
> +    switch (<CODE>)
> +      {
> +      case US_PLUS:
> +     operands[3] = gen_rtx_LTU (<MODE>mode, ccin, const0_rtx);
> +     operands[4] = gen_int_mode (-1, <MODE>mode);
> +     break;
> +      case US_MINUS:
> +     operands[3] = gen_rtx_GEU (<MODE>mode, ccin, const0_rtx);
> +     operands[4] = const0_rtx;
> +     break;
> +      default:
> +     break;
> +      }
> +  }
> +)
> +
>  ;; suqadd and usqadd
>  
>  (define_insn "aarch64_<sur>qadd<mode><vczle><vczbe>"
> diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
> index 0bc98315bb6..10720b5e66b 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -93,6 +93,10 @@
>  ;; integer modes; 64-bit scalar integer mode.
>  (define_mode_iterator VSDQ_I_DI [V8QI V16QI V4HI V8HI V2SI V4SI V2DI DI])
>  
> +;; Advanced SIMD and scalar, 64 & 128-bit container; 8 and 16-bit scalar
> +;; integer modes.
> +(define_mode_iterator VSDQ_I_QI_HI [V8QI V16QI V4HI V8HI V2SI V4SI V2DI HI 
> QI])

I think this is one case where inheriting from VSDQ_I would be worthwhile:

(define_mode_iterator VSDQ_I_QI_HI [VSDQ_I QI HI])

> +
>  ;; Double vector modes.
>  (define_mode_iterator VD [V8QI V4HI V4HF V2SI V2SF V4BF])
>  
> diff --git 
> a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
>  
> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
> new file mode 100644
> index 00000000000..2b72be7b0d7
> --- /dev/null
> +++ 
> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
> @@ -0,0 +1,79 @@
> +/* { dg-do assemble { target { aarch64*-*-* } } } */
> +/* { dg-options "-O2 --save-temps -ftree-vectorize" } */
> +/* { dg-final { check-function-bodies "**" "" "-DCHECK_ASM" } } */
> +
> +/*
> +** uadd_lane: { xfail *-*-* }
> +**   dup\tv([0-9]+).8b, w0
> +**   uqadd\tb([0-9]+), (?:b\1, b0|b0, b\1)
> +**   umov\tw0, v\2.b\[0\]
> +**   ret
> +*/

Whats the reason behind the xfail?  Is it the early-ra thing, or
something else?  (You might already have covered this, sorry.)

xfailing is fine if it needs further optimisation, was just curious :)

> [...]
> diff --git a/gcc/testsuite/gcc.target/aarch64/saturating-arithmetic-signed.c 
> b/gcc/testsuite/gcc.target/aarch64/saturating-arithmetic-signed.c
> new file mode 100644
> index 00000000000..0fc6804683a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/saturating-arithmetic-signed.c
> @@ -0,0 +1,270 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 --save-temps -mearly-ra=none" } */

It'd be worth adding -fno-schedule-insns2 here.  Same for
saturating_arithmetic_1.c and saturating_arithmetic_2.c.  The reason
is that:

> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +#include <limits.h>
> +#include <stdbool.h>
> +#include <stdint.h>
> +
> +/*
> +** sadd32:
> +**   asr     w([0-9]+), w1, 31
> +**   adds    w([0-9]+), (?:w0, w1|w1, w0)
> +**   eor     w\1, w\1, -2147483648
> +**   csinv   w0, w\2, w\1, vc
> +**   ret
> +*/

...the first two instructions can be in either order, and similarly
for the second and third.

Really nice tests though :)

Like Kyrill says, thanks a lot for doing this.

Richard

> +int32_t __attribute__((noipa))
> +sadd32 (int32_t __a, int32_t __b)
> +{
> +  int32_t sum;
> +  bool overflow = __builtin_add_overflow (__a, __b, &sum);
> +  return !overflow ? sum : __a < 0 ? INT_MIN : INT_MAX;
> +}
> [...]

Reply via email to