Akram Ahmad <akram.ah...@arm.com> writes: > Hi Kyrill, > > Thanks for the feedback on V2. I found a pattern which works for > the open-coded signed arithmetic, and I've implemented the other > feedback you provided as well. > > I've send the modified patch in this thread as the SVE patch [2/2] > hasn't been changed, but I'm happy to send the entire V3 patch > series as a new thread if that's easier. Patch continues below. > > If this is OK, please could you commit on my behalf? > > Many thanks, > > Akram > > --- > > This renames the existing {s,u}q{add,sub} instructions to use the > standard names {s,u}s{add,sub}3 which are used by IFN_SAT_ADD and > IFN_SAT_SUB. > > The NEON intrinsics for saturating arithmetic and their corresponding > builtins are changed to use these standard names too. > > Using the standard names for the instructions causes 32 and 64-bit > unsigned scalar saturating arithmetic to use the NEON instructions, > resulting in an additional (and inefficient) FMOV to be generated when > the original operands are in GP registers. This patch therefore also > restores the original behaviour of using the adds/subs instructions > in this circumstance. > > Furthermore, this patch introduces a new optimisation for signed 32 > and 64-bit scalar saturating arithmetic which uses adds/subs in place > of the NEON instruction. > > Addition, before: > fmov d0, x0 > fmov d1, x1 > sqadd d0, d0, d1 > fmov x0, d0 > > Addition, after: > asr x2, x1, 63 > adds x0, x0, x1 > eor x2, x2, 0x8000000000000000 > csinv x0, x0, x2, vc > > In the above example, subtraction replaces the adds with subs and the > csinv with csel. The 32-bit case follows the same approach. Arithmetic > with a constant operand is simplified further by directly storing the > saturating limit in the temporary register, resulting in only three > instructions being used. It is important to note that this only works > when early-ra is disabled due to an early-ra bug which erroneously > assigns FP registers to the operands; if early-ra is enabled, then the > original behaviour (NEON instruction) occurs.
This can be fixed by changing: case CT_REGISTER: if (REG_P (op) || SUBREG_P (op)) return true; break; to: case CT_REGISTER: if (REG_P (op) || SUBREG_P (op) || GET_CODE (op) == SCRATCH) return true; break; But I can test & post that as a follow-up if you prefer. > Additional tests are written for the scalar and Adv. SIMD cases to > ensure that the correct instructions are used. The NEON intrinsics are > already tested elsewhere. The signed scalar case is also tested with > an execution test to check the results. It looks like this is based on a relatively old version of trunk. (Probably from the same time as v1? Sorry that this has been in review for a while.) Otherwise it mostly LGTM too, but some additional comments on top of Kyrill's: > diff --git a/gcc/config/aarch64/aarch64-simd.md > b/gcc/config/aarch64/aarch64-simd.md > index e456f693d2f..ef5e2823673 100644 > --- a/gcc/config/aarch64/aarch64-simd.md > +++ b/gcc/config/aarch64/aarch64-simd.md > @@ -5230,15 +5230,225 @@ > ) > ;; <su>q<addsub> > > -(define_insn "aarch64_<su_optab>q<addsub><mode><vczle><vczbe>" > - [(set (match_operand:VSDQ_I 0 "register_operand" "=w") > - (BINQOPS:VSDQ_I (match_operand:VSDQ_I 1 "register_operand" "w") > - (match_operand:VSDQ_I 2 "register_operand" "w")))] > +(define_insn "<su_optab>s<addsub><mode>3<vczle><vczbe>" > + [(set (match_operand:VSDQ_I_QI_HI 0 "register_operand" "=w") > + (BINQOPS:VSDQ_I_QI_HI (match_operand:VSDQ_I_QI_HI 1 "register_operand" > "w") > + (match_operand:VSDQ_I_QI_HI 2 "register_operand" > "w")))] Very minor, wouldn't have raised it if it wasn't for the comments below, but: it'd be good to put the input match_operands on their own line now that this overflows 80 chars. > "TARGET_SIMD" > "<su_optab>q<addsub>\\t%<v>0<Vmtype>, %<v>1<Vmtype>, %<v>2<Vmtype>" > [(set_attr "type" "neon_q<addsub><q>")] > ) > > +(define_expand "<su_optab>s<addsub><mode>3" > + [(parallel [(set (match_operand:GPI 0 "register_operand") > + (SBINQOPS:GPI (match_operand:GPI 1 "register_operand") > + (match_operand:GPI 2 "aarch64_plus_operand"))) > + (clobber (scratch:GPI)) > + (clobber (reg:CC CC_REGNUM))])] Likewise very minor, but I think a more usual formatting would be: [(parallel [(set (match_operand:GPI 0 "register_operand") (SBINQOPS:GPI (match_operand:GPI 1 "register_operand") (match_operand:GPI 2 "aarch64_plus_operand"))) (clobber (scratch:GPI)) (clobber (reg:CC CC_REGNUM))])] > +) > + > +;; Introducing a temporary GP reg allows signed saturating arithmetic with > GPR > +;; operands to be calculated without the use of costly transfers to and from > FP > +;; registers. For example, saturating addition usually uses three FMOVs: > +;; > +;; fmov d0, x0 > +;; fmov d1, x1 > +;; sqadd d0, d0, d1 > +;; fmov x0, d0 > +;; > +;; Using a temporary register results in three cheaper instructions being > used > +;; in place of the three FMOVs, which calculate the saturating limit > accounting > +;; for the signedness of operand2: > +;; > +;; asr x2, x1, 63 > +;; adds x0, x0, x1 > +;; eor x2, x2, 0x8000000000000000 > +;; csinv x0, x0, x2, vc > +;; > +;; If operand2 is a constant value, the temporary register can be used to > store > +;; the saturating limit without the need for asr, xor to calculate said > limit. > + > +(define_insn_and_split "aarch64_<su_optab>s<addsub><mode>3<vczle><vczbe>" > + [(set (match_operand:GPI 0 "register_operand") > + (SBINQOPS:GPI (match_operand:GPI 1 "register_operand") > + (match_operand:GPI 2 "aarch64_plus_operand"))) > + (clobber (match_scratch:GPI 3)) > + (clobber (reg:CC CC_REGNUM))] > + "" > + {@ [ cons: =0, 1 , 2 , =3 ; attrs: type , arch , length ] > + [ w , w , w , X ; neon_q<addsub><q> , simd , 4 ] > <su_optab>q<addsub>\\t%<v>0<Vmtype>, %<v>1<Vmtype>, %<v>2<Vmtype> > + [ r , r , JIr , &r ; * , * , 8 ] # > + } > + "&& reload_completed && GP_REGNUM_P (REGNO (operands[0]))" > + [(set (match_dup 0) > + (if_then_else:GPI > + (match_operator 4 "comparison_operator" [(reg:CC_V CC_REGNUM) > (const_int 0)]) The comparison_operator predicate isn't applied, since operand 4 is always provided by the split pattern. (Seems like a missing diagnostic from the gen* machinery.) The way to write it as a dup is: (match_op_dup 4 [(reg:CC_V CC_REGNUM) (const_int 0)]) but it's simpler as just: (match_dup 4) since the split code generates the full expression itself. > + (match_dup 5) > + (match_dup 6)))] > + { > + if (REG_P (operands[2])) > + { > + switch (<MODE>mode) > + { > + case SImode: > + emit_insn (gen_ashr<mode>3 (operands[3], operands[2], > + gen_int_mode (31, <MODE>mode))); > + emit_insn (gen_xor<mode>3 (operands[3], operands[3], > + gen_int_mode (0x80000000, <MODE>mode))); > + break; > + case DImode: > + emit_insn (gen_ashr<mode>3 (operands[3], operands[2], > + gen_int_mode (63, <MODE>mode))); > + emit_insn (gen_xor<mode>3 (operands[3], operands[3], > + gen_int_mode (0x8000000000000000, > + <MODE>mode))); > + break; > + default: > + break; > + } > + switch (<CODE>) > + { > + case SS_MINUS: > + emit_insn (gen_sub<mode>3_compare1 (operands[0], operands[1], > + operands[2])); > + break; > + case SS_PLUS: > + emit_insn (gen_add<mode>3_compare0 (operands[0], operands[1], > + operands[2])); > + break; > + default: > + break; > + } > + > + rtx ccin = gen_rtx_REG (E_CC_Vmode, CC_REGNUM); > + switch (<CODE>) > + { > + case SS_PLUS: > + operands[4] = gen_rtx_NE (<MODE>mode, ccin, const0_rtx); > + operands[5] = gen_rtx_NOT (<MODE>mode, operands[3]); > + operands[6] = operands[0]; > + break; > + case SS_MINUS: > + operands[4] = gen_rtx_EQ (<MODE>mode, ccin, const0_rtx); > + operands[5] = operands[0]; > + operands[6] = operands[3]; > + break; > + default: > + break; > + } > + } > + else > + { > + long imm = INTVAL (operands[2]); If this uses a named type, it should be HOST_WIDE_INT rather than "long". But "auto imm" is probably easier. > + gcc_assert (imm != 0); The constraints do allow 0, so I'm not sure this assert is safe. Certainly we shouldn't usually get unfolded instructions, but strange things can happen with fuzzed options. Does the code mishandle that case? It looked like it should be ok. > + rtx neg_imm = gen_int_mode (-imm, <MODE>mode); > + wide_int limit; > + > + switch (<CODE>) > + { > + case SS_MINUS: > + emit_insn (gen_sub<mode>3_compare1_imm (operands[0], operands[1], > + operands[2], neg_imm)); > + limit = (imm >> 63) + 1 ? wi::min_value (<MODE>mode, SIGNED) > + : wi::max_value (<MODE>mode, SIGNED); INTVALs are always stored in sign-extended form, and we're not supposed to rely on >> being an arithmetic shift, so this is IMO more obvious as imm >= 0 ? ... : ... Same for SS_PLUS. > + break; > + case SS_PLUS: > + emit_insn (gen_sub<mode>3_compare1_imm (operands[0], operands[1], > + neg_imm, operands[2])); > + limit = (imm >> 63) + 1 ? wi::max_value (<MODE>mode, SIGNED) > + : wi::min_value (<MODE>mode, SIGNED); > + break; > + default: > + break; > + } > + > + rtx sat_limit = immed_wide_int_const (limit, <MODE>mode); > + emit_insn (gen_rtx_SET (operands[3], sat_limit)); > + > + rtx ccin = gen_rtx_REG (E_CC_Vmode, CC_REGNUM); > + operands[4] = gen_rtx_EQ (<MODE>mode, ccin, const0_rtx); > + operands[5] = operands[0]; > + operands[6] = operands[3]; > + } > + } > +) > + > +;; Unsigned saturating arithmetic with GPR operands can be optimised > similarly > +;; to the signed case, albeit without the need for a temporary register as > the > +;; saturating limit can be inferred from the <addsub> code. This applies > only > +;; to SImode and DImode. > + > +(define_insn_and_split "<su_optab>s<addsub><mode>3<vczle><vczbe>" > + [(set (match_operand:GPI 0 "register_operand") > + (UBINQOPS:GPI (match_operand:GPI 1 "register_operand") > + (match_operand:GPI 2 "aarch64_plus_operand"))) > + (clobber (reg:CC CC_REGNUM))] > + "" > + {@ [ cons: =0, 1 , 2 ; attrs: type , arch , length ] > + [ w , w , w ; neon_q<addsub><q> , simd , 4 ] > <su_optab>q<addsub>\\t%<v>0<Vmtype>, %<v>1<Vmtype>, %<v>2<Vmtype> > + [ r , r , JIr ; * , * , 8 ] # > + } > + "&& reload_completed && GP_REGNUM_P (REGNO (operands[0]))" > + [(set (match_dup 0) > + (if_then_else:GPI > + (match_operator 3 "comparison_operator" [(reg:CC CC_REGNUM) > (const_int 0)]) Similarly to the above, this could be (match_dup 3). (It matters a little more here, since the code produces (reg:CC_C CC_REGNUM) rather than (reg:CC ...).) > + (match_dup 0) > + (match_operand:GPI 4 "immediate_operand" "i")))] And this could be (match_dup 4). > + { > + > + if (REG_P (operands[2])) > + { > + switch (<CODE>) > + { > + case US_MINUS: > + emit_insn (gen_sub<mode>3_compare1 (operands[0], operands[1], > + operands[2])); > + break; > + case US_PLUS: > + emit_insn (gen_add<mode>3_compare0 (operands[0], operands[1], > + operands[2])); > + break; > + default: > + break; > + } > + } > + else > + { > + unsigned long imm = UINTVAL (operands[2]); > + gcc_assert (imm != 0); > + rtx neg_imm = gen_int_mode (-imm, <MODE>mode); > + switch (<CODE>) > + { > + case US_MINUS: > + emit_insn (gen_sub<mode>3_compare1_imm (operands[0], operands[1], > + operands[2], neg_imm)); > + break; > + case US_PLUS: > + emit_insn (gen_sub<mode>3_compare1_imm (operands[0], operands[1], > + neg_imm, operands[2])); > + break; > + default: > + break; > + } > + } > + > + rtx ccin = gen_rtx_REG (CC_Cmode, CC_REGNUM); > + switch (<CODE>) > + { > + case US_PLUS: > + operands[3] = gen_rtx_LTU (<MODE>mode, ccin, const0_rtx); > + operands[4] = gen_int_mode (-1, <MODE>mode); > + break; > + case US_MINUS: > + operands[3] = gen_rtx_GEU (<MODE>mode, ccin, const0_rtx); > + operands[4] = const0_rtx; > + break; > + default: > + break; > + } > + } > +) > + > ;; suqadd and usqadd > > (define_insn "aarch64_<sur>qadd<mode><vczle><vczbe>" > diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md > index 0bc98315bb6..10720b5e66b 100644 > --- a/gcc/config/aarch64/iterators.md > +++ b/gcc/config/aarch64/iterators.md > @@ -93,6 +93,10 @@ > ;; integer modes; 64-bit scalar integer mode. > (define_mode_iterator VSDQ_I_DI [V8QI V16QI V4HI V8HI V2SI V4SI V2DI DI]) > > +;; Advanced SIMD and scalar, 64 & 128-bit container; 8 and 16-bit scalar > +;; integer modes. > +(define_mode_iterator VSDQ_I_QI_HI [V8QI V16QI V4HI V8HI V2SI V4SI V2DI HI > QI]) I think this is one case where inheriting from VSDQ_I would be worthwhile: (define_mode_iterator VSDQ_I_QI_HI [VSDQ_I QI HI]) > + > ;; Double vector modes. > (define_mode_iterator VD [V8QI V4HI V4HF V2SI V2SF V4BF]) > > diff --git > a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c > > b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c > new file mode 100644 > index 00000000000..2b72be7b0d7 > --- /dev/null > +++ > b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c > @@ -0,0 +1,79 @@ > +/* { dg-do assemble { target { aarch64*-*-* } } } */ > +/* { dg-options "-O2 --save-temps -ftree-vectorize" } */ > +/* { dg-final { check-function-bodies "**" "" "-DCHECK_ASM" } } */ > + > +/* > +** uadd_lane: { xfail *-*-* } > +** dup\tv([0-9]+).8b, w0 > +** uqadd\tb([0-9]+), (?:b\1, b0|b0, b\1) > +** umov\tw0, v\2.b\[0\] > +** ret > +*/ Whats the reason behind the xfail? Is it the early-ra thing, or something else? (You might already have covered this, sorry.) xfailing is fine if it needs further optimisation, was just curious :) > [...] > diff --git a/gcc/testsuite/gcc.target/aarch64/saturating-arithmetic-signed.c > b/gcc/testsuite/gcc.target/aarch64/saturating-arithmetic-signed.c > new file mode 100644 > index 00000000000..0fc6804683a > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/saturating-arithmetic-signed.c > @@ -0,0 +1,270 @@ > +/* { dg-do run } */ > +/* { dg-options "-O2 --save-temps -mearly-ra=none" } */ It'd be worth adding -fno-schedule-insns2 here. Same for saturating_arithmetic_1.c and saturating_arithmetic_2.c. The reason is that: > +/* { dg-final { check-function-bodies "**" "" "" } } */ > + > +#include <limits.h> > +#include <stdbool.h> > +#include <stdint.h> > + > +/* > +** sadd32: > +** asr w([0-9]+), w1, 31 > +** adds w([0-9]+), (?:w0, w1|w1, w0) > +** eor w\1, w\1, -2147483648 > +** csinv w0, w\2, w\1, vc > +** ret > +*/ ...the first two instructions can be in either order, and similarly for the second and third. Really nice tests though :) Like Kyrill says, thanks a lot for doing this. Richard > +int32_t __attribute__((noipa)) > +sadd32 (int32_t __a, int32_t __b) > +{ > + int32_t sum; > + bool overflow = __builtin_add_overflow (__a, __b, &sum); > + return !overflow ? sum : __a < 0 ? INT_MIN : INT_MAX; > +} > [...]