Jonathan Wright via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > Hi, > > V2 of the patch addresses the initial review comments, factors out > common code (as we discussed off-list) and adds a set of unit tests > to verify the code generation benefit. > > Regression tested and bootstrapped on aarch64-none-linux-gnu - no > issues. > > Ok for master? > > Thanks, > Jonathan > > --- > > gcc/ChangeLog: > > 2021-07-19 Jonathan Wright <jonathan.wri...@arm.com> > > * config/aarch64/aarch64.c (aarch64_strip_duplicate_vec_elt): > Define. > (aarch64_rtx_mult_cost): Traverse RTL tree to prevent > vec_select cost from being added into Neon multiply cost. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/vmul_element_cost.c: New test. > > > > From: Richard Sandiford <richard.sandif...@arm.com> > Sent: 22 July 2021 18:16 > To: Jonathan Wright <jonathan.wri...@arm.com> > Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>; Kyrylo Tkachov > <kyrylo.tkac...@arm.com> > Subject: Re: [PATCH] aarch64: Don't include vec_select in SIMD multiply cost > > Jonathan Wright <jonathan.wri...@arm.com> writes: >> Hi, >> >> The Neon multiply/multiply-accumulate/multiply-subtract instructions >> can take various forms - multiplying full vector registers of values >> or multiplying one vector by a single element of another. Regardless >> of the form used, these instructions have the same cost, and this >> should be reflected by the RTL cost function. >> >> This patch adds RTL tree traversal in the Neon multiply cost function >> to match the vec_select used by the lane-referencing forms of the >> instructions already mentioned. This traversal prevents the cost of >> the vec_select from being added into the cost of the multiply - >> meaning that these instructions can now be emitted in the combine >> pass as they are no longer deemed prohibitively expensive. >> >> Regression tested and bootstrapped on aarch64-none-linux-gnu - no >> issues. >> >> Ok for master? >> >> Thanks, >> Jonathan >> >> --- >> >> gcc/ChangeLog: >> >> 2021-07-19 Jonathan Wright <jonathan.wri...@arm.com> >> >> * config/aarch64/aarch64.c (aarch64_rtx_mult_cost): Traverse >> RTL tree to prevents vec_select from being added into Neon >> multiply cost. >> >> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c >> index >> f5b25a7f7041645921e6ad85714efda73b993492..b368303b0e699229266e6d008e28179c496bf8cd >> 100644 >> --- a/gcc/config/aarch64/aarch64.c >> +++ b/gcc/config/aarch64/aarch64.c >> @@ -11985,6 +11985,21 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, >> int outer, bool speed) >> op0 = XEXP (op0, 0); >> else if (GET_CODE (op1) == VEC_DUPLICATE) >> op1 = XEXP (op1, 0); >> + /* The same argument applies to the VEC_SELECT when using the lane- >> + referencing forms of the MUL/MLA/MLS instructions. Without the >> + traversal here, the combine pass deems these patterns too >> + expensive and subsequently does not emit the lane-referencing >> + forms of the instructions. In addition, canonical form is for the >> + VEC_SELECT to be the second argument of the multiply - thus only >> + op1 is traversed. */ >> + if (GET_CODE (op1) == VEC_SELECT >> + && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1) >> + op1 = XEXP (op1, 0); >> + else if ((GET_CODE (op1) == ZERO_EXTEND >> + || GET_CODE (op1) == SIGN_EXTEND) >> + && GET_CODE (XEXP (op1, 0)) == VEC_SELECT >> + && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1) >> + op1 = XEXP (XEXP (op1, 0), 0); > > I think this logically belongs in the “GET_CODE (op1) == VEC_DUPLICATE” > if block, since the condition is never true otherwise. We can probably > skip the GET_MODE_NUNITS tests, but if you'd prefer to keep them, I think > it would be better to add them to the existing VEC_DUPLICATE tests rather > than restrict them to the VEC_SELECT ones. > > Also, although this is in Advanced SIMD-specific code, I think it'd be > better to use: > > is_a<scalar_mode> (GET_MODE (op1)) > > instead of: > > GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1 > > Do you have a testcase? > > Thanks, > Richard > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index > 3bdf19d71b54d0ade8e5648323f6e1f012bc4f8f..5809887997305317c5a81421089db431685e2927 > 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -11908,6 +11908,26 @@ aarch64_strip_extend (rtx x, bool strip_shift) > return x; > } > > + > +/* Helper function for rtx cost calculation. Strip VEC_DUPLICATE as well as > + any subsequent extend and VEC_SELECT from X. Returns the inner scalar > + operand if successful, or the original expression on failure. */ > +static rtx > +aarch64_strip_duplicate_vec_elt (rtx x) > +{ > + if (GET_CODE (x) == VEC_DUPLICATE > + && is_a<scalar_mode> (GET_MODE (XEXP (x, 0)))) > + { > + x = XEXP (x, 0); > + if (GET_CODE (x) == VEC_SELECT) > + x = XEXP (x, 0); > + else if ((GET_CODE (x) == ZERO_EXTEND || GET_CODE (x) == SIGN_EXTEND) > + && GET_CODE (XEXP (x, 0)) == VEC_SELECT) > + x = XEXP (XEXP (x, 0), 0); > + } > + return x; > +} > + > /* Return true iff CODE is a shift supported in combination > with arithmetic instructions. */ > > @@ -11977,14 +11997,14 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, > int outer, bool speed) > { > /* The by-element versions of the instruction have the same costs as > the normal 3-vector version. So don't add the costs of the > - duplicate into the costs of the multiply. We make an assumption > - that the input to the VEC_DUPLICATE is already on the FP & SIMD > - side. This means costing of a MUL by element pre RA is a bit > - optimistic. */ > + duplicate or subsequent select into the costs of the multiply. We
Very pedantic, but: the select conceptually happens before the duplicate. TBH I think we can probably just drop this sentence, since the calls make the operation self-description. (The other parts of the comment are still useful.) > + make an assumption that the input to the VEC_DUPLICATE is already > + on the FP & SIMD side. This means costing of a MUL by element pre > + RA is a bit optimistic. */ > if (GET_CODE (op0) == VEC_DUPLICATE) > - op0 = XEXP (op0, 0); > + op0 = aarch64_strip_duplicate_vec_elt (op0); > else if (GET_CODE (op1) == VEC_DUPLICATE) > - op1 = XEXP (op1, 0); > + op1 = aarch64_strip_duplicate_vec_elt (op1); I think we might as well call aarch64_strip_duplicate_vec_elt unconditionally, without the VEC_DUPLICATE tests. OK with those changes, and sorry for the slow review. Thanks, Richard > } > cost += rtx_cost (op0, mode, MULT, 0, speed); > cost += rtx_cost (op1, mode, MULT, 1, speed); > diff --git a/gcc/testsuite/gcc.target/aarch64/vmul_element_cost.c > b/gcc/testsuite/gcc.target/aarch64/vmul_element_cost.c > new file mode 100644 > index > 0000000000000000000000000000000000000000..c153775f0914072fb985b18516f110aded7dccd5 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/vmul_element_cost.c > @@ -0,0 +1,94 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3" } */ > + > +#include <arm_neon.h> > + > +#define TEST_MUL_UNIFORM(name, q, vectype, ts) \ > + vectype test_ ## name ## q ## _ ## ts (vectype a, vectype b, vectype c) \ > + { \ > + vectype t0 = name ## q ## _n_ ## ts (a, c[1]); \ > + vectype t1 = name ## q ## _n_ ## ts (b, c[1]); \ > + return vmul ## q ## _ ## ts (t0, t1); \ > + } > + > +TEST_MUL_UNIFORM (vmul, , int16x4_t, s16) > +TEST_MUL_UNIFORM (vmul, , uint16x4_t, u16) > +TEST_MUL_UNIFORM (vmul, , int32x2_t, s32) > +TEST_MUL_UNIFORM (vmul, , uint32x2_t, u32) > +TEST_MUL_UNIFORM (vmul, , float32x2_t, f32) > +TEST_MUL_UNIFORM (vmul, q, int16x8_t, s16) > +TEST_MUL_UNIFORM (vmul, q, uint16x8_t, u16) > +TEST_MUL_UNIFORM (vmul, q, int32x4_t, s32) > +TEST_MUL_UNIFORM (vmul, q, uint32x4_t, u32) > +TEST_MUL_UNIFORM (vmul, q, float32x4_t, f32) > +TEST_MUL_UNIFORM (vmul, q, float64x2_t, f64) > + > +#define TEST_MLX_UNIFORM(name, q, vectype, ts) \ > + vectype test_ ## name ## q ## _ ## ts (vectype acc, vectype a, vectype b) \ > + { \ > + acc = name ## q ## _n_ ## ts (acc, a, b[1]); \ > + return name ## q ## _n_ ## ts (acc, a, b[1]); \ > + } > + > +TEST_MLX_UNIFORM (vmla, , int16x4_t, s16) > +TEST_MLX_UNIFORM (vmla, , uint16x4_t, u16) > +TEST_MLX_UNIFORM (vmla, , int32x2_t, s32) > +TEST_MLX_UNIFORM (vmla, , uint32x2_t, u32) > +TEST_MLX_UNIFORM (vmla, , float32x2_t, f32) > +TEST_MLX_UNIFORM (vmla, q, int16x8_t, s16) > +TEST_MLX_UNIFORM (vmla, q, uint16x8_t, u16) > +TEST_MLX_UNIFORM (vmla, q, int32x4_t, s32) > +TEST_MLX_UNIFORM (vmla, q, uint32x4_t, u32) > +TEST_MLX_UNIFORM (vmla, q, float32x4_t, f32) > + > +TEST_MLX_UNIFORM (vmls, , int16x4_t, s16) > +TEST_MLX_UNIFORM (vmls, , uint16x4_t, u16) > +TEST_MLX_UNIFORM (vmls, , int32x2_t, s32) > +TEST_MLX_UNIFORM (vmls, , uint32x2_t, u32) > +TEST_MLX_UNIFORM (vmls, , float32x2_t, f32) > +TEST_MLX_UNIFORM (vmls, q, int16x8_t, s16) > +TEST_MLX_UNIFORM (vmls, q, uint16x8_t, u16) > +TEST_MLX_UNIFORM (vmls, q, int32x4_t, s32) > +TEST_MLX_UNIFORM (vmls, q, uint32x4_t, u32) > +TEST_MLX_UNIFORM (vmls, q, float32x4_t, f32) > + > +#define TEST_MUL_LONG(name, rettype, intype, ts, rs) \ > + rettype test_ ## name ## ts (intype a, intype b, intype c) \ > + { \ > + rettype t0 = name ## ts (a, c[1]); \ > + rettype t1 = name ## ts (b, c[1]); \ > + return vqaddq ## _ ## rs (t0, t1); \ > + } > + > +TEST_MUL_LONG (vmull_n_, int32x4_t, int16x4_t, s16, s32) > +TEST_MUL_LONG (vmull_n_, uint32x4_t, uint16x4_t, u16, u32) > +TEST_MUL_LONG (vmull_n_, int64x2_t, int32x2_t, s32, s64) > +TEST_MUL_LONG (vmull_n_, uint64x2_t, uint32x2_t, u32, u64) > + > +TEST_MUL_LONG (vqdmull_n_, int32x4_t, int16x4_t, s16, s32) > +TEST_MUL_LONG (vqdmull_n_, int64x2_t, int32x2_t, s32, s64) > + > +#define TEST_MLX_LONG(name, rettype, intype, ts, rs) \ > + rettype test_ ## name ## _ ## ts (rettype acc, intype a, intype b) \ > + { \ > + acc = name ## ts (acc, a, b[1]); \ > + return name ## ts (acc, a, b[1]); \ > + } > + > +TEST_MLX_LONG (vmlal_n_, int32x4_t, int16x4_t, s16, s32) > +TEST_MLX_LONG (vmlal_n_, uint32x4_t, uint16x4_t, u16, u32) > +TEST_MLX_LONG (vmlal_n_, int64x2_t, int32x2_t, s32, s64) > +TEST_MLX_LONG (vmlal_n_, uint64x2_t, uint32x2_t, u32, u64) > + > +TEST_MLX_LONG (vmlsl_n_, int32x4_t, int16x4_t, s16, s32) > +TEST_MLX_LONG (vmlsl_n_, uint32x4_t, uint16x4_t, u16, u32) > +TEST_MLX_LONG (vmlsl_n_, int64x2_t, int32x2_t, s32, s64) > +TEST_MLX_LONG (vmlsl_n_, uint64x2_t, uint32x2_t, u32, u64) > + > +TEST_MLX_LONG (vqdmlal_n_, int32x4_t, int16x4_t, s16, s32) > +TEST_MLX_LONG (vqdmlal_n_, int64x2_t, int32x2_t, s32, s64) > + > +TEST_MLX_LONG (vqdmlsl_n_, int32x4_t, int16x4_t, s16, s32) > +TEST_MLX_LONG (vqdmlsl_n_, int64x2_t, int32x2_t, s32, s64) > + > +/* { dg-final { scan-assembler-not "dup\\t" } } */