Hi, V2 of the patch addresses the initial review comments, factors out common code (as we discussed off-list) and adds a set of unit tests to verify the code generation benefit.
Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-19 Jonathan Wright <jonathan.wri...@arm.com> * config/aarch64/aarch64.c (aarch64_strip_duplicate_vec_elt): Define. (aarch64_rtx_mult_cost): Traverse RTL tree to prevent vec_select cost from being added into Neon multiply cost. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vmul_element_cost.c: New test. From: Richard Sandiford <richard.sandif...@arm.com> Sent: 22 July 2021 18:16 To: Jonathan Wright <jonathan.wri...@arm.com> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>; Kyrylo Tkachov <kyrylo.tkac...@arm.com> Subject: Re: [PATCH] aarch64: Don't include vec_select in SIMD multiply cost Jonathan Wright <jonathan.wri...@arm.com> writes: > Hi, > > The Neon multiply/multiply-accumulate/multiply-subtract instructions > can take various forms - multiplying full vector registers of values > or multiplying one vector by a single element of another. Regardless > of the form used, these instructions have the same cost, and this > should be reflected by the RTL cost function. > > This patch adds RTL tree traversal in the Neon multiply cost function > to match the vec_select used by the lane-referencing forms of the > instructions already mentioned. This traversal prevents the cost of > the vec_select from being added into the cost of the multiply - > meaning that these instructions can now be emitted in the combine > pass as they are no longer deemed prohibitively expensive. > > Regression tested and bootstrapped on aarch64-none-linux-gnu - no > issues. > > Ok for master? > > Thanks, > Jonathan > > --- > > gcc/ChangeLog: > > 2021-07-19 Jonathan Wright <jonathan.wri...@arm.com> > > * config/aarch64/aarch64.c (aarch64_rtx_mult_cost): Traverse > RTL tree to prevents vec_select from being added into Neon > multiply cost. > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index > f5b25a7f7041645921e6ad85714efda73b993492..b368303b0e699229266e6d008e28179c496bf8cd > 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -11985,6 +11985,21 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, > int outer, bool speed) > op0 = XEXP (op0, 0); > else if (GET_CODE (op1) == VEC_DUPLICATE) > op1 = XEXP (op1, 0); > + /* The same argument applies to the VEC_SELECT when using the lane- > + referencing forms of the MUL/MLA/MLS instructions. Without the > + traversal here, the combine pass deems these patterns too > + expensive and subsequently does not emit the lane-referencing > + forms of the instructions. In addition, canonical form is for the > + VEC_SELECT to be the second argument of the multiply - thus only > + op1 is traversed. */ > + if (GET_CODE (op1) == VEC_SELECT > + && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1) > + op1 = XEXP (op1, 0); > + else if ((GET_CODE (op1) == ZERO_EXTEND > + || GET_CODE (op1) == SIGN_EXTEND) > + && GET_CODE (XEXP (op1, 0)) == VEC_SELECT > + && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1) > + op1 = XEXP (XEXP (op1, 0), 0); I think this logically belongs in the “GET_CODE (op1) == VEC_DUPLICATE” if block, since the condition is never true otherwise. We can probably skip the GET_MODE_NUNITS tests, but if you'd prefer to keep them, I think it would be better to add them to the existing VEC_DUPLICATE tests rather than restrict them to the VEC_SELECT ones. Also, although this is in Advanced SIMD-specific code, I think it'd be better to use: is_a<scalar_mode> (GET_MODE (op1)) instead of: GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1 Do you have a testcase? Thanks, Richard
rb14675.patch
Description: rb14675.patch