Jonathan Wright <jonathan.wri...@arm.com> writes:
> Hi,
>
> The Neon multiply/multiply-accumulate/multiply-subtract instructions
> can take various forms - multiplying full vector registers of values
> or multiplying one vector by a single element of another. Regardless
> of the form used, these instructions have the same cost, and this
> should be reflected by the RTL cost function.
>
> This patch adds RTL tree traversal in the Neon multiply cost function
> to match the vec_select used by the lane-referencing forms of the
> instructions already mentioned. This traversal prevents the cost of
> the vec_select from being added into the cost of the multiply -
> meaning that these instructions can now be emitted in the combine
> pass as they are no longer deemed prohibitively expensive.
>
> Regression tested and bootstrapped on aarch64-none-linux-gnu - no
> issues.
>
> Ok for master?
>
> Thanks,
> Jonathan
>
> ---
>
> gcc/ChangeLog:
>
> 2021-07-19  Jonathan Wright  <jonathan.wri...@arm.com>
>
>         * config/aarch64/aarch64.c (aarch64_rtx_mult_cost): Traverse
>         RTL tree to prevents vec_select from being added into Neon
>         multiply cost.
>
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 
> f5b25a7f7041645921e6ad85714efda73b993492..b368303b0e699229266e6d008e28179c496bf8cd
>  100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -11985,6 +11985,21 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, 
> int outer, bool speed)
>           op0 = XEXP (op0, 0);
>         else if (GET_CODE (op1) == VEC_DUPLICATE)
>           op1 = XEXP (op1, 0);
> +       /* The same argument applies to the VEC_SELECT when using the lane-
> +          referencing forms of the MUL/MLA/MLS instructions. Without the
> +          traversal here, the combine pass deems these patterns too
> +          expensive and subsequently does not emit the lane-referencing
> +          forms of the instructions. In addition, canonical form is for the
> +          VEC_SELECT to be the second argument of the multiply - thus only
> +          op1 is traversed.  */
> +       if (GET_CODE (op1) == VEC_SELECT
> +           && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1)
> +         op1 = XEXP (op1, 0);
> +       else if ((GET_CODE (op1) == ZERO_EXTEND
> +                 || GET_CODE (op1) == SIGN_EXTEND)
> +                && GET_CODE (XEXP (op1, 0)) == VEC_SELECT
> +                && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1)
> +         op1 = XEXP (XEXP (op1, 0), 0);

I think this logically belongs in the “GET_CODE (op1) == VEC_DUPLICATE”
if block, since the condition is never true otherwise.  We can probably
skip the GET_MODE_NUNITS tests, but if you'd prefer to keep them, I think
it would be better to add them to the existing VEC_DUPLICATE tests rather
than restrict them to the VEC_SELECT ones.

Also, although this is in Advanced SIMD-specific code, I think it'd be
better to use:

  is_a<scalar_mode> (GET_MODE (op1))

instead of:

  GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1

Do you have a testcase?

Thanks,
Richard

Reply via email to