Re: [PATCH V2] aarch64: Don't include vec_select in SIMD multiply cost

Richard Sandiford via Gcc-patches Wed, 04 Aug 2021 01:51:50 -0700

Jonathan Wright via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> Hi,
>
> V2 of the patch addresses the initial review comments, factors out
> common code (as we discussed off-list) and adds a set of unit tests
> to verify the code generation benefit.
>
> Regression tested and bootstrapped on aarch64-none-linux-gnu - no
> issues.
>
> Ok for master?
>
> Thanks,
> Jonathan
>
> ---
>
> gcc/ChangeLog:
>
> 2021-07-19  Jonathan Wright  <jonathan.wri...@arm.com>
>
>       * config/aarch64/aarch64.c (aarch64_strip_duplicate_vec_elt):
>       Define.
>       (aarch64_rtx_mult_cost): Traverse RTL tree to prevent
>       vec_select cost from being added into Neon multiply cost.
>
> gcc/testsuite/ChangeLog:
>
>       * gcc.target/aarch64/vmul_element_cost.c: New test.
>
>
>
> From: Richard Sandiford <richard.sandif...@arm.com>
> Sent: 22 July 2021 18:16
> To: Jonathan Wright <jonathan.wri...@arm.com>
> Cc: gcc-patches@gcc.gnu.org <gcc-patches@gcc.gnu.org>; Kyrylo Tkachov 
> <kyrylo.tkac...@arm.com>
> Subject: Re: [PATCH] aarch64: Don't include vec_select in SIMD multiply cost 
>  
> Jonathan Wright <jonathan.wri...@arm.com> writes:
>> Hi,
>>
>> The Neon multiply/multiply-accumulate/multiply-subtract instructions
>> can take various forms - multiplying full vector registers of values
>> or multiplying one vector by a single element of another. Regardless
>> of the form used, these instructions have the same cost, and this
>> should be reflected by the RTL cost function.
>>
>> This patch adds RTL tree traversal in the Neon multiply cost function
>> to match the vec_select used by the lane-referencing forms of the
>> instructions already mentioned. This traversal prevents the cost of
>> the vec_select from being added into the cost of the multiply -
>> meaning that these instructions can now be emitted in the combine
>> pass as they are no longer deemed prohibitively expensive.
>>
>> Regression tested and bootstrapped on aarch64-none-linux-gnu - no
>> issues.
>>
>> Ok for master?
>>
>> Thanks,
>> Jonathan
>>
>> ---
>>
>> gcc/ChangeLog:
>>
>> 2021-07-19  Jonathan Wright  <jonathan.wri...@arm.com>
>>
>>         * config/aarch64/aarch64.c (aarch64_rtx_mult_cost): Traverse
>>         RTL tree to prevents vec_select from being added into Neon
>>         multiply cost.
>>
>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>> index 
>> f5b25a7f7041645921e6ad85714efda73b993492..b368303b0e699229266e6d008e28179c496bf8cd
>>  100644
>> --- a/gcc/config/aarch64/aarch64.c
>> +++ b/gcc/config/aarch64/aarch64.c
>> @@ -11985,6 +11985,21 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, 
>> int outer, bool speed)
>>            op0 = XEXP (op0, 0);
>>          else if (GET_CODE (op1) == VEC_DUPLICATE)
>>            op1 = XEXP (op1, 0);
>> +       /* The same argument applies to the VEC_SELECT when using the lane-
>> +          referencing forms of the MUL/MLA/MLS instructions. Without the
>> +          traversal here, the combine pass deems these patterns too
>> +          expensive and subsequently does not emit the lane-referencing
>> +          forms of the instructions. In addition, canonical form is for the
>> +          VEC_SELECT to be the second argument of the multiply - thus only
>> +          op1 is traversed.  */
>> +       if (GET_CODE (op1) == VEC_SELECT
>> +           && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1)
>> +         op1 = XEXP (op1, 0);
>> +       else if ((GET_CODE (op1) == ZERO_EXTEND
>> +                 || GET_CODE (op1) == SIGN_EXTEND)
>> +                && GET_CODE (XEXP (op1, 0)) == VEC_SELECT
>> +                && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1)
>> +         op1 = XEXP (XEXP (op1, 0), 0);
>
> I think this logically belongs in the “GET_CODE (op1) == VEC_DUPLICATE”
> if block, since the condition is never true otherwise.  We can probably
> skip the GET_MODE_NUNITS tests, but if you'd prefer to keep them, I think
> it would be better to add them to the existing VEC_DUPLICATE tests rather
> than restrict them to the VEC_SELECT ones.
>
> Also, although this is in Advanced SIMD-specific code, I think it'd be
> better to use:
>
>   is_a<scalar_mode> (GET_MODE (op1))
>
> instead of:
>
>   GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1
>
> Do you have a testcase?
>
> Thanks,
> Richard
>
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 
> 3bdf19d71b54d0ade8e5648323f6e1f012bc4f8f..5809887997305317c5a81421089db431685e2927
>  100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -11908,6 +11908,26 @@ aarch64_strip_extend (rtx x, bool strip_shift)
>    return x;
>  }
>  
> +
> +/* Helper function for rtx cost calculation. Strip VEC_DUPLICATE as well as
> +   any subsequent extend and VEC_SELECT from X. Returns the inner scalar
> +   operand if successful, or the original expression on failure.  */
> +static rtx
> +aarch64_strip_duplicate_vec_elt (rtx x)
> +{
> +  if (GET_CODE (x) == VEC_DUPLICATE
> +      && is_a<scalar_mode> (GET_MODE (XEXP (x, 0))))
> +    {
> +      x = XEXP (x, 0);
> +      if (GET_CODE (x) == VEC_SELECT)
> +     x = XEXP (x, 0);
> +      else if ((GET_CODE (x) == ZERO_EXTEND || GET_CODE (x) == SIGN_EXTEND)
> +            && GET_CODE (XEXP (x, 0)) == VEC_SELECT)
> +     x = XEXP (XEXP (x, 0), 0);
> +    }
> +  return x;
> +}
> +
>  /* Return true iff CODE is a shift supported in combination
>     with arithmetic instructions.  */
>  
> @@ -11977,14 +11997,14 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, 
> int outer, bool speed)
>       {
>         /* The by-element versions of the instruction have the same costs as
>            the normal 3-vector version.  So don't add the costs of the
> -          duplicate into the costs of the multiply.  We make an assumption
> -          that the input to the VEC_DUPLICATE is already on the FP & SIMD
> -          side.  This means costing of a MUL by element pre RA is a bit
> -          optimistic.  */
> +          duplicate or subsequent select into the costs of the multiply.  We


Very pedantic, but: the select conceptually happens before the duplicate.
TBH I think we can probably just drop this sentence, since the calls
make the operation self-description.  (The other parts of the comment
are still useful.)

> +          make an assumption that the input to the VEC_DUPLICATE is already
> +          on the FP & SIMD side.  This means costing of a MUL by element pre
> +          RA is a bit optimistic.  */
>         if (GET_CODE (op0) == VEC_DUPLICATE)
> -         op0 = XEXP (op0, 0);
> +         op0 = aarch64_strip_duplicate_vec_elt (op0);
>         else if (GET_CODE (op1) == VEC_DUPLICATE)
> -         op1 = XEXP (op1, 0);
> +         op1 = aarch64_strip_duplicate_vec_elt (op1);

I think we might as well call aarch64_strip_duplicate_vec_elt
unconditionally, without the VEC_DUPLICATE tests.

OK with those changes, and sorry for the slow review.

Thanks,
Richard

>       }
>        cost += rtx_cost (op0, mode, MULT, 0, speed);
>        cost += rtx_cost (op1, mode, MULT, 1, speed);
> diff --git a/gcc/testsuite/gcc.target/aarch64/vmul_element_cost.c 
> b/gcc/testsuite/gcc.target/aarch64/vmul_element_cost.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..c153775f0914072fb985b18516f110aded7dccd5
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vmul_element_cost.c
> @@ -0,0 +1,94 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3" } */
> +
> +#include <arm_neon.h>
> +
> +#define TEST_MUL_UNIFORM(name, q, vectype, ts) \
> +  vectype test_ ## name ## q ## _ ## ts (vectype a, vectype b, vectype c) \
> +     { \
> +             vectype t0 = name ## q ## _n_ ## ts (a, c[1]); \
> +             vectype t1 = name ## q ## _n_ ## ts (b, c[1]); \
> +             return vmul ## q ## _ ## ts (t0, t1); \
> +     }
> +
> +TEST_MUL_UNIFORM (vmul, , int16x4_t, s16)
> +TEST_MUL_UNIFORM (vmul, , uint16x4_t, u16)
> +TEST_MUL_UNIFORM (vmul, , int32x2_t, s32)
> +TEST_MUL_UNIFORM (vmul, , uint32x2_t, u32)
> +TEST_MUL_UNIFORM (vmul, , float32x2_t, f32)
> +TEST_MUL_UNIFORM (vmul, q, int16x8_t, s16)
> +TEST_MUL_UNIFORM (vmul, q, uint16x8_t, u16)
> +TEST_MUL_UNIFORM (vmul, q, int32x4_t, s32)
> +TEST_MUL_UNIFORM (vmul, q, uint32x4_t, u32)
> +TEST_MUL_UNIFORM (vmul, q, float32x4_t, f32)
> +TEST_MUL_UNIFORM (vmul, q, float64x2_t, f64)
> +
> +#define TEST_MLX_UNIFORM(name, q, vectype, ts) \
> +  vectype test_ ## name ## q ## _ ## ts (vectype acc, vectype a, vectype b) \
> +     { \
> +             acc = name ## q ## _n_ ## ts (acc, a, b[1]); \
> +             return name ## q ## _n_ ## ts (acc, a, b[1]); \
> +     }
> +
> +TEST_MLX_UNIFORM (vmla, , int16x4_t, s16)
> +TEST_MLX_UNIFORM (vmla, , uint16x4_t, u16)
> +TEST_MLX_UNIFORM (vmla, , int32x2_t, s32)
> +TEST_MLX_UNIFORM (vmla, , uint32x2_t, u32)
> +TEST_MLX_UNIFORM (vmla, , float32x2_t, f32)
> +TEST_MLX_UNIFORM (vmla, q, int16x8_t, s16)
> +TEST_MLX_UNIFORM (vmla, q, uint16x8_t, u16)
> +TEST_MLX_UNIFORM (vmla, q, int32x4_t, s32)
> +TEST_MLX_UNIFORM (vmla, q, uint32x4_t, u32)
> +TEST_MLX_UNIFORM (vmla, q, float32x4_t, f32)
> +
> +TEST_MLX_UNIFORM (vmls, , int16x4_t, s16)
> +TEST_MLX_UNIFORM (vmls, , uint16x4_t, u16)
> +TEST_MLX_UNIFORM (vmls, , int32x2_t, s32)
> +TEST_MLX_UNIFORM (vmls, , uint32x2_t, u32)
> +TEST_MLX_UNIFORM (vmls, , float32x2_t, f32)
> +TEST_MLX_UNIFORM (vmls, q, int16x8_t, s16)
> +TEST_MLX_UNIFORM (vmls, q, uint16x8_t, u16)
> +TEST_MLX_UNIFORM (vmls, q, int32x4_t, s32)
> +TEST_MLX_UNIFORM (vmls, q, uint32x4_t, u32)
> +TEST_MLX_UNIFORM (vmls, q, float32x4_t, f32)
> +
> +#define TEST_MUL_LONG(name, rettype, intype, ts, rs) \
> +  rettype test_ ## name ## ts (intype a, intype b, intype c) \
> +     { \
> +             rettype t0 = name ## ts (a, c[1]); \
> +             rettype t1 = name ## ts (b, c[1]); \
> +             return vqaddq ## _ ## rs (t0, t1); \
> +     }
> +
> +TEST_MUL_LONG (vmull_n_, int32x4_t, int16x4_t, s16, s32)
> +TEST_MUL_LONG (vmull_n_, uint32x4_t, uint16x4_t, u16, u32)
> +TEST_MUL_LONG (vmull_n_, int64x2_t, int32x2_t, s32, s64)
> +TEST_MUL_LONG (vmull_n_, uint64x2_t, uint32x2_t, u32, u64)
> +
> +TEST_MUL_LONG (vqdmull_n_, int32x4_t, int16x4_t, s16, s32)
> +TEST_MUL_LONG (vqdmull_n_, int64x2_t, int32x2_t, s32, s64)
> +
> +#define TEST_MLX_LONG(name, rettype, intype, ts, rs) \
> +  rettype test_ ## name ## _ ## ts (rettype acc, intype a, intype b) \
> +     { \
> +             acc = name ## ts (acc, a, b[1]); \
> +             return name ## ts (acc, a, b[1]); \
> +     }
> +
> +TEST_MLX_LONG (vmlal_n_, int32x4_t, int16x4_t, s16, s32)
> +TEST_MLX_LONG (vmlal_n_, uint32x4_t, uint16x4_t, u16, u32)
> +TEST_MLX_LONG (vmlal_n_, int64x2_t, int32x2_t, s32, s64)
> +TEST_MLX_LONG (vmlal_n_, uint64x2_t, uint32x2_t, u32, u64)
> +
> +TEST_MLX_LONG (vmlsl_n_, int32x4_t, int16x4_t, s16, s32)
> +TEST_MLX_LONG (vmlsl_n_, uint32x4_t, uint16x4_t, u16, u32)
> +TEST_MLX_LONG (vmlsl_n_, int64x2_t, int32x2_t, s32, s64)
> +TEST_MLX_LONG (vmlsl_n_, uint64x2_t, uint32x2_t, u32, u64)
> +
> +TEST_MLX_LONG (vqdmlal_n_, int32x4_t, int16x4_t, s16, s32)
> +TEST_MLX_LONG (vqdmlal_n_, int64x2_t, int32x2_t, s32, s64)
> +
> +TEST_MLX_LONG (vqdmlsl_n_, int32x4_t, int16x4_t, s16, s32)
> +TEST_MLX_LONG (vqdmlsl_n_, int64x2_t, int32x2_t, s32, s64)
> +
> +/* { dg-final { scan-assembler-not "dup\\t" } } */

Re: [PATCH V2] aarch64: Don't include vec_select in SIMD multiply cost

Reply via email to