Re: [PATCH]AArch64: Fix costing of emulated gathers/scatters [PR118188]

Richard Sandiford Wed, 08 Jan 2025 02:30:27 -0800

Tamar Christina <tamar.christ...@arm.com> writes:
>> >> i.e. we use separate address arithmetic and avoid UMOVs.  Counting
>> >> two loads and one store for each element of the scatter store seems
>> >> like overkill for that.
>> >
>> > Hmm agreed..
>> >
>> > How about for stores we increase the load counts by count / 2?
>> >
>> > This would account for the fact that we know we have indexed stores
>> > and so the data-to-scalar operation is free?
>> 
>> Yeah, sounds good.  We should probably divide count itself by 2,
>> then apply the new count to both the load heuristic and the general ops,
>> to avoid double-counting in both.  (The V pipe usage for stores is
>> modelled as part of the scalar_store itself.)  But like you say,
>> we should probably drop the - 1 from the load adjustment for stores,
>> because that - 1 would also be applied twice.
>> 
>
> Here's a new version of the patch.  For now I've punted the data bound variant
> as to do that properly we do need to model the transfer throughputs. So 
> differing
> that to GCC 16.


OK, so just to check I understand, that's why the patch isn't just:

  if (kind == vec_to_scalar
      && (m_vec_flags & VEC_ADVSIMD)
      && vect_mem_access_type (stmt_info, node) == VMAT_GATHER_SCATTER)
    {
      if (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type)
        ops->loads += count - 1;
      else
        /* Stores want to count both the index to array and data to
           array using vec_to_scalar.  However we have index stores in
           Adv.SIMD and so we only want to adjust the index loads.  */
        ops->loads += count / 2;
    }

(with no early return).  Is that right?  If so, that's ok with me, but...

> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>       PR target/118188
>       * config/aarch64/aarch64.cc (aarch64_vector_costs::count_ops): Adjust
>       throughput of emulated gather and scatters.
>
> gcc/testsuite/ChangeLog:
>
>       PR target/118188
>       * gcc.target/aarch64/sve/gather_load_12.c: New test.
>       * gcc.target/aarch64/sve/gather_load_13.c: New test.
>       * gcc.target/aarch64/sve/gather_load_14.c: New test.
>
> -- inline copy of patch --
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> d6a8e4c209527f53e6292887b3fbc8a0ac3245da..efe9a935192748fba4e71e98a55693317f5ca786
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -17348,6 +17348,47 @@ aarch64_vector_costs::count_ops (unsigned int count, 
> vect_cost_for_stmt kind,
>       return;
>      }
>  
> +  /* Detect the case where we are using an emulated gather/scatter.  When a
> +     target does not support gathers and scatters directly the vectorizer
> +     emulates these by constructing an index vector and then issuing an
> +     extraction for every lane in the vector.  This is subsequently lowered
> +     by veclower into a series of loads which creates the scalar indexes for
> +     the subsequent loads.

I think the last sentence should be "If the index vector is loaded from
memory, the vector load and extractions are subsequently lowered by veclower
into a series of scalar index loads".  (The key part being the "if".)

> +     After the final loads are done it issues a
> +     vec_construct to recreate the vector from the scalar.  For costing when
> +     we see a vec_to_scalar on a stmt with VMAT_GATHER_SCATTER we are dealing
> +     with an emulated instruction and should adjust costing properly.  */
> +  if (kind == vec_to_scalar
> +      && (m_vec_flags & VEC_ADVSIMD)
> +      && vect_mem_access_type (stmt_info, node) == VMAT_GATHER_SCATTER)
> +    {
> +      auto dr = STMT_VINFO_DATA_REF (stmt_info);
> +      tree dr_ref = DR_REF (dr);
> +      /* Only really expect MEM_REF or ARRAY_REF here.  Ignore the rest.  */
> +      switch (TREE_CODE (dr_ref))
> +     {
> +     case MEM_REF:
> +     case ARRAY_REF:
> +     case TARGET_MEM_REF:
> +       {
> +         tree offset = TREE_OPERAND (dr_ref, 1);
> +         if (SSA_VAR_P (offset)
> +             && gimple_vuse (SSA_NAME_DEF_STMT (offset)))
> +           {
> +             if (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type)
> +               ops->loads += count - 1;
> +             else
> +               /* Stores want to count both the index to array and data to
> +                  array using vec_to_scalar.  However we have index stores in
> +                  Adv.SIMD and so we only want to adjust the index loads.  */
> +               ops->loads += count / 2;
> +             return;
> +           }
> +         break;
> +       }
> +     default:
> +       break;
> +     }
> +    }

Operand 1 of MEM_REF and TARGET_MEM_REF are always constant, so the
handling of those codes looks redundant.  Perhaps we should instead use:

   while (handled_component_p (dr_ref))
     {
       if (TREE_CODE (dr_ref) == ARRAY_REF)
         {
         ...early return or break if SSA offset found...
         }
       dr_ref = TREE_OPERAND (dr_ref, 0);
     }

A gather load could reasonably be to COMPONENT_REFs or BIT_FIELD_REFs of
an ARRAY_REF, so the ARRAY_REF might not be the outermost code.

Thanks,
Richard

>    /* Count the basic operation cost associated with KIND.  */
>    switch (kind)
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/gather_load_12.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_12.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..d550f005d638f260973312fbccd05e2bb9aef1a7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_12.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-Ofast -mcpu=neoverse-v2" } */
> +
> +#define iterations 100000
> +#define LEN_1D 32000
> +
> +float a[LEN_1D], b[LEN_1D];
> +
> +float
> +s4115 (int *ip)
> +{
> +    float sum = 0.;
> +    for (int i = 0; i < LEN_1D; i++)
> +      {
> +        sum += a[i] * b[ip[i]];
> +      }
> +    return sum;
> +}
> +
> +/* { dg-final { scan-assembler {\s+ld1w\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/gather_load_13.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_13.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..24da0646a75e41eae0c0dd6a2b8af58631ac94cb
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_13.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-Ofast -mcpu=neoverse-v2" } */
> +
> +#define iterations 100000
> +#define LEN_1D 32000
> +
> +float a[LEN_1D], b[LEN_1D];
> +
> +float
> +s4115 (int *ip)
> +{
> +    float sum = 0.;
> +    for (int i = 0; i < LEN_1D; i++)
> +      {
> +        sum += a[i] * b[ip[i] + 1];
> +      }
> +    return sum;
> +}
> +
> +/* { dg-final { scan-assembler {\s+ld1w\t} { xfail *-*-* } } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/gather_load_14.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_14.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..77d06d2a0e0ee89e5e45730eebf9b4bebfa2aaed
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_14.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-Ofast -mcpu=neoverse-v2" } */
> +
> +#define iterations 100000
> +#define LEN_1D 32000
> +
> +float a[LEN_1D], b[LEN_1D];
> +
> +float
> +s4115 (int *ip)
> +{
> +    float sum = 0.;
> +    for (int i = 0; i < LEN_1D; i++)
> +      {
> +        sum += a[i] * b[ip[i]];
> +      }
> +    return sum;
> +}
> +
> +/* { dg-final { scan-assembler-not {\s+st1w\t} } } */

Re: [PATCH]AArch64: Fix costing of emulated gathers/scatters [PR118188]

Reply via email to