RE: [PATCH]AArch64: Fix costing of emulated gathers/scatters [PR118188]

Tamar Christina Wed, 08 Jan 2025 14:50:37 -0800

> -----Original Message-----
> From: Richard Sandiford <richard.sandif...@arm.com>
> Sent: Wednesday, January 8, 2025 10:30 AM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw
> <richard.earns...@arm.com>; ktkac...@gcc.gnu.org
> Subject: Re: [PATCH]AArch64: Fix costing of emulated gathers/scatters
> [PR118188]
> 
> Tamar Christina <tamar.christ...@arm.com> writes:
> >> >> i.e. we use separate address arithmetic and avoid UMOVs.  Counting
> >> >> two loads and one store for each element of the scatter store seems
> >> >> like overkill for that.
> >> >
> >> > Hmm agreed..
> >> >
> >> > How about for stores we increase the load counts by count / 2?
> >> >
> >> > This would account for the fact that we know we have indexed stores
> >> > and so the data-to-scalar operation is free?
> >>
> >> Yeah, sounds good.  We should probably divide count itself by 2,
> >> then apply the new count to both the load heuristic and the general ops,
> >> to avoid double-counting in both.  (The V pipe usage for stores is
> >> modelled as part of the scalar_store itself.)  But like you say,
> >> we should probably drop the - 1 from the load adjustment for stores,
> >> because that - 1 would also be applied twice.
> >>
> >
> > Here's a new version of the patch.  For now I've punted the data bound 
> > variant
> > as to do that properly we do need to model the transfer throughputs. So 
> > differing
> > that to GCC 16.
> 
> OK, so just to check I understand, that's why the patch isn't just:
> 
>   if (kind == vec_to_scalar
>       && (m_vec_flags & VEC_ADVSIMD)
>       && vect_mem_access_type (stmt_info, node) == VMAT_GATHER_SCATTER)
>     {
>       if (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type)
>         ops->loads += count - 1;
>       else
>         /* Stores want to count both the index to array and data to
>            array using vec_to_scalar.  However we have index stores in
>            Adv.SIMD and so we only want to adjust the index loads.  */
>         ops->loads += count / 2;
>     }
> 
> (with no early return).  Is that right?  If so, that's ok with me, but...


Indeed, since removing the return would have been to try to address the
data bound version which I've deferred to GCC 16 for more accurate
costing.

Since the vectorizer cost model is going to change anyway.

> 
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> >     PR target/118188
> >     * config/aarch64/aarch64.cc (aarch64_vector_costs::count_ops): Adjust
> >     throughput of emulated gather and scatters.
> >
> > gcc/testsuite/ChangeLog:
> >
> >     PR target/118188
> >     * gcc.target/aarch64/sve/gather_load_12.c: New test.
> >     * gcc.target/aarch64/sve/gather_load_13.c: New test.
> >     * gcc.target/aarch64/sve/gather_load_14.c: New test.
> >
> > -- inline copy of patch --
> >
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index
> d6a8e4c209527f53e6292887b3fbc8a0ac3245da..efe9a935192748fba4e71e98a
> 55693317f5ca786 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -17348,6 +17348,47 @@ aarch64_vector_costs::count_ops (unsigned int
> count, vect_cost_for_stmt kind,
> >     return;
> >      }
> >
> > +  /* Detect the case where we are using an emulated gather/scatter.  When a
> > +     target does not support gathers and scatters directly the vectorizer
> > +     emulates these by constructing an index vector and then issuing an
> > +     extraction for every lane in the vector.  This is subsequently lowered
> > +     by veclower into a series of loads which creates the scalar indexes 
> > for
> > +     the subsequent loads.
> 
> I think the last sentence should be "If the index vector is loaded from
> memory, the vector load and extractions are subsequently lowered by veclower
> into a series of scalar index loads".  (The key part being the "if".)
> 
> > +     After the final loads are done it issues a
> > +     vec_construct to recreate the vector from the scalar.  For costing 
> > when
> > +     we see a vec_to_scalar on a stmt with VMAT_GATHER_SCATTER we are
> dealing
> > +     with an emulated instruction and should adjust costing properly.  */
> > +  if (kind == vec_to_scalar
> > +      && (m_vec_flags & VEC_ADVSIMD)
> > +      && vect_mem_access_type (stmt_info, node) == VMAT_GATHER_SCATTER)
> > +    {
> > +      auto dr = STMT_VINFO_DATA_REF (stmt_info);
> > +      tree dr_ref = DR_REF (dr);
> > +      /* Only really expect MEM_REF or ARRAY_REF here.  Ignore the rest.  
> > */
> > +      switch (TREE_CODE (dr_ref))
> > +   {
> > +   case MEM_REF:
> > +   case ARRAY_REF:
> > +   case TARGET_MEM_REF:
> > +     {
> > +       tree offset = TREE_OPERAND (dr_ref, 1);
> > +       if (SSA_VAR_P (offset)
> > +           && gimple_vuse (SSA_NAME_DEF_STMT (offset)))
> > +         {
> > +           if (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type)
> > +             ops->loads += count - 1;
> > +           else
> > +             /* Stores want to count both the index to array and data to
> > +                array using vec_to_scalar.  However we have index stores in
> > +                Adv.SIMD and so we only want to adjust the index loads.  */
> > +             ops->loads += count / 2;
> > +           return;
> > +         }
> > +       break;
> > +     }
> > +   default:
> > +     break;
> > +   }
> > +    }
> 
> Operand 1 of MEM_REF and TARGET_MEM_REF are always constant, so the
> handling of those codes looks redundant.  Perhaps we should instead use:
> 
>    while (handled_component_p (dr_ref))
>      {
>        if (TREE_CODE (dr_ref) == ARRAY_REF)
>          {
>          ...early return or break if SSA offset found...
>          }
>        dr_ref = TREE_OPERAND (dr_ref, 0);
>      }
> 
> A gather load could reasonably be to COMPONENT_REFs or BIT_FIELD_REFs of
> an ARRAY_REF, so the ARRAY_REF might not be the outermost code.
> 

Ah, I tried to find their layouts but couldn't readily find one.  Interesting, 
I would have
Thought a scatter/gather on either of those is more efficiently done through 
bit masking.

Here's updated version of patch.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

        PR target/118188
        * config/aarch64/aarch64.cc (aarch64_vector_costs::count_ops): Adjust
        throughput of emulated gather and scatters.

gcc/testsuite/ChangeLog:

        PR target/118188
        * gcc.target/aarch64/sve/gather_load_12.c: New test.
        * gcc.target/aarch64/sve/gather_load_13.c: New test.
        * gcc.target/aarch64/sve/gather_load_14.c: New test.

-- inline copy of patch --

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
3e700ed41e97a98dc844cad1c8a66a3555d82221..6a7abceda466f5569d743b0d3a0eed1bbeb79a2c
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -17378,6 +17378,46 @@ aarch64_vector_costs::count_ops (unsigned int count, 
vect_cost_for_stmt kind,
        return;
     }
 
+  /* Detect the case where we are using an emulated gather/scatter.  When a
+     target does not support gathers and scatters directly the vectorizer
+     emulates these by constructing an index vector and then issuing an
+     extraction for every lane in the vector.  If the index vector is loaded
+     from memory, the vector load and extractions are subsequently lowered by
+     veclower into a series of scalar index loads.  After the final loads are
+     done it issues a vec_construct to recreate the vector from the scalar.  
For
+     costing when we see a vec_to_scalar on a stmt with VMAT_GATHER_SCATTER we
+     are dealing with an emulated instruction and should adjust costing
+     properly.  */
+  if (kind == vec_to_scalar
+      && (m_vec_flags & VEC_ADVSIMD)
+      && vect_mem_access_type (stmt_info, node) == VMAT_GATHER_SCATTER)
+    {
+      auto dr = STMT_VINFO_DATA_REF (stmt_info);
+      tree dr_ref = DR_REF (dr);
+      while (handled_component_p (dr_ref))
+       {
+         if (TREE_CODE (dr_ref) == ARRAY_REF)
+           {
+             tree offset = TREE_OPERAND (dr_ref, 1);
+             if (SSA_VAR_P (offset)
+                 && gimple_vuse (SSA_NAME_DEF_STMT (offset)))
+               {
+                 if (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type)
+                   ops->loads += count - 1;
+                 else
+                   /* Stores want to count both the index to array and data to
+                      array using vec_to_scalar.  However we have index stores
+                      in Adv.SIMD and so we only want to adjust the index
+                      loads.  */
+                   ops->loads += count / 2;
+                 return;
+             }
+           else
+             break;
+          }
+       dr_ref = TREE_OPERAND (dr_ref, 0);
+     }
+    }
 
   /* Count the basic operation cost associated with KIND.  */
   switch (kind)
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/gather_load_12.c 
b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_12.c
new file mode 100644
index 
0000000000000000000000000000000000000000..d550f005d638f260973312fbccd05e2bb9aef1a7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_12.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-Ofast -mcpu=neoverse-v2" } */
+
+#define iterations 100000
+#define LEN_1D 32000
+
+float a[LEN_1D], b[LEN_1D];
+
+float
+s4115 (int *ip)
+{
+    float sum = 0.;
+    for (int i = 0; i < LEN_1D; i++)
+      {
+        sum += a[i] * b[ip[i]];
+      }
+    return sum;
+}
+
+/* { dg-final { scan-assembler {\s+ld1w\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/gather_load_13.c 
b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_13.c
new file mode 100644
index 
0000000000000000000000000000000000000000..24da0646a75e41eae0c0dd6a2b8af58631ac94cb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_13.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-Ofast -mcpu=neoverse-v2" } */
+
+#define iterations 100000
+#define LEN_1D 32000
+
+float a[LEN_1D], b[LEN_1D];
+
+float
+s4115 (int *ip)
+{
+    float sum = 0.;
+    for (int i = 0; i < LEN_1D; i++)
+      {
+        sum += a[i] * b[ip[i] + 1];
+      }
+    return sum;
+}
+
+/* { dg-final { scan-assembler {\s+ld1w\t} { xfail *-*-* } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/gather_load_14.c 
b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_14.c
new file mode 100644
index 
0000000000000000000000000000000000000000..77d06d2a0e0ee89e5e45730eebf9b4bebfa2aaed
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/gather_load_14.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-Ofast -mcpu=neoverse-v2" } */
+
+#define iterations 100000
+#define LEN_1D 32000
+
+float a[LEN_1D], b[LEN_1D];
+
+float
+s4115 (int *ip)
+{
+    float sum = 0.;
+    for (int i = 0; i < LEN_1D; i++)
+      {
+        sum += a[i] * b[ip[i]];
+      }
+    return sum;
+}
+
+/* { dg-final { scan-assembler-not {\s+st1w\t} } } */

RE: [PATCH]AArch64: Fix costing of emulated gathers/scatters [PR118188]

Reply via email to