Re: [Vectorizer] Add SLP support for masked loads

Richard Sandiford Wed, 08 May 2019 06:38:00 -0700

Richard Biener <richard.guent...@gmail.com> writes:
> On Fri, Apr 26, 2019 at 3:14 PM Richard Sandiford
> <richard.sandif...@arm.com> wrote:
>>
>> Alejandro Martinez Vicente <alejandro.martinezvice...@arm.com> writes:
>> > Hi,
>> >
>> > Current vectorizer doesn't support masked loads for SLP. We should add 
>> > that, to
>> > allow things like:
>> >
>> > void
>> > f (int *restrict x, int *restrict y, int *restrict z, int n)
>> > {
>> >   for (int i = 0; i < n; i += 2)
>> >     {
>> >       x[i] = y[i] ? z[i] : 1;
>> >       x[i + 1] = y[i + 1] ? z[i + 1] : 2;
>> >     }
>> > }
>> >
>> > to be vectorized using contiguous loads rather than LD2 and ST2.
>> >
>> > This patch was motivated by SVE, but it is completely generic and should 
>> > apply
>> > to any architecture with masked loads.
>> >
>> > After the patch is applied, the above code generates this output
>> > (-march=armv8.2-a+sve -O2 -ftree-vectorize):
>> >
>> > 0000000000000000 <f>:
>> >    0: 7100007f        cmp     w3, #0x0
>> >    4: 540002cd        b.le    5c <f+0x5c>
>> >    8: 51000464        sub     w4, w3, #0x1
>> >    c: d2800003        mov     x3, #0x0                        // #0
>> >   10: 90000005        adrp    x5, 0 <f>
>> >   14: 25d8e3e0        ptrue   p0.d
>> >   18: 53017c84        lsr     w4, w4, #1
>> >   1c: 910000a5        add     x5, x5, #0x0
>> >   20: 11000484        add     w4, w4, #0x1
>> >   24: 85c0e0a1        ld1rd   {z1.d}, p0/z, [x5]
>> >   28: 2598e3e3        ptrue   p3.s
>> >   2c: d37ff884        lsl     x4, x4, #1
>> >   30: 25a41fe2        whilelo p2.s, xzr, x4
>> >   34: d503201f        nop
>> >   38: a5434820        ld1w    {z0.s}, p2/z, [x1, x3, lsl #2]
>> >   3c: 25808c11        cmpne   p1.s, p3/z, z0.s, #0
>> >   40: 25808810        cmpne   p0.s, p2/z, z0.s, #0
>> >   44: a5434040        ld1w    {z0.s}, p0/z, [x2, x3, lsl #2]
>> >   48: 05a1c400        sel     z0.s, p1, z0.s, z1.s
>> >   4c: e5434800        st1w    {z0.s}, p2, [x0, x3, lsl #2]
>> >   50: 04b0e3e3        incw    x3
>> >   54: 25a41c62        whilelo p2.s, x3, x4
>> >   58: 54ffff01        b.ne    38 <f+0x38>  // b.any
>> >   5c: d65f03c0        ret
>> >
>> >
>> > I tested this patch in an aarch64 machine bootstrapping the compiler and
>> > running the checks.
>> >
>> > Alejandro
>> >
>> > gcc/Changelog:
>> >
>> > 2019-01-16  Alejandro Martinez  <alejandro.martinezvice...@arm.com>
>> >
>> >       * config/aarch64/aarch64-sve.md (copysign<mode>3): New define_expand.
>> >       (xorsign<mode>3): Likewise.
>> >       internal-fn.c: Marked mask_load_direct and mask_store_direct as
>> >       vectorizable.
>> >       tree-data-ref.c (data_ref_compare_tree): Fixed comment typo.
>> >       tree-vect-data-refs.c (can_group_stmts_p): Allow masked loads to be
>> >       combined even if masks different.
>> >       (slp_vect_only_p): New function to detect masked loads that are only
>> >       vectorizable using SLP.
>> >       (vect_analyze_data_ref_accesses): Mark SLP only vectorizable groups.
>> >       tree-vect-loop.c (vect_dissolve_slp_only_groups): New function to
>> >       dissolve SLP-only vectorizable groups when SLP has been discarded.
>> >       (vect_analyze_loop_2): Call vect_dissolve_slp_only_groups when 
>> > needed.
>> >       tree-vect-slp.c (vect_get_and_check_slp_defs): Check masked loads
>> >       masks.
>> >       (vect_build_slp_tree_1): Fixed comment typo.
>> >       (vect_build_slp_tree_2): Include masks from masked loads in SLP tree.
>> >       tree-vect-stmts.c (vect_get_vec_defs_for_operand): New function to 
>> > get
>> >       vec_defs for operand with optional SLP and vectype.
>> >       (vectorizable_load): Allow vectorizaion of masked loads for SLP only.
>> >       tree-vectorizer.h (_stmt_vec_info): Added flag for SLP-only
>> >       vectorizable.
>> >       tree-vectorizer.c (vec_info::new_stmt_vec_info): Likewise.
>> >
>> > gcc/testsuite/Changelog:
>> >
>> > 2019-01-16  Alejandro Martinez  <alejandro.martinezvice...@arm.com>
>> >
>> >       * gcc.target/aarch64/sve/mask_load_slp_1.c: New test for SLP
>> >       vectorized masked loads.
>> >
>> > diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
>> > index 4f2ef45..67eee59 100644
>> > --- a/gcc/internal-fn.c
>> > +++ b/gcc/internal-fn.c
>> > @@ -100,11 +100,11 @@ init_internal_fns ()
>> >  /* Create static initializers for the information returned by
>> >     direct_internal_fn.  */
>> >  #define not_direct { -2, -2, false }
>> > -#define mask_load_direct { -1, 2, false }
>> > +#define mask_load_direct { -1, 2, true }
>> >  #define load_lanes_direct { -1, -1, false }
>> >  #define mask_load_lanes_direct { -1, -1, false }
>> >  #define gather_load_direct { -1, -1, false }
>> > -#define mask_store_direct { 3, 2, false }
>> > +#define mask_store_direct { 3, 2, true }
>> >  #define store_lanes_direct { 0, 0, false }
>> >  #define mask_store_lanes_direct { 0, 0, false }
>> >  #define scatter_store_direct { 3, 3, false }
>> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/mask_load_slp_1.c 
>> > b/gcc/testsuite/gcc.target/aarch64/sve/mask_load_slp_1.c
>> > new file mode 100644
>> > index 0000000..b106cae
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/mask_load_slp_1.c
>> > @@ -0,0 +1,74 @@
>> > +/* { dg-do compile } */
>> > +/* { dg-options "-O2 -ftree-vectorize" } */
>> > +
>> > +#include <stdint.h>
>> > +
>> > +#define MASK_SLP_2(TYPE_COND, ALT_VAL)                                    
>> >    \
>> > +void __attribute__ ((noinline, noclone))                             \
>> > +mask_slp_##TYPE_COND##_2_##ALT_VAL (int *restrict x, int *restrict y,     
>> >    \
>> > +                                 TYPE_COND *restrict z, int n)       \
>> > +{                                                                    \
>> > +  for (int i = 0; i < n; i += 2)                                     \
>> > +    {                                                                     
>> >    \
>> > +      x[i] = y[i] ? z[i] : 1;                                             
>> >    \
>> > +      x[i + 1] = y[i + 1] ? z[i + 1] : ALT_VAL;                           
>> >    \
>> > +    }                                                                     
>> >    \
>> > +}
>> > +
>> > +#define MASK_SLP_4(TYPE_COND, ALT_VAL)                                    
>> >    \
>> > +void __attribute__ ((noinline, noclone))                             \
>> > +mask_slp_##TYPE_COND##_4_##ALT_VAL (int *restrict x, int *restrict y,     
>> >    \
>> > +                                 TYPE_COND *restrict z, int n)       \
>> > +{                                                                    \
>> > +  for (int i = 0; i < n; i += 4)                                     \
>> > +    {                                                                     
>> >    \
>> > +      x[i] = y[i] ? z[i] : 1;                                             
>> >    \
>> > +      x[i + 1] = y[i + 1] ? z[i + 1] : ALT_VAL;                           
>> >    \
>> > +      x[i + 2] = y[i + 2] ? z[i + 2] : 1;                            \
>> > +      x[i + 3] = y[i + 3] ? z[i + 3] : ALT_VAL;                           
>> >    \
>> > +    }                                                                     
>> >    \
>> > +}
>> > +
>> > +#define MASK_SLP_8(TYPE_COND, ALT_VAL)                                    
>> >    \
>> > +void __attribute__ ((noinline, noclone))                             \
>> > +mask_slp_##TYPE_COND##_8_##ALT_VAL (int *restrict x, int *restrict y,     
>> >    \
>> > +                                 TYPE_COND *restrict z, int n)       \
>> > +{                                                                    \
>> > +  for (int i = 0; i < n; i += 8)                                     \
>> > +    {                                                                     
>> >    \
>> > +      x[i] = y[i] ? z[i] : 1;                                             
>> >    \
>> > +      x[i + 1] = y[i + 1] ? z[i + 1] : ALT_VAL;                           
>> >    \
>> > +      x[i + 2] = y[i + 2] ? z[i + 2] : 1;                            \
>> > +      x[i + 3] = y[i + 3] ? z[i + 3] : ALT_VAL;                           
>> >    \
>> > +      x[i + 4] = y[i + 4] ? z[i + 4] : 1;                            \
>> > +      x[i + 5] = y[i + 5] ? z[i + 5] : ALT_VAL;                           
>> >    \
>> > +      x[i + 6] = y[i + 6] ? z[i + 6] : 1;                            \
>> > +      x[i + 7] = y[i + 7] ? z[i + 7] : ALT_VAL;                           
>> >    \
>> > +    }                                                                     
>> >    \
>> > +}
>> > +
>> > +MASK_SLP_2(int8_t, 1)
>> > +MASK_SLP_2(int8_t, 2)
>> > +MASK_SLP_2(int, 1)
>> > +MASK_SLP_2(int, 2)
>> > +MASK_SLP_2(int64_t, 1)
>> > +MASK_SLP_2(int64_t, 2)
>> > +
>> > +MASK_SLP_4(int8_t, 1)
>> > +MASK_SLP_4(int8_t, 2)
>> > +MASK_SLP_4(int, 1)
>> > +MASK_SLP_4(int, 2)
>> > +MASK_SLP_4(int64_t, 1)
>> > +MASK_SLP_4(int64_t, 2)
>> > +
>> > +MASK_SLP_8(int8_t, 1)
>> > +MASK_SLP_8(int8_t, 2)
>> > +MASK_SLP_8(int, 1)
>> > +MASK_SLP_8(int, 2)
>> > +MASK_SLP_8(int64_t, 1)
>> > +MASK_SLP_8(int64_t, 2)
>> > +
>> > +/* { dg-final { scan-assembler-not {\tld2w\t} } } */
>> > +/* { dg-final { scan-assembler-not {\tst2w\t} } } */
>> > +/* { dg-final { scan-assembler-times {\tld1w\t} 48 } } */
>> > +/* { dg-final { scan-assembler-times {\tst1w\t} 40 } } */
>> > diff --git a/gcc/tree-data-ref.c b/gcc/tree-data-ref.c
>> > index 7d1f03c..1833a5f 100644
>> > --- a/gcc/tree-data-ref.c
>> > +++ b/gcc/tree-data-ref.c
>> > @@ -1272,7 +1272,7 @@ create_data_ref (edge nest, loop_p loop, tree 
>> > memref, gimple *stmt,
>> >    return dr;
>> >  }
>> >
>> > -/*  A helper function computes order between two tree epxressions T1 and 
>> > T2.
>> > +/*  A helper function computes order between two tree expressions T1 and 
>> > T2.
>> >      This is used in comparator functions sorting objects based on the 
>> > order
>> >      of tree expressions.  The function returns -1, 0, or 1.  */
>> >
>> > diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
>> > index 7bbd47f..8a82147 100644
>> > --- a/gcc/tree-vect-data-refs.c
>> > +++ b/gcc/tree-vect-data-refs.c
>> > @@ -2837,22 +2837,72 @@ can_group_stmts_p (stmt_vec_info stmt1_info, 
>> > stmt_vec_info stmt2_info)
>> >        if (ifn != gimple_call_internal_fn (call2))
>> >       return false;
>> >
>> > -      /* Check that the masks are the same.  Cope with casts of masks,
>> > +      /* Check that the masks can be combined.  */
>> > +      tree mask1 = gimple_call_arg (call1, 2);
>> > +      tree mask2 = gimple_call_arg (call2, 2);
>> > +      if (!operand_equal_p (mask1, mask2, 0))
>> > +     {
>> > +       /* Stores need identical masks.  */
>> > +       if (ifn == IFN_MASK_STORE)
>> > +         {
>> > +           mask1 = strip_conversion (mask1);
>> > +           if (!mask1)
>> > +             return false;
>> > +           mask2 = strip_conversion (mask2);
>> > +           if (!mask2)
>> > +             return false;
>> > +           if (!operand_equal_p (mask1, mask2, 0))
>> > +             return false;
>> > +         }
>> > +       /* Loads are allowed different masks under SLP only.
>> > +          (See slp_vect_only_p () below).  */
>> > +     }
>> > +      return true;
>> > +    }
>> > +
>> > +  return false;
>> > +}
>> > +
>> > +/* Return true if vectorizable_* routines can handle statements STMT1_INFO
>> > +   and STMT2_INFO being in a single group for SLP only.  */
>> > +
>> > +static bool
>> > +slp_vect_only_p (stmt_vec_info stmt1_info, stmt_vec_info stmt2_info)
>> > +{
>> > +  if (gimple_assign_single_p (stmt1_info->stmt))
>> > +    {
>> > +      gcc_assert (gimple_assign_single_p (stmt2_info->stmt));
>> > +      return false;
>> > +    }
>> > +
>> > +  gcall *call1 = dyn_cast <gcall *> (stmt1_info->stmt);
>> > +  if (call1 && gimple_call_internal_p (call1))
>> > +    {
>> > +      /* Check for two masked loads or two masked stores.  */
>> > +      gcall *call2 = dyn_cast <gcall *> (stmt2_info->stmt);
>> > +      gcc_assert (call2 && gimple_call_internal_p (call2));
>> > +      internal_fn ifn = gimple_call_internal_fn (call1);
>> > +      if (ifn != IFN_MASK_LOAD)
>> > +     return false;
>> > +      gcc_assert (ifn == gimple_call_internal_fn (call2));
>> > +
>> > +      /* Check if the masks are the same.  Cope with casts of masks,
>> >        like those created by build_mask_conversion.  */
>> >        tree mask1 = gimple_call_arg (call1, 2);
>> >        tree mask2 = gimple_call_arg (call2, 2);
>> >        if (!operand_equal_p (mask1, mask2, 0))
>> >       {
>> > +       /* This is the only case that is just for SLP: non-identical but
>> > +          otherwise slp-compatible masks.  */
>> >         mask1 = strip_conversion (mask1);
>> >         if (!mask1)
>> > -         return false;
>> > +         return true;
>> >         mask2 = strip_conversion (mask2);
>> >         if (!mask2)
>> > -         return false;
>> > +         return true;
>> >         if (!operand_equal_p (mask1, mask2, 0))
>> > -         return false;
>> > +         return true;
>> >       }
>> > -      return true;
>> >      }
>> >
>> >    return false;
>>
>> Normally I'd say it would be better to add a bool argument to
>> can_group_stmts_p that says whether we want non-SLP-only rules,
>> or perhaps convert the return type to an enum.  But given that the
>> non-SLP path is going away soon anyway, I guess separate functions
>> are better despite the cut-&-paste.
>>
>> > diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
>> > index afbf9a9..754a2e4 100644
>> > --- a/gcc/tree-vect-loop.c
>> > +++ b/gcc/tree-vect-loop.c
>> > @@ -1755,6 +1755,49 @@ vect_get_datarefs_in_loop (loop_p loop, basic_block 
>> > *bbs,
>> >    return opt_result::success ();
>> >  }
>> >
>> > +/* Look for SLP-only access groups and turn each individual access into 
>> > its own
>> > +   group.  */
>> > +static void
>> > +vect_dissolve_slp_only_groups (loop_vec_info loop_vinfo)
>> > +{
>> > +  unsigned int i;
>> > +  struct data_reference *dr;
>> > +
>> > +  DUMP_VECT_SCOPE ("vect_dissolve_slp_only_groups");
>> > +
>> > +  vec<data_reference_p> datarefs = loop_vinfo->shared->datarefs;
>> > +  FOR_EACH_VEC_ELT (datarefs, i, dr)
>> > +    {
>> > +      gcc_assert (DR_REF (dr));
>> > +      stmt_vec_info stmt_info = loop_vinfo->lookup_stmt (DR_STMT (dr));
>> > +
>> > +      /* Check if the load is a part of an interleaving chain.  */
>> > +      if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
>> > +     {
>> > +       stmt_vec_info first_element = DR_GROUP_FIRST_ELEMENT (stmt_info);
>> > +       unsigned int group_size = DR_GROUP_SIZE (first_element);
>> > +
>> > +       /* Check if SLP-only groups.  */
>> > +       if (STMT_VINFO_SLP_VECT_ONLY (first_element))
>> > +         {
>> > +             /* Dissolve the group.  */
>> > +             STMT_VINFO_SLP_VECT_ONLY (first_element) = false;
>> > +
>> > +             stmt_vec_info vinfo = first_element;
>> > +             while (vinfo)
>> > +               {
>> > +                 stmt_vec_info next = DR_GROUP_NEXT_ELEMENT (vinfo);
>> > +                 DR_GROUP_FIRST_ELEMENT (vinfo) = NULL;
>> > +                 DR_GROUP_NEXT_ELEMENT (vinfo) = NULL;
>> > +                 DR_GROUP_SIZE (vinfo) = 1;
>> > +                 DR_GROUP_GAP (vinfo) = group_size - 1;
>> > +                 vinfo = next;
>>
>> I think DR_GROUP_FIRST_ELEMENT should be vinfo here, so that it remains
>> a grouped access with only one element.
>
> Then the above looks like single-element interleaving?  Do we handle
> interleaving
> at all for masked loads/stores?


Not yet, but it's on the wishlist.

> Generally a no longer grouped access would have DR_GROUP_FIRST_ELEMENT NULL
> and "no" size/gap (well, nobody looks at those fields then).  It would
> need vectorization
> with strided accesses then though thus you need to set the strided flag.

But with the way get_load_store_type is structured, single-element
groups give strictly more information than a strided access.  We still
fall back on gather/scatter or elementwise accesses if necessary.

(One of the reasons for adding get_load_store_type was to avoid
the group/stride choice dictating a particular implementation.)

So I think single element groups are better here.

> I think you want to have a testcase exercising this path.

No argument with this of course. :-)

Richard

Re: [Vectorizer] Add SLP support for masked loads

Reply via email to