On Wed, 10 Jul 2024, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguent...@suse.de>
> > Sent: Wednesday, July 10, 2024 10:04 AM
> > To: gcc-patches@gcc.gnu.org
> > Subject: [PATCH 2/3] Support group-size of three in SLP load permutation 
> > lowering
> > 
> > The following adds support for group-size three in SLP load permutation
> > lowering to match the non-SLP capabilities.  This is done by using
> > the non-interleaving fallback code which then creates at VF == 4 from
> > { { a0, b0, c0 }, { a1, b1, c1 }, { a2, b2, c2 }, { a3, b3, c3 } }
> > the intermediate vectors { c0, c0, c1, c1 } and { c2, c2, c3, c3 }
> > to produce { c0, c1, c2, c3 }.
> > 
> 
> Just curious, is this only for the 3rd component then? I'm assuming the firs 
> two get handled by
> {a0, b0, a1, b1} {a2, b2, a3, b3} still?

Yes, the example is only for the third component, the first and second
get handled by a similar two-stage approach so in total you have
6 permutes to reduce the four vectors down to three.  When you have
ld3 that's obviously going to be better to use, see the load/store-lane
followup in [3/3].

Richard.

> Regards,
> Tamar
> 
> > This turns out to be more effective than the scheme implemented
> > for non-SLP for SSE and only slightly worse for AVX512 and a bit
> > more worse for AVX2.  It seems to me that this would extend to
> > other non-power-of-two group-sizes though (but the patch does not).
> > Optimal schemes are likely difficult to lay out in VF agnostic form.
> > 
> > I'll note that while the lowering assumes even/odd extract is
> > generally available for all vector element sizes (which is probably
> > a good assumption), it doesn't in any way constrain the other
> > permutes it generates based on target availability.  Again difficult
> > to do in a VF agnostic way (but at least currently the vector type
> > is fixed).
> > 
> > I'll also note that the SLP store side merges lanes in a way
> > producing three-vector permutes for store group-size of three, so
> > the testcase uses a store group-size of four.
> > 
> >     * tree-vect-slp.cc (vect_lower_load_permutations): Support
> >     group-size of three.
> > 
> >     * gcc.dg/vect/slp-52.c: New testcase.
> > ---
> >  gcc/testsuite/gcc.dg/vect/slp-52.c | 14 ++++++++++++
> >  gcc/tree-vect-slp.cc               | 35 +++++++++++++++++-------------
> >  2 files changed, 34 insertions(+), 15 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/slp-52.c
> > 
> > diff --git a/gcc/testsuite/gcc.dg/vect/slp-52.c 
> > b/gcc/testsuite/gcc.dg/vect/slp-52.c
> > new file mode 100644
> > index 00000000000..ba49f0046e2
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/slp-52.c
> > @@ -0,0 +1,14 @@
> > +/* { dg-do compile } */
> > +
> > +void foo (int * __restrict x, int *y)
> > +{
> > +  for (int i = 0; i < 1024; ++i)
> > +    {
> > +      x[4*i+0] = y[3*i+0];
> > +      x[4*i+1] = y[3*i+1] * 2;
> > +      x[4*i+2] = y[3*i+2] + 3;
> > +      x[4*i+3] = y[3*i+2] * 2 - 5;
> > +    }
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { 
> > target {
> > vect_int && vect_int_mult } } } } */
> > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> > index 0f830c1ad9c..2dc6d365303 100644
> > --- a/gcc/tree-vect-slp.cc
> > +++ b/gcc/tree-vect-slp.cc
> > @@ -3710,7 +3710,8 @@ vect_build_slp_instance (vec_info *vinfo,
> >              with the least number of lanes to one and then repeat until
> >              we end up with two inputs.  That scheme makes sure we end
> >              up with permutes satisfying the restriction of requiring at
> > -            most two vector inputs to produce a single vector output.  */
> > +            most two vector inputs to produce a single vector output
> > +            when the number of lanes is even.  */
> >           while (SLP_TREE_CHILDREN (perm).length () > 2)
> >             {
> >               /* When we have three equal sized groups left the pairwise
> > @@ -4050,11 +4051,10 @@ vect_lower_load_permutations (loop_vec_info
> > loop_vinfo,
> >      = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (loads[0])[0]);
> > 
> >    /* Only a power-of-two number of lanes matches interleaving with N 
> > levels.
> > -     The non-SLP path also supports DR_GROUP_SIZE == 3.
> >       ???  An even number of lanes could be reduced to 1<<ceil_log2(N)-1 
> > lanes
> >       at each step.  */
> >    unsigned group_lanes = DR_GROUP_SIZE (first);
> > -  if (exact_log2 (group_lanes) == -1)
> > +  if (exact_log2 (group_lanes) == -1 && group_lanes != 3)
> >      return;
> > 
> >    for (slp_tree load : loads)
> > @@ -4071,7 +4071,7 @@ vect_lower_load_permutations (loop_vec_info
> > loop_vinfo,
> >      with a non-1:1 load permutation around instead of canonicalizing
> >      those into a load and a permute node.  Removing this early
> >      check would do such canonicalization.  */
> > -      if (SLP_TREE_LANES (load) >= group_lanes / 2)
> > +      if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2)
> >     continue;
> > 
> >        /* First build (and possibly re-use) a load node for the
> > @@ -4107,7 +4107,7 @@ vect_lower_load_permutations (loop_vec_info
> > loop_vinfo,
> >        while (1)
> >     {
> >       unsigned group_lanes = SLP_TREE_LANES (l0);
> > -     if (SLP_TREE_LANES (load) >= group_lanes / 2)
> > +     if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2)
> >         break;
> > 
> >       /* Try to lower by reducing the group to half its size using an
> > @@ -4117,19 +4117,24 @@ vect_lower_load_permutations (loop_vec_info
> > loop_vinfo,
> >          Thus { e, e, o, o, e, e, o, o } woud be an even/odd decomposition
> >          with N == 2.  */
> >       /* ???  Only an even number of lanes can be handed this way, but the
> > -        fallback below could work for any number.  */
> > -     gcc_assert ((group_lanes & 1) == 0);
> > -     unsigned even = (1 << ceil_log2 (group_lanes)) - 1;
> > -     unsigned odd = even;
> > -     for (auto l : final_perm)
> > +        fallback below could work for any number.  We have to make sure
> > +        to round up in that case.  */
> > +     gcc_assert ((group_lanes & 1) == 0 || group_lanes == 3);
> > +     unsigned even = 0, odd = 0;
> > +     if ((group_lanes & 1) == 0)
> >         {
> > -         even &= ~l.second;
> > -         odd &= l.second;
> > +         even = (1 << ceil_log2 (group_lanes)) - 1;
> > +         odd = even;
> > +         for (auto l : final_perm)
> > +           {
> > +             even &= ~l.second;
> > +             odd &= l.second;
> > +           }
> >         }
> > 
> >       /* Now build an even or odd extraction from the unpermuted load.  */
> >       lane_permutation_t perm;
> > -     perm.create (group_lanes / 2);
> > +     perm.create ((group_lanes + 1) / 2);
> >       unsigned level;
> >       if (even
> >           && ((level = 1 << ctz_hwi (even)), true)
> > @@ -4164,7 +4169,7 @@ vect_lower_load_permutations (loop_vec_info
> > loop_vinfo,
> >           bitmap_iterator bi;
> >           EXECUTE_IF_SET_IN_BITMAP (l, 0, i, bi)
> >               perm.quick_push (std::make_pair (0, i));
> > -         while (perm.length () < group_lanes / 2)
> > +         while (perm.length () < (group_lanes + 1) / 2)
> >             perm.quick_push (perm.last ());
> >         }
> > 
> > @@ -4200,7 +4205,7 @@ vect_lower_load_permutations (loop_vec_info
> > loop_vinfo,
> >          have a "local" CSE map here.  */
> >       SLP_TREE_SCALAR_STMTS (p) = perm_stmts;
> > 
> > -     /* We now have a node for group_lanes / 2 lanes.  */
> > +     /* We now have a node for (group_lanes + 1) / 2 lanes.  */
> >       l0 = p;
> >     }
> > 
> > --
> > 2.35.3
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to