Re: [PATCH] RFC: Optimise SLP permutes of non-consecutive loads

Richard Biener via Gcc-patches Fri, 24 Jun 2022 05:19:04 -0700

On Fri, 24 Jun 2022, Richard Sandiford wrote:

> Richard Biener <rguent...@suse.de> writes:
> > On Thu, 23 Jun 2022, Richard Sandiford wrote:
> >> In a reduction pair like:
> >> 
> >>   typedef float T;
> >> 
> >>   void
> >>   f1 (T *x)
> >>   {
> >>     T res1 = 0;
> >>     T res2 = 0;
> >>     for (int i = 0; i < 100; ++i)
> >>       {
> >>    res1 += x[i * 2];
> >>    res2 += x[i * 2 + 1];
> >>       }
> >>     x[0] = res1;
> >>     x[1] = res2;
> >>   }
> >> 
> >> it isn't easy to predict whether the initial reduction group will
> >> be { res1, res2 } or { res2, res1 }.  If the initial group ends up
> >> being { res2, res1 }, vect_optimize_slp will try to swap it around
> >> in order to remove an unncessary permute on the load.  This gives:
> >> 
> >> f1:
> >> .LFB0:
> >>         .cfi_startproc
> >>         movi    v0.4s, 0
> >>         mov     x1, x0
> >>         add     x2, x0, 800
> >>         .p2align 3,,7
> >> .L2:
> >>         ldr     q1, [x1], 16
> >>         fadd    v0.4s, v0.4s, v1.4s
> >>         cmp     x2, x1
> >>         bne     .L2
> >>         dup     d1, v0.d[1]
> >>         fadd    v0.2s, v0.2s, v1.2s
> >>         str     d0, [x0]
> >>         ret
> >> 
> >> But there are no specific ordering constraints for non-consecutive
> >> loads.  For example, in:
> >> 
> >>   void
> >>   f2 (T *x)
> >>   {
> >>     T res1 = 0;
> >>     T res2 = 0;
> >>     for (int i = 0; i < 100; ++i)
> >>       {
> >>    res1 += x[i * 4];
> >>    res2 += x[i * 4 + 2];
> >>       }
> >>     x[0] = res1;
> >>     x[1] = res2;
> >>   }
> >> 
> >> the reduction group happens to be { res2, res1 }, and so we get
> >> a load permutation of { 2, 0, 6, 4 }.  On aarch64 this gives:
> >> 
> >> f2:
> >> .LFB1:
> >>         .cfi_startproc
> >>         adrp    x2, .LC0
> >>         mov     x1, x0
> >>         movi    v2.4s, 0
> >>         ldr     q3, [x2, #:lo12:.LC0]
> >>         add     x2, x0, 1568
> >>         .p2align 3,,7
> >> .L6:
> >>         ldp     q0, q1, [x1]
> >>         add     x1, x1, 32
> >>         tbl     v0.16b, {v0.16b - v1.16b}, v3.16b
> >>         fadd    v2.4s, v2.4s, v0.4s
> >>         cmp     x1, x2
> >>         bne     .L6
> >>    ...untidy code, 18 insns...
> >>         ret
> >> 
> >> where we have to load the permute selector from memory and use
> >> the general TBL instruction.  If the order happened to be { res1, res2 }
> >> instead we would generate the much tidier:
> >> 
> >> f2:
> >> .LFB1:
> >>         .cfi_startproc
> >>         movi    v1.4s, 0
> >>         mov     x1, x0
> >>         add     x2, x0, 1568
> >>         .p2align 3,,7
> >> .L6:
> >>         ldp     q0, q2, [x1]
> >>         add     x1, x1, 32
> >>         uzp1    v0.4s, v0.4s, v2.4s
> >>         fadd    v1.4s, v1.4s, v0.4s
> >>         cmp     x1, x2
> >>         bne     .L6
> >>         ldr     d0, [x0, 1568]
> >>         ldr     d5, [x0, 1576]
> >>         ldr     s2, [x0, 1584]
> >>         dup     d3, v1.d[1]
> >>         ldr     s4, [x0, 1592]
> >>         zip1    v0.2s, v0.2s, v5.2s
> >>         ins     v2.s[1], v4.s[0]
> >>         fadd    v0.2s, v0.2s, v2.2s
> >>         fadd    v0.2s, v0.2s, v1.2s
> >>         fadd    v0.2s, v0.2s, v3.2s
> >>         str     d0, [x0]
> >>         ret
> >> 
> >> This WIP patch makes vect_optimize_slp try to put load permutations
> >> into index order.  However, this new transform might be a loss if it
> >> forces permutations elsewhere.  For example, given:
> >> 
> >> void
> >> f3 (T *restrict x, T *restrict y, T *restrict z)
> >> {
> >>   for (int i = 0; i < 100; ++i)
> >>     {
> >>       x[i * 2] = y[i * 4 + 2] + z[i * 4 + 2];
> >>       x[i * 2 + 1] = y[i * 4] + z[i * 4];
> >>     }
> >> }
> >> 
> >> we would generate:
> >> 
> >> .L9:
> >>         ldr     q0, [x1, x3]
> >>         ldr     q3, [x6, x3]
> >>         ldr     q1, [x2, x3]
> >>         ldr     q2, [x5, x3]
> >>         add     x3, x3, 32
> >>         uzp1    v0.4s, v0.4s, v3.4s
> >>         uzp1    v1.4s, v1.4s, v2.4s
> >>         fadd    v0.4s, v0.4s, v1.4s
> >>         rev64   v0.4s, v0.4s
> >>         str     q0, [x4], 16
> >>         cmp     x3, 1568
> >>         bne     .L9
> >> 
> >> (3 permutes) rather than:
> >> 
> >> .L9:
> >>         ldr     q0, [x1, x3]
> >>         ldr     q1, [x6, x3]
> >>         ldr     q2, [x2, x3]
> >>         ldr     q3, [x5, x3]
> >>         tbl     v0.16b, {v0.16b - v1.16b}, v4.16b
> >>         add     x3, x3, 32
> >>         tbl     v2.16b, {v2.16b - v3.16b}, v4.16b
> >>         fadd    v0.4s, v0.4s, v2.4s
> >>         str     q0, [x4], 16
> >>         cmp     x3, 1568
> >>         bne     .L9
> >> 
> >> (2 permutes).
> >> 
> >> One simple(ish) fix would be to conditionalise the new case on
> >> whether all "roots" of the load are reduction groups.  Alternatively,
> >> we could treat the new case as a soft preference and back out if it
> >> would require any new VEC_PERM_EXPR nodes to be created.  This would
> >> require a propagation back from uses to definitions.
> >> 
> >> WDYT?  Does this look like a feasible direction?  If so, any thoughts
> >> on when the new case should be enabled?
> >> 
> >> The patch keeps the bijection requirement, since relaxing that seems
> >> like separate work.
> >> 
> >> Tested on aarch64-linux-gnu, no regressions.
> >> 
> >> Thanks,
> >> Richard
> >> 
> >> ---
> >>  gcc/tree-vect-slp.cc | 41 ++++++++++++++---------------------------
> >>  1 file changed, 14 insertions(+), 27 deletions(-)
> >> 
> >> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> >> index dab5daddcc5..fde35d8c954 100644
> >> --- a/gcc/tree-vect-slp.cc
> >> +++ b/gcc/tree-vect-slp.cc
> >> @@ -20,6 +20,7 @@ along with GCC; see the file COPYING3.  If not see
> >>  <http://www.gnu.org/licenses/>.  */
> >>  
> >>  #include "config.h"
> >> +#define INCLUDE_ALGORITHM
> >>  #include "system.h"
> >>  #include "coretypes.h"
> >>  #include "backend.h"
> >> @@ -3698,43 +3699,29 @@ vect_optimize_slp (vec_info *vinfo)
> >>        if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
> >>    continue;
> >>        dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt);
> >> -      unsigned imin = DR_GROUP_SIZE (dr_stmt) + 1, imax = 0;
> >> -      bool any_permute = false;
> >> -      for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> >> -  {
> >> -    unsigned idx = SLP_TREE_LOAD_PERMUTATION (node)[j];
> >> -    imin = MIN (imin, idx);
> >> -    imax = MAX (imax, idx);
> >> -    if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j)
> >> -      any_permute = true;
> >> -  }
> >> -      /* If there's no permute no need to split one out.  */
> >> -      if (!any_permute)
> >> -  continue;
> >> -      /* If the span doesn't match we'd disrupt VF computation, avoid
> >> -   that for now.  */
> >> -      if (imax - imin + 1 != SLP_TREE_LANES (node))
> >> +
> >> +      auto_vec<unsigned> sorted;
> >> +      sorted.safe_splice (SLP_TREE_LOAD_PERMUTATION (node));
> >> +      std::sort (sorted.begin (), sorted.end ());
> >> +      if (std::equal (sorted.begin (), sorted.end (),
> >> +                SLP_TREE_LOAD_PERMUTATION (node).begin ()))
> >>    continue;
> >>  
> >>        /* For now only handle true permutes, like
> >>     vect_attempt_slp_rearrange_stmts did.  This allows us to be lazy
> >>     when permuting constants and invariants keeping the permute
> >>     bijective.  */
> >> -      auto_sbitmap load_index (SLP_TREE_LANES (node));
> >> -      bitmap_clear (load_index);
> >> -      for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> >> -  bitmap_set_bit (load_index, SLP_TREE_LOAD_PERMUTATION (node)[j] - imin);
> >> -      unsigned j;
> >> -      for (j = 0; j < SLP_TREE_LANES (node); ++j)
> >> -  if (!bitmap_bit_p (load_index, j))
> >> -    break;
> >> -      if (j != SLP_TREE_LANES (node))
> >> +      if (std::adjacent_find (sorted.begin (), sorted.end ()) != 
> >> sorted.end ())
> >>    continue;
> >
> > So without the std:: rewrite this is the only change, where
> > previously we rejected { 0, 2, ... } because '1' was missing
> > but now we only reject duplicates?
> 
> All three changes work together really.  I thought the std:: stuff
> made things simpler, but the things that use std:: would still need
> to be rewritten if the patch didn't use std::.
> 
> For a load permute (Lo) like { 2, 0, 6, 4 } we want the lane permute (La)
> to be { 1, 0, 3, 2 }, so that the new load permute (Lo') is { 0, 2, 4, 6 }.
> 
> So the ?SLP_TREE_LOAD_PERMUTATION (node)[j] - imin? below is no longer
> enough when computing La, since subtracting the minimum doesn't have
> the necessary compressive effect.  We instead need to set La[i] to the
> index of Lo[i] in Lo'.  And the easiest way of doing that seemed to be
> to compute Lo' (by sorting) and then doing a binary search for each
> element.  Having the sorted array also makes it easy to see whether
> the permute is a no-op and whether the same source element appears
> twice.
> 
> > As the comment says this
> > is what vect_attempt_slp_rearrange_stmts did so we would need
> > to adjust the comment as well.
> 
> I thought the comment meant the that permute entered into the perm
> array is a bijection on [0, SLP_TREE_LANES (node) - 1].  That's still
> true with the patch, and so for example we still don't need to worry
> about:
> 
>                 /* ???  But for constants once we want to handle
>                    non-bijective permutes we have to verify the permute,
>                    when unifying lanes, will not unify different constants.
>                    For example see gcc.dg/vect/bb-slp-14.c for a case
>                    that would break.  */
> 
> (I hope).


Yes.  I think the comment meant that permutes that only select a part
of the vector are not supported, but I'd have to go back to the original
vect_attempt_slp_rearrange_stmts (which was quite limited) to check.
The idea there was that the "unpermuted" vector load would be always
a natively supported operation and thus always better than the
permuted one.

I think extending on that is OK.  Ideally targets would expose
multi-step permutes here so we can better assess cost.

> > (I'm unsure about the std:: stuff but mostly because I'm not
> > familiar with it and its semantics)
> >
> > The current "propagation" isn't really designed to find
> > optimal solutions, it rather relies on the initial
> > permute "anticipation" we set up and simply propagates
> > those up as far as possible.  I think at some point we need
> > to come up with something more elaborate and rewrite this all.
> >
> > It might also be that a target handles { 2, 0 } but not
> > { 0, 2 } while we know it handles the unpermuted case always.
> 
> OK.  I guess we should reject new permutes if vect_transform_slp_perm_load
> is false for them and true for the original.

Yes, for example.  Or somehow "compute" a permute that makes the
permute supported (since when it ends up not supported we disqualify
the vectorization later).

> > Can you think of a way to express this as dataflow problem
> > capturing finding the optimal solution as well?
> 
> I think it's inevitably going to be a two-way thing: propagate what's
> best for producers as far as possible, then propagate back what's
> best for consumers as far as possible, or the other way around.

Yeah, with the additional constraint what the target handles.

Richard.

> Thanks,
> Richard
> 
> >
> >>        vec<unsigned> perm = vNULL;
> >>        perm.safe_grow (SLP_TREE_LANES (node), true);
> >>        for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> >> -  perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin;
> >> +  {
> >> +    auto it = std::lower_bound (sorted.begin (), sorted.end (),
> >> +                                SLP_TREE_LOAD_PERMUTATION (node)[j]);
> >> +    perm[j] = it - sorted.begin ();
> >> +  }
> >>        perms.safe_push (perm);
> >>        vertices[idx].perm_in = perms.length () - 1;
> >>        vertices[idx].perm_out = perms.length () - 1;
> >> @@ -6647,7 +6634,7 @@ vect_transform_slp_perm_load (vec_info *vinfo,
> >>  {
> >>    stmt_vec_info stmt_info = SLP_TREE_SCALAR_STMTS (node)[0];
> >>    int vec_index = 0;
> >> -  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
> >> +  tree vectype = SLP_TREE_VECTYPE (node);
> >
> > that looks like an obvious fix.
> >
> > Richard.
> >
> >>    unsigned int group_size = SLP_TREE_SCALAR_STMTS (node).length ();
> >>    unsigned int mask_element;
> >>    machine_mode mode;
> >> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstra

Re: [PATCH] RFC: Optimise SLP permutes of non-consecutive loads

Reply via email to