On Wed, Jun 5, 2024 at 11:07 AM Richard Biener <rguent...@suse.de> wrote:
>
> On Tue, 4 Jun 2024, Manolis Tsamis wrote:
>
> > This change adds a function that checks for SLP nodes with multiple 
> > occurrences
> > of the same statement (e.g. {A, B, A, B, ...}) and tries to rearrange the 
> > node
> > so that there are no duplicates. A vec_perm is then introduced to recreate 
> > the
> > original ordering. These duplicates can appear due to how two_operators 
> > nodes
> > are handled, and they prevent vectorization in some cases.
>
> So the trick is that when we have two operands we elide duplicate lanes
> so we can do discovery for a single combined operand instead which we
> then decompose into the required two again.  That's a nice one.
>
> But as implemented this will fail SLP discovery if the combined operand
> fails discovery possibly because of divergence in downstream defs.  That
> is, it doesn't fall back to separate discovery.  I suspect the situation
> of duplicate lanes isn't common but then I would also suspect that
> divergence _is_ common.
>
That's a good point; I checked out and at least for the x264 testcase
provided SLP discovery succeeds in both cases but in one case
vectorization fails later on due to the unsupported load permutations
among others.
I think that's what Tamar also mentioned and it makes it hard to
decide whether to apply the pattern based on if discovery fails.

> The discovery code is already quite complex with the way it possibly
> swaps operands of lanes, fitting in this as another variant to try (first)
> is likely going to be a bit awkward.  A way out might be to split the
> function or to make the re-try in the caller which could indicate whether
> to apply this pattern trick or not.  That said - can you try to get
> data on how often the trick applies and discovery succeeds and how
> often discovery fails but discovery would suceed without applying the
> pattern (say, on SPEC)?
>
I checked out SPEC and this pattern only triggers on x264 and in that
case discovery succeeds. So we don't have any data on the pattern
applying but discovery failing.

> I also suppose instead of hardcoding three patterns for a fixed
> size it should be possible to see there's
> only (at most) half unique lanes in both operands (and one less in one
> operand if the number of lanes is odd) and compute the un-swizzling lane
> permutes during this discovery, removing the need of the explicit enum
> and open-coding each case?
>
Yes, that's a fair point. I will change that in the next iteration.

> Another general note is that trying (and then undo on fail) such ticks
> eats at the discovery limit we have in place to avoid exponential run-off
> in exactly this degenerate cases.
>

So, most importantly, the points you and Tamar mentioned got me
thinking about the transformation again, why it is useful and when it
applies.
In this initial implementation I tried to make this independant from
the two_operators logic and apply it when possible, which brings up
all these issues about discovery and usefulness of the pattern in
general.
E.g. If we had just [a, b, a, b] + [c, d, c, d] without two_operators
I sort of doubt it would be worth it to apply the transformation in
most cases (except of course if it enables vectorization, but as I
understand it it is hard to tell when that happens).
On the other hand, if we know that we're dealing with two_operators
nodes then the argument changes, as we know that we'll duplicate these
nodes.

In turn, it may be best to try to see this as a 'two_operators
lowering strategy' improvement instead of a generic rearrangement
pattern.
Specifically for x264, we're given code like

int t0 = s0 + s1;
int t1 = s0 - s1;
int t2 = s2 + s3;
int t3 = s2 - s3;

and currently we lower that to VEC_PERM<(A + B), (A - B)>(...) with A
= [s0, s0, s2, s2], B = [s1, s1, s3, s3] which doesn't work very well
(due to element duplication).
With this patch we do VEC_PERM<(A + B), (A - B)>(...) with A =
VEC_PERM<C, C>(...), B = VEC_PERM<C, C>(...),  C = [s0, s1, s2, s3]
instead which works good.
But it is obvious that there are other strategies to lower this too
and they may be even better (by taking advantage of the fact that we
know we're dealing with a two_operators node *and* have duplicate
elements).
For example doing VEC_PERM<(A + B), (A - B)>(...) with A = [s0, s1,
s2, s3] and B = VEC_PERM<A, A>(1, 0, 3, 2) looks interesting too and
is only possible because we combine two_operators and rearrangement.

Do you believe that narrowing this to a "two_operators lowering
improvement" makes more sense and addresses at least some of the
issues mentioned?
I'm currently testing to see the code that we generate with other
strategies and will reach out once I have new results.

Thanks,
Manolis

> Thanks,
> Richard.
>
> > This targets the vectorization of the SPEC2017 x264 pixel_satd functions.
> > In some processors a larger than 10% improvement on x264 has been observed.
> >
> > See also: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138
> >
> > gcc/ChangeLog:
> >
> >       * tree-vect-slp.cc (enum slp_oprnd_pattern): new enum for 
> > rearrangement
> >       patterns.
> >       (try_rearrange_oprnd_info): Detect if a node corresponds to one of the
> >       patterns.
> >
> > gcc/testsuite/ChangeLog:
> >
> >       * gcc.target/aarch64/vect-slp-two-operator.c: New test.
> >
> > Signed-off-by: Manolis Tsamis <manolis.tsa...@vrull.eu>
> > ---
> >
> >  .../aarch64/vect-slp-two-operator.c           |  42 ++++
> >  gcc/tree-vect-slp.cc                          | 234 ++++++++++++++++++
> >  2 files changed, 276 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c
> >
> > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c 
> > b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c
> > new file mode 100644
> > index 00000000000..2db066a0b6e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c
> > @@ -0,0 +1,42 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect 
> > -fdump-tree-vect-details" } */
> > +
> > +typedef unsigned char uint8_t;
> > +typedef unsigned int uint32_t;
> > +
> > +#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
> > +    int t0 = s0 + s1;\
> > +    int t1 = s0 - s1;\
> > +    int t2 = s2 + s3;\
> > +    int t3 = s2 - s3;\
> > +    d0 = t0 + t2;\
> > +    d1 = t1 + t3;\
> > +    d2 = t0 - t2;\
> > +    d3 = t1 - t3;\
> > +}
> > +
> > +static uint32_t abs2( uint32_t a )
> > +{
> > +    uint32_t s = ((a>>15)&0x10001)*0xffff;
> > +    return (a+s)^s;
> > +}
> > +
> > +void sink(uint32_t tmp[4][4]);
> > +
> > +int x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int 
> > i_pix2 )
> > +{
> > +    uint32_t tmp[4][4];
> > +    int sum = 0;
> > +    for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
> > +    {
> > +        uint32_t a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
> > +        uint32_t a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
> > +        uint32_t a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
> > +        uint32_t a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
> > +        HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0,a1,a2,a3 
> > );
> > +    }
> > +    sink(tmp);
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
> > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
> > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> > index bf1f467f53f..e395db0e185 100644
> > --- a/gcc/tree-vect-slp.cc
> > +++ b/gcc/tree-vect-slp.cc
> > @@ -40,6 +40,7 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "tree-vectorizer.h"
> >  #include "langhooks.h"
> >  #include "gimple-walk.h"
> > +#include "gimple-pretty-print.h"
> >  #include "dbgcnt.h"
> >  #include "tree-vector-builder.h"
> >  #include "vec-perm-indices.h"
> > @@ -1829,6 +1830,141 @@ vect_slp_build_two_operator_nodes (slp_tree perm, 
> > tree vectype,
> >    SLP_TREE_CHILDREN (perm).quick_push (child2);
> >  }
> >
> > +enum slp_oprnd_pattern
> > +{
> > +  SLP_OPRND_PATTERN_NONE,
> > +  SLP_OPRND_PATTERN_ABAB,
> > +  SLP_OPRND_PATTERN_AABB,
> > +  SLP_OPRND_PATTERN_ABBA
> > +};
> > +
> > +/* Check if OPRNDS_INFO has duplicated nodes that correspond to a 
> > predefined
> > +   pattern described by SLP_OPRND_PATTERN and return it.  */
> > +
> > +static int
> > +try_rearrange_oprnd_info (vec<slp_oprnd_info> &oprnds_info, unsigned 
> > group_size)
> > +{
> > +  unsigned i;
> > +  slp_oprnd_info info;
> > +
> > +  if (oprnds_info.length () != 2 || group_size % 4 != 0)
> > +    return SLP_OPRND_PATTERN_NONE;
> > +
> > +  if (!oprnds_info[0]->def_stmts[0]
> > +      || !is_a<gassign *> (oprnds_info[0]->def_stmts[0]->stmt))
> > +    return SLP_OPRND_PATTERN_NONE;
> > +
> > +  enum tree_code code
> > +    = gimple_assign_rhs_code (oprnds_info[0]->def_stmts[0]->stmt);
> > +  FOR_EACH_VEC_ELT (oprnds_info, i, info)
> > +    for (unsigned int j = 0; j < group_size; j += 1)
> > +      {
> > +     if (!info->def_stmts[j]
> > +         || !is_a<gassign *> (info->def_stmts[j]->stmt)
> > +         || STMT_VINFO_DATA_REF (info->def_stmts[j]))
> > +       return SLP_OPRND_PATTERN_NONE;
> > +     /* Don't mix different operations.  */
> > +     if (gimple_assign_rhs_code (info->def_stmts[j]->stmt) != code)
> > +       return SLP_OPRND_PATTERN_NONE;
> > +      }
> > +
> > +  if (gimple_assign_rhs_code (oprnds_info[0]->def_stmts[0]->stmt)
> > +      != gimple_assign_rhs_code (oprnds_info[1]->def_stmts[0]->stmt))
> > +    return SLP_OPRND_PATTERN_NONE;
> > +
> > +  int pattern = SLP_OPRND_PATTERN_NONE;
> > +  FOR_EACH_VEC_ELT (oprnds_info, i, info)
> > +    for (unsigned int j = 0; j < group_size; j += 4)
> > +      {
> > +     int cur_pattern = SLP_OPRND_PATTERN_NONE;
> > +     /* Check for an ABAB... pattern.  */
> > +     if ((info->def_stmts[j] == info->def_stmts[j + 2])
> > +         && (info->def_stmts[j + 1] == info->def_stmts[j + 3])
> > +         && (info->def_stmts[j] != info->def_stmts[j + 1]))
> > +       cur_pattern = SLP_OPRND_PATTERN_ABAB;
> > +     /* Check for an AABB... pattern.  */
> > +     else if ((info->def_stmts[j] == info->def_stmts[j + 1])
> > +              && (info->def_stmts[j + 2] == info->def_stmts[j + 3])
> > +              && (info->def_stmts[j] != info->def_stmts[j + 2]))
> > +       cur_pattern = SLP_OPRND_PATTERN_AABB;
> > +     /* Check for an ABBA... pattern.  */
> > +     else if ((info->def_stmts[j] == info->def_stmts[j + 3])
> > +              && (info->def_stmts[j + 1] == info->def_stmts[j + 2])
> > +              && (info->def_stmts[j] != info->def_stmts[j + 1]))
> > +       cur_pattern = SLP_OPRND_PATTERN_ABBA;
> > +     /* Unrecognised pattern.  */
> > +     else
> > +       return SLP_OPRND_PATTERN_NONE;
> > +
> > +     if (pattern == SLP_OPRND_PATTERN_NONE)
> > +       pattern = cur_pattern;
> > +     /* Multiple patterns detected.  */
> > +     else if (cur_pattern != pattern)
> > +       return SLP_OPRND_PATTERN_NONE;
> > +      }
> > +
> > +  gcc_checking_assert (pattern != SLP_OPRND_PATTERN_NONE);
> > +
> > +  if (dump_enabled_p ())
> > +    {
> > +      if (pattern == SLP_OPRND_PATTERN_ABAB)
> > +     dump_printf (MSG_NOTE, "ABAB");
> > +      else if (pattern == SLP_OPRND_PATTERN_AABB)
> > +     dump_printf (MSG_NOTE, "AABB");
> > +      else if (pattern == SLP_OPRND_PATTERN_ABBA)
> > +     dump_printf (MSG_NOTE, "ABBA");
> > +      dump_printf (MSG_NOTE, " pattern detected.\n");
> > +    }
> > +
> > +  if (pattern == SLP_OPRND_PATTERN_ABAB || pattern == 
> > SLP_OPRND_PATTERN_ABBA)
> > +    for (unsigned int j = 0; j < group_size; j += 4)
> > +      {
> > +     /* Given oprnd[0] -> A1, B1, A1, B1, A2, B2, A2, B2, ...
> > +        Given oprnd[1] -> C1, D1, C1, D1, C2, D2, C2, D2, ...
> > +        Create a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ...  */
> > +     oprnds_info[0]->def_stmts[j+2] = oprnds_info[1]->def_stmts[j];
> > +     oprnds_info[0]->ops[j+2] = oprnds_info[1]->ops[j];
> > +     oprnds_info[0]->def_stmts[j+3] = oprnds_info[1]->def_stmts[j+1];
> > +     oprnds_info[0]->ops[j+3] = oprnds_info[1]->ops[j+1];
> > +      }
> > +  else if (pattern == SLP_OPRND_PATTERN_AABB)
> > +    for (unsigned int j = 0; j < group_size; j += 4)
> > +      {
> > +     /* Given oprnd[0] -> A1, A1, B1, B1, A2, A2, B2, B2, ...
> > +        Given oprnd[1] -> C1, C1, D1, D1, C2, C2, D2, D2, ...
> > +        Create a single node -> A1, C1, B1, D1, A2, C2, B2, D2, ...  */
> > +
> > +     /* The ordering here is at least to some extent arbitrary.
> > +        A generilized version needs to use some explicit ordering.  */
> > +     oprnds_info[0]->def_stmts[j+1] = oprnds_info[1]->def_stmts[j];
> > +     oprnds_info[0]->ops[j+1] = oprnds_info[1]->ops[j];
> > +     oprnds_info[0]->def_stmts[j+2] = oprnds_info[0]->def_stmts[j+2];
> > +     oprnds_info[0]->ops[j+2] = oprnds_info[0]->ops[j+2];
> > +     oprnds_info[0]->def_stmts[j+3] = oprnds_info[1]->def_stmts[j+2];
> > +     oprnds_info[0]->ops[j+3] = oprnds_info[1]->ops[j+2];
> > +      }
> > +
> > +  if (dump_enabled_p ())
> > +    {
> > +      dump_printf (MSG_NOTE, "Recurse with:\n");
> > +      for (unsigned int j = 0; j < group_size; j++)
> > +     {
> > +       dump_printf (MSG_NOTE, "  ");
> > +       print_gimple_stmt (dump_file, oprnds_info[0]->def_stmts[j]->stmt, 
> > 0);
> > +     }
> > +    }
> > +
> > +  /* Since we've merged the two nodes in one, make the second one a copy of
> > +     the first.  */
> > +  for (unsigned int j = 0; j < group_size; j++)
> > +    {
> > +      oprnds_info[1]->def_stmts[j] = oprnds_info[0]->def_stmts[j];
> > +      oprnds_info[1]->ops[j] = oprnds_info[0]->ops[j];
> > +    }
> > +
> > +  return pattern;
> > +}
> > +
> >  /* Recursively build an SLP tree starting from NODE.
> >     Fail (and return a value not equal to zero) if def-stmts are not
> >     isomorphic, require data permutation or are of unsupported types of
> > @@ -2409,6 +2545,10 @@ out:
> >
> >    stmt_info = stmts[0];
> >
> > +  int rearrange_pattern = SLP_OPRND_PATTERN_NONE;
> > +  if (is_a<gassign *> (stmt_info->stmt))
> > +    rearrange_pattern = try_rearrange_oprnd_info (oprnds_info, group_size);
> > +
> >    /* Create SLP_TREE nodes for the definition node/s.  */
> >    FOR_EACH_VEC_ELT (oprnds_info, i, oprnd_info)
> >      {
> > @@ -2669,6 +2809,100 @@ fail:
> >    *tree_size += this_tree_size + 1;
> >    *max_nunits = this_max_nunits;
> >
> > +  /* If we applied any rearrangmenets then we need to reconstruct the 
> > original
> > +     elements with an additional permutation layer.  */
> > +  if (rearrange_pattern != SLP_OPRND_PATTERN_NONE)
> > +    {
> > +      slp_tree one =  new _slp_tree;
> > +      slp_tree two = new _slp_tree;
> > +      SLP_TREE_DEF_TYPE (one) = vect_internal_def;
> > +      SLP_TREE_DEF_TYPE (two) = vect_internal_def;
> > +      SLP_TREE_VECTYPE (one) = vectype;
> > +      SLP_TREE_VECTYPE (two) = vectype;
> > +      SLP_TREE_CHILDREN (one).safe_splice (children);
> > +      SLP_TREE_CHILDREN (two).safe_splice (children);
> > +
> > +      SLP_TREE_CODE (one) = VEC_PERM_EXPR;
> > +      SLP_TREE_CODE (two) = VEC_PERM_EXPR;
> > +      SLP_TREE_REPRESENTATIVE (one) = stmts[0];
> > +      SLP_TREE_REPRESENTATIVE (two) = stmts[2];
> > +      lane_permutation_t &perm_one = SLP_TREE_LANE_PERMUTATION (one);
> > +      lane_permutation_t &perm_two = SLP_TREE_LANE_PERMUTATION (two);
> > +
> > +      if (rearrange_pattern == SLP_OPRND_PATTERN_ABAB)
> > +     {
> > +        /* Given a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ...
> > +           Create node "one" -> A1, B1, A1, B1, A2, B2, A2, B2, ...
> > +           Create node "two" -> C1, D1, C1, D1, C2, D2, C2, D2, ...  */
> > +
> > +       for (unsigned int j = 0; j < group_size; j += 4)
> > +         {
> > +           perm_one.safe_push (std::make_pair (0, j + 0));
> > +           perm_one.safe_push (std::make_pair (0, j + 1));
> > +           perm_one.safe_push (std::make_pair (0, j + 0));
> > +           perm_one.safe_push (std::make_pair (0, j + 1));
> > +
> > +           perm_two.safe_push (std::make_pair (0, j + 2));
> > +           perm_two.safe_push (std::make_pair (0, j + 3));
> > +           perm_two.safe_push (std::make_pair (0, j + 2));
> > +           perm_two.safe_push (std::make_pair (0, j + 3));
> > +         }
> > +     }
> > +      else if (rearrange_pattern == SLP_OPRND_PATTERN_AABB)
> > +     {
> > +        /* Given a single node -> A1, C1, B1, D1, A2, C2, B2, D2, ...
> > +           Create node "one" -> A1, A1, B1, B1, A2, A2, B2, B2, ...
> > +           Create node "two" -> C1, C1, D1, D1, C2, C2, D2, D2, ...  */
> > +
> > +       for (unsigned int j = 0; j < group_size; j += 4)
> > +         {
> > +           perm_one.safe_push (std::make_pair (0, j + 0));
> > +           perm_one.safe_push (std::make_pair (0, j + 0));
> > +           perm_one.safe_push (std::make_pair (0, j + 2));
> > +           perm_one.safe_push (std::make_pair (0, j + 2));
> > +
> > +           perm_two.safe_push (std::make_pair (0, j + 1));
> > +           perm_two.safe_push (std::make_pair (0, j + 1));
> > +           perm_two.safe_push (std::make_pair (0, j + 3));
> > +           perm_two.safe_push (std::make_pair (0, j + 3));
> > +         }
> > +     }
> > +      else if (rearrange_pattern == SLP_OPRND_PATTERN_ABBA)
> > +     {
> > +        /* Given a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ...
> > +           Create node "one" -> A1, B1, B1, A1, A2, B2, B2, A2, ...
> > +           Create node "two" -> C1, D1, D1, C1, C2, D2, D2, C2, ...  */
> > +
> > +       for (unsigned int j = 0; j < group_size; j += 4)
> > +         {
> > +           perm_one.safe_push (std::make_pair (0, j + 0));
> > +           perm_one.safe_push (std::make_pair (0, j + 1));
> > +           perm_one.safe_push (std::make_pair (0, j + 1));
> > +           perm_one.safe_push (std::make_pair (0, j + 0));
> > +
> > +           perm_two.safe_push (std::make_pair (0, j + 2));
> > +           perm_two.safe_push (std::make_pair (0, j + 3));
> > +           perm_two.safe_push (std::make_pair (0, j + 3));
> > +           perm_two.safe_push (std::make_pair (0, j + 2));
> > +         }
> > +     }
> > +
> > +      slp_tree child;
> > +      FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (two), i, child)
> > +       SLP_TREE_REF_COUNT (child)++;
> > +
> > +      node = vect_create_new_slp_node (node, stmts, 2);
> > +      SLP_TREE_VECTYPE (node) = vectype;
> > +      SLP_TREE_CHILDREN (node).quick_push (one);
> > +      SLP_TREE_CHILDREN (node).quick_push (two);
> > +
> > +      SLP_TREE_LANES (one) = stmts.length ();
> > +      SLP_TREE_LANES (two) = stmts.length ();
> > +
> > +      children.truncate (0);
> > +      children.safe_splice (SLP_TREE_CHILDREN (node));
> > +    }
> > +
> >    if (two_operators)
> >      {
> >        /* ???  We'd likely want to either cache in bst_map sth like
> >
>
> --
> Richard Biener <rguent...@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to