Re: [PATCH 2/2] Add a new permute optimization step in SLP

Christoph Müllner Wed, 30 Oct 2024 04:06:48 -0700

On Fri, Oct 18, 2024 at 1:08 PM Richard Biener <rguent...@suse.de> wrote:
>
> On Fri, 18 Oct 2024, Tamar Christina wrote:
>
> > > -----Original Message-----
> > > From: Richard Biener <rguent...@suse.de>
> > > Sent: Friday, October 18, 2024 11:03 AM
> > > To: Tamar Christina <tamar.christ...@arm.com>
> > > Cc: Christoph Müllner <christoph.muell...@vrull.eu>; 
> > > gcc-patches@gcc.gnu.org;
> > > Philipp Tomsich <philipp.toms...@vrull.eu>; Jeff Law 
> > > <jeffreya...@gmail.com>;
> > > Robin Dapp <rd...@ventanamicro.com>
> > > Subject: RE: [PATCH 2/2] Add a new permute optimization step in SLP
> > >
> > > On Thu, 17 Oct 2024, Tamar Christina wrote:
> > >
> > > > Hi Christoph,
> > > >
> > > > > -----Original Message-----
> > > > > From: Christoph Müllner <christoph.muell...@vrull.eu>
> > > > > Sent: Tuesday, October 15, 2024 3:57 PM
> > > > > To: gcc-patches@gcc.gnu.org; Philipp Tomsich 
> > > > > <philipp.toms...@vrull.eu>;
> > > Tamar
> > > > > Christina <tamar.christ...@arm.com>; Richard Biener 
> > > > > <rguent...@suse.de>
> > > > > Cc: Jeff Law <jeffreya...@gmail.com>; Robin Dapp
> > > <rd...@ventanamicro.com>;
> > > > > Christoph Müllner <christoph.muell...@vrull.eu>
> > > > > Subject: [PATCH 2/2] Add a new permute optimization step in SLP
> > > > >
> > > > > This commit adds a new permute optimization step after running SLP
> > > > > vectorization.
> > > > > Although there are existing places where individual or nested permutes
> > > > > can be optimized, there are cases where independent permutes can be
> > > optimized,
> > > > > which cannot be expressed in the current pattern matching framework.
> > > > > The optimization step is run at the end so that permutes from 
> > > > > completely
> > > different
> > > > > SLP builds can be optimized.
> > > > >
> > > > > The initial optimizations implemented can detect some cases where 
> > > > > different
> > > > > "select permutes" (permutes that only use some of the incoming vector 
> > > > > lanes)
> > > > > can be co-located in a single permute. This can optimize some cases 
> > > > > where
> > > > > two_operator SLP nodes have duplicate elements.
> > > > >
> > > > > Bootstrapped and reg-tested on AArch64 (C, C++, Fortran).
> > > > >
> > > > > Manolis Tsamis was the patch's initial author before I took it over.
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >         * tree-vect-slp.cc (get_tree_def): Return the definition of a 
> > > > > name.
> > > > >         (recognise_perm_binop_perm_pattern): Helper function.
> > > > >         (vect_slp_optimize_permutes): New permute optimization step.
> > > > >         (vect_slp_function): Run the new permute optimization step.
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >
> > > > >         * gcc.dg/vect/slp-perm-14.c: New test.
> > > > >         * gcc.target/aarch64/sve/slp-perm-14.c: New test.
> > > > >
> > > > > Signed-off-by: Christoph Müllner <christoph.muell...@vrull.eu>
> > > > > ---
> > > > >  gcc/testsuite/gcc.dg/vect/slp-perm-14.c       |  42 +++
> > > > >  .../gcc.target/aarch64/sve/slp-perm-14.c      |   3 +
> > > > >  gcc/tree-vect-slp.cc                          | 248 
> > > > > ++++++++++++++++++
> > > > >  3 files changed, 293 insertions(+)
> > > > >  create mode 100644 gcc/testsuite/gcc.dg/vect/slp-perm-14.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp-perm-14.c
> > > > >
> > > > > diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-14.c
> > > > > b/gcc/testsuite/gcc.dg/vect/slp-perm-14.c
> > > > > new file mode 100644
> > > > > index 00000000000..f56e3982a62
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.dg/vect/slp-perm-14.c
> > > > > @@ -0,0 +1,42 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-additional-options "-O3 -fdump-tree-slp1-details" } */
> > > > > +
> > > > > +#include <stdint.h>
> > > > > +
> > > > > +#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
> > > > > +    int t0 = s0 + s1;\
> > > > > +    int t1 = s0 - s1;\
> > > > > +    int t2 = s2 + s3;\
> > > > > +    int t3 = s2 - s3;\
> > > > > +    d0 = t0 + t2;\
> > > > > +    d1 = t1 + t3;\
> > > > > +    d2 = t0 - t2;\
> > > > > +    d3 = t1 - t3;\
> > > > > +}
> > > > > +
> > > > > +int
> > > > > +x264_pixel_satd_8x4_simplified (uint8_t *pix1, int i_pix1, uint8_t 
> > > > > *pix2, int
> > > > > i_pix2)
> > > > > +{
> > > > > +  uint32_t tmp[4][4];
> > > > > +  uint32_t a0, a1, a2, a3;
> > > > > +  int sum = 0;
> > > > > +
> > > > > +  for (int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2)
> > > > > +    {
> > > > > +      a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
> > > > > +      a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
> > > > > +      a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
> > > > > +      a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
> > > > > +      HADAMARD4(tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0, a1, 
> > > > > a2, a3);
> > > > > +    }
> > > > > +
> > > > > +  for (int i = 0; i < 4; i++)
> > > > > +    {
> > > > > +      HADAMARD4(a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], 
> > > > > tmp[3][i]);
> > > > > +      sum += a0 + a1 + a2 + a3;
> > > > > +    }
> > > > > +
> > > > > +  return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1;
> > > > > +}
> > > > > +
> > > > > +/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" 
> > > > > "slp1" } } */
> > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-14.c
> > > > > b/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-14.c
> > > > > new file mode 100644
> > > > > index 00000000000..4e0d5175be8
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-14.c
> > > > > @@ -0,0 +1,3 @@
> > > > > +#include "../../../gcc.dg/vect/slp-perm-14.c"
> > > > > +
> > > > > +/* { dg-final { scan-assembler-not {\ttbl\t} } } */
> > > > > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> > > > > index 8794c94ef90..4bf5ccb9cdf 100644
> > > > > --- a/gcc/tree-vect-slp.cc
> > > > > +++ b/gcc/tree-vect-slp.cc
> > > > > @@ -9478,6 +9478,252 @@ vect_slp_if_converted_bb (basic_block bb,
> > > loop_p
> > > > > orig_loop)
> > > > >    return vect_slp_bbs (bbs, orig_loop);
> > > > >  }
> > > > >
> > > > > +/* If NAME is an SSA_NAME defined by an assignment, return that
> > > assignment.
> > > > > +   If SINGLE_USE_ONLY is true and NAME has multiple uses, return 
> > > > > NULL.  */
> > > > > +
> > > > > +static gassign *
> > > > > +get_tree_def (tree name, bool single_use_only)
> > > > > +{
> > > > > +  if (TREE_CODE (name) != SSA_NAME)
> > > > > +    return NULL;
> > > > > +
> > > > > +  gimple *def_stmt = SSA_NAME_DEF_STMT (name);
> > > > > +
> > > > > +  if (single_use_only && !has_single_use (name))
> > > > > +    return NULL;
> > > > > +
> > > > > +  if (!is_gimple_assign (def_stmt))
> > > > > +    return NULL;
> > > >
> > > > Probably cheaper to test this before the single_use. But..
> > > > > +
> > > > > +  return dyn_cast <gassign *> (def_stmt);
> > > >
> > > > Why not just do the dyn_cast and assign that to a result and check
> > > > If the dyn_cast succeeded.  Saves you the second check of the type.
> > > >
> > > > > +}
> > > > > +
> > > > > +/* Helper function for vect_slp_optimize_permutes.  Return true if 
> > > > > STMT is an
> > > > > +   expression of the form:
> > > > > +
> > > > > +     src1_perm = VEC_PERM_EXPR <SRC1, SRC1, CONST_VEC>
> > > > > +     src2_perm = VEC_PERM_EXPR <SRC2, SRC2, CONST_VEC>
> > > > > +     bop1 = src1_perm BINOP1 src2_perm
> > > > > +     bop2 = src1_perm BINOP2 src2_perm
> > > > > +     STMT = VEC_PERM_EXPR <bop1, bop2, CONST_VEC>
> > > > > +
> > > > > +   and src1_perm, src2_perm, bop1, bop2 are not used outside of STMT.
> > > > > +   Return the first two permute statements and the binops through the
> > > > > +   corresponding pointer arguments.  */
> > > > > +
> > > > > +static bool
> > > > > +recognise_perm_binop_perm_pattern (gassign *stmt,
> > > > > +                                  gassign **bop1_out, gassign 
> > > > > **bop2_out,
> > > > > +                                  gassign **perm1_out, gassign 
> > > > > **perm2_out)
> > > > > +{
> > > > > +  if (gimple_assign_rhs_code (stmt) != VEC_PERM_EXPR)
> > > > > +    return false;
> > > > > +
> > > > > +  gassign *bop1, *bop2;
> > > > > +  if (!(bop1 = get_tree_def (gimple_assign_rhs1 (stmt), true))
> > > > > +      || !(bop2 = get_tree_def (gimple_assign_rhs2 (stmt), true))
> > > > > +      || !VECTOR_CST_NELTS (gimple_assign_rhs3 (stmt)).is_constant ()
> > > > > +      || TREE_CODE_CLASS (gimple_assign_rhs_code (bop1)) != 
> > > > > tcc_binary
> > > > > +      || TREE_CODE_CLASS (gimple_assign_rhs_code (bop2)) != 
> > > > > tcc_binary)
> > > > > +    return false;
> > > > > +
> > > > > +  if (gimple_assign_rhs1 (bop1) != gimple_assign_rhs1 (bop2)
> > > > > +      || gimple_assign_rhs2 (bop1) != gimple_assign_rhs2 (bop2)
> > > > > +      || bop1 == bop2)
> > > > > +    return false;
> > > > > +
> > > > > +  tree src1_perm = gimple_assign_rhs1 (bop1);
> > > > > +  tree src2_perm = gimple_assign_rhs2 (bop1);
> > > > > +
> > > > > +  gassign *perm1, *perm2;
> > > > > +  if (!(perm1 = get_tree_def (src1_perm, false))
> > > > > +      || !(perm2 = get_tree_def (src2_perm, false))
> > > > > +      || num_imm_uses (src1_perm) != 2
> > > > > +      || num_imm_uses (src2_perm) != 2
> > > > > +      || gimple_assign_rhs_code (perm1) != VEC_PERM_EXPR
> > > > > +      || gimple_assign_rhs_code (perm2) != VEC_PERM_EXPR
> > > > > +      || gimple_assign_rhs1 (perm1) != gimple_assign_rhs2 (perm1)
> > > > > +      || gimple_assign_rhs1 (perm2) != gimple_assign_rhs2 (perm2)
> > > > > +      || !VECTOR_CST_NELTS (gimple_assign_rhs3 (perm1)).is_constant 
> > > > > ()
> > > > > +      || !VECTOR_CST_NELTS (gimple_assign_rhs3 (perm2)).is_constant 
> > > > > ())
> > > > > +    return false;
> > > > > +
> > > > > +  *bop1_out = bop1;
> > > > > +  *bop2_out = bop2;
> > > > > +  *perm1_out = perm1;
> > > > > +  *perm2_out = perm2;
> > > > > +
> > > > > +  return true;
> > > > > +}
> > > > > +
> > > > > +typedef hash_map<int_hash<unsigned HOST_WIDE_INT, -1, -1>,
> > > > > +                unsigned HOST_WIDE_INT> lane_map;
> > > > > +
> > > > > +/* Iterate over the basic blocks of FUN and try to optimize VEC_PERM
> > > > > +   expressions.  This is done at the end of vectorization so it can 
> > > > > optimize
> > > > > +   expressions that are part of multiple different vectorized 
> > > > > blocks/loops.  */
> > > > > +
> > > > > +static void
> > > > > +vect_slp_optimize_permutes (function *fun)
> > > > > +{
> > > > > +  basic_block bb;
> > > > > +  FOR_EACH_BB_FN (bb, fun)
> > > > > +    for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p 
> > > > > (gsi);)
> > > > > +      {
> > > > > +       gimple_stmt_iterator gsi_stmt1 = gsi;
> > > > > +       gassign *stmt1 = dyn_cast <gassign *> (gsi_stmt (gsi));
> > > > > +       gsi_next (&gsi);
> > >
> > > gsi_next_nondebug (&gsi);
> > >
> > > to avoid compare-debug issues.
> > >
> > > It seems quite limited that you restrict this to adjacent stmts.
> > >
> > > > > +
> > > > > +       if (gsi_end_p (gsi))
> > > > > +         break;
> > > > > +
> > > > > +       gassign *stmt2 = dyn_cast <gassign *> (gsi_stmt (gsi));
> > > > > +
> > > > > +       if (!stmt1 || !stmt2)
> > > > > +         continue;
> > > > > +
> > > > > +       /* Try to identify select permutes which utilize only part of 
> > > > > a
> > > > > +          vector and merge two of them into one.  This case can 
> > > > > arise from
> > > > > +          TWO_OPERATOR SLP patterns because the final permute uses 
> > > > > only half
> > > > > +          of each input vector.  */
> > > > > +       gassign *bop1_1, *bop1_2, *bop2_1, *bop2_2;
> > > > > +       gassign *src1_1, *src1_2, *src2_1, *src2_2;
> > > > > +
> > > > > +       if (!recognise_perm_binop_perm_pattern (stmt1, &bop1_1, 
> > > > > &bop1_2,
> > > > > +                                              &src1_1, &src1_2))
> > > > > +         continue;
> > > > > +
> > > > > +       if (!recognise_perm_binop_perm_pattern (stmt2, &bop2_1, 
> > > > > &bop2_2,
> > > > > +                                              &src2_1, &src2_2))
> > > > > +         continue;
> > > > > +
> > > > > +       if (gimple_assign_rhs_code (bop1_1) != gimple_assign_rhs_code
> > > > > (bop2_1))
> > > > > +         continue;
> > > > > +
> > > > > +       if (gimple_assign_rhs_code (bop1_2) != gimple_assign_rhs_code
> > > > > (bop2_2))
> > > > > +         continue;
> > >
> > > Hmm, up to here we've only verified both "binop_perm" are using the
> > > same binop operation code but we've not related their SRC1 or SRC2?
> > >
> > > From 1/2 in the series I guess we're trying to see whether we can
> > > re-use "unused lanes" here?
> > >
> > > > > +
> > > > > +       tree mask1 = gimple_assign_rhs3 (stmt1);
> > > > > +       tree mask2 = gimple_assign_rhs3 (stmt2);
> > > > > +
> > > > > +       /* Calculate the lanes that the first permute needs.  */
> > > > > +       lane_map mask1_lanes;
> > > > > +       unsigned HOST_WIDE_INT i;
> > > > > +       unsigned HOST_WIDE_INT nelts = VECTOR_CST_NELTS
> > > > > (mask1).to_constant ();
> > > > > +       for (i = 0; i < nelts; i++)
> > > > > +         {
> > > > > +           tree val = VECTOR_CST_ELT (mask1, i);
> > > > > +           gcc_assert (TREE_CODE (val) == INTEGER_CST);
> > > > > +           unsigned HOST_WIDE_INT j = TREE_INT_CST_LOW (val) & 
> > > > > (nelts - 1);
> > > > > +           mask1_lanes.put (j, j);
> > > > > +         }
> > > > > +
> > > > > +       /* Calculate the lanes that the second permute needs and 
> > > > > rewrite
> > > > > +          them to use the lanes that are unused from the first 
> > > > > permute.  */
> > > > > +       lane_map mask2_lanes;
> > > > > +       unsigned HOST_WIDE_INT lane_gen = 0;
> > > > > +       for (i = 0; i < nelts; i++)
> > > > > +         {
> > > > > +           tree val = VECTOR_CST_ELT (mask2, i);
> > > > > +           gcc_assert (TREE_CODE (val) == INTEGER_CST);
> > > > > +           unsigned HOST_WIDE_INT j = TREE_INT_CST_LOW (val) & 
> > > > > (nelts - 1);
> > > > > +           bool existed;
> > > > > +           unsigned HOST_WIDE_INT &rewrite_lane
> > > > > +             = mask2_lanes.get_or_insert (j, &existed);
> > > > > +           if (!existed)
> > > > > +             {
> > > > > +               while (mask1_lanes.get (lane_gen))
> > > > > +                 lane_gen++;
> > > > > +               if (lane_gen >= nelts)
> > > > > +                 break;
> > > > > +               rewrite_lane = lane_gen;
> > > > > +               lane_gen++;
> > > > > +             }
> > > > > +         }
> > > > > +
> > > > > +       /* The requested lanes need more than one permute.  */
> > > > > +       if (i < nelts)
> > > > > +         continue;
> > > > > +
> > > > > +       vec_perm_builder sel (nelts, nelts, 1);
> > > > > +       vec_perm_builder sel_1 (nelts, nelts, 1);
> > > > > +       vec_perm_builder sel_2 (nelts, nelts, 1);
> > > > > +
> > > > > +       /* Rewrite the permutations based on MASK1_LANES/MASK2_LANES. 
> > > > >  */
> > > > > +       for (i = 0; i < nelts; i++)
> > > > > +         {
> > > > > +           /* Calculate new mask for STMT2.  */
> > > > > +           tree val = VECTOR_CST_ELT (mask2, i);
> > > > > +           unsigned HOST_WIDE_INT lane
> > > > > +             = TREE_INT_CST_LOW (val) & (2 * nelts - 1);
> > > > > +           unsigned off = (lane < nelts) ? 0 : nelts;
> > > > > +           unsigned HOST_WIDE_INT new_lane
> > > > > +             = *mask2_lanes.get (lane - off) + off;
> > > > > +           sel.quick_push (new_lane);
> > > > > +
> > > > > +           /* Calculate new masks for SRC1_1 and SRC1_2.  */
> > > > > +           bool use_src1 = mask1_lanes.get (i);
> > > > > +           tree mask1 = gimple_assign_rhs3 (use_src1 ? src1_1 : 
> > > > > src2_1);
> > > > > +           tree mask2 = gimple_assign_rhs3 (use_src1 ? src1_2 : 
> > > > > src2_2);
> > > > > +           unsigned HOST_WIDE_INT lane1
> > > > > +             = TREE_INT_CST_LOW (VECTOR_CST_ELT (mask1, i)) & (nelts 
> > > > > - 1);
> > > > > +           unsigned HOST_WIDE_INT lane2
> > > > > +             = TREE_INT_CST_LOW (VECTOR_CST_ELT (mask2, i)) & (nelts 
> > > > > - 1);
> > > > > +           sel_1.quick_push (lane1 + (use_src1 ? 0 : nelts));
> > > > > +           sel_2.quick_push (lane2 + (use_src1 ? 0 : nelts));
> > > > > +         }
> > > > > +
> > > > > +       vec_perm_indices indices (sel, 2, nelts);
> > > > > +       vec_perm_indices indices1_1 (sel_1, 2, nelts);
> > > > > +       vec_perm_indices indices1_2 (sel_2, 2, nelts);
> > > > > +
> > > > > +       tree vectype = TREE_TYPE (gimple_assign_lhs (stmt2));
> > > > > +       if (!can_vec_perm_const_p (TYPE_MODE (vectype),
> > > > > +                                  TYPE_MODE (vectype), indices)
> > > > > +           || !can_vec_perm_const_p (TYPE_MODE (vectype),
> > > > > +                                     TYPE_MODE (vectype), indices1_1)
> > > > > +           || !can_vec_perm_const_p (TYPE_MODE (vectype),
> > > > > +                                     TYPE_MODE (vectype), 
> > > > > indices1_2))
> > > > > +         continue;
> > > > > +
> > > > > +       /* Success.  Update all the statements.  */
> > > > > +       gimple_assign_set_rhs1 (stmt2, gimple_assign_rhs1 (stmt1));
> > > > > +       gimple_assign_set_rhs2 (stmt2, gimple_assign_rhs2 (stmt1));
> > > > > +       tree m1 = vect_gen_perm_mask_checked (vectype, indices);
> > > > > +       gimple_assign_set_rhs3 (stmt2, m1);
> > > > > +
> > > > > +       gimple_assign_set_rhs2 (src1_1, gimple_assign_rhs1 (src2_1));
> > > > > +       gimple_assign_set_rhs2 (src1_2, gimple_assign_rhs1 (src2_2));
> > > > > +
> > > > > +       tree m2 = vect_gen_perm_mask_checked (vectype, indices1_1);
> > > > > +       gimple_assign_set_rhs3 (src1_1, m2);
> > > > > +       tree m3 = vect_gen_perm_mask_checked (vectype, indices1_2);
> > > > > +       gimple_assign_set_rhs3 (src1_2, m3);
> > > > > +
> > > > > +       update_stmt (stmt2);
> > > > > +       update_stmt (src1_1);
> > > > > +       update_stmt (src1_2);
> > > > > +
> > > > > +       /* We need to move the updated statements because otherwise 
> > > > > they may
> > > > > +          come before some variable that they depend on.  Since we 
> > > > > know that
> > > > > +          they don't have uses outside the pattern, we can remove 
> > > > > them and
> > > > > +          add them back in order.  */
> > > > > +       gimple_stmt_iterator gsi_rm = gsi_for_stmt (bop1_1);
> > > > > +       gsi_remove (&gsi_rm, false);
> > > > > +       gsi_rm = gsi_for_stmt (bop1_2);
> > > > > +       gsi_remove (&gsi_rm, false);
> > > > > +       gsi_rm = gsi_for_stmt (src1_1);
> > > > > +       gsi_remove (&gsi_rm, false);
> > > > > +       gsi_rm = gsi_for_stmt (src1_2);
> > > > > +       gsi_remove (&gsi_rm, false);
> > > > > +
> > > > > +       gsi_insert_before (&gsi_stmt1, src1_1, GSI_SAME_STMT);
> > > > > +       gsi_insert_before (&gsi_stmt1, src1_2, GSI_SAME_STMT);
> > > > > +       gsi_insert_before (&gsi_stmt1, bop1_1, GSI_SAME_STMT);
> > > > > +       gsi_insert_before (&gsi_stmt1, bop1_2, GSI_SAME_STMT);
> > > > > +      }
> > > > > +}
> > > > > +
> > > >
> > > > So this is trying to recognize two back to back permutes? I'm having a 
> > > > bit of a
> > > hard time
> > > > seeing what it's trying to match and what it's optimizing into.  Can 
> > > > you add an
> > > example
> > > > in the description of the function?
> > > >
> > > > I'm having trouble with what the relationship between src_1_* and 
> > > > src2_2_* are
> > > or
> > > > should be as they're never checked.
> > > >
> > > > Aside from that though this doesn't feel like it should be done in the 
> > > > vectorizer.
> > > > It feels odd to have a post vectorization permute optimization pass in 
> > > > the
> > > vectorizer.
> > > >
> > > > I think this fits better in tree-ssa-forwpop.cc in simplify_permutation.
> > >
> > > Since the stmt groups are completely unrelated I don't think this fits.
> > >
> >
> > I was thinking you could use forwprop to mark all existing permutes that
> > have the properties you're looking to merge and then iterate over them
> > after.  Forwprop already walks all permutes.
> >
> > > I also see how in full generality such opportunities are difficult to
> > > exploit directly with the way we do greedy SLP discovery right now.
> > > With the idea of starting from SLP matching the SSA graph and instead
> > > merging nodes to form larger groups it might be able to catch this
> > > though.
> > >
> > > But yes, this doesn't seem to belong to tree-vect-slp.cc but elsewhere,
> > > iff we want to have it.  As said at the very beginning it seems to
> > > be very much a "SPEC hack" thing given it's very narrow focus.
> > >
> > > For more general application I would assume that a forward simulation
> > > of a function to compute lane equivalences in vector (those "unused"
> > > or "duplicate" lanes) and then shrinking and re-combining (re-vectorizing)
> > > based on that info sounds like a better thing to do than this hard
> > > pattern matching approach.  This would also sort-of mimic the
> > > "new SLP discovery" way.
> > >
> >
> > Isn't that a too broad problem though? Given, as you say, the permutes are
> > unrelated.  So if you generalize it to any permute that seems like it's
> > computationally expensive.
>
> For most generality you'd try to combine all PLUS operations in a
> basic-block/function/program to a single vector operation and then
> split along those pairs you cannot combine.  But sure, this is
> computationally expensive unless you manage to quickly lookup
> likely win candidates.  For the case in question you'd want O(quick)
> lookup of same BINOP with each half unused lanes and key off validation
> from the hopefully not too many of such candidates.
>
> That said - the patch as-is also tries to match up two operation chains
> in the attempt to combine them, it just very much restricts itself
> to a pair of adjacent stmts to identify those two chains.


Thank you, Richard and Tamar, for the feedback.
I'm responding with delay because I had to close some knowledge gaps
about SLP to avoid making wrong assumptions/statements.

So, what these two patches are doing is using the redundancy in the lane
utilization to parallelize vectorized sequences.
This redundancy in lane utilization comes from the way how specific
scalar statements end up vectorized: two VEC_PERMs on top, binary operations
on both of them, and a final VEC_PERM to create the result.
Here is an example of this sequence:

[...] // v_in = {e0, e1, e2, e3}
v_1 = VEC_PERM <v_in, v_in, {1, 3, 1, 3}> // v_1 = {e1, e3, e1, e3}
v_2 = VEC_PERM <v_in, v_in, {0, 2, 0, 2}> // v_2 = {e0, e2, e0, e2}
v_y = v_2 - v_1                           // v_y = {e0-e1, e2-e3, e0-e1, e2-e3}
v_x = v_2 + v_1                           // v_x = {e0+e1, e2+e3, e0+e1, e2+e3}
v_out = VEC_PERM <v_x, v_y, {0, 1, 6, 7}> // v_out = {e0+e1, e2+e3,
e0-e1, e2-e3}

The first patch frees up lanes 2 and 3 and changes the last statement to:
v_out = VEC_PERM <v_x, v_y, {0, 1, 4, 5}> // _v5 = {e0+e1, e2+e3, e0-e1, e2-e3}

The second patch searches for a pair of such lane-compressed sequences in a BB
(with identical vector operations) and merges them into a single sequence.

Overall, on success, we replace 2x (2x VEC_PERM + 2x BINOP + 1x VEC_PERM)
instructions by 2x VEC_PERM + 2x BINOP + 2x VEC_PERM.
2x VEC_PERM and 2x BINOP get eliminated.

So, the two optimizations are:
* lane compression in a certain sequence of vector statements
* merge two lane-compressed sequences of vector statements

Richard stated that the lane compression is a possibly pessimizing
change. Therefore, I've merged the two optimizations into a new
pass ("slp_perm_simplify") that runs after slp1.
The pass now searches for lane-compression candidates and calculates
how the resulting selector in the final VEC_PERM could look like.
Then, we attempt to pair two sequences (which don't need to be adjacent)
up for merging. If we find a pair, then the two optimizations are applied.

This targets x264_pixel_satd_8x4, which calculates the sum of absolute
transformed differences (SATD) using Hadamard transformation.
We have seen 7% speedup on SPEC's x264 (on an AArch64 machine).

I still need to clean up the code and optimize the analysis part of it
before sending out a patch, but I wanted to show a sign of life and
provide the missing context that was requested.



>
> Richard.
>
> > Tamar
> >
> > > Richard.
> > >
> > > > But if you're rewriting the permutes then there must be some kind of 
> > > > link, could
> > > you
> > > > not find them through the use/def chain instead of relying on them 
> > > > being next to
> > > each other?
> > > >
> > > > An example of what It's trying to match and rewrite to would be helpful 
> > > > here.
> > > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > >  /* Main entry for the BB vectorizer.  Analyze and transform BB, 
> > > > > returns
> > > > >     true if anything in the basic-block was vectorized.  */
> > > > >
> > > > > @@ -9586,6 +9832,8 @@ vect_slp_function (function *fun)
> > > > >    if (!bbs.is_empty ())
> > > > >      r |= vect_slp_bbs (bbs, NULL);
> > > > >
> > > > > +  vect_slp_optimize_permutes (fun);
> > > > > +
> > > > >    free (rpo);
> > > > >
> > > > >    return r;
> > > > > --
> > > > > 2.46.0
> > > >
> > > >
> > >
> > > --
> > > Richard Biener <rguent...@suse.de>
> > > SUSE Software Solutions Germany GmbH,
> > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> >
>
> --
> Richard Biener <rguent...@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH 2/2] Add a new permute optimization step in SLP

Reply via email to