Re: [PATCH][tree-optimization/61338] - Optimize redundant reverse permutations in vectorized stores

Richard Biener Thu, 23 Oct 2025 04:15:48 -0700

On Wed, Oct 22, 2025 at 4:48 PM Tamar Christina <[email protected]> wrote:
>
> Hi Kugan,
>
>
> The 10/22/2025 11:14, Kugan Vivekanandarajah wrote:
> >
> >
> > >
> > > Did you send the right version of the patch Kugan? It's identical to the 
> > > one you
> > > sent before and also has some changes in gcc/fortran/resolve.cc not 
> > > specified
> > > and your changelog seems to have an incorrect format, the files 
> > > containing what
> > > you changed aren't mentioned.
> >
> > Apologies again. Here is the correct version. This also does not have the 
> > changes for resolve.cc <http://resolve.cc/>.
> >
>
> The patch doesn't seem quite correct to me, because it doesn't take into 
> acccount that SLP
> nodes (and thus stmt_vinfo) are shared between SLP instances.
>
> So for instance
>
> #define N 32000
>
> void test_reverse_loop (float * __restrict__ a,
>                         float * __restrict__ b,
>                         float * __restrict__ c)
> {
>   for (int i = N - 1; i >= 0; i--)
>     {
>       a[i] = b[i] + 1.0f;
>       c[N-1-i] = b[i] + 1.0f;
>     }
> }
>
> generates the wrong code because `b[i]` is shared but the permute on `a` and 
> `c` are different.
>
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index dcb25225b1b..cafc5a55dbe 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -65,6 +65,95 @@ along with GCC; see the file COPYING3.  If not see
> >  static tree vector_vector_composition_type (tree, poly_uint64, tree *,
> >                                           bool = false);
> >
> > +/* Recursively search for an SLP node whose representative matches 
> > stmt_info.  */
> > +
> > +static slp_tree
> > +find_slp_node_in_children (slp_tree node, stmt_vec_info stmt_info)
> > +{
> > +  if (!node)
> > +    return NULL;
> > +
> > +  if (SLP_TREE_REPRESENTATIVE (node) == stmt_info)
> > +    return node;
> > +
> > +  slp_tree child;
> > +  unsigned int i;
> > +  FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
> > +    {
> > +      slp_tree found = find_slp_node_in_children (child, stmt_info);
> > +      if (found)
> > +     return found;
> > +    }
> > +
> > +  return NULL;
> > +}
> > +
> > +/* Check if we can elide reverse permutes in a back-to-back load/store 
> > pattern.
> > +   Returns true if the optimization applies and marks both load and store 
> > to
> > +   skip reverse permute generation.  */
> > +
> > +static bool
> > +vect_optimize_reverse_permutes (vec_info *vinfo, stmt_vec_info stmt_info,
> > +                             slp_tree slp_node)
> > +{
> > +  /* Get the RHS of the store statement.  */
> > +  tree rhs = gimple_assign_rhs1 (stmt_info->stmt);
> > +  if (TREE_CODE (rhs) != SSA_NAME)
> > +    return false;
> > +
>
> I think in these functions you're mixing scalar analysis and SLP traversals.
> for SLP the nodes would have been classified and SLP_TREE_TYPE would return
> store_vec_info_type or load_vec_info_type.  But you definitely can't rely on
> the scalar statements here.
>
> > +  gimple *def = SSA_NAME_DEF_STMT (rhs);
> > +  if (!def || !is_gimple_assign (def))
> > +    return false;
> > +
> > +  tree_code code = gimple_assign_rhs_code (def);
> > +  /* Direct copy or element-wise operations.  */
> > +  if (code != SSA_NAME
> > +      && TREE_CODE_CLASS (code) != tcc_unary
> > +      && TREE_CODE_CLASS (code) != tcc_binary)
> > +    return false;
> > +
> > +  tree op1 = gimple_assign_rhs1 (def);
> > +  if (TREE_CODE (op1) != SSA_NAME)
> > +    return false;
> > +
>
> You're traversing the use-def chain of the SLP representative statement,
> however they may not be relevant for vectorization, they could be unused
> or in a pattern.
>
> So for processing SLP graphs you can only look at the SLP children.
>
> > +  /* For binary operations, check that op2 is a constant.  */
> > +  if (TREE_CODE_CLASS (code) == tcc_binary)
> > +    {
> > +      tree op2 = gimple_assign_rhs2 (def);
> > +      if (TREE_CODE (op2) == INTEGER_CST)
> > +     return false;
> > +    }
> > +  stmt_vec_info op1_info = vinfo->lookup_def (op1);
> > +  if (!op1_info || !STMT_VINFO_DATA_REF (op1_info)
> > +      || !DR_IS_READ (STMT_VINFO_DATA_REF (op1_info)))
> > +    return false;
> > +
> > +  /* Check if the load result is used only once (by this operation).
> > +     If it has multiple uses, we can't skip the permute.  */
> > +  if (!has_single_use (op1))
> > +    return false;
> > +
>
> I think you're attempting to see if the computation isn't shared, however
> SLP trees track this differently through reference counting on the node
> rather than the SSA name.  Typically we just have to create a new node
> with the linear nodes.
>
> > +  /* Recursively find the SLP node for the load in the store's SLP tree.  
> > */
> > +  slp_tree op1_node = find_slp_node_in_children (slp_node, op1_info);
> > +
>
> This wouldn't be needed if you were traversing the SLP node itself.
>
> > +  if (!op1_node)
> > +    return false;
> > +
> > +  /* Check if this is also a permutation.  */
> > +  if (SLP_TREE_LOAD_PERMUTATION (op1_node).exists ())
> > +    return false;
> > +  /* Both are accesses in a reverse loop.
> > +     Mark both to skip reverse permutes.  */
> > +  op1_info->skip_reverse_permute = true;
> > +  stmt_info->skip_reverse_permute = true;
> > +
>
> Because you're updating the stmt_vinfo which may be shared between SLP 
> instances
> you would incorrectly mark objects here.  See the example provided on the top.
>
> > +  if (dump_enabled_p ())
> > +    dump_printf_loc (MSG_NOTE, vect_location,
> > +                  "Optimized back-to-back reverse permutes\n");
> > +
> > +  return true;
> > +}
> > +
> >  /* Return TRUE iff the given statement is in an inner loop relative to
> >     the loop being vectorized.  */
> >  bool
> > @@ -7885,7 +7974,6 @@ vectorizable_scan_store (vec_info *vinfo, 
> > stmt_vec_info stmt_info,
> >    return true;
> >  }
> >
> > -
> >  /* Function vectorizable_store.
> >
> >     Check if STMT_INFO defines a non scalar data-ref 
> > (array/pointer/structure)
> > @@ -8172,7 +8260,7 @@ vectorizable_store (vec_info *vinfo,
> >        tree offvar = NULL_TREE;
> >        tree ivstep;
> >        tree running_off;
> > -      tree stride_base, stride_step, alias_off;
> > +      tree stride_base, stride_step = NULL_TREE, alias_off;
> >        tree vec_oprnd = NULL_TREE;
> >        tree dr_offset;
> >        /* Checked by get_load_store_type.  */
> > @@ -9073,6 +9161,12 @@ vectorizable_store (vec_info *vinfo,
> >
> >    new_stmt = NULL;
> >    gcc_assert (!grouped_store);
> > +  /* During the initial analysis or code generation (when cost_vec is 
> > NULL),
> > +     check if we can elide reverse permute by analyzing the scalar 
> > statement pattern.
> > +     For code generation, check the flags; for initial analysis, set them. 
> >  */
> > +  if (costing_p && memory_access_type == VMAT_CONTIGUOUS_REVERSE)
> > +    vect_optimize_reverse_permutes (vinfo, stmt_info, slp_node);
> > +
> >    for (i = 0; i < vec_num; i++)
> >      {
> >        if (!costing_p)
> > @@ -9080,10 +9174,15 @@ vectorizable_store (vec_info *vinfo,
> >
> >        if (memory_access_type == VMAT_CONTIGUOUS_REVERSE)
> >       {
> > -       if (costing_p)
> > +       /* Check if the reverse permute should be skipped.  This flag is set
> > +          during costing when we detect a back-to-back load/store pattern. 
> >  */
> > +       bool skip_permute = stmt_info->skip_reverse_permute;
> > +
> > +       /* Generate the reverse permute if not skipped.  */
> > +       if (costing_p && !skip_permute)
> >           inside_cost += record_stmt_cost (cost_vec, 1, vec_perm,
> >                                            slp_node, 0, vect_body);
> > -       else
> > +       else if (!costing_p && !skip_permute)
> >           {
> >             tree perm_mask = perm_mask_for_reverse (vectype);
> >             tree new_temp = make_ssa_name (vectype);
> > @@ -9449,7 +9548,7 @@ vectorizable_load (vec_info *vinfo,
> >    gphi *phi = NULL;
> >    vec<tree> dr_chain = vNULL;
> >    bool grouped_load = false;
> > -  stmt_vec_info first_stmt_info;
> > +  stmt_vec_info first_stmt_info = NULL;
> >    stmt_vec_info first_stmt_info_for_drptr = NULL;
> >    bool compute_in_loop = false;
> >    class loop *at_loop;
> > @@ -9859,7 +9958,7 @@ vectorizable_load (vec_info *vinfo,
> >        tree ivstep;
> >        tree running_off;
> >        vec<constructor_elt, va_gc> *v = NULL;
> > -      tree stride_base, stride_step, alias_off;
> > +      tree stride_base, stride_step = NULL_TREE, alias_off;
> >        /* Checked by get_load_store_type.  */
> >        unsigned int const_nunits = nunits.to_constant ();
> >        unsigned HOST_WIDE_INT cst_offset = 0;
> > @@ -11517,15 +11616,22 @@ vectorizable_load (vec_info *vinfo,
> >
> >        if (memory_access_type == VMAT_CONTIGUOUS_REVERSE)
> >       {
> > -       if (costing_p)
> > -         inside_cost = record_stmt_cost (cost_vec, 1, vec_perm,
> > -                                         slp_node, 0, vect_body);
> > -       else
> > +       /* Check if the reverse permute should be skipped.  This flag is set
> > +          during store costing when we detect a back-to-back load/store 
> > pattern.  */
> > +       bool skip_permute = stmt_info->skip_reverse_permute;
> > +
> > +       if (!skip_permute)
> >           {
> > -           tree perm_mask = perm_mask_for_reverse (vectype);
> > -           new_temp = permute_vec_elements (vinfo, new_temp, new_temp,
> > -                                            perm_mask, stmt_info, gsi);
> > -           new_stmt = SSA_NAME_DEF_STMT (new_temp);
> > +           if (costing_p)
> > +             inside_cost = record_stmt_cost (cost_vec, 1, vec_perm,
> > +                                             slp_node, 0, vect_body);
> > +           else
> > +             {
> > +               tree perm_mask = perm_mask_for_reverse (vectype);
> > +               new_temp = permute_vec_elements (vinfo, new_temp, new_temp,
> > +                                                perm_mask, stmt_info, gsi);
> > +               new_stmt = SSA_NAME_DEF_STMT (new_temp);
> > +             }
> >           }
> >       }
> >
>
> I agree with Richi here in that there are really two ways forward here:
>
>
> 1. you move this to somewhere that post processes VEC_PERM_EXPR like forwprop.
>    which would work as well for intrinsics code.


This would work.

> 2. a) You either implement this into vect_optimize_slp which knows how to 
> deal with
>       the complexity of SLP graph permutes, so that by the time we get to
>       vectorizable_loads/stores we have no permutes to begin with.
>
>    b) You create an SLP pattern which materializes the inverse permutes, 
> which would allow
>       vect_optimize_slp to remove the redundant permutes.

I think those two do not easily given we cannot express the "reverse
permute" applied after
the load for negative strided loads in load or lane permutations or
scalar stmt order.  At least
I couldn't come up with a way ;)

So my suggestion was to pattern-match the situation on the SLP graph
that negative
strided loads feeds fully into negative strided stores.  As there are
no backedges in
the SLP graph this analysis is suited to reside inside
vect_optimize_slp (where we
build a full graph), but the current permute optimization cannot do
this because of
the above representation issue.  But still after the optimization the same graph
can be used to flag nodes where we can elide the reverse permutes.

Richard.

> Richi has the final say of course but one of those two would have my vote.
>
> Thanks,
> Tamar
>
> > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > index fdb76ad4d4d..94002304f18 100644
> > --- a/gcc/tree-vectorizer.h
> > +++ b/gcc/tree-vectorizer.h
> > @@ -1619,6 +1619,11 @@ public:
> >    /* True if this is a pattern that can only be handled by SLP
> >       vectorization.  */
> >    bool slp_vect_pattern_only_p;
> > +
> > +  /* For loads/stores with VMAT_CONTIGUOUS_REVERSE, indicates that the
> > +     reverse permute should be skipped because it cancels with a reverse
> > +     permute in the paired store/load.  */
> > +  bool skip_reverse_permute;
> >  };
> >
> >  /* Information about a gather/scatter call.  */
> > --
> > 2.50.1 (Apple Git-155)
> >
> >

Re: [PATCH][tree-optimization/61338] - Optimize redundant reverse permutations in vectorized stores

Reply via email to