On Fri, Jul 12, 2024 at 7:20 AM Feng Xue OS <f...@os.amperecomputing.com> wrote:
>
> Hi, Richard,
>
> Let me explain some idea that has to be chosen for lane-reducing. The key
> complication is that these ops are associated with two kinds of vec_nums,
> one is number of effective vector stmts, which is used by partial vectorzation
> function such as vect_get_loop_mask.  The other is number of total created
> vector stmts. Now we should make it aligned with normal op, in order to
> interoperate with normal op. Suppose expressions mixed with lane-reducing
> and normal as:
>
>     temp = lane_reducing<16*char> + expr<4*int>;
>     temp = cst<4*int> * lane_reducing<16*char>;
>
> If only generating effective vector stmt for lane_reducing, vector def/use
> between ops will never be matched, so extra pass-through copies are
> necessary. This is why we say "refit a lane-reducing to be a fake normal op".

And this only happens in vect_transform_reduction, right?

The other pre-existing issue is that for single_defuse_cycle optimization
SLP_TREE_NUMBER_OF_VEC_STMTS is also off (too large).  But here
the transform also goes through vect_transform_reduction.

> The requirement of two vec_stmts are independent of how we will implement
> SLP_TREE_NUMBER_OF_VEC_STMTS. Moreover, if we want to refactor vect code
> to unify ncopies/vec_num computation and completely remove
> SLP_TREE_NUMBER_OF_VEC_STMTS, this tends to be a a large task, and might
> be overkill for these lane-reducing patches. So I will keep it as before, and 
> do
> not touch it as what I have done in this patch.
>
> Since one SLP_TREE_NUMBER_OF_VEC_STMTS could not be used for two purposes.
> The your previous suggestion might not be work:
>
> > As said, I don't like this much.  vect_slp_analyze_node_operations_1 sets 
> > this
> > and I think the existing "exception"
> >
> >  /* Calculate the number of vector statements to be created for the
> >     scalar stmts in this node.  For SLP reductions it is equal to the
> >     number of vector statements in the children (which has already been
> >     calculated by the recursive call).  Otherwise it is the number of
> >     scalar elements in one scalar iteration (DR_GROUP_SIZE) multiplied by
> >     VF divided by the number of elements in a vector.  */
> >  if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
> >      && !STMT_VINFO_DATA_REF (stmt_info)
> >      && REDUC_GROUP_FIRST_ELEMENT (stmt_info))
> >    {
> >      for (unsigned i = 0; i < SLP_TREE_CHILDREN (node).length (); ++i)
> >        if (SLP_TREE_DEF_TYPE (SLP_TREE_CHILDREN (node)[i]) ==
> >  vect_internal_def)
> >          {
> >            SLP_TREE_NUMBER_OF_VEC_STMTS (node)
> >              = SLP_TREE_NUMBER_OF_VEC_STMTS (SLP_TREE_CHILDREN (node)[i]);
> >            break;
> >          }
> >    }
> >
> > could be changed (or amended if replacing doesn't work out) to
> >
> >   if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
> >       && STMT_VINFO_REDUC_IDX (stmt_info)
> >       // do we have this always set?
> >       && STMT_VINFO_REDUC_VECTYPE_IN (stmt_info))
> >    {
> >       do the same as in else {} but using VECTYPE_IN
> >    }
> >
> > Or maybe scrap the special case and use STMT_VINFO_REDUC_VECTYPE_IN
> > when that's set instead of SLP_TREE_VECTYPE?  As said having wrong
> > SLP_TREE_NUMBER_OF_VEC_STMTS is going to backfire.
>
> Then the alternative is to limit special handling related to the vec_num only
> inside vect_transform_reduction. Is that ok? Or any other suggestion?

I think that's kind-of in line with the suggestion of a reduction
specific VF, so yes,
not using SLP_TREE_NUMBER_OF_VEC_STMTS in vect_transform_reduction
sounds fine to me and would be a step towards not having
SLP_TREE_NUMBER_OF_VEC_STMTS
where the function would be responsible for appropriate allocation as well.

Richard.

> Thanks,
> Feng
> ________________________________________
> From: Richard Biener <richard.guent...@gmail.com>
> Sent: Thursday, July 11, 2024 5:43 PM
> To: Feng Xue OS; Richard Sandiford
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH 2/4] vect: Fix inaccurate vector stmts number for slp 
> reduction with lane-reducing
>
> On Thu, Jul 11, 2024 at 10:53 AM Feng Xue OS
> <f...@os.amperecomputing.com> wrote:
> >
> > Vector stmts number of an operation is calculated based on output vectype.
> > This is over-estimated for lane-reducing operation. Sometimes, to workaround
> > the issue, we have to rely on additional logic to deduce an exactly accurate
> > number by other means. Aiming at the inconvenience, in this patch, we would
> > "turn" lane-reducing operation into a normal one by inserting new trivial
> > statements like zero-valued PHIs and pass-through copies, which could be
> > optimized away by later passes. At the same time, a new field is added for
> > slp node to hold number of vector stmts that are really effective after
> > vectorization. For example:
>
> Adding Richard into the loop.
>
> I'm sorry, but this feels a bit backwards - in the end I was hoping that we
> can get rid of SLP_TREE_NUMBER_OF_VEC_STMTS completely.
> We do currently have the odd ncopies (non-SLP) vs. vec_num (SLP)
> duality but in reality all vectorizable_* should know the number of
> stmt copies (or output vector defs) to produce by looking at the vector
> type and the vectorization factor (and in the SLP case the number of
> lanes represented by the node).
>
> That means that in the end vectorizable_* could at transform time
> simply make sure that SLP_TREE_VEC_DEF is appropriately
> created (currently generic code does this based on
> SLP_TREE_NUMBER_OF_VEC_STMTS and also generic code
> tries to determine SLP_TREE_NUMBER_OF_VEC_STMTS).
>
> There are a lot (well, not too many) uses of SLP_TREE_NUMBER_OF_VEC_STMTS
> that short-cut "appropriate" vec_num computation based on
> VF, vector type and lanes.  I hope all of them could vanish.
>
> You add vect_get_[slp_]num_vectors, but I'd rather see a single
>
> inline unsigned int
> vect_get_num_copies (vec_info *vinfo, tree vectype, slp_tree node = nullptr)
> {
>    if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
>      {
>         if (node)
>           return vect_get_num_vectors (LOOP_VINFO_VECT_FACTOR
> (loop_vinfo) * SLP_TREE_LANES (node), vectype);
>         else
>           return vect_get_num_vectors (LOOP_VINFO_VECT_FACTOR
> (loop_vinfo), vectype);
>      }
>   else
>      return vect_get_num_vectors (SLP_TREE_LANES (node), vectype);
> }
>
> so that
>
>   if (slp_node)
>     {
>       ncopies = 1;
>       vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
>     }
>   else
>     {
>       ncopies = vect_get_num_copies (loop_vinfo, vectype);
>       vec_num = 1;
>     }
>
> can become
>
>   if (slp_node)
>     {
>       ncopies = 1;
>       vec_num = vect_get_num_copies (loop_vinfo, vectype, slp_node);
>     }
>   else
>     {
>       ncopies = vect_get_num_copies (loop_vinfo, vectype/*, slp_node */);
>       vec_num = 1;
>     }
>
> without actually resolving the ncopies/vec_num duality (that will solve itself
> when we have achieved only-SLP).
>
> I think if SLP_TREE_NUMBER_OF_VEC_STMTS is gone then having the
> few places you needed to fix compute the correct number of stmts should be OK.
> Note this will probably require moving the allocation of SLP_TREE_VEC_DEFS
> to the code doing the transform (and remove/adjust some sanity checking we 
> do).
>
> Thanks,
> Richard.
>
> >   int sum = 1;
> >   for (i)
> >     {
> >       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
> >     }
> >
> >   The vector size is 128-bit,vectorization factor is 16.  Reduction
> >   statements would be transformed as:
> >
> >   vector<4> int sum_v0 = { 0, 0, 0, 1 };
> >   vector<4> int sum_v1 = { 0, 0, 0, 0 };
> >   vector<4> int sum_v2 = { 0, 0, 0, 0 };
> >   vector<4> int sum_v3 = { 0, 0, 0, 0 };
> >
> >   for (i / 16)
> >     {
> >       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
> >       sum_v1 = sum_v1;  // copy
> >       sum_v2 = sum_v2;  // copy
> >       sum_v3 = sum_v3;  // copy
> >     }
> >
> >   sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0
> >
> > Thanks,
> > Feng
> >
> > ---
> > gcc/
> >         * tree-vectorizer.h (vec_stmts_effec_size): New field in _slp_tree.
> >         (SLP_TREE_VEC_STMTS_EFFEC_NUM): New macro.
> >         (vect_get_num_vectors): New overload function.
> >         (vect_get_slp_num_vectors): New function.
> >         * tree-vect-loop.cc (vect_reduction_update_partial_vector_usage): 
> > Use
> >         effective vector stmts number.
> >         (vectorizable_reduction): Compute number of effective vector stmts 
> > for
> >         lane-reducing op and reduction PHI.
> >         (vect_transform_reduction): Insert copies for lane-reducing so as to
> >         fix inaccurate vector stmts number.
> >         (vect_transform_cycle_phi): Only need to calculate vector PHI number
> >         based on input vectype for non-slp code path.
> >         * tree-vect-slp.cc (_slp_tree::_slp_tree): Initialize effective 
> > vector
> >         stmts number to zero.
> >         (vect_slp_analyze_node_operations_1): Remove adjustment on vector
> >         stmts number specific to slp reduction.
> >         (vect_slp_analyze_node_operations): Compute number of vector 
> > elements
> >         for constant/external slp node with vect_get_slp_num_vectors.
> > ---
> >  gcc/tree-vect-loop.cc | 139 ++++++++++++++++++++++++++++++++++++------
> >  gcc/tree-vect-slp.cc  |  56 ++++++-----------
> >  gcc/tree-vectorizer.h |  45 ++++++++++++++
> >  3 files changed, 183 insertions(+), 57 deletions(-)
> >
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index c183e2b6068..5ad9836d6c8 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -7471,7 +7471,7 @@ vect_reduction_update_partial_vector_usage 
> > (loop_vec_info loop_vinfo,
> >        unsigned nvectors;
> >
> >        if (slp_node)
> > -       nvectors = SLP_TREE_VEC_STMTS_NUM (slp_node);
> > +       nvectors = SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node);
> >        else
> >         nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
> >
> > @@ -7594,6 +7594,15 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >    stmt_vec_info phi_info = stmt_info;
> >    if (!is_a <gphi *> (stmt_info->stmt))
> >      {
> > +      if (lane_reducing_stmt_p (stmt_info->stmt) && slp_node)
> > +       {
> > +         /* Compute number of effective vector statements for lane-reducing
> > +            ops.  */
> > +         vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> > +         gcc_assert (vectype_in);
> > +         SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node)
> > +           = vect_get_slp_num_vectors (loop_vinfo, slp_node, vectype_in);
> > +       }
> >        STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
> >        return true;
> >      }
> > @@ -8012,14 +8021,25 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >    if (STMT_VINFO_LIVE_P (phi_info))
> >      return false;
> >
> > -  if (slp_node)
> > -    ncopies = 1;
> > -  else
> > -    ncopies = vect_get_num_copies (loop_vinfo, vectype_in);
> > +  poly_uint64 nunits_out = TYPE_VECTOR_SUBPARTS (vectype_out);
> >
> > -  gcc_assert (ncopies >= 1);
> > +  if (slp_node)
> > +    {
> > +      ncopies = 1;
> >
> > -  poly_uint64 nunits_out = TYPE_VECTOR_SUBPARTS (vectype_out);
> > +      if (maybe_ne (TYPE_VECTOR_SUBPARTS (vectype_in), nunits_out))
> > +       {
> > +         /* Not all vector reduction PHIs would be used, compute number
> > +            of the effective statements.  */
> > +         SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node)
> > +           = vect_get_slp_num_vectors (loop_vinfo, slp_node, vectype_in);
> > +       }
> > +    }
> > +  else
> > +    {
> > +      ncopies = vect_get_num_copies (loop_vinfo, vectype_in);
> > +      gcc_assert (ncopies >= 1);
> > +    }
> >
> >    if (nested_cycle)
> >      {
> > @@ -8360,7 +8380,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >         || (slp_node
> >            && !REDUC_GROUP_FIRST_ELEMENT (stmt_info)
> >            && SLP_TREE_LANES (slp_node) == 1
> > -          && vect_get_num_copies (loop_vinfo, vectype_in) > 1))
> > +          && SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node) > 1))
> >        && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live)
> >        && reduc_chain_length == 1
> >        && loop_vinfo->suggested_unroll_factor == 1)
> > @@ -8595,12 +8615,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >    stmt_vec_info phi_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt 
> > (stmt_info));
> >    gphi *reduc_def_phi = as_a <gphi *> (phi_info->stmt);
> >    int reduc_index = STMT_VINFO_REDUC_IDX (stmt_info);
> > -  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
> > +  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> > +
> > +  if (!vectype_in)
> > +    vectype_in = STMT_VINFO_VECTYPE (stmt_info);
> >
> >    if (slp_node)
> >      {
> >        ncopies = 1;
> > -      vec_num = SLP_TREE_VEC_STMTS_NUM (slp_node);
> > +      vec_num = SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node);
> >      }
> >    else
> >      {
> > @@ -8702,6 +8725,64 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >                          reduc_index == 2 ? op.ops[2] : NULL_TREE,
> >                          &vec_oprnds[2]);
> >      }
> > +  else if (lane_reducing)
> > +    {
> > +      /* For normal reduction, consistency between vectorized def/use is
> > +        naturally ensured when mapping from scalar statement.  But if lane-
> > +        reducing op is involved in reduction, thing would become somewhat
> > +        complicated in that the op's result and operand for accumulation 
> > are
> > +        limited to less lanes than other operands, which certainly causes
> > +        def/use mismatch on adjacent statements around the op if do not 
> > have
> > +        any kind of specific adjustment.  One approach is to refit lane-
> > +        reducing op in the way of introducing new trivial pass-through 
> > copies
> > +        to fix possible def/use gap, so as to make it behave like a normal 
> > op.
> > +        And vector reduction PHIs are always generated to the full extent, 
> > no
> > +        matter lane-reducing op exists or not.  If some copies or PHIs are
> > +        actually superfluous, they would be cleaned up by passes after
> > +        vectorization.  An example for single-lane slp is given as below.
> > +        Similarly, this handling is applicable for multiple-lane slp as 
> > well.
> > +
> > +          int sum = 1;
> > +          for (i)
> > +            {
> > +              sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
> > +            }
> > +
> > +        The vector size is 128-bit,vectorization factor is 16.  Reduction
> > +        statements would be transformed as:
> > +
> > +          vector<4> int sum_v0 = { 0, 0, 0, 1 };
> > +          vector<4> int sum_v1 = { 0, 0, 0, 0 };
> > +          vector<4> int sum_v2 = { 0, 0, 0, 0 };
> > +          vector<4> int sum_v3 = { 0, 0, 0, 0 };
> > +
> > +          for (i / 16)
> > +            {
> > +              sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], 
> > sum_v0);
> > +              sum_v1 = sum_v1;  // copy
> > +              sum_v2 = sum_v2;  // copy
> > +              sum_v3 = sum_v3;  // copy
> > +            }
> > +
> > +          sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0
> > +       */
> > +      unsigned effec_ncopies = vec_oprnds[0].length ();
> > +      unsigned total_ncopies = vec_oprnds[reduc_index].length ();
> > +
> > +      if (slp_node)
> > +       gcc_assert (effec_ncopies == SLP_TREE_VEC_STMTS_EFFEC_NUM 
> > (slp_node));
> > +
> > +      gcc_assert (effec_ncopies <= total_ncopies);
> > +
> > +      if (effec_ncopies < total_ncopies)
> > +       {
> > +         for (unsigned i = 0; i < op.num_ops - 1; i++)
> > +           {
> > +             gcc_assert (vec_oprnds[i].length () == effec_ncopies);
> > +             vec_oprnds[i].safe_grow_cleared (total_ncopies);
> > +           }
> > +       }
> > +    }
> >
> >    bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod 
> > (stmt_info);
> >    unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
> > @@ -8710,7 +8791,27 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >      {
> >        gimple *new_stmt;
> >        tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
> > -      if (masked_loop_p && !mask_by_cond_expr)
> > +      if (!vop[0] || !vop[1])
> > +       {
> > +         tree reduc_vop = vec_oprnds[reduc_index][i];
> > +
> > +         /* If could not generate an effective vector statement for current
> > +            portion of reduction operand, insert a trivial copy to simply
> > +            handle over the operand to other dependent statements.  */
> > +         gcc_assert (reduc_vop);
> > +
> > +         if (slp_node && TREE_CODE (reduc_vop) == SSA_NAME
> > +             && !SSA_NAME_IS_DEFAULT_DEF (reduc_vop))
> > +           new_stmt = SSA_NAME_DEF_STMT (reduc_vop);
> > +         else
> > +           {
> > +             new_temp = make_ssa_name (vec_dest);
> > +             new_stmt = gimple_build_assign (new_temp, reduc_vop);
> > +             vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt,
> > +                                          gsi);
> > +           }
> > +       }
> > +      else if (masked_loop_p && !mask_by_cond_expr)
> >         {
> >           /* No conditional ifns have been defined for lane-reducing op
> >              yet.  */
> > @@ -8810,21 +8911,19 @@ vect_transform_cycle_phi (loop_vec_info loop_vinfo,
> >      /* Leave the scalar phi in place.  */
> >      return true;
> >
> > -  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
> > -  /* For a nested cycle we do not fill the above.  */
> > -  if (!vectype_in)
> > -    vectype_in = STMT_VINFO_VECTYPE (stmt_info);
> > -  gcc_assert (vectype_in);
> > -
> >    if (slp_node)
> >      {
> > -      /* The size vect_schedule_slp_instance computes is off for us.  */
> > -      vec_num = vect_get_num_vectors (LOOP_VINFO_VECT_FACTOR (loop_vinfo)
> > -                                     * SLP_TREE_LANES (slp_node), 
> > vectype_in);
> > +      vec_num = SLP_TREE_VEC_STMTS_NUM (slp_node);
> >        ncopies = 1;
> >      }
> >    else
> >      {
> > +      tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
> > +      /* For a nested cycle we do not fill the above.  */
> > +      if (!vectype_in)
> > +       vectype_in = STMT_VINFO_VECTYPE (stmt_info);
> > +      gcc_assert (vectype_in);
> > +
> >        vec_num = 1;
> >        ncopies = vect_get_num_copies (loop_vinfo, vectype_in);
> >      }
> > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> > index 00863398fd0..3c163c16fc8 100644
> > --- a/gcc/tree-vect-slp.cc
> > +++ b/gcc/tree-vect-slp.cc
> > @@ -114,6 +114,7 @@ _slp_tree::_slp_tree ()
> >    SLP_TREE_SCALAR_OPS (this) = vNULL;
> >    SLP_TREE_VEC_DEFS (this) = vNULL;
> >    SLP_TREE_VEC_STMTS_NUM (this) = 0;
> > +  SLP_TREE_VEC_STMTS_EFFEC_NUM (this) = 0;
> >    SLP_TREE_CHILDREN (this) = vNULL;
> >    SLP_TREE_LOAD_PERMUTATION (this) = vNULL;
> >    SLP_TREE_LANE_PERMUTATION (this) = vNULL;
> > @@ -6552,38 +6553,16 @@ vect_slp_analyze_node_operations_1 (vec_info 
> > *vinfo, slp_tree node,
> >                                     slp_instance node_instance,
> >                                     stmt_vector_for_cost *cost_vec)
> >  {
> > -  stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (node);
> > -
> >    /* Calculate the number of vector statements to be created for the
> > -     scalar stmts in this node.  For SLP reductions it is equal to the
> > -     number of vector statements in the children (which has already been
> > -     calculated by the recursive call).  Otherwise it is the number of
> > -     scalar elements in one scalar iteration (DR_GROUP_SIZE) multiplied by
> > -     VF divided by the number of elements in a vector.  */
> > -  if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
> > -      && !STMT_VINFO_DATA_REF (stmt_info)
> > -      && REDUC_GROUP_FIRST_ELEMENT (stmt_info))
> > -    {
> > -      for (unsigned i = 0; i < SLP_TREE_CHILDREN (node).length (); ++i)
> > -       if (SLP_TREE_DEF_TYPE (SLP_TREE_CHILDREN (node)[i]) == 
> > vect_internal_def)
> > -         {
> > -           SLP_TREE_VEC_STMTS_NUM (node)
> > -             = SLP_TREE_VEC_STMTS_NUM (SLP_TREE_CHILDREN (node)[i]);
> > -           break;
> > -         }
> > -    }
> > -  else
> > -    {
> > -      poly_uint64 vf;
> > -      if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
> > -       vf = loop_vinfo->vectorization_factor;
> > -      else
> > -       vf = 1;
> > -      unsigned int group_size = SLP_TREE_LANES (node);
> > -      tree vectype = SLP_TREE_VECTYPE (node);
> > -      SLP_TREE_VEC_STMTS_NUM (node)
> > -       = vect_get_num_vectors (vf * group_size, vectype);
> > -    }
> > +     scalar stmts in this node.  */
> > +  unsigned int num = vect_get_slp_num_vectors (vinfo, node);
> > +
> > +  SLP_TREE_VEC_STMTS_NUM (node) = num;
> > +
> > +  /* Initialize with the number of vector statements, and would be changed
> > +     to some smaller one for reduction PHI and lane-reducing statement
> > +     after real input vectype is determined by vectorizable analysis.  */
> > +  SLP_TREE_VEC_STMTS_EFFEC_NUM (node) = num;
> >
> >    /* Handle purely internal nodes.  */
> >    if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> > @@ -6605,7 +6584,9 @@ vect_slp_analyze_node_operations_1 (vec_info *vinfo, 
> > slp_tree node,
> >        return true;
> >      }
> >
> > +  stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (node);
> >    bool dummy;
> > +
> >    return vect_analyze_stmt (vinfo, stmt_info, &dummy,
> >                             node, node_instance, cost_vec);
> >  }
> > @@ -6851,12 +6832,13 @@ vect_slp_analyze_node_operations (vec_info *vinfo, 
> > slp_tree node,
> >                           && j == 1);
> >               continue;
> >             }
> > -         unsigned group_size = SLP_TREE_LANES (child);
> > -         poly_uint64 vf = 1;
> > -         if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
> > -           vf = loop_vinfo->vectorization_factor;
> > -         SLP_TREE_VEC_STMTS_NUM (child)
> > -           = vect_get_num_vectors (vf * group_size, vector_type);
> > +
> > +         /* Calculate number of vector elements for constant slp node.  */
> > +         unsigned int num = vect_get_slp_num_vectors (vinfo, child);
> > +
> > +         SLP_TREE_VEC_STMTS_NUM (child) = num;
> > +         SLP_TREE_VEC_STMTS_EFFEC_NUM (child) = num;
> > +
> >           /* And cost them.  */
> >           vect_prologue_cost_for_slp (child, cost_vec);
> >         }
> > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > index ef89ff1b3e4..f769c2b8c92 100644
> > --- a/gcc/tree-vectorizer.h
> > +++ b/gcc/tree-vectorizer.h
> > @@ -211,6 +211,15 @@ struct _slp_tree {
> >       divided by vector size.  */
> >    unsigned int vec_stmts_size;
> >
> > +  /* Number of vector stmts that are really effective after transformation.
> > +     Lane-reducing operation requires less vector stmts than normal, so
> > +     trivial copies will be inserted to make vector def-use chain align
> > +     mutually.  Vector reduction PHIs are fully generated with VF as above,
> > +     some may be optimized away if it starts a copy-self def-use cycle, 
> > which
> > +     occurs in reduction involving only lane-reducing.  Note: The 
> > enforcement
> > +     of single-defuse-cycle would not impact this, is specially handled.  
> > */
> > +  unsigned int vec_stmts_effec_size;
> > +
> >    /* Reference count in the SLP graph.  */
> >    unsigned int refcnt;
> >    /* The maximum number of vector elements for the subtree rooted
> > @@ -304,6 +313,7 @@ public:
> >  #define SLP_TREE_REF_COUNT(S)                    (S)->refcnt
> >  #define SLP_TREE_VEC_DEFS(S)                     (S)->vec_defs
> >  #define SLP_TREE_VEC_STMTS_NUM(S)                (S)->vec_stmts_size
> > +#define SLP_TREE_VEC_STMTS_EFFEC_NUM(S)          (S)->vec_stmts_effec_size
> >  #define SLP_TREE_LOAD_PERMUTATION(S)             (S)->load_permutation
> >  #define SLP_TREE_LANE_PERMUTATION(S)             (S)->lane_permutation
> >  #define SLP_TREE_SIMD_CLONE_INFO(S)              (S)->simd_clone_info
> > @@ -2080,6 +2090,41 @@ vect_get_num_vectors (poly_uint64 nunits, tree 
> > vectype)
> >    return exact_div (nunits, TYPE_VECTOR_SUBPARTS (vectype)).to_constant ();
> >  }
> >
> > +/* Return the number of vectors in the context of vectorization region 
> > VINFO,
> > +   needed for a group of total SIZE statements that are supposed to be
> > +   interleaved together with no gap, and all operate on vectors of type
> > +   VECTYPE.  */
> > +
> > +inline unsigned int
> > +vect_get_num_vectors (vec_info *vinfo, tree vectype, unsigned int size)
> > +{
> > +  poly_uint64 vf;
> > +
> > +  if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
> > +    vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> > +  else
> > +    vf = 1;
> > +
> > +  return vect_get_num_vectors (vf * size, vectype);
> > +}
> > +
> > +/* Return the number of vectors needed for SLP_NODE regarding to VINFO.  If
> > +   not NULL, VECTYPE gives a new vectype to operate on, rather than
> > +   SLP_TREE_VECTYPE of the node.  */
> > +
> > +inline unsigned int
> > +vect_get_slp_num_vectors (vec_info *vinfo, slp_tree slp_node,
> > +                         tree vectype = NULL_TREE)
> > +{
> > +  if (!vectype)
> > +    {
> > +      vectype = SLP_TREE_VECTYPE (slp_node);
> > +      gcc_checking_assert (vectype);
> > +    }
> > +
> > +  return vect_get_num_vectors (vinfo, vectype, SLP_TREE_LANES (slp_node));
> > +}
> > +
> >  /* Return the number of copies needed for loop vectorization when
> >     a statement operates on vectors of type VECTYPE.  This is the
> >     vectorization factor divided by the number of elements in
> > --
> > 2.17.1

Reply via email to