Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]

Richard Biener Tue, 25 Jun 2024 03:26:31 -0700

On Tue, Jun 25, 2024 at 11:32 AM Feng Xue OS
<f...@os.amperecomputing.com> wrote:
>
> >>
> >> >> -      if (slp_node)
> >> >> +      if (slp_node && SLP_TREE_LANES (slp_node) > 1)
> >> >
> >> > Hmm, that looks wrong.  It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is off
> >> > instead, which is bad.
> >> >
> >> >>         nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
> >> >>        else
> >> >>         nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
> >> >> @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage 
> >> >> (loop_vec_info loop_vinfo,
> >> >>      }
> >> >>  }
> >> >>
> >> >> +/* Check if STMT_INFO is a lane-reducing operation that can be 
> >> >> vectorized in
> >> >> +   the context of LOOP_VINFO, and vector cost will be recorded in 
> >> >> COST_VEC.
> >> >> +   Now there are three such kinds of operations: dot-prod/widen-sum/sad
> >> >> +   (sum-of-absolute-differences).
> >> >> +
> >> >> +   For a lane-reducing operation, the loop reduction path that it lies 
> >> >> in,
> >> >> +   may contain normal operation, or other lane-reducing operation of 
> >> >> different
> >> >> +   input type size, an example as:
> >> >> +
> >> >> +     int sum = 0;
> >> >> +     for (i)
> >> >> +       {
> >> >> +         ...
> >> >> +         sum += d0[i] * d1[i];       // dot-prod <vector(16) char>
> >> >> +         sum += w[i];                // widen-sum <vector(16) char>
> >> >> +         sum += abs(s0[i] - s1[i]);  // sad <vector(8) short>
> >> >> +         sum += n[i];                // normal <vector(4) int>
> >> >> +         ...
> >> >> +       }
> >> >> +
> >> >> +   Vectorization factor is essentially determined by operation whose 
> >> >> input
> >> >> +   vectype has the most lanes ("vector(16) char" in the example), 
> >> >> while we
> >> >> +   need to choose input vectype with the least lanes ("vector(4) int" 
> >> >> in the
> >> >> +   example) for the reduction PHI statement.  */
> >> >> +
> >> >> +bool
> >> >> +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info 
> >> >> stmt_info,
> >> >> +                           slp_tree slp_node, stmt_vector_for_cost 
> >> >> *cost_vec)
> >> >> +{
> >> >> +  gimple *stmt = stmt_info->stmt;
> >> >> +
> >> >> +  if (!lane_reducing_stmt_p (stmt))
> >> >> +    return false;
> >> >> +
> >> >> +  tree type = TREE_TYPE (gimple_assign_lhs (stmt));
> >> >> +
> >> >> +  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))
> >> >> +    return false;
> >> >> +
> >> >> +  /* Do not try to vectorize bit-precision reductions.  */
> >> >> +  if (!type_has_mode_precision_p (type))
> >> >> +    return false;
> >> >> +
> >> >> +  if (!slp_node)
> >> >> +    return false;
> >> >> +
> >> >> +  for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++)
> >> >> +    {
> >> >> +      stmt_vec_info def_stmt_info;
> >> >> +      slp_tree slp_op;
> >> >> +      tree op;
> >> >> +      tree vectype;
> >> >> +      enum vect_def_type dt;
> >> >> +
> >> >> +      if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,
> >> >> +                              &slp_op, &dt, &vectype, &def_stmt_info))
> >> >> +       {
> >> >> +         if (dump_enabled_p ())
> >> >> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> >> +                            "use not simple.\n");
> >> >> +         return false;
> >> >> +       }
> >> >> +
> >> >> +      if (!vectype)
> >> >> +       {
> >> >> +         vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE 
> >> >> (op),
> >> >> +                                                slp_op);
> >> >> +         if (!vectype)
> >> >> +           return false;
> >> >> +       }
> >> >> +
> >> >> +      if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))
> >> >> +       {
> >> >> +         if (dump_enabled_p ())
> >> >> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> >> +                            "incompatible vector types for 
> >> >> invariants\n");
> >> >> +         return false;
> >> >> +       }
> >> >> +
> >> >> +      if (i == STMT_VINFO_REDUC_IDX (stmt_info))
> >> >> +       continue;
> >> >> +
> >> >> +      /* There should be at most one cycle def in the stmt.  */
> >> >> +      if (VECTORIZABLE_CYCLE_DEF (dt))
> >> >> +       return false;
> >> >> +    }
> >> >> +
> >> >> +  stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt 
> >> >> (stmt_info));
> >> >> +
> >> >> +  /* TODO: Support lane-reducing operation that does not directly 
> >> >> participate
> >> >> +     in loop reduction. */
> >> >> +  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
> >> >> +    return false;
> >> >> +
> >> >> +  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
> >> >> +     recoginized.  */
> >> >> +  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def);
> >> >> +  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == 
> >> >> TREE_CODE_REDUCTION);
> >> >> +
> >> >> +  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> >> >> +  int ncopies_for_cost;
> >> >> +
> >> >> +  if (SLP_TREE_LANES (slp_node) > 1)
> >> >> +    {
> >> >> +      /* Now lane-reducing operations in a non-single-lane slp node 
> >> >> should only
> >> >> +        come from the same loop reduction path.  */
> >> >> +      gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));
> >> >> +      ncopies_for_cost = 1;
> >> >> +    }
> >> >> +  else
> >> >> +    {
> >> >> +      ncopies_for_cost = vect_get_num_copies (loop_vinfo, vectype_in);
> >> >
> >> > OK, so the fact that the ops are lane-reducing means they effectively
> >> > change the VF for the result.  That's only possible as we tightly control
> >> > code generation and "adjust" to the expected VF (by inserting the copies
> >> > you mentioned above), but only up to the highest number of outputs
> >> > created in the reduction chain.  In that sense instead of talking and 
> >> > recording
> >> > "input vector types" wouldn't it make more sense to record the effective
> >> > vectorization factor for the reduction instance?  That VF would be at 
> >> > most
> >> > the loops VF but could be as low as 1.  Once we have a non-lane-reducing
> >> > operation in the reduction chain it would be always equal to the loops 
> >> > VF.
> >> >
> >> > ncopies would then be always determined by that reduction instance VF and
> >> > the accumulator vector type (STMT_VINFO_VECTYPE).  This reduction
> >> > instance VF would also trivially indicate the force-single-def-use-cycle
> >> > case, possibly simplifying code?
> >>
> >> I tried to add such an effective VF, while the vectype_in is still needed 
> >> in some
> >> scenarios, such as when checking whether a dot-prod stmt is emulated or 
> >> not.
> >> The former could be deduced from the later, so recording both things seems
> >> to be redundant. Another consideration is that for normal op, ncopies
> >> is determined from type (STMT_VINFO_VECTYPE), but for lane-reducing op,
> >> it is from VF. So, a better means to make them unified?
> >
> > AFAICS reductions are special in that they, for the accumulation SSA cycle,
> > do not adhere to the loops VF but as optimization can chose a smaller one.
> > OTOH STMT_VINFO_VECTYPE is for the vector type used for individual
> > operations which even for lane-reducing ops is adhered to - those just
> > may use a smaller VF, that of the reduction SSA cycle.
> >
> > So what's redundant is STMT_VINFO_REDUC_VECTYPE_IN - or rather
> > it's not fully redundant but needlessly replicated over all stmts 
> > participating
> > in the reduction instead of recording the reduction VF in the reduc_info and
> > using that (plus STMT_VINFO_VECTYPE) to compute the effective ncopies
> > for stmts in the reduction cycle.
> >
> > At least that was my idea ...
> >
>
> For lane-reducing ops and single-defuse-cycle optimization, we could assume
> no lane would be reduced, and always generate vectorization statements
> according to the normal VF, if placeholder is needed, just insert some trivial
> statement like zero-initialization, or pass-through copy. And define 
> a"effective VF or
> ncopies" to control lane-reducing related aspects in analysis and codegen 
> (such
> as the below vect_get_loop_mask).  Since all things will become SLP-based 
> finally,
> I think a suitable place to add such a field might be in slp_node, as a 
> supplement to
> "vect_stmts_size", and it is expected to be adjusted in 
> vectorizable_reduction. So
> could we do the refinement as separate patches when non-slp code path is to be
> removed?


I suppose so.

Thanks,
Richard.

> >> >> +      gcc_assert (ncopies_for_cost >= 1);
> >> >> +    }
> >> >> +
> >> >> +  if (vect_is_emulated_mixed_dot_prod (stmt_info))
> >> >> +    {
> >> >> +      /* We need extra two invariants: one that contains the minimum 
> >> >> signed
> >> >> +        value and one that contains half of its negative.  */
> >> >> +      int prologue_stmts = 2;
> >> >> +      unsigned cost = record_stmt_cost (cost_vec, prologue_stmts,
> >> >> +                                       scalar_to_vec, stmt_info, 0,
> >> >> +                                       vect_prologue);
> >> >> +      if (dump_enabled_p ())
> >> >> +       dump_printf (MSG_NOTE, "vectorizable_lane_reducing: "
> >> >> +                    "extra prologue_cost = %d .\n", cost);
> >> >> +
> >> >> +      /* Three dot-products and a subtraction.  */
> >> >> +      ncopies_for_cost *= 4;
> >> >> +    }
> >> >> +
> >> >> +  record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, 
> >> >> stmt_info, 0,
> >> >> +                   vect_body);
> >> >> +
> >> >> +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> >> >> +    {
> >> >> +      enum tree_code code = gimple_assign_rhs_code (stmt);
> >> >> +      vect_reduction_update_partial_vector_usage (loop_vinfo, 
> >> >> reduc_info,
> >> >> +                                                 slp_node, code, type,
> >> >> +                                                 vectype_in);
> >> >> +    }
> >> >> +
> >> >
> >> > Add a comment:
> >> >
> >> >     /* Transform via vect_transform_reduction.  */
> >> >
> >> >> +  STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
> >> >> +  return true;
> >> >> +}
> >> >> +
> >> >>  /* Function vectorizable_reduction.
> >> >>
> >> >>     Check if STMT_INFO performs a reduction operation that can be 
> >> >> vectorized.
> >> >> @@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >> >>    if (!type_has_mode_precision_p (op.type))
> >> >>      return false;
> >> >>
> >> >> -  /* For lane-reducing ops we're reducing the number of reduction PHIs
> >> >> -     which means the only use of that may be in the lane-reducing 
> >> >> operation.  */
> >> >> -  if (lane_reducing
> >> >> -      && reduc_chain_length != 1
> >> >> -      && !only_slp_reduc_chain)
> >> >> -    {
> >> >> -      if (dump_enabled_p ())
> >> >> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> >> -                        "lane-reducing reduction with extra stmts.\n");
> >> >> -      return false;
> >> >> -    }
> >> >> -
> >> >>    /* Lane-reducing ops also never can be used in a SLP reduction group
> >> >>       since we'll mix lanes belonging to different reductions.  But it's
> >> >>       OK to use them in a reduction chain or when the reduction group
> >> >> @@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info 
> >> >> loop_vinfo,
> >> >>        && loop_vinfo->suggested_unroll_factor == 1)
> >> >>      single_defuse_cycle = true;
> >> >>
> >> >> -  if (single_defuse_cycle || lane_reducing)
> >> >> +  if (single_defuse_cycle && !lane_reducing)
> >> >
> >> > If there's also a non-lane-reducing plus in the chain don't we have to
> >> > check for that reduction op?  So shouldn't it be
> >> > single_defuse_cycle && ... fact that we don't record
> >> > (non-lane-reducing op there) ...
> >>
> >> Quite not understand this point.  For a non-lane-reducing op in the chain,
> >> it should be handled in its own vectorizable_xxx function? The below check
> >> is only for the first statement (vect_reduction_def) in the reduction.
> >
> > Hmm.  So we have vectorizable_lane_reducing_* for the check on the
> > lane-reducing stmts, vectorizable_* for !single-def-use stmts.  And the
> > following is then just for the case there's a single def that's not
> > lane-reducing
> > and we're forcing a single-def-use and thus go via vect_transform_reduction?
>
> Yes. Non-lane-reducing with single-defuse-cycle is handled in the function.
> This logic is same as the original.
>
> >> >
> >> >>      {
> >> >>        gcc_assert (op.code != COND_EXPR);
> >> >>
> >> >> -      /* 4. Supportable by target?  */
> >> >> -      bool ok = true;
> >> >> -
> >> >> -      /* 4.1. check support for the operation in the loop
> >> >> +      /* 4. check support for the operation in the loop
> >> >>
> >> >>          This isn't necessary for the lane reduction codes, since they
> >> >>          can only be produced by pattern matching, and it's up to the
> >> >> @@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info 
> >> >> loop_vinfo,
> >> >>          mixed-sign dot-products can be implemented using signed
> >> >>          dot-products.  */
> >> >>        machine_mode vec_mode = TYPE_MODE (vectype_in);
> >> >> -      if (!lane_reducing
> >> >> -         && !directly_supported_p (op.code, vectype_in, optab_vector))
> >> >> +      if (!directly_supported_p (op.code, vectype_in, optab_vector))
> >> >>          {
> >> >>            if (dump_enabled_p ())
> >> >>              dump_printf (MSG_NOTE, "op not supported by target.\n");
> >> >>           if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD)
> >> >>               || !vect_can_vectorize_without_simd_p (op.code))
> >> >> -           ok = false;
> >> >> +           single_defuse_cycle = false;
> >> >>           else
> >> >>             if (dump_enabled_p ())
> >> >>               dump_printf (MSG_NOTE, "proceeding using word mode.\n");
> >> >> @@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >> >>             dump_printf (MSG_NOTE, "using word mode not possible.\n");
> >> >>           return false;
> >> >>         }
> >> >> -
> >> >> -      /* lane-reducing operations have to go through 
> >> >> vect_transform_reduction.
> >> >> -         For the other cases try without the single cycle 
> >> >> optimization.  */
> >> >> -      if (!ok)
> >> >> -       {
> >> >> -         if (lane_reducing)
> >> >> -           return false;
> >> >> -         else
> >> >> -           single_defuse_cycle = false;
> >> >> -       }
> >> >>      }
> >> >>    if (dump_enabled_p () && single_defuse_cycle)
> >> >>      dump_printf_loc (MSG_NOTE, vect_location,
> >> >> @@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info 
> >> >> loop_vinfo,
> >> >>                      "multiple vectors to one in the loop body\n");
> >> >>    STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) = single_defuse_cycle;
> >> >>
> >> >> -  /* If the reduction stmt is one of the patterns that have lane
> >> >> -     reduction embedded we cannot handle the case of ! 
> >> >> single_defuse_cycle.  */
> >> >> -  if ((ncopies > 1 && ! single_defuse_cycle)
> >> >> -      && lane_reducing)
> >> >> -    {
> >> >> -      if (dump_enabled_p ())
> >> >> -       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> >> -                        "multi def-use cycle not possible for 
> >> >> lane-reducing "
> >> >> -                        "reduction operation\n");
> >> >> -      return false;
> >> >> -    }
> >> >> +  /* For lane-reducing operation, the below processing related to 
> >> >> single
> >> >> +     defuse-cycle will be done in its own vectorizable function.  One 
> >> >> more
> >> >> +     thing to note is that the operation must not be involved in 
> >> >> fold-left
> >> >> +     reduction.  */
> >> >> +  single_defuse_cycle &= !lane_reducing;
> >> >>
> >> >>    if (slp_node
> >> >> -      && !(!single_defuse_cycle
> >> >> -          && !lane_reducing
> >> >> -          && reduction_type != FOLD_LEFT_REDUCTION))
> >> >> +      && (single_defuse_cycle || reduction_type == 
> >> >> FOLD_LEFT_REDUCTION))
> >> >>      for (i = 0; i < (int) op.num_ops; i++)
> >> >>        if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))
> >> >>         {
> >> >> @@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info 
> >> >> loop_vinfo,
> >> >>    vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,
> >> >>                              reduction_type, ncopies, cost_vec);
> >> >>    /* Cost the reduction op inside the loop if transformed via
> >> >> -     vect_transform_reduction.  Otherwise this is costed by the
> >> >> -     separate vectorizable_* routines.  */
> >> >> -  if (single_defuse_cycle || lane_reducing)
> >> >> -    {
> >> >> -      int factor = 1;
> >> >> -      if (vect_is_emulated_mixed_dot_prod (stmt_info))
> >> >> -       /* Three dot-products and a subtraction.  */
> >> >> -       factor = 4;
> >> >> -      record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
> >> >> -                       stmt_info, 0, vect_body);
> >> >> -    }
> >> >> +     vect_transform_reduction for non-lane-reducing operation.  
> >> >> Otherwise
> >> >> +     this is costed by the separate vectorizable_* routines.  */
> >> >> +  if (single_defuse_cycle)
> >> >> +    record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, 
> >> >> vect_body);
> >> >>
> >> >>    if (dump_enabled_p ()
> >> >>        && reduction_type == FOLD_LEFT_REDUCTION)
> >> >>      dump_printf_loc (MSG_NOTE, vect_location,
> >> >>                      "using an in-order (fold-left) reduction.\n");
> >> >>    STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type;
> >> >> -  /* All but single defuse-cycle optimized, lane-reducing and fold-left
> >> >> -     reductions go through their own vectorizable_* routines.  */
> >> >> -  if (!single_defuse_cycle
> >> >> -      && !lane_reducing
> >> >> -      && reduction_type != FOLD_LEFT_REDUCTION)
> >> >> +
> >> >> +  /* All but single defuse-cycle optimized and fold-left reductions go
> >> >> +     through their own vectorizable_* routines.  */
> >> >> +  if (!single_defuse_cycle && reduction_type != FOLD_LEFT_REDUCTION)
> >> >>      {
> >> >>        stmt_vec_info tem
> >> >>         = vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info));
> >> >> @@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info 
> >> >> loop_vinfo,
> >> >>    bool lane_reducing = lane_reducing_op_p (code);
> >> >>    gcc_assert (single_defuse_cycle || lane_reducing);
> >> >>
> >> >> +  if (lane_reducing)
> >> >> +    {
> >> >> +      /* The last operand of lane-reducing op is for reduction.  */
> >> >> +      gcc_assert (reduc_index == (int) op.num_ops - 1);
> >> >> +
> >> >> +      /* Now all lane-reducing ops are covered by some slp node.  */
> >> >> +      gcc_assert (slp_node);
> >> >> +    }
> >> >> +
> >> >>    /* Create the destination vector  */
> >> >>    tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
> >> >>    tree vec_dest = vect_create_destination_var (scalar_dest, 
> >> >> vectype_out);
> >> >> @@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info 
> >> >> loop_vinfo,
> >> >>                          reduc_index == 2 ? op.ops[2] : NULL_TREE,
> >> >>                          &vec_oprnds[2]);
> >> >>      }
> >> >> +  else if (lane_reducing && SLP_TREE_LANES (slp_node) == 1
> >> >> +          && vec_oprnds[0].length () < vec_oprnds[reduc_index].length 
> >> >> ())
> >> >> +    {
> >> >> +      /* For lane-reducing op covered by single-lane slp node, the 
> >> >> input
> >> >> +        vectype of the reduction PHI determines copies of vectorized 
> >> >> def-use
> >> >> +        cycles, which might be more than effective copies of 
> >> >> vectorized lane-
> >> >> +        reducing reduction statements.  This could be complemented by
> >> >> +        generating extra trivial pass-through copies.  For example:
> >> >> +
> >> >> +          int sum = 0;
> >> >> +          for (i)
> >> >> +            {
> >> >> +              sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
> >> >> +              sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
> >> >> +              sum += n[i];               // normal <vector(4) int>
> >> >> +            }
> >> >> +
> >> >> +        The vector size is 128-bit,vectorization factor is 16.  
> >> >> Reduction
> >> >> +        statements would be transformed as:
> >> >> +
> >> >> +          vector<4> int sum_v0 = { 0, 0, 0, 0 };
> >> >> +          vector<4> int sum_v1 = { 0, 0, 0, 0 };
> >> >> +          vector<4> int sum_v2 = { 0, 0, 0, 0 };
> >> >> +          vector<4> int sum_v3 = { 0, 0, 0, 0 };
> >> >> +
> >> >> +          for (i / 16)
> >> >> +            {
> >> >> +              sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], 
> >> >> sum_v0);
> >> >> +              sum_v1 = sum_v1;  // copy
> >> >> +              sum_v2 = sum_v2;  // copy
> >> >> +              sum_v3 = sum_v3;  // copy
> >> >> +
> >> >> +              sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], 
> >> >> sum_v0);
> >> >> +              sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], 
> >> >> sum_v1);
> >> >> +              sum_v2 = sum_v2;  // copy
> >> >> +              sum_v3 = sum_v3;  // copy
> >> >> +
> >> >> +              sum_v0 += n_v0[i: 0  ~ 3 ];
> >> >> +              sum_v1 += n_v1[i: 4  ~ 7 ];
> >> >> +              sum_v2 += n_v2[i: 8  ~ 11];
> >> >> +              sum_v3 += n_v3[i: 12 ~ 15];
> >> >> +            }
> >> >> +       */
> >> >> +      unsigned using_ncopies = vec_oprnds[0].length ();
> >> >> +      unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
> >> >> +
> >> >
> >> > assert reduc_ncopies >= using_ncopies?  Maybe assert
> >> > reduc_index == op.num_ops - 1 given you use one above
> >> > and the other below?  Or simply iterate till op.num_ops
> >> > and sip i == reduc_index.
> >> >
> >> >> +      for (unsigned i = 0; i < op.num_ops - 1; i++)
> >> >> +       {
> >> >> +         gcc_assert (vec_oprnds[i].length () == using_ncopies);
> >> >> +         vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
> >> >> +       }
> >> >> +    }
> >> >>
> >> >>    bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod 
> >> >> (stmt_info);
> >> >>    unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
> >> >> @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info 
> >> >> loop_vinfo,
> >> >>      {
> >> >>        gimple *new_stmt;
> >> >>        tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
> >> >> -      if (masked_loop_p && !mask_by_cond_expr)
> >> >> +
> >> >> +      if (!vop[0] || !vop[1])
> >> >> +       {
> >> >> +         tree reduc_vop = vec_oprnds[reduc_index][i];
> >> >> +
> >> >> +         /* Insert trivial copy if no need to generate vectorized
> >> >> +            statement.  */
> >> >> +         gcc_assert (reduc_vop);
> >> >> +
> >> >> +         new_stmt = gimple_build_assign (vec_dest, reduc_vop);
> >> >> +         new_temp = make_ssa_name (vec_dest, new_stmt);
> >> >> +         gimple_set_lhs (new_stmt, new_temp);
> >> >> +         vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, 
> >> >> gsi);
> >> >
> >> > I think you could simply do
> >> >
> >> >                slp_node->push_vec_def (reduc_vop);
> >> >                continue;
> >> >
> >> > without any code generation.
> >> >
> >>
> >> OK, that would be easy. Here comes another question, this patch assumes
> >> lane-reducing op would always be contained in a slp node, since single-lane
> >> slp node feature has been enabled. But I got some regression if I enforced
> >> such constraint on lane-reducing op check. Those cases are founded to
> >> be unvectorizable with single-lane slp, so this should not be what we want?
> >> and need to be fixed?
> >
> > Yes, in the end we need to chase down all unsupported cases and fix them
> > (there's known issues with load permutes, I'm working on that - hopefully
> > when finding a continuous stretch of time...).
> >
> >>
> >> >> +       }
> >> >> +      else if (masked_loop_p && !mask_by_cond_expr)
> >> >>         {
> >> >>           /* No conditional ifns have been defined for lane-reducing op
> >> >>              yet.  */
> >> >> @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info 
> >> >> loop_vinfo,
> >> >>
> >> >>           if (masked_loop_p && mask_by_cond_expr)
> >> >>             {
> >> >> +             tree stmt_vectype_in = vectype_in;
> >> >> +             unsigned nvectors = vec_num * ncopies;
> >> >> +
> >> >> +             if (lane_reducing && SLP_TREE_LANES (slp_node) == 1)
> >> >> +               {
> >> >> +                 /* Input vectype of the reduction PHI may be 
> >> >> defferent from
> >> >
> >> > different
> >> >
> >> >> +                    that of lane-reducing operation.  */
> >> >> +                 stmt_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN 
> >> >> (stmt_info);
> >> >> +                 nvectors = vect_get_num_copies (loop_vinfo, 
> >> >> stmt_vectype_in);
> >> >
> >> > I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS.
> >>
> >> To partially vectorizing a dot_prod<16 * char> with 128-bit vector width,
> >> we should pass (nvector=4, vectype=<4 *int>) instead of (nvector=1, 
> >> vectype=<16 *char>)
> >> to vect_get_loop_mask?
> >
> > Probably - it depends on the vectorization factor.  What I wanted to
> > point out is that
> > vec_num (likely from SLP_TREE_NUMBER_OF_VEC_STMTS) is wrong.  The
> > place setting SLP_TREE_NUMBER_OF_VEC_STMTS needs to be adjusted,
> > or we should forgo with it (but that's possibly a post-only-SLP
> > cleanup to be done).
> >
> > See vect_slp_analyze_node_operations_1 where that's computed.  For 
> > reductions
> > it's probably not quite right (and we might have latent issues like
> > those you are
> > "fixing" with code like above).  The order we analyze stmts might also be 
> > not
> > optimal for reductions with SLP - in fact given that stmt analysis
> > relies on a fixed VF
> > it would probably make sense to determine the reduction VF in advance as 
> > well.
> > But again this sounds like post-only-SLP cleanup opportunities.
> >
> > In the end I might suggest to always use reduct-VF and vectype to determine
> > the number of vector stmts rather than computing ncopies/vec_num separately.
> >
>
> Thanks,
> Feng

Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]

Reply via email to