On Tue, Jun 25, 2024 at 11:32 AM Feng Xue OS <f...@os.amperecomputing.com> wrote: > > >> > >> >> - if (slp_node) > >> >> + if (slp_node && SLP_TREE_LANES (slp_node) > 1) > >> > > >> > Hmm, that looks wrong. It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is off > >> > instead, which is bad. > >> > > >> >> nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); > >> >> else > >> >> nvectors = vect_get_num_copies (loop_vinfo, vectype_in); > >> >> @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage > >> >> (loop_vec_info loop_vinfo, > >> >> } > >> >> } > >> >> > >> >> +/* Check if STMT_INFO is a lane-reducing operation that can be > >> >> vectorized in > >> >> + the context of LOOP_VINFO, and vector cost will be recorded in > >> >> COST_VEC. > >> >> + Now there are three such kinds of operations: dot-prod/widen-sum/sad > >> >> + (sum-of-absolute-differences). > >> >> + > >> >> + For a lane-reducing operation, the loop reduction path that it lies > >> >> in, > >> >> + may contain normal operation, or other lane-reducing operation of > >> >> different > >> >> + input type size, an example as: > >> >> + > >> >> + int sum = 0; > >> >> + for (i) > >> >> + { > >> >> + ... > >> >> + sum += d0[i] * d1[i]; // dot-prod <vector(16) char> > >> >> + sum += w[i]; // widen-sum <vector(16) char> > >> >> + sum += abs(s0[i] - s1[i]); // sad <vector(8) short> > >> >> + sum += n[i]; // normal <vector(4) int> > >> >> + ... > >> >> + } > >> >> + > >> >> + Vectorization factor is essentially determined by operation whose > >> >> input > >> >> + vectype has the most lanes ("vector(16) char" in the example), > >> >> while we > >> >> + need to choose input vectype with the least lanes ("vector(4) int" > >> >> in the > >> >> + example) for the reduction PHI statement. */ > >> >> + > >> >> +bool > >> >> +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info > >> >> stmt_info, > >> >> + slp_tree slp_node, stmt_vector_for_cost > >> >> *cost_vec) > >> >> +{ > >> >> + gimple *stmt = stmt_info->stmt; > >> >> + > >> >> + if (!lane_reducing_stmt_p (stmt)) > >> >> + return false; > >> >> + > >> >> + tree type = TREE_TYPE (gimple_assign_lhs (stmt)); > >> >> + > >> >> + if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type)) > >> >> + return false; > >> >> + > >> >> + /* Do not try to vectorize bit-precision reductions. */ > >> >> + if (!type_has_mode_precision_p (type)) > >> >> + return false; > >> >> + > >> >> + if (!slp_node) > >> >> + return false; > >> >> + > >> >> + for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++) > >> >> + { > >> >> + stmt_vec_info def_stmt_info; > >> >> + slp_tree slp_op; > >> >> + tree op; > >> >> + tree vectype; > >> >> + enum vect_def_type dt; > >> >> + > >> >> + if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op, > >> >> + &slp_op, &dt, &vectype, &def_stmt_info)) > >> >> + { > >> >> + if (dump_enabled_p ()) > >> >> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > >> >> + "use not simple.\n"); > >> >> + return false; > >> >> + } > >> >> + > >> >> + if (!vectype) > >> >> + { > >> >> + vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE > >> >> (op), > >> >> + slp_op); > >> >> + if (!vectype) > >> >> + return false; > >> >> + } > >> >> + > >> >> + if (!vect_maybe_update_slp_op_vectype (slp_op, vectype)) > >> >> + { > >> >> + if (dump_enabled_p ()) > >> >> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > >> >> + "incompatible vector types for > >> >> invariants\n"); > >> >> + return false; > >> >> + } > >> >> + > >> >> + if (i == STMT_VINFO_REDUC_IDX (stmt_info)) > >> >> + continue; > >> >> + > >> >> + /* There should be at most one cycle def in the stmt. */ > >> >> + if (VECTORIZABLE_CYCLE_DEF (dt)) > >> >> + return false; > >> >> + } > >> >> + > >> >> + stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt > >> >> (stmt_info)); > >> >> + > >> >> + /* TODO: Support lane-reducing operation that does not directly > >> >> participate > >> >> + in loop reduction. */ > >> >> + if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0) > >> >> + return false; > >> >> + > >> >> + /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not > >> >> + recoginized. */ > >> >> + gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def); > >> >> + gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == > >> >> TREE_CODE_REDUCTION); > >> >> + > >> >> + tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info); > >> >> + int ncopies_for_cost; > >> >> + > >> >> + if (SLP_TREE_LANES (slp_node) > 1) > >> >> + { > >> >> + /* Now lane-reducing operations in a non-single-lane slp node > >> >> should only > >> >> + come from the same loop reduction path. */ > >> >> + gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info)); > >> >> + ncopies_for_cost = 1; > >> >> + } > >> >> + else > >> >> + { > >> >> + ncopies_for_cost = vect_get_num_copies (loop_vinfo, vectype_in); > >> > > >> > OK, so the fact that the ops are lane-reducing means they effectively > >> > change the VF for the result. That's only possible as we tightly control > >> > code generation and "adjust" to the expected VF (by inserting the copies > >> > you mentioned above), but only up to the highest number of outputs > >> > created in the reduction chain. In that sense instead of talking and > >> > recording > >> > "input vector types" wouldn't it make more sense to record the effective > >> > vectorization factor for the reduction instance? That VF would be at > >> > most > >> > the loops VF but could be as low as 1. Once we have a non-lane-reducing > >> > operation in the reduction chain it would be always equal to the loops > >> > VF. > >> > > >> > ncopies would then be always determined by that reduction instance VF and > >> > the accumulator vector type (STMT_VINFO_VECTYPE). This reduction > >> > instance VF would also trivially indicate the force-single-def-use-cycle > >> > case, possibly simplifying code? > >> > >> I tried to add such an effective VF, while the vectype_in is still needed > >> in some > >> scenarios, such as when checking whether a dot-prod stmt is emulated or > >> not. > >> The former could be deduced from the later, so recording both things seems > >> to be redundant. Another consideration is that for normal op, ncopies > >> is determined from type (STMT_VINFO_VECTYPE), but for lane-reducing op, > >> it is from VF. So, a better means to make them unified? > > > > AFAICS reductions are special in that they, for the accumulation SSA cycle, > > do not adhere to the loops VF but as optimization can chose a smaller one. > > OTOH STMT_VINFO_VECTYPE is for the vector type used for individual > > operations which even for lane-reducing ops is adhered to - those just > > may use a smaller VF, that of the reduction SSA cycle. > > > > So what's redundant is STMT_VINFO_REDUC_VECTYPE_IN - or rather > > it's not fully redundant but needlessly replicated over all stmts > > participating > > in the reduction instead of recording the reduction VF in the reduc_info and > > using that (plus STMT_VINFO_VECTYPE) to compute the effective ncopies > > for stmts in the reduction cycle. > > > > At least that was my idea ... > > > > For lane-reducing ops and single-defuse-cycle optimization, we could assume > no lane would be reduced, and always generate vectorization statements > according to the normal VF, if placeholder is needed, just insert some trivial > statement like zero-initialization, or pass-through copy. And define > a"effective VF or > ncopies" to control lane-reducing related aspects in analysis and codegen > (such > as the below vect_get_loop_mask). Since all things will become SLP-based > finally, > I think a suitable place to add such a field might be in slp_node, as a > supplement to > "vect_stmts_size", and it is expected to be adjusted in > vectorizable_reduction. So > could we do the refinement as separate patches when non-slp code path is to be > removed?
I suppose so. Thanks, Richard. > >> >> + gcc_assert (ncopies_for_cost >= 1); > >> >> + } > >> >> + > >> >> + if (vect_is_emulated_mixed_dot_prod (stmt_info)) > >> >> + { > >> >> + /* We need extra two invariants: one that contains the minimum > >> >> signed > >> >> + value and one that contains half of its negative. */ > >> >> + int prologue_stmts = 2; > >> >> + unsigned cost = record_stmt_cost (cost_vec, prologue_stmts, > >> >> + scalar_to_vec, stmt_info, 0, > >> >> + vect_prologue); > >> >> + if (dump_enabled_p ()) > >> >> + dump_printf (MSG_NOTE, "vectorizable_lane_reducing: " > >> >> + "extra prologue_cost = %d .\n", cost); > >> >> + > >> >> + /* Three dot-products and a subtraction. */ > >> >> + ncopies_for_cost *= 4; > >> >> + } > >> >> + > >> >> + record_stmt_cost (cost_vec, ncopies_for_cost, vector_stmt, > >> >> stmt_info, 0, > >> >> + vect_body); > >> >> + > >> >> + if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)) > >> >> + { > >> >> + enum tree_code code = gimple_assign_rhs_code (stmt); > >> >> + vect_reduction_update_partial_vector_usage (loop_vinfo, > >> >> reduc_info, > >> >> + slp_node, code, type, > >> >> + vectype_in); > >> >> + } > >> >> + > >> > > >> > Add a comment: > >> > > >> > /* Transform via vect_transform_reduction. */ > >> > > >> >> + STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type; > >> >> + return true; > >> >> +} > >> >> + > >> >> /* Function vectorizable_reduction. > >> >> > >> >> Check if STMT_INFO performs a reduction operation that can be > >> >> vectorized. > >> >> @@ -7804,18 +7944,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > >> >> if (!type_has_mode_precision_p (op.type)) > >> >> return false; > >> >> > >> >> - /* For lane-reducing ops we're reducing the number of reduction PHIs > >> >> - which means the only use of that may be in the lane-reducing > >> >> operation. */ > >> >> - if (lane_reducing > >> >> - && reduc_chain_length != 1 > >> >> - && !only_slp_reduc_chain) > >> >> - { > >> >> - if (dump_enabled_p ()) > >> >> - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > >> >> - "lane-reducing reduction with extra stmts.\n"); > >> >> - return false; > >> >> - } > >> >> - > >> >> /* Lane-reducing ops also never can be used in a SLP reduction group > >> >> since we'll mix lanes belonging to different reductions. But it's > >> >> OK to use them in a reduction chain or when the reduction group > >> >> @@ -8354,14 +8482,11 @@ vectorizable_reduction (loop_vec_info > >> >> loop_vinfo, > >> >> && loop_vinfo->suggested_unroll_factor == 1) > >> >> single_defuse_cycle = true; > >> >> > >> >> - if (single_defuse_cycle || lane_reducing) > >> >> + if (single_defuse_cycle && !lane_reducing) > >> > > >> > If there's also a non-lane-reducing plus in the chain don't we have to > >> > check for that reduction op? So shouldn't it be > >> > single_defuse_cycle && ... fact that we don't record > >> > (non-lane-reducing op there) ... > >> > >> Quite not understand this point. For a non-lane-reducing op in the chain, > >> it should be handled in its own vectorizable_xxx function? The below check > >> is only for the first statement (vect_reduction_def) in the reduction. > > > > Hmm. So we have vectorizable_lane_reducing_* for the check on the > > lane-reducing stmts, vectorizable_* for !single-def-use stmts. And the > > following is then just for the case there's a single def that's not > > lane-reducing > > and we're forcing a single-def-use and thus go via vect_transform_reduction? > > Yes. Non-lane-reducing with single-defuse-cycle is handled in the function. > This logic is same as the original. > > >> > > >> >> { > >> >> gcc_assert (op.code != COND_EXPR); > >> >> > >> >> - /* 4. Supportable by target? */ > >> >> - bool ok = true; > >> >> - > >> >> - /* 4.1. check support for the operation in the loop > >> >> + /* 4. check support for the operation in the loop > >> >> > >> >> This isn't necessary for the lane reduction codes, since they > >> >> can only be produced by pattern matching, and it's up to the > >> >> @@ -8370,14 +8495,13 @@ vectorizable_reduction (loop_vec_info > >> >> loop_vinfo, > >> >> mixed-sign dot-products can be implemented using signed > >> >> dot-products. */ > >> >> machine_mode vec_mode = TYPE_MODE (vectype_in); > >> >> - if (!lane_reducing > >> >> - && !directly_supported_p (op.code, vectype_in, optab_vector)) > >> >> + if (!directly_supported_p (op.code, vectype_in, optab_vector)) > >> >> { > >> >> if (dump_enabled_p ()) > >> >> dump_printf (MSG_NOTE, "op not supported by target.\n"); > >> >> if (maybe_ne (GET_MODE_SIZE (vec_mode), UNITS_PER_WORD) > >> >> || !vect_can_vectorize_without_simd_p (op.code)) > >> >> - ok = false; > >> >> + single_defuse_cycle = false; > >> >> else > >> >> if (dump_enabled_p ()) > >> >> dump_printf (MSG_NOTE, "proceeding using word mode.\n"); > >> >> @@ -8390,16 +8514,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > >> >> dump_printf (MSG_NOTE, "using word mode not possible.\n"); > >> >> return false; > >> >> } > >> >> - > >> >> - /* lane-reducing operations have to go through > >> >> vect_transform_reduction. > >> >> - For the other cases try without the single cycle > >> >> optimization. */ > >> >> - if (!ok) > >> >> - { > >> >> - if (lane_reducing) > >> >> - return false; > >> >> - else > >> >> - single_defuse_cycle = false; > >> >> - } > >> >> } > >> >> if (dump_enabled_p () && single_defuse_cycle) > >> >> dump_printf_loc (MSG_NOTE, vect_location, > >> >> @@ -8407,22 +8521,14 @@ vectorizable_reduction (loop_vec_info > >> >> loop_vinfo, > >> >> "multiple vectors to one in the loop body\n"); > >> >> STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info) = single_defuse_cycle; > >> >> > >> >> - /* If the reduction stmt is one of the patterns that have lane > >> >> - reduction embedded we cannot handle the case of ! > >> >> single_defuse_cycle. */ > >> >> - if ((ncopies > 1 && ! single_defuse_cycle) > >> >> - && lane_reducing) > >> >> - { > >> >> - if (dump_enabled_p ()) > >> >> - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > >> >> - "multi def-use cycle not possible for > >> >> lane-reducing " > >> >> - "reduction operation\n"); > >> >> - return false; > >> >> - } > >> >> + /* For lane-reducing operation, the below processing related to > >> >> single > >> >> + defuse-cycle will be done in its own vectorizable function. One > >> >> more > >> >> + thing to note is that the operation must not be involved in > >> >> fold-left > >> >> + reduction. */ > >> >> + single_defuse_cycle &= !lane_reducing; > >> >> > >> >> if (slp_node > >> >> - && !(!single_defuse_cycle > >> >> - && !lane_reducing > >> >> - && reduction_type != FOLD_LEFT_REDUCTION)) > >> >> + && (single_defuse_cycle || reduction_type == > >> >> FOLD_LEFT_REDUCTION)) > >> >> for (i = 0; i < (int) op.num_ops; i++) > >> >> if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i])) > >> >> { > >> >> @@ -8435,28 +8541,20 @@ vectorizable_reduction (loop_vec_info > >> >> loop_vinfo, > >> >> vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn, > >> >> reduction_type, ncopies, cost_vec); > >> >> /* Cost the reduction op inside the loop if transformed via > >> >> - vect_transform_reduction. Otherwise this is costed by the > >> >> - separate vectorizable_* routines. */ > >> >> - if (single_defuse_cycle || lane_reducing) > >> >> - { > >> >> - int factor = 1; > >> >> - if (vect_is_emulated_mixed_dot_prod (stmt_info)) > >> >> - /* Three dot-products and a subtraction. */ > >> >> - factor = 4; > >> >> - record_stmt_cost (cost_vec, ncopies * factor, vector_stmt, > >> >> - stmt_info, 0, vect_body); > >> >> - } > >> >> + vect_transform_reduction for non-lane-reducing operation. > >> >> Otherwise > >> >> + this is costed by the separate vectorizable_* routines. */ > >> >> + if (single_defuse_cycle) > >> >> + record_stmt_cost (cost_vec, ncopies, vector_stmt, stmt_info, 0, > >> >> vect_body); > >> >> > >> >> if (dump_enabled_p () > >> >> && reduction_type == FOLD_LEFT_REDUCTION) > >> >> dump_printf_loc (MSG_NOTE, vect_location, > >> >> "using an in-order (fold-left) reduction.\n"); > >> >> STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type; > >> >> - /* All but single defuse-cycle optimized, lane-reducing and fold-left > >> >> - reductions go through their own vectorizable_* routines. */ > >> >> - if (!single_defuse_cycle > >> >> - && !lane_reducing > >> >> - && reduction_type != FOLD_LEFT_REDUCTION) > >> >> + > >> >> + /* All but single defuse-cycle optimized and fold-left reductions go > >> >> + through their own vectorizable_* routines. */ > >> >> + if (!single_defuse_cycle && reduction_type != FOLD_LEFT_REDUCTION) > >> >> { > >> >> stmt_vec_info tem > >> >> = vect_stmt_to_vectorize (STMT_VINFO_REDUC_DEF (phi_info)); > >> >> @@ -8646,6 +8744,15 @@ vect_transform_reduction (loop_vec_info > >> >> loop_vinfo, > >> >> bool lane_reducing = lane_reducing_op_p (code); > >> >> gcc_assert (single_defuse_cycle || lane_reducing); > >> >> > >> >> + if (lane_reducing) > >> >> + { > >> >> + /* The last operand of lane-reducing op is for reduction. */ > >> >> + gcc_assert (reduc_index == (int) op.num_ops - 1); > >> >> + > >> >> + /* Now all lane-reducing ops are covered by some slp node. */ > >> >> + gcc_assert (slp_node); > >> >> + } > >> >> + > >> >> /* Create the destination vector */ > >> >> tree scalar_dest = gimple_get_lhs (stmt_info->stmt); > >> >> tree vec_dest = vect_create_destination_var (scalar_dest, > >> >> vectype_out); > >> >> @@ -8689,6 +8796,58 @@ vect_transform_reduction (loop_vec_info > >> >> loop_vinfo, > >> >> reduc_index == 2 ? op.ops[2] : NULL_TREE, > >> >> &vec_oprnds[2]); > >> >> } > >> >> + else if (lane_reducing && SLP_TREE_LANES (slp_node) == 1 > >> >> + && vec_oprnds[0].length () < vec_oprnds[reduc_index].length > >> >> ()) > >> >> + { > >> >> + /* For lane-reducing op covered by single-lane slp node, the > >> >> input > >> >> + vectype of the reduction PHI determines copies of vectorized > >> >> def-use > >> >> + cycles, which might be more than effective copies of > >> >> vectorized lane- > >> >> + reducing reduction statements. This could be complemented by > >> >> + generating extra trivial pass-through copies. For example: > >> >> + > >> >> + int sum = 0; > >> >> + for (i) > >> >> + { > >> >> + sum += d0[i] * d1[i]; // dot-prod <vector(16) char> > >> >> + sum += abs(s0[i] - s1[i]); // sad <vector(8) short> > >> >> + sum += n[i]; // normal <vector(4) int> > >> >> + } > >> >> + > >> >> + The vector size is 128-bit,vectorization factor is 16. > >> >> Reduction > >> >> + statements would be transformed as: > >> >> + > >> >> + vector<4> int sum_v0 = { 0, 0, 0, 0 }; > >> >> + vector<4> int sum_v1 = { 0, 0, 0, 0 }; > >> >> + vector<4> int sum_v2 = { 0, 0, 0, 0 }; > >> >> + vector<4> int sum_v3 = { 0, 0, 0, 0 }; > >> >> + > >> >> + for (i / 16) > >> >> + { > >> >> + sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], > >> >> sum_v0); > >> >> + sum_v1 = sum_v1; // copy > >> >> + sum_v2 = sum_v2; // copy > >> >> + sum_v3 = sum_v3; // copy > >> >> + > >> >> + sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], > >> >> sum_v0); > >> >> + sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], > >> >> sum_v1); > >> >> + sum_v2 = sum_v2; // copy > >> >> + sum_v3 = sum_v3; // copy > >> >> + > >> >> + sum_v0 += n_v0[i: 0 ~ 3 ]; > >> >> + sum_v1 += n_v1[i: 4 ~ 7 ]; > >> >> + sum_v2 += n_v2[i: 8 ~ 11]; > >> >> + sum_v3 += n_v3[i: 12 ~ 15]; > >> >> + } > >> >> + */ > >> >> + unsigned using_ncopies = vec_oprnds[0].length (); > >> >> + unsigned reduc_ncopies = vec_oprnds[reduc_index].length (); > >> >> + > >> > > >> > assert reduc_ncopies >= using_ncopies? Maybe assert > >> > reduc_index == op.num_ops - 1 given you use one above > >> > and the other below? Or simply iterate till op.num_ops > >> > and sip i == reduc_index. > >> > > >> >> + for (unsigned i = 0; i < op.num_ops - 1; i++) > >> >> + { > >> >> + gcc_assert (vec_oprnds[i].length () == using_ncopies); > >> >> + vec_oprnds[i].safe_grow_cleared (reduc_ncopies); > >> >> + } > >> >> + } > >> >> > >> >> bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod > >> >> (stmt_info); > >> >> unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length (); > >> >> @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info > >> >> loop_vinfo, > >> >> { > >> >> gimple *new_stmt; > >> >> tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE }; > >> >> - if (masked_loop_p && !mask_by_cond_expr) > >> >> + > >> >> + if (!vop[0] || !vop[1]) > >> >> + { > >> >> + tree reduc_vop = vec_oprnds[reduc_index][i]; > >> >> + > >> >> + /* Insert trivial copy if no need to generate vectorized > >> >> + statement. */ > >> >> + gcc_assert (reduc_vop); > >> >> + > >> >> + new_stmt = gimple_build_assign (vec_dest, reduc_vop); > >> >> + new_temp = make_ssa_name (vec_dest, new_stmt); > >> >> + gimple_set_lhs (new_stmt, new_temp); > >> >> + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, > >> >> gsi); > >> > > >> > I think you could simply do > >> > > >> > slp_node->push_vec_def (reduc_vop); > >> > continue; > >> > > >> > without any code generation. > >> > > >> > >> OK, that would be easy. Here comes another question, this patch assumes > >> lane-reducing op would always be contained in a slp node, since single-lane > >> slp node feature has been enabled. But I got some regression if I enforced > >> such constraint on lane-reducing op check. Those cases are founded to > >> be unvectorizable with single-lane slp, so this should not be what we want? > >> and need to be fixed? > > > > Yes, in the end we need to chase down all unsupported cases and fix them > > (there's known issues with load permutes, I'm working on that - hopefully > > when finding a continuous stretch of time...). > > > >> > >> >> + } > >> >> + else if (masked_loop_p && !mask_by_cond_expr) > >> >> { > >> >> /* No conditional ifns have been defined for lane-reducing op > >> >> yet. */ > >> >> @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info > >> >> loop_vinfo, > >> >> > >> >> if (masked_loop_p && mask_by_cond_expr) > >> >> { > >> >> + tree stmt_vectype_in = vectype_in; > >> >> + unsigned nvectors = vec_num * ncopies; > >> >> + > >> >> + if (lane_reducing && SLP_TREE_LANES (slp_node) == 1) > >> >> + { > >> >> + /* Input vectype of the reduction PHI may be > >> >> defferent from > >> > > >> > different > >> > > >> >> + that of lane-reducing operation. */ > >> >> + stmt_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN > >> >> (stmt_info); > >> >> + nvectors = vect_get_num_copies (loop_vinfo, > >> >> stmt_vectype_in); > >> > > >> > I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS. > >> > >> To partially vectorizing a dot_prod<16 * char> with 128-bit vector width, > >> we should pass (nvector=4, vectype=<4 *int>) instead of (nvector=1, > >> vectype=<16 *char>) > >> to vect_get_loop_mask? > > > > Probably - it depends on the vectorization factor. What I wanted to > > point out is that > > vec_num (likely from SLP_TREE_NUMBER_OF_VEC_STMTS) is wrong. The > > place setting SLP_TREE_NUMBER_OF_VEC_STMTS needs to be adjusted, > > or we should forgo with it (but that's possibly a post-only-SLP > > cleanup to be done). > > > > See vect_slp_analyze_node_operations_1 where that's computed. For > > reductions > > it's probably not quite right (and we might have latent issues like > > those you are > > "fixing" with code like above). The order we analyze stmts might also be > > not > > optimal for reductions with SLP - in fact given that stmt analysis > > relies on a fixed VF > > it would probably make sense to determine the reduction VF in advance as > > well. > > But again this sounds like post-only-SLP cleanup opportunities. > > > > In the end I might suggest to always use reduct-VF and vectype to determine > > the number of vector stmts rather than computing ncopies/vec_num separately. > > > > Thanks, > Feng