Hi Richard, Thanks a lot for your great comments!
on 2020/6/2 下午7:50, Richard Sandiford wrote: > "Kewen.Lin" <li...@linux.ibm.com> writes: >> Hi Richard, >> >> on 2020/5/29 下午4:32, Richard Sandiford wrote: >>> "Kewen.Lin" <li...@linux.ibm.com> writes: >>>> on 2020/5/27 下午6:02, Richard Sandiford wrote: >>>>> "Kewen.Lin" <li...@linux.ibm.com> writes: >>>>>> Hi Richard, >>>>>> snip ... > > It would be easier to review, and perhaps easier to bisect, > if some of the mechanical changes were split out. E.g.: > > - Rename can_fully_mask_p to can_use_partial_vectors_p. > > - Rename fully_masked_p to using_partial_vectors_p. > > - Rename things related to rgroup_masks. I think “rgroup_controls” > or “rgroup_guards” might be more descriptive than “rgroup_objs”. > > These should be fairly mechanical changes and can happen ahead of > the main series. It'll then be easier to see what's different > for masks and lengths, separately from the more mechanical stuff. > Good suggestion. My fault, I should have done it before. Will split it into some NFC patches. > As far as: > > + union > + { > + /* The type of mask to use, based on the highest nS recorded above. */ > + tree mask_type; > + /* Any vector type to use these lengths. */ > + tree vec_type; > + }; > > goes, some parts of the code seem to use mask_type for lengths too, > which I'm a bit nervous about. I think we should either be consistent > about which union field we use (always mask_type for masks, always > vec_type for lengths) or we should just rename mask_type to something > more generic. Just "type" might be good enough with a suitable comment. Will fix it. > >> { >> tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo); >> tree iv_type = LOOP_VINFO_MASK_IV_TYPE (loop_vinfo); >> - tree mask_type = rgm->mask_type; >> - unsigned int nscalars_per_iter = rgm->max_nscalars_per_iter; >> - poly_uint64 nscalars_per_mask = TYPE_VECTOR_SUBPARTS (mask_type); >> + >> + bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo); >> + if (!vect_for_masking) >> + { >> + /* Obtain target supported length type. */ >> + scalar_int_mode len_mode = targetm.vectorize.length_mode; >> + unsigned int len_prec = GET_MODE_PRECISION (len_mode); >> + compare_type = build_nonstandard_integer_type (len_prec, true); >> + /* Simply set iv_type as same as compare_type. */ >> + iv_type = compare_type; > > This might not be the best time to bring this up :-) but it seems > odd to be asking the target for the induction variable type here. > I got the impression that the hook was returning DImode, whereas > the PowerPC instructions only looked at the low 8 bits of the length. > If so, forcing a naturally 32-bit IV to DImode would insert extra > sign/zero extensions, even though the input to the length intrinsics > would have been happy with the 32-bit IV. > Good point, I'll check it with some cases. As Segher pointed out, the 8 bits in bits 0-7 (the top, abnormal I admit), these vector with length instructions are guarded in 64 bits only. IIUC the extra sign/zero extensions would exist in pre-header with current setting? At that time I thought the iv with less precision than length mode had to be converted later for length's need, it looks good to use length mode simply. > I think it would make sense to ask the target for its minimum > precision P (which would be 8 bits if the above is correct). > The starting point would then be the maximum of: > > - this P > - the IV's natural precision > - the precision needed to hold: > the maximum number of scalar iterations multiplied by the scale factor > (to convert scalar counts to bytes) > > If the IV might wrap at that precision without producing all-zero lengths, > it would be worth doubling the precision to avoid the wrapping issue, > provided that we don't go beyond BITS_PER_WORD. > Thanks! Will think/test more on this part. >> + tree obj_type = rgo->mask_type; >> + /* Here, take nscalars_per_iter as nbytes_per_iter for length. */ >> + unsigned int nscalars_per_iter = rgo->max_nscalars_per_iter; > > I think whether we count scalars or count bytes is really a separate > decision that shouldn't be tied directly to using lengths. Length-based > loads and stores on other arches might want to count scalars too. > I'm not saying you should add support for that (it wouldn't be tested), > but I think we should avoid structuring the code to make it harder to > add in future. > It makes sense, will update it. > So I think nscalars_per_iter should always count scalars and anything > length-based should be separate. Would it make sense to store the > length scale factor as a separate field? I.e. using the terms > above the rgroup_masks comment, the length IV step is: > > factor * nS * VF == factor * nV * nL > Yeah, factor*nS becomes what we wanted for length-based in bytes, factor * nL would be the vector size. > That way, applying the factor becomes separate from lengths vs. masks. > The factor would also be useful in calculating the IV precision above. > Yeah, nice! >> [...] >> -/* Make LOOP iterate NITERS times using masking and WHILE_ULT calls. >> - LOOP_VINFO describes the vectorization of LOOP. NITERS is the >> - number of iterations of the original scalar loop that should be >> - handled by the vector loop. NITERS_MAYBE_ZERO and FINAL_IV are >> - as for vect_set_loop_condition. >> +/* Make LOOP iterate NITERS times using objects like masks (and >> + WHILE_ULT calls) or lengths. LOOP_VINFO describes the vectorization >> + of LOOP. NITERS is the number of iterations of the original scalar >> + loop that should be handled by the vector loop. NITERS_MAYBE_ZERO >> + and FINAL_IV are as for vect_set_loop_condition. >> >> Insert the branch-back condition before LOOP_COND_GSI and return the >> final gcond. */ >> >> static gcond * >> -vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo, >> - tree niters, tree final_iv, >> - bool niters_maybe_zero, >> - gimple_stmt_iterator loop_cond_gsi) >> +vect_set_loop_condition_partial (class loop *loop, loop_vec_info loop_vinfo, >> + tree niters, tree final_iv, >> + bool niters_maybe_zero, >> + gimple_stmt_iterator loop_cond_gsi) >> { >> gimple_seq preheader_seq = NULL; >> gimple_seq header_seq = NULL; >> >> + bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo); >> + >> tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo); >> + if (!vect_for_masking) >> + { >> + /* Obtain target supported length type as compare_type. */ >> + scalar_int_mode len_mode = targetm.vectorize.length_mode; >> + unsigned len_prec = GET_MODE_PRECISION (len_mode); >> + compare_type = build_nonstandard_integer_type (len_prec, true); > > Same comment as above about the choice of IV type. We shouldn't > recalculate this multiple times. It would be better to calculate > it upfront and store it in the loop_vinfo. OK. > >> @@ -2567,7 +2622,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree >> niters, tree nitersm1, >> if (vect_epilogues >> && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) >> && prolog_peeling >= 0 >> - && known_eq (vf, lowest_vf)) >> + && known_eq (vf, lowest_vf) >> + && !LOOP_VINFO_FULLY_WITH_LENGTH_P (epilogue_vinfo)) > > Why's this check needed? > It's mainly for length-based epilogue handlings. while (!(constant_multiple_p (GET_MODE_SIZE (loop_vinfo->vector_mode), GET_MODE_SIZE (epilogue_vinfo->vector_mode), &ratio) && eiters >= lowest_vf / ratio + epilogue_gaps)) This "if" part checks whether remaining eiters enough for the epilogue, if eiters less than epilogue's lowest_vf, it will back out the epilogue. But for partial_vectors, it should be acceptable, since it can deal with partial. Probably I should use using_partial_vectors_p here instead of LOOP_VINFO_FULLY_WITH_LENGTH_P, although the masking won't have the possiblity to handle the epilogue, the concept would be the same. >> { >> unsigned HOST_WIDE_INT eiters >> = (LOOP_VINFO_INT_NITERS (loop_vinfo) >> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c >> index 80e33b61be7..99e6cb904ba 100644 >> --- a/gcc/tree-vect-loop.c >> +++ b/gcc/tree-vect-loop.c >> @@ -813,8 +813,10 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, >> vec_info_shared *shared) >> vec_outside_cost (0), >> vec_inside_cost (0), >> vectorizable (false), >> - can_fully_mask_p (true), >> + can_partial_vect_p (true), > > I think “can_use_partial_vectors_p” reads better Will update with it. > >> fully_masked_p (false), >> + fully_with_length_p (false), > > I think it would be better if these two were a single flag > (using_partial_vectors_p), with masking vs. lengths being derived > information. > Will update it. >> + epil_partial_vect_p (false), >> peeling_for_gaps (false), >> peeling_for_niter (false), >> no_data_dependencies (false), >> @@ -880,13 +882,25 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, >> vec_info_shared *shared) >> void >> release_vec_loop_masks (vec_loop_masks *masks) >> { >> - rgroup_masks *rgm; >> + rgroup_objs *rgm; >> unsigned int i; >> FOR_EACH_VEC_ELT (*masks, i, rgm) >> - rgm->masks.release (); >> + rgm->objs.release (); >> masks->release (); >> } >> >> +/* Free all levels of LENS. */ >> + >> +void >> +release_vec_loop_lens (vec_loop_lens *lens) >> +{ >> + rgroup_objs *rgl; >> + unsigned int i; >> + FOR_EACH_VEC_ELT (*lens, i, rgl) >> + rgl->objs.release (); >> + lens->release (); >> +} >> + > > There's no need to duplicate this function. > Good catch, will rename and merge them. BR, Kewen > The overall approach looks good though. I just think we need to work > through the details a bit more. > > Thanks, > Richard >