On Thu, Aug 22, 2019 at 11:18 AM Kewen.Lin <li...@linux.ibm.com> wrote: > > Hi Bin, > > Thanks for your time! > > on 2019/8/21 下午8:32, Bin.Cheng wrote: > > On Wed, Aug 14, 2019 at 3:23 PM Kewen.Lin <li...@linux.ibm.com> wrote: > >> > >> Hi! > >> > >> Comparing to the previous versions of implementation mainly based on the > >> existing IV cands but zeroing the related group/use cost, this new one is > >> based > >> on Richard and Segher's suggestion introducing one doloop dedicated IV > >> cand. > >> > >> Some key points are listed below: > >> 1) New field doloop_p in struct iv_cand to indicate doloop dedicated IV > >> cand. > >> 2) Special name "doloop" assigned. > >> 3) Doloop IV cand with form (niter+1, +, -1) > >> 4) For doloop IV cand, no extra one cost like BIV, assign zero cost for > >> step. > >> 5) Support may_be_zero (regressed PR is in this case), the base of > >> doloop IV > >> can be COND_EXPR, add handlings in cand_value_at and may_eliminate_iv. > >> 6) Add more expr support in force_expr_to_var_cost for reasonable cost > >> calculation on the IV base with may_be_zero (like COND_EXPR). > >> 7) Set zero cost when using doloop IV cand for doloop use. > >> 8) Add three hooks (should we merge _generic and _address?). > >> *) have_count_reg_decr_p, is to indicate the target has special > >> hardware > >> count register, we shouldn't consider the impact of doloop IV when > >> calculating register pressures. > >> *) doloop_cost_for_generic, is the extra cost when using doloop IV > >> cand for > >> generic type IV use. > >> *) doloop_cost_for_address, is the extra cost when using doloop IV > >> cand for > >> address type IV use. > > What will happen if doloop IV cand be used for generic/address type iv > > use? Can RTL doloop can still perform doloop optimization in this > > case? > > > > On Power, we put the iteration count into hardware count register, it takes > very > high cost to move the count to GPR, so the cost is set as INF to make it > impossible > to use it for generic/address type iv use. But as some discussion before, on > some > targets using GPR instead of hardware count register, they probably want to > use this > doloop iv used for other uses if profitable. These two hooks offer the > possibility. > In that case, I think RTL doloop can still perform since it can still get the > pattern and transform. The generic/address uses can still use it. > >> > >> Bootstrapped on powerpc64le-linux-gnu and regression testing passed > >> excepting > >> for one failure on gcc/testsuite/gcc.dg/guality/loop-1.c at -O3 which is > >> tracked > >> by PR89983. > >> > >> Any comments and suggestions are highly appreciated. Thanks! > > Not sure if I understand the patch correctly, some comments embedded. > > > > + /* The number of doloop candidate in the set. */ > > + unsigned n_doloop_cands; > > + > > This is unnecessary. See below comments. > > > > - add_candidate_1 (data, base, step, important, > > - IP_NORMAL, use, NULL, orig_iv); > > + add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, > > doloop, > > + orig_iv); > > if (ip_end_pos (data->current_loop) > > && allow_ip_end_pos_p (data->current_loop)) > > - add_candidate_1 (data, base, step, important, IP_END, use, NULL, > > orig_iv); > > + add_candidate_1 (data, base, step, important, IP_END, use, NULL, > > doloop, > > + orig_iv); > > Do we need to skip ip_end_pos case for doloop candidate? Because the > > candidate increment will be inserted in latch, i.e, increment position > > is after exit condition. > > > > Yes, we should skip it. Currently function find_doloop_use has the check on > an > empty latch and gimple_cond to latch, partially excluding it. But it's still > good > to guard it directly here. > > > - tree_to_aff_combination (iv->base, type, val); > > + tree base = iv->base; > > + /* See add_iv_candidate_for_doloop, if may_be_zero is set, we want to > > extract > > + the value under !may_be_zero to get the compact bound which also well > > fits > > + for may_be_zero since we ensure the value for it is const one. */ > > + if (cand->doloop_p && desc->may_be_zero && !integer_zerop > > (desc->may_be_zero)) > > + base = fold_build2 (PLUS_EXPR, TREE_TYPE (niter), > > + unshare_expr (rewrite_to_non_trapping_overflow > > (niter)), > > + build_int_cst (TREE_TYPE (niter), 1)); > > + tree_to_aff_combination (base, type, val); > > I don't quite follow here. The iv->base is computed from niter, I > > suppose compact bound is for cheaper candidate initialization? Why > > it's possible to extract !may_be_zero niter for may_be_zero here? The > > niter under !may_be_zero has no indication about the real niter under > > may_be_zero. > > > > As you note below, the cand_value for doloop would be zero, but for the case > may_be_zero set, the current calculation would take care of the whole niter > expression including the cond_expr introduced by may_be_zero check, it's > unexpected. The purpose is to use the value under condition !may_be_zero > for the calculation, and yes, to get expected zero finally. > > > - cand_value_at (loop, cand, use->stmt, desc->niter, &bnd); > > + cand_value_at (loop, cand, use->stmt, desc, &bnd); > > If I understand correctly, doloop use/cand will only be > > identified/added for single exit loop, and there will be only one > > cond(doloop) iv_use and only one doloop cand for doloop loop. So the > > cand_value at niter at use position would be 0. If that's the case, > > we can skip calling cand_value_at here for doloop cand. The change to > > cand_value_at would be unnecessary neither. > > > > Exactly, I'll add the early return with zero bound for doloop. > > > - expensive. */ > > - if (!integer_zerop (desc->may_be_zero)) > > + expensive. > > + > > + For doloop candidate, we have considered MAY_BE_ZERO for IV base, > > need to > > + support MAY_BE_ZERO ? 0 : NITER, so simply bypass this check. */ > > + if (!integer_zerop (desc->may_be_zero) && !cand->doloop_p) > > return iv_elimination_compare_lt (data, cand, comp, desc); > > And we can early return before this? > > > > OK. > > > + if (may_be_zero) > > + { > > + if (COMPARISON_CLASS_P (may_be_zero)) > > + { > > + niter = fold_build3 (COND_EXPR, ntype, may_be_zero, > > + build_int_cst (ntype, 0), > > + rewrite_to_non_trapping_overflow (niter)); > > + } > > + /* Don't try to obtain the iteration count expression when > > may_be_zero is > > + integer_nonzerop (actually iteration count is one) or else. */ > > + else > > + return; > > + } > > + > > + tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter), > > + build_int_cst (ntype, 1)); > > niter is the number of latch executions, so niter + 1 could wrap here, > > but guess it's not a problem the similar issue is not handled in > > vectorizer neither. > > > > OK. > > > + unsigned n_old = data->regs_used, n_spr_for_doloop = 0; > > + /* If target supports count register for doloop, it doesn't take GPR. */ > > + if (targetm.have_count_reg_decr_p) > > + n_spr_for_doloop = n_doloop_cands; > > + unsigned n_new = n_invs + n_cands - n_spr_for_doloop; > > Not necessary. See below. > > > - cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands); > > + cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands, > > + ivs->n_doloop_cands); > > Also. > > > > ivs->n_cands--; > > + if (cp->cand->doloop_p) > > + ivs->n_doloop_cands--; > > > > ivs->n_cands++; > > + if (cp->cand->doloop_p) > > + ivs->n_doloop_cands++; > > You can just book n_cands under condition !cp->cand->doloop_p. > > If my understanding is correct, you are suggesting the code like: > > if (!cp->cand->doloop_p) > ivs->n_cands++; > > But I'm afraid that it can NOT satisfy the need in function > ivopts_estimate_reg_pressure. As the comments, "if target supports > count register for doloop it doesn't take GPR.". If we make doloop > cand invisible in n_cands, it's fine for target with count register, > but we may miss to count them on targets without count register. Why not one more step do checks: if (!cp->cand->doloop_p || !targetm.have_count_reg_decr_p) ivs->n_cands++;
Thanks, bin > > > > > + if (flag_branch_on_count_reg && generic_predict_doloop_p (data)) > > + { > > + if (find_doloop_use (data)) > > + { > > + data->doloop_use_p = true; > > + if (dump_file && (dump_flags & TDF_DETAILS)) > > + { > > + fprintf (dump_file, > > + "Predict loop %d can perform" > > + " doloop optimization later.\n", > > + loop->num); > > + flow_loop_dump (loop, dump_file, NULL, 1); > > + } > > + } > > + } > > + > > Please factor this into a function to keep caller short. > > > > OK. > > > Thanks! > Kewen >