Re: [PATCH v6 3/3] PR80791 Consider doloop cmp use in ivopts

Bin.Cheng Wed, 21 Aug 2019 22:47:13 -0700

On Thu, Aug 22, 2019 at 11:18 AM Kewen.Lin <li...@linux.ibm.com> wrote:
>
> Hi Bin,
>
> Thanks for your time!
>
> on 2019/8/21 下午8:32, Bin.Cheng wrote:
> > On Wed, Aug 14, 2019 at 3:23 PM Kewen.Lin <li...@linux.ibm.com> wrote:
> >>
> >> Hi!
> >>
> >> Comparing to the previous versions of implementation mainly based on the
> >> existing IV cands but zeroing the related group/use cost, this new one is 
> >> based
> >> on Richard and Segher's suggestion introducing one doloop dedicated IV 
> >> cand.
> >>
> >> Some key points are listed below:
> >>   1) New field doloop_p in struct iv_cand to indicate doloop dedicated IV 
> >> cand.
> >>   2) Special name "doloop" assigned.
> >>   3) Doloop IV cand with form (niter+1, +, -1)
> >>   4) For doloop IV cand, no extra one cost like BIV, assign zero cost for 
> >> step.
> >>   5) Support may_be_zero (regressed PR is in this case), the base of 
> >> doloop IV
> >>      can be COND_EXPR, add handlings in cand_value_at and may_eliminate_iv.
> >>   6) Add more expr support in force_expr_to_var_cost for reasonable cost
> >>      calculation on the IV base with may_be_zero (like COND_EXPR).
> >>   7) Set zero cost when using doloop IV cand for doloop use.
> >>   8) Add three hooks (should we merge _generic and _address?).
> >>     *) have_count_reg_decr_p, is to indicate the target has special 
> >> hardware
> >>        count register, we shouldn't consider the impact of doloop IV when
> >>        calculating register pressures.
> >>     *) doloop_cost_for_generic, is the extra cost when using doloop IV 
> >> cand for
> >>        generic type IV use.
> >>     *) doloop_cost_for_address, is the extra cost when using doloop IV 
> >> cand for
> >>        address type IV use.
> > What will happen if doloop IV cand be used for generic/address type iv
> > use?  Can RTL doloop can still perform doloop optimization in this
> > case?
> >
>
> On Power, we put the iteration count into hardware count register, it takes 
> very
> high cost to move the count to GPR, so the cost is set as INF to make it 
> impossible
> to use it for generic/address type iv use.  But as some discussion before, on 
> some
> targets using GPR instead of hardware count register, they probably want to 
> use this
> doloop iv used for other uses if profitable.  These two hooks offer the 
> possibility.
> In that case, I think RTL doloop can still perform since it can still get the
> pattern and transform.  The generic/address uses can still use it.
> >>
> >> Bootstrapped on powerpc64le-linux-gnu and regression testing passed 
> >> excepting
> >> for one failure on gcc/testsuite/gcc.dg/guality/loop-1.c at -O3 which is 
> >> tracked
> >> by PR89983.
> >>
> >> Any comments and suggestions are highly appreciated.  Thanks!
> > Not sure if I understand the patch correctly, some comments embedded.
> >
> > +  /* The number of doloop candidate in the set.  */
> > +  unsigned n_doloop_cands;
> > +
> > This is unnecessary.  See below comments.
> >
> > -    add_candidate_1 (data, base, step, important,
> > -                    IP_NORMAL, use, NULL, orig_iv);
> > +    add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, 
> > doloop,
> > +                    orig_iv);
> >    if (ip_end_pos (data->current_loop)
> >        && allow_ip_end_pos_p (data->current_loop))
> > -    add_candidate_1 (data, base, step, important, IP_END, use, NULL, 
> > orig_iv);
> > +    add_candidate_1 (data, base, step, important, IP_END, use, NULL, 
> > doloop,
> > +                    orig_iv);
> > Do we need to skip ip_end_pos case for doloop candidate?  Because the
> > candidate increment will be inserted in latch, i.e, increment position
> > is after exit condition.
> >
>
> Yes, we should skip it.  Currently function find_doloop_use has the check on 
> an
> empty latch and gimple_cond to latch, partially excluding it.  But it's still 
> good
> to guard it directly here.
>
> > -  tree_to_aff_combination (iv->base, type, val);
> > +  tree base = iv->base;
> > +  /* See add_iv_candidate_for_doloop, if may_be_zero is set, we want to 
> > extract
> > +     the value under !may_be_zero to get the compact bound which also well 
> > fits
> > +     for may_be_zero since we ensure the value for it is const one.  */
> > +  if (cand->doloop_p && desc->may_be_zero && !integer_zerop
> > (desc->may_be_zero))
> > +    base = fold_build2 (PLUS_EXPR, TREE_TYPE (niter),
> > +                       unshare_expr (rewrite_to_non_trapping_overflow 
> > (niter)),
> > +                       build_int_cst (TREE_TYPE (niter), 1));
> > +  tree_to_aff_combination (base, type, val);
> > I don't quite follow here.  The iv->base is computed from niter, I
> > suppose compact bound is for cheaper candidate initialization?  Why
> > it's possible to extract !may_be_zero niter for may_be_zero here?  The
> > niter under !may_be_zero has no indication about the real niter under
> > may_be_zero.
> >
>
> As you note below, the cand_value for doloop would be zero, but for the case
> may_be_zero set, the current calculation would take care of the whole niter
> expression including the cond_expr introduced by may_be_zero check, it's
> unexpected.  The purpose is to use the value under condition !may_be_zero
> for the calculation, and yes, to get expected zero finally.
>
> > -  cand_value_at (loop, cand, use->stmt, desc->niter, &bnd);
> > +  cand_value_at (loop, cand, use->stmt, desc, &bnd);
> > If I understand correctly, doloop use/cand will only be
> > identified/added for single exit loop, and there will be only one
> > cond(doloop) iv_use and only one doloop cand for doloop loop.  So the
> > cand_value at niter at use position would be 0.  If that's the case,
> > we can skip calling cand_value_at here for doloop cand.  The change to
> > cand_value_at would be unnecessary neither.
> >
>
> Exactly, I'll add the early return with zero bound for doloop.
>
> > -          expensive.  */
> > -  if (!integer_zerop (desc->may_be_zero))
> > +          expensive.
> > +
> > +     For doloop candidate, we have considered MAY_BE_ZERO for IV base, 
> > need to
> > +     support MAY_BE_ZERO ? 0 : NITER, so simply bypass this check.  */
> > +  if (!integer_zerop (desc->may_be_zero) && !cand->doloop_p)
> >      return iv_elimination_compare_lt (data, cand, comp, desc);
> > And we can early return before this?
> >
>
> OK.
>
> > +  if (may_be_zero)
> > +    {
> > +      if (COMPARISON_CLASS_P (may_be_zero))
> > +       {
> > +         niter = fold_build3 (COND_EXPR, ntype, may_be_zero,
> > +                              build_int_cst (ntype, 0),
> > +                              rewrite_to_non_trapping_overflow (niter));
> > +       }
> > +      /* Don't try to obtain the iteration count expression when 
> > may_be_zero is
> > +        integer_nonzerop (actually iteration count is one) or else.  */
> > +      else
> > +       return;
> > +    }
> > +
> > +  tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
> > +                          build_int_cst (ntype, 1));
> > niter is the number of latch executions, so niter + 1 could wrap here,
> > but guess it's not a problem the similar issue is not handled in
> > vectorizer neither.
> >
>
> OK.
>
> > +  unsigned n_old = data->regs_used, n_spr_for_doloop = 0;
> > +  /* If target supports count register for doloop, it doesn't take GPR.  */
> > +  if (targetm.have_count_reg_decr_p)
> > +    n_spr_for_doloop = n_doloop_cands;
> > +  unsigned n_new = n_invs + n_cands - n_spr_for_doloop;
> > Not necessary.  See below.
>
> > -  cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands);
> > +  cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands,
> > +                                       ivs->n_doloop_cands);
> > Also.
> >
> >        ivs->n_cands--;
> > +      if (cp->cand->doloop_p)
> > +       ivs->n_doloop_cands--;
> >
> >           ivs->n_cands++;
> > +         if (cp->cand->doloop_p)
> > +           ivs->n_doloop_cands++;
> > You can just book n_cands under condition !cp->cand->doloop_p.
>
> If my understanding is correct, you are suggesting the code like:
>
> if (!cp->cand->doloop_p)
>   ivs->n_cands++;
>
> But I'm afraid that it can NOT satisfy the need in function
> ivopts_estimate_reg_pressure.  As the comments, "if target supports
> count register for doloop it doesn't take GPR.".  If we make doloop
> cand invisible in n_cands, it's fine for target with count register,
> but we may miss to count them on targets without count register.
Why not one more step do checks:
if (!cp->cand->doloop_p || !targetm.have_count_reg_decr_p)
  ivs->n_cands++;


Thanks,
bin
>
> >
> > +  if (flag_branch_on_count_reg && generic_predict_doloop_p (data))
> > +    {
> > +      if (find_doloop_use (data))
> > +       {
> > +         data->doloop_use_p = true;
> > +         if (dump_file && (dump_flags & TDF_DETAILS))
> > +           {
> > +             fprintf (dump_file,
> > +                      "Predict loop %d can perform"
> > +                      " doloop optimization later.\n",
> > +                      loop->num);
> > +             flow_loop_dump (loop, dump_file, NULL, 1);
> > +           }
> > +       }
> > +    }
> > +
> > Please factor this into a function to keep caller short.
> >
>
> OK.
>
>
> Thanks!
> Kewen
>

Re: [PATCH v6 3/3] PR80791 Consider doloop cmp use in ivopts

Reply via email to