"Kewen.Lin" <li...@linux.ibm.com> writes:
> Hi Richard,
>
> on 2020/7/21 下午3:57, Richard Biener wrote:
>> On Tue, Jul 21, 2020 at 7:52 AM Kewen.Lin <li...@linux.ibm.com> wrote:
>>>
>>> Hi,
>>>
>>> This patch is to add the cost modeling for vector with length,
>>> it mainly follows what we generate for vector with length in
>>> functions vect_set_loop_controls_directly and vect_gen_len
>>> at the worst case.
>>>
>>> For Power, the length is expected to be in bits 0-7 (high bits),
>>> we have to model the cost of shifting bits.  To allow other targets
>>> not suffer this, I used one target hook to describe this extra cost,
>>> I'm not sure if it's a correct way.
>>>
>>> Bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit
>>> param vect-partial-vector-usage=1.
>>>
>>> Any comments/suggestions are highly appreciated!
>> 
>> I don't like the introduction of an extra target hook for this.  All
>> vectorizer cost modeling should ideally go through
>> init_cost/add_stmt_cost/finish_cost.  If the extra costing is
>> not per stmt then either init_cost or finish_cost is appropriate.
>> Currently init_cost only gets a struct loop while we should
>> probably give it a vec_info * parameter so targets can
>> check LOOP_VINFO_USING_PARTIAL_VECTORS_P and friends.
>> 
>
> Thanks!  Nice, your suggested way looks better.  I've removed the hook
> and taken care of it in finish_cost.  The updated v2 is attached.
>
> Bootstrapped/regtested again on powerpc64le-linux-gnu (P9) with explicit
> param vect-partial-vector-usage=1.
>
> BR,
> Kewen
> -----
> gcc/ChangeLog:
>
>       * config/rs6000/rs6000.c (adjust_vect_cost): New function.
>       (rs6000_finish_cost): Call function adjust_vect_cost.
>       * tree-vect-loop.c (vect_estimate_min_profitable_iters): Add cost
>       modeling for vector with length.
>
> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index 5a4f07d5810..f2724e792c9 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> @@ -5177,6 +5177,34 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void 
> *data, int count,
>    return retval;
>  }
>  
> +/* For some target specific vectorization cost which can't be handled per 
> stmt,
> +   we check the requisite conditions and adjust the vectorization cost
> +   accordingly if satisfied.  One typical example is to model shift cost for
> +   vector with length by counting number of required lengths under condition
> +   LOOP_VINFO_FULLY_WITH_LENGTH_P.  */
> +
> +static void
> +adjust_vect_cost (rs6000_cost_data *data)
> +{
> +  struct loop *loop = data->loop_info;
> +  gcc_assert (loop);
> +  loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop);
> +
> +  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
> +    {
> +      rgroup_controls *rgc;
> +      unsigned int num_vectors_m1;
> +      unsigned int shift_cnt = 0;
> +      FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc)
> +     if (rgc->type)
> +       /* Each length needs one shift to fill into bits 0-7.  */
> +       shift_cnt += (num_vectors_m1 + 1);
> +
> +      rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, 
> scalar_stmt,
> +                         NULL, NULL_TREE, 0, vect_body);
> +    }
> +}
> +
>  /* Implement targetm.vectorize.finish_cost.  */
>  
>  static void
> @@ -5186,7 +5214,10 @@ rs6000_finish_cost (void *data, unsigned 
> *prologue_cost,
>    rs6000_cost_data *cost_data = (rs6000_cost_data*) data;
>  
>    if (cost_data->loop_info)
> -    rs6000_density_test (cost_data);
> +    {
> +      adjust_vect_cost (cost_data);
> +      rs6000_density_test (cost_data);
> +    }
>  
>    /* Don't vectorize minimum-vectorization-factor, simple copy loops
>       that require versioning for any reason.  The vectorization is at
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index e933441b922..99e1fd7bdd0 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -3652,7 +3652,7 @@ vect_estimate_min_profitable_iters (loop_vec_info 
> loop_vinfo,
>       TODO: Build an expression that represents peel_iters for prologue and
>       epilogue to be used in a run-time test.  */
>  
> -  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> +  if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
>      {
>        peel_iters_prologue = 0;
>        peel_iters_epilogue = 0;
> @@ -3663,45 +3663,145 @@ vect_estimate_min_profitable_iters (loop_vec_info 
> loop_vinfo,
>         peel_iters_epilogue += 1;
>         stmt_info_for_cost *si;
>         int j;
> -       FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
> -                         j, si)
> +       FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j,
> +                         si)
>           (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count,
>                                 si->kind, si->stmt_info, si->vectype,
>                                 si->misalign, vect_epilogue);
>       }
>  
> -      /* Calculate how many masks we need to generate.  */
> -      unsigned int num_masks = 0;
> -      rgroup_controls *rgm;
> -      unsigned int num_vectors_m1;
> -      FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
> -     if (rgm->type)
> -       num_masks += num_vectors_m1 + 1;
> -      gcc_assert (num_masks > 0);
> -
> -      /* In the worst case, we need to generate each mask in the prologue
> -      and in the loop body.  One of the loop body mask instructions
> -      replaces the comparison in the scalar loop, and since we don't
> -      count the scalar comparison against the scalar body, we shouldn't
> -      count that vector instruction against the vector body either.
> -
> -      Sometimes we can use unpacks instead of generating prologue
> -      masks and sometimes the prologue mask will fold to a constant,
> -      so the actual prologue cost might be smaller.  However, it's
> -      simpler and safer to use the worst-case cost; if this ends up
> -      being the tie-breaker between vectorizing or not, then it's
> -      probably better not to vectorize.  */
> -      (void) add_stmt_cost (loop_vinfo,
> -                         target_cost_data, num_masks, vector_stmt,
> -                         NULL, NULL_TREE, 0, vect_prologue);
> -      (void) add_stmt_cost (loop_vinfo,
> -                         target_cost_data, num_masks - 1, vector_stmt,
> -                         NULL, NULL_TREE, 0, vect_body);
> -    }
> -  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
> -    {
> -      peel_iters_prologue = 0;
> -      peel_iters_epilogue = 0;
> +      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> +     {
> +       /* Calculate how many masks we need to generate.  */
> +       unsigned int num_masks = 0;
> +       rgroup_controls *rgm;
> +       unsigned int num_vectors_m1;
> +       FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm)
> +         if (rgm->type)
> +           num_masks += num_vectors_m1 + 1;
> +       gcc_assert (num_masks > 0);
> +
> +       /* In the worst case, we need to generate each mask in the prologue
> +          and in the loop body.  One of the loop body mask instructions
> +          replaces the comparison in the scalar loop, and since we don't
> +          count the scalar comparison against the scalar body, we shouldn't
> +          count that vector instruction against the vector body either.
> +
> +          Sometimes we can use unpacks instead of generating prologue
> +          masks and sometimes the prologue mask will fold to a constant,
> +          so the actual prologue cost might be smaller.  However, it's
> +          simpler and safer to use the worst-case cost; if this ends up
> +          being the tie-breaker between vectorizing or not, then it's
> +          probably better not to vectorize.  */
> +       (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks,
> +                             vector_stmt, NULL, NULL_TREE, 0, vect_prologue);
> +       (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1,
> +                             vector_stmt, NULL, NULL_TREE, 0, vect_body);
> +     }
> +      else
> +     {
> +       gcc_assert (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo));
> +
> +       /* Consider cost for LOOP_VINFO_PEELING_FOR_ALIGNMENT.  */
> +       if (npeel < 0)
> +         {
> +           peel_iters_prologue = assumed_vf / 2;
> +           /* See below, if peeled iterations are unknown, count a taken
> +              branch and a not taken branch per peeled loop.  */
> +           (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
> +                                 cond_branch_taken, NULL, NULL_TREE, 0,
> +                                 vect_prologue);
> +           (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
> +                                 cond_branch_not_taken, NULL, NULL_TREE, 0,
> +                                 vect_prologue);
> +         }
> +       else
> +         {
> +           peel_iters_prologue = npeel;
> +           if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> +             /* See vect_get_known_peeling_cost, if peeled iterations are
> +                known but number of scalar loop iterations are unknown, count
> +                a taken branch per peeled loop.  */
> +             (void) add_stmt_cost (loop_vinfo, target_cost_data, 1,
> +                                   cond_branch_taken, NULL, NULL_TREE, 0,
> +                                   vect_prologue);
> +         }

I think it'd be good to avoid duplicating this.  How about the
following structure?

  if (vect_use_loop_mask_for_alignment_p (…))
    {
      peel_iters_prologue = 0;
      peel_iters_epilogue = 0;
    }
  else if (npeel < 0)
    {
      … // A
    }
  else
    {
      …vect_get_known_peeling_cost stuff…
    }

but in A and vect_get_known_peeling_cost, set peel_iters_epilogue to:

  LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0

for LOOP_VINFO_USING_PARTIAL_VECTORS_P, instead of setting it to
whatever value we'd normally use.  Then wrap:

      (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken,
                            NULL, NULL_TREE, 0, vect_epilogue);
      (void) add_stmt_cost (loop_vinfo,
                            target_cost_data, 1, cond_branch_not_taken,
                            NULL, NULL_TREE, 0, vect_epilogue);

in !LOOP_VINFO_USING_PARTIAL_VECTORS_P and make the other vect_epilogue
stuff in A conditional on peel_iters_epilogue != 0.

This will also remove the need for the existing LOOP_VINFO_FULLY_MASKED_P
code:

      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
        {
          /* We need to peel exactly one iteration.  */
          peel_iters_epilogue += 1;
          stmt_info_for_cost *si;
          int j;
          FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
                            j, si)
            (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count,
                                  si->kind, si->stmt_info, si->vectype,
                                  si->misalign, vect_epilogue);
        }

Then, after the above, have:

  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
    …add costs for mask overhead…
  else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
    …add costs for lengths overhead…

So we'd have one block of code for estimating the prologue and epilogue
peeling cost, and a separate block of code for the loop control overhead.

Thanks,
Richard

Reply via email to