On Tue, 4 Jul 2023, Richard Sandiford wrote:

> Richard Biener <rguent...@suse.de> writes:
> > On Tue, 4 Jul 2023, Richard Biener wrote:
> >
> >> On Mon, 3 Jul 2023, Richard Sandiford wrote:
> >> 
> >> > Richard Biener <rguent...@suse.de> writes:
> >> > > The following removes late deciding to elide vectorized epilogues to
> >> > > the analysis phase and also avoids altering the epilogues niter.
> >> > > The costing part from vect_determine_partial_vectors_and_peeling is
> >> > > moved to vect_analyze_loop_costing where we use the main loop
> >> > > analysis to constrain the epilogue scalar iterations.
> >> > >
> >> > > I have not tried to integrate this with 
> >> > > vect_known_niters_smaller_than_vf.
> >> > >
> >> > > It seems the for_epilogue_p parameter in
> >> > > vect_determine_partial_vectors_and_peeling is largely useless and
> >> > > we could compute that in the function itself.
> >> > >
> >> > > Bootstrapped and tested on x86_64-unknown-linux-gnu, OK?
> >> > >
> >> > > I suppose testing on aarch64 would be nice-to-have - any takers?
> >> > 
> >> > Sorry, ran this earlier today and then forgot about it.  And yeah,
> >> > it passes bootstrap & regtest on aarch64-linux-gnu (all languages).
> >> > 
> >> > LGTM FWIW, except:
> >> > 
> >> > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> >> > > index 0a03f56aae7..f39a1ecb306 100644
> >> > > --- a/gcc/tree-vect-loop.cc
> >> > > +++ b/gcc/tree-vect-loop.cc
> >> > > @@ -2144,14 +2144,76 @@ vect_analyze_loop_costing (loop_vec_info 
> >> > > loop_vinfo,
> >> > >  
> >> > >    /* Only loops that can handle partially-populated vectors can have 
> >> > > iteration
> >> > >       counts less than the vectorization factor.  */
> >> > > -  if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
> >> > > +  if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
> >> > > +      && vect_known_niters_smaller_than_vf (loop_vinfo))
> >> > >      {
> >> > > -      if (vect_known_niters_smaller_than_vf (loop_vinfo))
> >> > > +      if (dump_enabled_p ())
> >> > > +      dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> > > +                       "not vectorized: iteration count smaller than "
> >> > > +                       "vectorization factor.\n");
> >> > > +      return 0;
> >> > > +    }
> >> > > +
> >> > > +  /* If we know the number of iterations we can do better, for the
> >> > > +     epilogue we can also decide whether the main loop leaves us
> >> > > +     with enough iterations, prefering a smaller vector epilog then
> >> > > +     also possibly used for the case we skip the vector loop.  */
> >> > > +  if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
> >> > > +      && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
> >> > > +    {
> >> > > +      widest_int scalar_niters
> >> > > +      = wi::to_widest (LOOP_VINFO_NITERSM1 (loop_vinfo)) + 1;
> >> > > +      if (LOOP_VINFO_EPILOGUE_P (loop_vinfo))
> >> > > +      {
> >> > > +        loop_vec_info orig_loop_vinfo
> >> > > +          = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo);
> >> > > +        unsigned lowest_vf
> >> > > +          = constant_lower_bound (LOOP_VINFO_VECT_FACTOR 
> >> > > (orig_loop_vinfo));
> >> > > +        int prolog_peeling = 0;
> >> > > +        if (!vect_use_loop_mask_for_alignment_p (loop_vinfo))
> >> > > +          prolog_peeling = LOOP_VINFO_PEELING_FOR_ALIGNMENT 
> >> > > (orig_loop_vinfo);
> >> > > +        if (prolog_peeling >= 0
> >> > > +            && known_eq (LOOP_VINFO_VECT_FACTOR (orig_loop_vinfo),
> >> > > +                         lowest_vf))
> >> > > +          {
> >> > > +            unsigned gap
> >> > > +              = LOOP_VINFO_PEELING_FOR_GAPS (orig_loop_vinfo) ? 1 : 0;
> >> > > +            scalar_niters = ((scalar_niters - gap - prolog_peeling)
> >> > > +                             % lowest_vf + gap);
> >> > 
> >> > Are you sure we want this + gap?  A vectorised epilogue can't handle the
> >> > gap either, at least for things that use (say) the first vector of LD2
> >> > and ignore the second vector.
> >> 
> >> I must confess I blindly copied this code from vect_do_peeling which did
> >> 
> >> -      unsigned HOST_WIDE_INT eiters
> >> -       = (LOOP_VINFO_INT_NITERS (loop_vinfo)
> >> -          - LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo));
> >> -
> >> -      eiters -= prolog_peeling;
> >> -      eiters
> >> -       = eiters % lowest_vf + LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
> >> 
> >> I think it's correct - the main vector loop processes
> >> (scalar_niters - gap - prolog_peeling) % vf iterations and it leaves
> >> that one 'gap' iteration to the epilogue.  Yes, that cannot handle
> >> the gap either, so we should subtract it's gap (maybe we were able
> >> to elide the gap peeling with lower VF) which means altering the
> >> two conditions below.  That btw also holds for the main vector
> >> loop.
> >> 
> >> I think I'm going to incrementally try to fix this so we can bisect
> >> issues from moving this code from transform to analysis separately
> >> from changing the actual heuristics.
> >
> > So it looks correct even since we below check both
> >
> >       /* Check that the loop processes at least one full vector.  */
> >       poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> >       if (known_lt (scalar_niters, vf))
> >         {
> >
> > and
> >
> >       /* If we need to peel an extra epilogue iteration to handle data
> >          accesses with gaps, check that there are enough scalar iterations
> >          available.
> >
> >          The check above is redundant with this one when peeling for gaps,
> >          but the distinction is useful for diagnostics.  */
> >       if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> >           && known_le (scalar_niters, vf))
> >
> > so here we take that into consideration.
> 
> Ah, I see.  I was thinking mostly of the following:
> 
> >> > > +            if (scalar_niters == 0)
> >> > > +              {
> >> > > +                if (dump_enabled_p ())
> >> > > +                  dump_printf_loc (MSG_MISSED_OPTIMIZATION, 
> >> > > vect_location,
> >> > > +                                   "not vectorized: loop never 
> >> > > entered\n");
> >> > > +                return 0;
> >> > > +              }
> 
> which looked odd when we'd just added "gaps" back in.  The loop is never
> entered when there's a single iteration left over for gaps either.
> 
> Maybe it would be clearer to delete the == 0 check?  Or alternatively,
> do that check before adding gaps back in?

Yeah, I think the check is now useless - 'scalar_niters' used to be
unsigned int and we'd subtract one from it.  But that's all gone now
and widest_int will handle possibly negative scalar_niters just fine.

I'll delete that check.

Thanks,
Richard.

> Thanks,
> Richard
> 
> 
> >> > > +          }
> >> > > +      }
> >> > > +
> >> > > +      /* Check that the loop processes at least one full vector.  */
> >> > > +      poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> >> > > +      if (known_lt (scalar_niters, vf))
> >> > >        {
> >> > >          if (dump_enabled_p ())
> >> > >            dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> > > -                           "not vectorized: iteration count smaller 
> >> > > than "
> >> > > -                           "vectorization factor.\n");
> >> > > +                           "loop does not have enough iterations "
> >> > > +                           "to support vectorization.\n");
> >> > > +        return 0;
> >> > > +      }
> >> > > +
> >> > > +      /* If we need to peel an extra epilogue iteration to handle data
> >> > > +       accesses with gaps, check that there are enough scalar 
> >> > > iterations
> >> > > +       available.
> >> > > +
> >> > > +       The check above is redundant with this one when peeling for 
> >> > > gaps,
> >> > > +       but the distinction is useful for diagnostics.  */
> >> > > +      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> >> > > +        && known_le (scalar_niters, vf))
> >> > > +      {
> >> > > +        if (dump_enabled_p ())
> >> > > +          dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >> > > +                           "loop does not have enough iterations "
> >> > > +                           "to support peeling for gaps.\n");
> >> > >          return 0;
> >> > >        }
> >> > >      }
> >> > > @@ -2502,31 +2564,6 @@ vect_determine_partial_vectors_and_peeling 
> >> > > (loop_vec_info loop_vinfo,
> >> > >                              LOOP_VINFO_VECT_FACTOR 
> >> > > (orig_loop_vinfo)));
> >> > >      }
> >> > >  
> >> > > -  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> >> > > -      && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
> >> > > -    {
> >> > > -      /* Check that the loop processes at least one full vector.  */
> >> > > -      poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> >> > > -      tree scalar_niters = LOOP_VINFO_NITERS (loop_vinfo);
> >> > > -      if (known_lt (wi::to_widest (scalar_niters), vf))
> >> > > -      return opt_result::failure_at (vect_location,
> >> > > -                                     "loop does not have enough 
> >> > > iterations"
> >> > > -                                     " to support vectorization.\n");
> >> > > -
> >> > > -      /* If we need to peel an extra epilogue iteration to handle data
> >> > > -       accesses with gaps, check that there are enough scalar 
> >> > > iterations
> >> > > -       available.
> >> > > -
> >> > > -       The check above is redundant with this one when peeling for 
> >> > > gaps,
> >> > > -       but the distinction is useful for diagnostics.  */
> >> > > -      tree scalar_nitersm1 = LOOP_VINFO_NITERSM1 (loop_vinfo);
> >> > > -      if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> >> > > -        && known_lt (wi::to_widest (scalar_nitersm1), vf))
> >> > > -      return opt_result::failure_at (vect_location,
> >> > > -                                     "loop does not have enough 
> >> > > iterations"
> >> > > -                                     " to support peeling for 
> >> > > gaps.\n");
> >> > > -    }
> >> > > -
> >> > >    LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
> >> > >      = (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
> >> > >         && need_peeling_or_partial_vectors_p);
> >> > > @@ -3002,7 +3039,8 @@ start_over:
> >> > >       assuming that the loop will be used as a main loop.  We will redo
> >> > >       this analysis later if we instead decide to use the loop as an
> >> > >       epilogue loop.  */
> >> > > -  ok = vect_determine_partial_vectors_and_peeling (loop_vinfo, false);
> >> > > +  ok = vect_determine_partial_vectors_and_peeling
> >> > > +       (loop_vinfo, LOOP_VINFO_EPILOGUE_P (loop_vinfo));
> >> > >    if (!ok)
> >> > >      return ok;
> >> > 
> >> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

Reply via email to