Re: [RFC PATCH 1/5] vect: Force alignment peeling to vectorize more early break loops

Alex Coplan Tue, 29 Oct 2024 08:54:22 -0700

On 29/10/2024 13:39, Richard Biener wrote:
> On Mon, 28 Oct 2024, Alex Coplan wrote:
> 
> > This allows us to vectorize more loops with early exits by forcing
> > peeling for alignment to make sure that we're guaranteed to be able to
> > safely read an entire vector iteration without crossing a page boundary.
> > 
> > To make this work for VLA architectures we have to allow compile-time
> > non-constant target alignments.  We also have to override the result of
> > the target's preferred_vector_alignment hook if it isn't a power-of-two
> > multiple of the TYPE_SIZE of the chosen vector type.
> > 
> > There is currently an implicit assumption that the TYPE_SIZE of the
> > vector type is itself a power of two.  For non-VLA types this
> > could be checked directly in the vectorizer.  For VLA types I
> > had discussed offline with Richard S about adding a target hook to allow
> > the vectorizer to query the backend to confirm that a given VLA type
> > is known to have a power-of-two size at runtime.
> 
> GCC assumes all vectors have power-of-two size, so I don't think we
> need to check anything but we'd instead have to make sure the
> target constrains the hardware when this assumption doesn't hold
> in silicon.


Ah, I didn't realise this was already assumed to be the case (even for
VLA in a target-independent way).  In that case it sounds like we might
not need the check/hook.  Thanks.

> 
> >  I thought we
> > might be able to do this check in vector_alignment_reachable_p.  Any
> > thoughts on that, richi?
> 
> For the purpose of alignment peeling yeah, I guess this would be
> a possible place to check this.  The hook is currently used for
> the case where the element has a lower alignment than its
> size and thus vector alignment cannot be reached by peeling.
> 
> Btw, I thought we can already apply peeling for alignment for
> VLA vectors ...

Yes.  However, without this patch we don't attempt alignment peeling by
a compile-time unknown quantity (e.g. to get aligned to the size of a
VLA vector).  As things stand we take whatever
targetm.preferred_vector_alignment says we should align to, and for SVE
that is typically a compile-time constant, I think (see
aarch64_vectorize_preferred_vector_alignment).

> 
> > gcc/ChangeLog:
> > 
> >     * tree-vect-data-refs.cc (vect_analyze_early_break_dependences):
> >     Set need_peeling_for_alignment flag on read DRs instead of
> >     failing vectorization.  Punt on gathers.
> >     (dr_misalignment): Handle non-constant target alignments.
> >     (vect_compute_data_ref_alignment): If need_peeling_for_alignment
> >     flag is set on the DR, then override the target alignment chosen
> >     by the preferred_vector_alignment hook to choose a safe
> >     alignment.
> >     (vect_supportable_dr_alignment): Override
> >     support_vector_misalignment hook if need_peeling_for_alignment
> >     is set on the DR: in this case we must return
> >     dr_unaligned_unsupported in order to force peeling.
> >     * tree-vect-loop-manip.cc (vect_do_peeling): Allow prolog
> >     peeling by a compile-time non-constant amount.
> >     * tree-vectorizer.h (dr_vec_info): Add new flag
> >     need_peeling_for_alignment.
> > ---
> >  gcc/tree-vect-data-refs.cc  | 77 ++++++++++++++++++++++++++++++-------
> >  gcc/tree-vect-loop-manip.cc |  6 ---
> >  gcc/tree-vectorizer.h       |  5 +++
> >  3 files changed, 68 insertions(+), 20 deletions(-)
> 
> Eh, where's the inline copy ...

Hmm, I suppose I should be passing --inline to git format-patch rather
than --attach.

> 
> @@ -739,15 +739,22 @@ vect_analyze_early_break_dependences (loop_vec_info 
> loop_vinfo)
>           if (DR_IS_READ (dr_ref)
>               && !ref_within_array_bound (stmt, DR_REF (dr_ref)))
>             {
> +             if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo))
> +               {
> +                 const char *msg
> 
> you want to add STMT_VINFO_STRIDED_P as well.

Ack, thanks.

> 
>       /* Vector size in bytes.  */
> +      poly_uint64 safe_align
> +       = exact_div (tree_to_poly_uint64 (TYPE_SIZE (vectype)), 
> BITS_PER_UNIT);
> 
> safe_align = TYPE_SIZE_UNIT (vectype);

Nice, thanks.

> 
> +      /* Multiply by the unroll factor to get the number of bytes read
> +        per vector iteration.  */
> +      if (loop_vinfo)
> +       {
> +         auto num_copies = vect_get_num_copies (loop_vinfo, vectype);
> +         gcc_checking_assert (pow2p_hwi (num_copies));
> +         safe_align *= num_copies;
> 
> the unroll factor is the vectorization factor

I'm probably being dense here, but I thought that the VF was the number
of scalar iterations we perform per vector iteration.  Is that not the
case?

> - I think the above goes
> wrong for grouped accesses like an early break condition
> 
>  if (a[2*i] == a[2*i+1])
> 
> or so.  Thus, multiply by LOOP_VINFO_VECT_FACTOR (loop_vinfo).

If my understanding of VF above is correct, then won't this end up being
too large a number?  E.g. (AIUI) if we vectorize with a single DR with
vectype having TYPE_MODE of V16QI, then isn't the VF = 16?  So we'd take
the safe_align=16 alignment (size of the vectype) we actually need,
multiply it by VF = 16, and try to align unnecessarily to 64 bytes?

With the above logic (multiplying by num_copies) I was trying to handle
cases like:

  short a[256];
  void f(unsigned *b)
  {
      for (int i = 0; i < 128; i++) {
        if (b[i] == 42)
          break;
        a[2*i] = b[i];
        a[2*i+1] = b[i];
      }
  }

where for every vector iteration we have to load two vectors for b (and
only store one vector's worth for a).  Hence if we need to peel for b,
we need to align to (num_copies = 2) * TREE_SIZE (vectype) bytes.

> Note this number doesn't need to be a power of two (and num_copies
> above neither)

To clarify, do you mean that there is no guarantee that the VF (or
indeed num_copies above) will be a power of two, and as such we
shouldn't assert it but rather fail vectorization if this is the case?

The other interpretation of the above is that we don't need to require
this for correctness, but I think we do (perhaps I'm missing something,
though).

Thanks a lot for the reviews.

Alex

> 
> The rest of the patch looks good to me.
> 
> Richard.
>

Re: [RFC PATCH 1/5] vect: Force alignment peeling to vectorize more early break loops

Reply via email to