On 29/10/2024 15:53, Alex Coplan wrote: > On 29/10/2024 13:39, Richard Biener wrote: > > On Mon, 28 Oct 2024, Alex Coplan wrote: > > > > > This allows us to vectorize more loops with early exits by forcing > > > peeling for alignment to make sure that we're guaranteed to be able to > > > safely read an entire vector iteration without crossing a page boundary. > > > > > > To make this work for VLA architectures we have to allow compile-time > > > non-constant target alignments. We also have to override the result of > > > the target's preferred_vector_alignment hook if it isn't a power-of-two > > > multiple of the TYPE_SIZE of the chosen vector type. > > > > > > There is currently an implicit assumption that the TYPE_SIZE of the > > > vector type is itself a power of two. For non-VLA types this > > > could be checked directly in the vectorizer. For VLA types I > > > had discussed offline with Richard S about adding a target hook to allow > > > the vectorizer to query the backend to confirm that a given VLA type > > > is known to have a power-of-two size at runtime. > > > > GCC assumes all vectors have power-of-two size, so I don't think we > > need to check anything but we'd instead have to make sure the > > target constrains the hardware when this assumption doesn't hold > > in silicon. > > Ah, I didn't realise this was already assumed to be the case (even for > VLA in a target-independent way). In that case it sounds like we might > not need the check/hook. Thanks. > > > > > > I thought we > > > might be able to do this check in vector_alignment_reachable_p. Any > > > thoughts on that, richi? > > > > For the purpose of alignment peeling yeah, I guess this would be > > a possible place to check this. The hook is currently used for > > the case where the element has a lower alignment than its > > size and thus vector alignment cannot be reached by peeling. > > > > Btw, I thought we can already apply peeling for alignment for > > VLA vectors ... > > Yes. However, without this patch we don't attempt alignment peeling by > a compile-time unknown quantity (e.g. to get aligned to the size of a > VLA vector). As things stand we take whatever > targetm.preferred_vector_alignment says we should align to, and for SVE > that is typically a compile-time constant, I think (see > aarch64_vectorize_preferred_vector_alignment). > > > > > > gcc/ChangeLog: > > > > > > * tree-vect-data-refs.cc (vect_analyze_early_break_dependences): > > > Set need_peeling_for_alignment flag on read DRs instead of > > > failing vectorization. Punt on gathers. > > > (dr_misalignment): Handle non-constant target alignments. > > > (vect_compute_data_ref_alignment): If need_peeling_for_alignment > > > flag is set on the DR, then override the target alignment chosen > > > by the preferred_vector_alignment hook to choose a safe > > > alignment. > > > (vect_supportable_dr_alignment): Override > > > support_vector_misalignment hook if need_peeling_for_alignment > > > is set on the DR: in this case we must return > > > dr_unaligned_unsupported in order to force peeling. > > > * tree-vect-loop-manip.cc (vect_do_peeling): Allow prolog > > > peeling by a compile-time non-constant amount. > > > * tree-vectorizer.h (dr_vec_info): Add new flag > > > need_peeling_for_alignment. > > > --- > > > gcc/tree-vect-data-refs.cc | 77 ++++++++++++++++++++++++++++++------- > > > gcc/tree-vect-loop-manip.cc | 6 --- > > > gcc/tree-vectorizer.h | 5 +++ > > > 3 files changed, 68 insertions(+), 20 deletions(-) > > > > Eh, where's the inline copy ... > > Hmm, I suppose I should be passing --inline to git format-patch rather > than --attach. > > > > > @@ -739,15 +739,22 @@ vect_analyze_early_break_dependences (loop_vec_info > > loop_vinfo) > > if (DR_IS_READ (dr_ref) > > && !ref_within_array_bound (stmt, DR_REF (dr_ref))) > > { > > + if (STMT_VINFO_GATHER_SCATTER_P (stmt_vinfo)) > > + { > > + const char *msg > > > > you want to add STMT_VINFO_STRIDED_P as well. > > Ack, thanks. > > > > > /* Vector size in bytes. */ > > + poly_uint64 safe_align > > + = exact_div (tree_to_poly_uint64 (TYPE_SIZE (vectype)), > > BITS_PER_UNIT); > > > > safe_align = TYPE_SIZE_UNIT (vectype); > > Nice, thanks. > > > > > + /* Multiply by the unroll factor to get the number of bytes read > > + per vector iteration. */ > > + if (loop_vinfo) > > + { > > + auto num_copies = vect_get_num_copies (loop_vinfo, vectype); > > + gcc_checking_assert (pow2p_hwi (num_copies)); > > + safe_align *= num_copies; > > > > the unroll factor is the vectorization factor > > I'm probably being dense here, but I thought that the VF was the number > of scalar iterations we perform per vector iteration. Is that not the > case? > > > - I think the above goes > > wrong for grouped accesses like an early break condition > > > > if (a[2*i] == a[2*i+1]) > > > > or so. Thus, multiply by LOOP_VINFO_VECT_FACTOR (loop_vinfo). > > If my understanding of VF above is correct, then won't this end up being > too large a number? E.g. (AIUI) if we vectorize with a single DR with > vectype having TYPE_MODE of V16QI, then isn't the VF = 16? So we'd take > the safe_align=16 alignment (size of the vectype) we actually need, > multiply it by VF = 16, and try to align unnecessarily to 64 bytes?
of course 16 * 16 = 256, not 64 (but hopefully the point still stands despite my failure to do basic arithmetic) :) > > With the above logic (multiplying by num_copies) I was trying to handle > cases like: > > short a[256]; > void f(unsigned *b) > { > for (int i = 0; i < 128; i++) { > if (b[i] == 42) > break; > a[2*i] = b[i]; > a[2*i+1] = b[i]; > } > } > > where for every vector iteration we have to load two vectors for b (and > only store one vector's worth for a). Hence if we need to peel for b, > we need to align to (num_copies = 2) * TREE_SIZE (vectype) bytes. > > > Note this number doesn't need to be a power of two (and num_copies > > above neither) > > To clarify, do you mean that there is no guarantee that the VF (or > indeed num_copies above) will be a power of two, and as such we > shouldn't assert it but rather fail vectorization if this is the case? > > The other interpretation of the above is that we don't need to require > this for correctness, but I think we do (perhaps I'm missing something, > though). > > Thanks a lot for the reviews. > > Alex > > > > > The rest of the patch looks good to me. > > > > Richard. > >