On Thu, 15 Feb 2024, Richard Sandiford wrote: > Richard Biener <rguent...@suse.de> writes: > > On Wed, 14 Feb 2024, Richard Sandiford wrote: > > > >> Richard Biener <rguent...@suse.de> writes: > >> > On Wed, 14 Feb 2024, Richard Sandiford wrote: > >> > > >> >> Richard Biener <rguent...@suse.de> writes: > >> >> > The following avoids accessing out-of-bound vector elements when > >> >> > native encoding a boolean vector with sub-BITS_PER_UNIT precision > >> >> > elements. The error was basing the number of elements to extract > >> >> > on the rounded up total byte size involved and the patch bases > >> >> > everything on the total number of elements to extract instead. > >> >> > >> >> It's too long ago to be certain, but I think this was a deliberate > >> >> choice. > >> >> The point of the new vector constant encoding is that it can give an > >> >> allegedly sensible value for any given index, even out-of-range ones. > >> >> > >> >> Since the padding bits are undefined, we should in principle have a free > >> >> choice of what to use. And for VLA, it's often better to continue the > >> >> existing pattern rather than force to zero. > >> >> > >> >> I don't strongly object to changing it. I think we should be careful > >> >> about relying on zeroing for correctness though. The bits are in > >> >> principle > >> >> undefined and we can't rely on reading zeros from equivalent memory or > >> >> register values. > >> > > >> > The main motivation for a change here is to allow catching out-of-bound > >> > indices again for VECTOR_CST_ELT, at least for constant nunits because > >> > it might be a programming error like fat-fingering the index. I do > >> > think it's a regression that we no longer catch those. > >> > > >> > It's probably also a bit non-obvious how an encoding continues and > >> > there might be DImode masks that can be represented by a > >> > zero-extended QImode immediate but "continued" it would require > >> > a larger immediate. > >> > > >> > The change also effectively only changes something for 1 byte > >> > encodings since nunits is a power of two and so is the element > >> > size in bits. > >> > >> Yeah, but even there, there's an argument that all-1s (0xff) is a more > >> obvious value for an all-1s mask. > >> > >> > A patch restoring the VECTOR_CST_ELT checking might be the > >> > following > >> > > >> > diff --git a/gcc/tree.cc b/gcc/tree.cc > >> > index 046a558d1b0..4c9b05167fd 100644 > >> > --- a/gcc/tree.cc > >> > +++ b/gcc/tree.cc > >> > @@ -10325,6 +10325,9 @@ vector_cst_elt (const_tree t, unsigned int i) > >> > if (i < encoded_nelts) > >> > return VECTOR_CST_ENCODED_ELT (t, i); > >> > > >> > + /* Catch out-of-bound element accesses. */ > >> > + gcc_checking_assert (maybe_gt (VECTOR_CST_NELTS (t), i)); > >> > + > >> > /* If there are no steps, the final encoded value is the right one. > >> > */ > >> > if (!VECTOR_CST_STEPPED_P (t)) > >> > { > >> > > >> > but it triggers quite a bit via const_binop for, for example > >> > > >> > #2 0x00000000011c1506 in const_binop (code=PLUS_EXPR, > >> > arg1=<vector_cst 0x7ffff6f40b40>, arg2=<vector_cst 0x7ffff6e731e0>) > >> > (gdb) p debug_generic_expr (arg1) > >> > { 12, 13, 14, 15 } > >> > $5 = void > >> > (gdb) p debug_generic_expr (arg2) > >> > { -2, -2, -2, -3 } > >> > (gdb) p count > >> > $4 = 6 > >> > (gdb) l > >> > 1711 if (!elts.new_binary_operation (type, arg1, arg2, > >> > step_ok_p)) > >> > 1712 return NULL_TREE; > >> > 1713 unsigned int count = elts.encoded_nelts (); > >> > 1714 for (unsigned int i = 0; i < count; ++i) > >> > 1715 { > >> > 1716 tree elem1 = VECTOR_CST_ELT (arg1, i); > >> > 1717 tree elem2 = VECTOR_CST_ELT (arg2, i); > >> > 1718 > >> > 1719 tree elt = const_binop (code, elem1, elem2); > >> > > >> > this seems like an error to me - why would we, for fixed-size > >> > vectors and for PLUS ever create a vector encoding with 6 elements?! > >> > That seems at least inefficient to me? > >> > >> It's a case of picking your poison. On the other side, operating > >> individually on each element of a V64QI is inefficient when the > >> representation says up-front that all elements are equal. > > > > True, though I wonder why for VLS vectors new_binary_operation > > doesn't cap the number of encoded elts on the fixed vector size, > > like doing > > > > encoded_elts = ordered_min (TYPE_VECTOR_SUBPARTS (..), encoded_elts); > > > > or if there's no good way to write it applying for both VLA and VLS > > do it only when TYPE_VECTOR_SUBPARTS is constant. > > ordered_min can't be used because there's no guarantee that encoded_elts > and TYPE_VECTOR_SUBPARTS are well-ordered for the VLA case. E.g. > for a stepped (3-element) encoding and a length of 2+2X, the stepped > encoding is longer for X==0 and the vector is longer for X>0. > > But yeah, in general, trying to enforce this for VLS would probably > lead to a proliferation of more "if VLA do one thing, if VLS do some > other thing". The aim was to avoid that where it didn't seem strictly > necessary. > > >> Fundemantally, operations on VLA vectors are treated as functions > >> that map patterns to patterns. The number of elements that are > >> consumed isn't really relevant to the function itself. The VLA > >> folders therefore rely on being to read an element from a pattern > >> even if the index is outside TREE_VECTOR_SUBPARTS. > >> > >> There were two reasons for using VLA paths for VLS vectors. > >> One I mentioned above: it saves time when all elements are equal, > >> or have a similarly compact representation. The other is that it > >> makes VLA less special and ensures that the code gets more testing. > >> > >> Maybe one compromise between that and the assert would be: > >> > >> (1) enforce the assert only for VLS and > > > > that's what I did by using maybe_gt? > > > >> (2) add new checks to ensure that a VLA-friendly operation will never > >> read out-of-bounds for VLS vectors > >> > >> But I think this would be awkward. E.g. we now build reversed vectors > >> as a 3-element series N-1, N-2, N-3. It would be nice not to have > >> to special-case N==2 by suppressing N-3. And the condition for (2) > >> might not always be obvious. > > > > If we know only N <= 2 is problematic then having must_eq () > > special cases there makes sense IMO. > > I'm not sure it's just that. E.g. a zip is { 0, N, 1, N+1, 2, N+2, ... }, > which would need to be curtailed for N==2 or N==4. > > >> Another option would be to have special accessors that are allowed > >> to read out of bounds, and add the assert (for both VLA & VLS) to > >> VECTOR_CST_ELT. It might take a while to root out all the places that > >> need to change though. > > > > Yes, that's another possibility. > > > > The point is (and we're seeing it now with the behavior of > > encoding QImode bitmasks) that the behavior isn't entirely > > obvious when VLS vectors are involved. There's a memset zero > > of the QImode so anything other than zero-padding is odd. > > Which memset do you mean? The one in: > > /* Zero the buffer and then set bits later where necessary. */ > int extract_bytes = MIN (len, total_bytes - off); > if (ptr) > memset (ptr, 0, extract_bytes); > > ? If so, that's there for elt_bits>1. It means that the canonical > form of a 2-bit active mask element is 01 rather than 11. (And that's > something that might end up needing to be target-configurable. The current > behaviour is guided by SVE, although it shouldn't matter much in practice > there.) > > > It's also not clear to me we'd get reliable sign extension > > (we probably don't?). > > For which case? The idea isn't that we always sign-extend or always > zero-extend, but that we repeat the pattern. So an all-1s mask would > give an all-1s QImode. A 1010 mask would give a 10101010 QI, etc.
But wouldn't 0 1 2 get you 0 1 2 3 4 5? (not for bools of course) Anyway, for QImode the efficiency argument is likely moot unless some targets only have say 4 bit immediates, but even if they probably either always sign or zero extend rather than pad out the 4 bits to repeated other 4 bits ... so "repeating" isn't the natural thing to do, it's zero- or sign-extending into padding. Richard.