On Thu, 15 Feb 2024, Richard Sandiford wrote:

> Richard Biener <rguent...@suse.de> writes:
> > On Wed, 14 Feb 2024, Richard Sandiford wrote:
> >
> >> Richard Biener <rguent...@suse.de> writes:
> >> > On Wed, 14 Feb 2024, Richard Sandiford wrote:
> >> >
> >> >> Richard Biener <rguent...@suse.de> writes:
> >> >> > The following avoids accessing out-of-bound vector elements when
> >> >> > native encoding a boolean vector with sub-BITS_PER_UNIT precision
> >> >> > elements.  The error was basing the number of elements to extract
> >> >> > on the rounded up total byte size involved and the patch bases
> >> >> > everything on the total number of elements to extract instead.
> >> >> 
> >> >> It's too long ago to be certain, but I think this was a deliberate 
> >> >> choice.
> >> >> The point of the new vector constant encoding is that it can give an
> >> >> allegedly sensible value for any given index, even out-of-range ones.
> >> >> 
> >> >> Since the padding bits are undefined, we should in principle have a free
> >> >> choice of what to use.  And for VLA, it's often better to continue the
> >> >> existing pattern rather than force to zero.
> >> >> 
> >> >> I don't strongly object to changing it.  I think we should be careful
> >> >> about relying on zeroing for correctness though.  The bits are in 
> >> >> principle
> >> >> undefined and we can't rely on reading zeros from equivalent memory or
> >> >> register values.
> >> >
> >> > The main motivation for a change here is to allow catching out-of-bound
> >> > indices again for VECTOR_CST_ELT, at least for constant nunits because
> >> > it might be a programming error like fat-fingering the index.  I do
> >> > think it's a regression that we no longer catch those.
> >> >
> >> > It's probably also a bit non-obvious how an encoding continues and
> >> > there might be DImode masks that can be represented by a 
> >> > zero-extended QImode immediate but "continued" it would require
> >> > a larger immediate.
> >> >
> >> > The change also effectively only changes something for 1 byte
> >> > encodings since nunits is a power of two and so is the element
> >> > size in bits.
> >> 
> >> Yeah, but even there, there's an argument that all-1s (0xff) is a more
> >> obvious value for an all-1s mask.
> >> 
> >> > A patch restoring the VECTOR_CST_ELT checking might be the
> >> > following
> >> >
> >> > diff --git a/gcc/tree.cc b/gcc/tree.cc
> >> > index 046a558d1b0..4c9b05167fd 100644
> >> > --- a/gcc/tree.cc
> >> > +++ b/gcc/tree.cc
> >> > @@ -10325,6 +10325,9 @@ vector_cst_elt (const_tree t, unsigned int i)
> >> >    if (i < encoded_nelts)
> >> >      return VECTOR_CST_ENCODED_ELT (t, i);
> >> >  
> >> > +  /* Catch out-of-bound element accesses.  */
> >> > +  gcc_checking_assert (maybe_gt (VECTOR_CST_NELTS (t), i));
> >> > +
> >> >    /* If there are no steps, the final encoded value is the right one.  
> >> > */
> >> >    if (!VECTOR_CST_STEPPED_P (t))
> >> >      {
> >> >
> >> > but it triggers quite a bit via const_binop for, for example
> >> >
> >> > #2  0x00000000011c1506 in const_binop (code=PLUS_EXPR, 
> >> >     arg1=<vector_cst 0x7ffff6f40b40>, arg2=<vector_cst 0x7ffff6e731e0>)
> >> > (gdb) p debug_generic_expr (arg1)
> >> > { 12, 13, 14, 15 }
> >> > $5 = void
> >> > (gdb) p debug_generic_expr (arg2)
> >> > { -2, -2, -2, -3 }
> >> > (gdb) p count
> >> > $4 = 6
> >> > (gdb) l
> >> > 1711          if (!elts.new_binary_operation (type, arg1, arg2, 
> >> > step_ok_p))
> >> > 1712            return NULL_TREE;
> >> > 1713          unsigned int count = elts.encoded_nelts ();
> >> > 1714          for (unsigned int i = 0; i < count; ++i)
> >> > 1715            {
> >> > 1716              tree elem1 = VECTOR_CST_ELT (arg1, i);
> >> > 1717              tree elem2 = VECTOR_CST_ELT (arg2, i);
> >> > 1718
> >> > 1719              tree elt = const_binop (code, elem1, elem2);
> >> >
> >> > this seems like an error to me - why would we, for fixed-size
> >> > vectors and for PLUS ever create a vector encoding with 6 elements?!
> >> > That seems at least inefficient to me?
> >> 
> >> It's a case of picking your poison.  On the other side, operating
> >> individually on each element of a V64QI is inefficient when the
> >> representation says up-front that all elements are equal.
> >
> > True, though I wonder why for VLS vectors new_binary_operation
> > doesn't cap the number of encoded elts on the fixed vector size,
> > like doing
> >
> >   encoded_elts = ordered_min (TYPE_VECTOR_SUBPARTS (..), encoded_elts);
> >
> > or if there's no good way to write it applying for both VLA and VLS
> > do it only when TYPE_VECTOR_SUBPARTS is constant.
> 
> ordered_min can't be used because there's no guarantee that encoded_elts
> and TYPE_VECTOR_SUBPARTS are well-ordered for the VLA case.  E.g.
> for a stepped (3-element) encoding and a length of 2+2X, the stepped
> encoding is longer for X==0 and the vector is longer for X>0.
> 
> But yeah, in general, trying to enforce this for VLS would probably
> lead to a proliferation of more "if VLA do one thing, if VLS do some
> other thing".  The aim was to avoid that where it didn't seem strictly
> necessary.
> 
> >> Fundemantally, operations on VLA vectors are treated as functions
> >> that map patterns to patterns.  The number of elements that are
> >> consumed isn't really relevant to the function itself.  The VLA
> >> folders therefore rely on being to read an element from a pattern
> >> even if the index is outside TREE_VECTOR_SUBPARTS.
> >> 
> >> There were two reasons for using VLA paths for VLS vectors.
> >> One I mentioned above: it saves time when all elements are equal,
> >> or have a similarly compact representation.  The other is that it
> >> makes VLA less special and ensures that the code gets more testing.
> >> 
> >> Maybe one compromise between that and the assert would be:
> >> 
> >> (1) enforce the assert only for VLS and
> >
> > that's what I did by using maybe_gt?
> >
> >> (2) add new checks to ensure that a VLA-friendly operation will never
> >>     read out-of-bounds for VLS vectors
> >> 
> >> But I think this would be awkward.  E.g. we now build reversed vectors
> >> as a 3-element series N-1, N-2, N-3.  It would be nice not to have
> >> to special-case N==2 by suppressing N-3.  And the condition for (2)
> >> might not always be obvious.
> >
> > If we know only N <= 2 is problematic then having must_eq ()
> > special cases there makes sense IMO.
> 
> I'm not sure it's just that.  E.g. a zip is { 0, N, 1, N+1, 2, N+2, ... },
> which would need to be curtailed for N==2 or N==4.
> 
> >> Another option would be to have special accessors that are allowed
> >> to read out of bounds, and add the assert (for both VLA & VLS) to
> >> VECTOR_CST_ELT.  It might take a while to root out all the places that
> >> need to change though.
> >
> > Yes, that's another possibility.
> >
> > The point is (and we're seeing it now with the behavior of
> > encoding QImode bitmasks) that the behavior isn't entirely
> > obvious when VLS vectors are involved.  There's a memset zero
> > of the QImode so anything other than zero-padding is odd.
> 
> Which memset do you mean?  The one in:
> 
>       /* Zero the buffer and then set bits later where necessary.  */
>       int extract_bytes = MIN (len, total_bytes - off);
>       if (ptr)
>       memset (ptr, 0, extract_bytes);
> 
> ?  If so, that's there for elt_bits>1.  It means that the canonical
> form of a 2-bit active mask element is 01 rather than 11.  (And that's
> something that might end up needing to be target-configurable.  The current
> behaviour is guided by SVE, although it shouldn't matter much in practice
> there.)
> 
> > It's also not clear to me we'd get reliable sign extension
> > (we probably don't?).
> 
> For which case?  The idea isn't that we always sign-extend or always
> zero-extend, but that we repeat the pattern.  So an all-1s mask would
> give an all-1s QImode.  A 1010 mask would give a 10101010 QI, etc.

But wouldn't 0 1 2 get you 0 1 2 3 4 5?  (not for bools of course)

Anyway, for QImode the efficiency argument is likely moot unless
some targets only have say 4 bit immediates, but even if they
probably either always sign or zero extend rather than pad out
the 4 bits to repeated other 4 bits ... so "repeating" isn't the
natural thing to do, it's zero- or sign-extending into padding.

Richard.

Reply via email to