Richard Biener <rguent...@suse.de> writes:
> On Wed, 14 Feb 2024, Richard Sandiford wrote:
>
>> Richard Biener <rguent...@suse.de> writes:
>> > On Wed, 14 Feb 2024, Richard Sandiford wrote:
>> >
>> >> Richard Biener <rguent...@suse.de> writes:
>> >> > The following avoids accessing out-of-bound vector elements when
>> >> > native encoding a boolean vector with sub-BITS_PER_UNIT precision
>> >> > elements.  The error was basing the number of elements to extract
>> >> > on the rounded up total byte size involved and the patch bases
>> >> > everything on the total number of elements to extract instead.
>> >> 
>> >> It's too long ago to be certain, but I think this was a deliberate choice.
>> >> The point of the new vector constant encoding is that it can give an
>> >> allegedly sensible value for any given index, even out-of-range ones.
>> >> 
>> >> Since the padding bits are undefined, we should in principle have a free
>> >> choice of what to use.  And for VLA, it's often better to continue the
>> >> existing pattern rather than force to zero.
>> >> 
>> >> I don't strongly object to changing it.  I think we should be careful
>> >> about relying on zeroing for correctness though.  The bits are in 
>> >> principle
>> >> undefined and we can't rely on reading zeros from equivalent memory or
>> >> register values.
>> >
>> > The main motivation for a change here is to allow catching out-of-bound
>> > indices again for VECTOR_CST_ELT, at least for constant nunits because
>> > it might be a programming error like fat-fingering the index.  I do
>> > think it's a regression that we no longer catch those.
>> >
>> > It's probably also a bit non-obvious how an encoding continues and
>> > there might be DImode masks that can be represented by a 
>> > zero-extended QImode immediate but "continued" it would require
>> > a larger immediate.
>> >
>> > The change also effectively only changes something for 1 byte
>> > encodings since nunits is a power of two and so is the element
>> > size in bits.
>> 
>> Yeah, but even there, there's an argument that all-1s (0xff) is a more
>> obvious value for an all-1s mask.
>> 
>> > A patch restoring the VECTOR_CST_ELT checking might be the
>> > following
>> >
>> > diff --git a/gcc/tree.cc b/gcc/tree.cc
>> > index 046a558d1b0..4c9b05167fd 100644
>> > --- a/gcc/tree.cc
>> > +++ b/gcc/tree.cc
>> > @@ -10325,6 +10325,9 @@ vector_cst_elt (const_tree t, unsigned int i)
>> >    if (i < encoded_nelts)
>> >      return VECTOR_CST_ENCODED_ELT (t, i);
>> >  
>> > +  /* Catch out-of-bound element accesses.  */
>> > +  gcc_checking_assert (maybe_gt (VECTOR_CST_NELTS (t), i));
>> > +
>> >    /* If there are no steps, the final encoded value is the right one.  */
>> >    if (!VECTOR_CST_STEPPED_P (t))
>> >      {
>> >
>> > but it triggers quite a bit via const_binop for, for example
>> >
>> > #2  0x00000000011c1506 in const_binop (code=PLUS_EXPR, 
>> >     arg1=<vector_cst 0x7ffff6f40b40>, arg2=<vector_cst 0x7ffff6e731e0>)
>> > (gdb) p debug_generic_expr (arg1)
>> > { 12, 13, 14, 15 }
>> > $5 = void
>> > (gdb) p debug_generic_expr (arg2)
>> > { -2, -2, -2, -3 }
>> > (gdb) p count
>> > $4 = 6
>> > (gdb) l
>> > 1711          if (!elts.new_binary_operation (type, arg1, arg2, 
>> > step_ok_p))
>> > 1712            return NULL_TREE;
>> > 1713          unsigned int count = elts.encoded_nelts ();
>> > 1714          for (unsigned int i = 0; i < count; ++i)
>> > 1715            {
>> > 1716              tree elem1 = VECTOR_CST_ELT (arg1, i);
>> > 1717              tree elem2 = VECTOR_CST_ELT (arg2, i);
>> > 1718
>> > 1719              tree elt = const_binop (code, elem1, elem2);
>> >
>> > this seems like an error to me - why would we, for fixed-size
>> > vectors and for PLUS ever create a vector encoding with 6 elements?!
>> > That seems at least inefficient to me?
>> 
>> It's a case of picking your poison.  On the other side, operating
>> individually on each element of a V64QI is inefficient when the
>> representation says up-front that all elements are equal.
>
> True, though I wonder why for VLS vectors new_binary_operation
> doesn't cap the number of encoded elts on the fixed vector size,
> like doing
>
>   encoded_elts = ordered_min (TYPE_VECTOR_SUBPARTS (..), encoded_elts);
>
> or if there's no good way to write it applying for both VLA and VLS
> do it only when TYPE_VECTOR_SUBPARTS is constant.

ordered_min can't be used because there's no guarantee that encoded_elts
and TYPE_VECTOR_SUBPARTS are well-ordered for the VLA case.  E.g.
for a stepped (3-element) encoding and a length of 2+2X, the stepped
encoding is longer for X==0 and the vector is longer for X>0.

But yeah, in general, trying to enforce this for VLS would probably
lead to a proliferation of more "if VLA do one thing, if VLS do some
other thing".  The aim was to avoid that where it didn't seem strictly
necessary.

>> Fundemantally, operations on VLA vectors are treated as functions
>> that map patterns to patterns.  The number of elements that are
>> consumed isn't really relevant to the function itself.  The VLA
>> folders therefore rely on being to read an element from a pattern
>> even if the index is outside TREE_VECTOR_SUBPARTS.
>> 
>> There were two reasons for using VLA paths for VLS vectors.
>> One I mentioned above: it saves time when all elements are equal,
>> or have a similarly compact representation.  The other is that it
>> makes VLA less special and ensures that the code gets more testing.
>> 
>> Maybe one compromise between that and the assert would be:
>> 
>> (1) enforce the assert only for VLS and
>
> that's what I did by using maybe_gt?
>
>> (2) add new checks to ensure that a VLA-friendly operation will never
>>     read out-of-bounds for VLS vectors
>> 
>> But I think this would be awkward.  E.g. we now build reversed vectors
>> as a 3-element series N-1, N-2, N-3.  It would be nice not to have
>> to special-case N==2 by suppressing N-3.  And the condition for (2)
>> might not always be obvious.
>
> If we know only N <= 2 is problematic then having must_eq ()
> special cases there makes sense IMO.

I'm not sure it's just that.  E.g. a zip is { 0, N, 1, N+1, 2, N+2, ... },
which would need to be curtailed for N==2 or N==4.

>> Another option would be to have special accessors that are allowed
>> to read out of bounds, and add the assert (for both VLA & VLS) to
>> VECTOR_CST_ELT.  It might take a while to root out all the places that
>> need to change though.
>
> Yes, that's another possibility.
>
> The point is (and we're seeing it now with the behavior of
> encoding QImode bitmasks) that the behavior isn't entirely
> obvious when VLS vectors are involved.  There's a memset zero
> of the QImode so anything other than zero-padding is odd.

Which memset do you mean?  The one in:

      /* Zero the buffer and then set bits later where necessary.  */
      int extract_bytes = MIN (len, total_bytes - off);
      if (ptr)
        memset (ptr, 0, extract_bytes);

?  If so, that's there for elt_bits>1.  It means that the canonical
form of a 2-bit active mask element is 01 rather than 11.  (And that's
something that might end up needing to be target-configurable.  The current
behaviour is guided by SVE, although it shouldn't matter much in practice
there.)

> It's also not clear to me we'd get reliable sign extension
> (we probably don't?).

For which case?  The idea isn't that we always sign-extend or always
zero-extend, but that we repeat the pattern.  So an all-1s mask would
give an all-1s QImode.  A 1010 mask would give a 10101010 QI, etc.

Thanks,
Richard

Reply via email to