Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts

Richard Biener via Gcc-patches Wed, 02 Aug 2023 10:39:37 -0700

On Wed, 2 Aug 2023, Richard Sandiford wrote:

> Richard Biener <rguent...@suse.de> writes:
> > On Tue, 1 Aug 2023, Richard Sandiford wrote:
> >
> >> Richard Sandiford <richard.sandif...@arm.com> writes:
> >> > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> >> The following makes sure to limit the shift operand when vectorizing
> >> >> (short)((int)x >> 31) via (short)x >> 31 as the out of bounds shift
> >> >> operand otherwise invokes undefined behavior.  When we determine
> >> >> whether we can demote the operand we know we at most shift in the
> >> >> sign bit so we can adjust the shift amount.
> >> >>
> >> >> Note this has the possibility of un-CSEing common shift operands
> >> >> as there's no good way to share pattern stmts between patterns.
> >> >> We'd have to separately pattern recognize the definition.
> >> >>
> >> >> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
> >> >>
> >> >> Not sure about LSHIFT_EXPR, it probably has the same issue but
> >> >> the fallback optimistic zero for out-of-range shifts is at least
> >> >> "corrrect".  Not sure we ever try to demote rotates (probably not).
> >> >
> >> > I guess you mean "correct" for x86?  But that's just a quirk of x86.
> >> > IMO the behaviour is equally wrong for LSHIFT_EXPR.
> >
> > I meant "correct" for the constant folding that evaluates out-of-bound
> > shifts as zero.
> >
> >> Sorry for the multiple messages.  Wanted to get something out quickly
> >> because I wasn't sure how long it would take me to write this...
> >> 
> >> On rotates, for:
> >> 
> >> void
> >> foo (unsigned short *restrict ptr)
> >> {
> >>   for (int i = 0; i < 200; ++i)
> >>     {
> >>       unsigned int x = ptr[i] & 0xff0;
> >>       ptr[i] = (x << 1) | (x >> 31);
> >>     }
> >> }
> >> 
> >> we do get:
> >> 
> >> can narrow to unsigned:13 without loss of precision: _5 = x_12 r>> 31;
> >> 
> >> although aarch64 doesn't provide rrotate patterns, so nothing actually
> >> comes of it.
> >
> > I think it's still correct that we only need unsigned:13 for the input,
> > we know other bits are zero.  But of course when actually applying
> > this as documented
> >
> > /* Record that STMT_INFO could be changed from operating on TYPE to
> >    operating on a type with the precision and sign given by PRECISION
> >    and SIGN respectively.
> >
> > the operation itself has to be altered (the above doesn't suggest
> > promoting/demoting the operands to TYPE is the only thing to do).
> >
> > So it seems to be the burden is on the consumers of the information?
> 
> Yeah, textually that seems fair.  Not sure I was thinking of it in
> those terms at the time though. :)
> 
> >> I think the handling of variable shifts is flawed for other reasons.  
> >> Given:
> >> 
> >> void
> >> uu (unsigned short *restrict ptr1, unsigned short *restrict ptr2)
> >> {
> >>   for (int i = 0; i < 200; ++i)
> >>     ptr1[i] = ptr1[i] >> ptr2[i];
> >> }
> >> 
> >> void
> >> us (unsigned short *restrict ptr1, short *restrict ptr2)
> >> {
> >>   for (int i = 0; i < 200; ++i)
> >>     ptr1[i] = ptr1[i] >> ptr2[i];
> >> }
> >> 
> >> void
> >> su (short *restrict ptr1, unsigned short *restrict ptr2)
> >> {
> >>   for (int i = 0; i < 200; ++i)
> >>     ptr1[i] = ptr1[i] >> ptr2[i];
> >> }
> >> 
> >> void
> >> ss (short *restrict ptr1, short *restrict ptr2)
> >> {
> >>   for (int i = 0; i < 200; ++i)
> >>     ptr1[i] = ptr1[i] >> ptr2[i];
> >> }
> >> 
> >> we only narrow uu and ss, due to:
> >> 
> >>        /* Ignore codes that don't take uniform arguments.  */
> >>        if (!types_compatible_p (TREE_TYPE (op), type))
> >>          return;
> >
> > I suppose that's because we care about the shift operand at all here.
> > We could possibly use [0 .. precision-1] as known range for it
> > and only if that doesn't fit 'type' give up (and otherwise simply
> > ignore the input range of the shift operands here).
> >
> >> in vect_determine_precisions_from_range.  Maybe we should drop
> >> the shift handling from there and instead rely on
> >> vect_determine_precisions_from_users, extending:
> >> 
> >>    if (TREE_CODE (shift) != INTEGER_CST
> >>        || !wi::ltu_p (wi::to_widest (shift), precision))
> >>      return;
> >> 
> >> to handle ranges where the max is known to be < precision.
> >> 
> >> There again, if masking is enough for right shifts and right rotates,
> >> maybe we should keep the current handling for then (with your fix)
> >> and skip the types_compatible_p check for those cases.
> >
> > I think it should be enough for left-shifts as well?  If we lshift
> > out like 0x100 << 9 so the lhs range is [0,0] the input range from
> > op0 will still make us use HImode.  I think we only ever get overly
> > conservative answers for left-shifts from this function?
> 
> But if we have:
> 
>   short x, y;
>   int z = (int) x << (int) y;
> 
> and at runtime, x == 1, y == 16, (short) z should be 0 (no UB),
> whereas x << y would invoke UB and x << (y & 15) would be 1.


True, but we start with the range of the LHS which in this case
would be of type 'int' and thus 1 << 16 and not zero.  You
might call that a failure of vect_determine_precisions_from_range
of course, since it makes it not exactly a forward propagation ...

> > Whatever works for RROTATE should also work for LROTATE.
> 
> I think the same problem affects LROTATE.
>
> >> So:
> >> 
> >> - restrict shift handling in vect_determine_precisions_from_range to
> >>   RSHIFT_EXPR and RROTATE_EXPR
> >> 
> >> - remove types_compatible_p restriction for those cases
> >> 
> >> - extend vect_determine_precisions_from_users shift handling to check
> >>   for ranges on the shift amount
> >> 
> >> Does that sound right?
> >
> > I'm not sure.   This all felt somewhat fragile when looking closer
> > (I was hoping you would eventually tackle it from the older
> > referenced bug) ... so my main idea was to perform incremental changes
> > where I have test coverage (as with the new wrong-code bug).
> 
> Fair :)
> 
> > Originally I completely disabled shift support but that regressed
> > the over-widen testcases a lot which at least have widened shifts
> > by constants a lot.
> >
> > x86 has vector rotates only for AMD XOP (which is dead) plus
> > some for V1TImode AFAICS, but I think we pattern-match rotates
> > to shifts, so maybe the precision stuff is interesting for the
> > case where we match the pattern rotate sequence for widenings?
> >
> > So for the types_compatible_p issue something along
> > the following?  We could also exempt the shift operand from
> > being covered by min_precision so the consumer would have
> > to make sure it can be represented (I think that's never going
> > to be an issue in practice until we get 256bit integers vectorized).
> > It will have to fixup the shift operands anyway.
> >
> > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> > index e4ab8c2d65b..cdeeaf98a47 100644
> > --- a/gcc/tree-vect-patterns.cc
> > +++ b/gcc/tree-vect-patterns.cc
> > @@ -6378,16 +6378,26 @@ vect_determine_precisions_from_range 
> > (stmt_vec_info stmt_info, gassign *stmt)
> >           }
> >         else if (TREE_CODE (op) == SSA_NAME)
> >           {
> > -           /* Ignore codes that don't take uniform arguments.  */
> > -           if (!types_compatible_p (TREE_TYPE (op), type))
> > +           /* Ignore codes that don't take uniform arguments.  For shifts
> > +              the shift amount is known to be in-range.  */
> 
> I guess it's more "we can assume that the amount is in range"?

Yes.

> > +           if (code == LSHIFT_EXPR
> > +               || code == RSHIFT_EXPR
> > +               || code == LROTATE_EXPR
> > +               || code == RROTATE_EXPR)
> > +             {
> > +               min_value = wi::min (min_value, 0, sign);
> > +               max_value = wi::max (max_value, TYPE_PRECISION (type), 
> > sign);
> 
> LGTM for shifts right.  Because of the above lshift thing, I think we
> need something like:
> 
>   if (code == LSHIFT_EXPR || code == LROTATE_EXPR)
>     {
>       wide_int op_min_value, op_max_value;
>       if (!vect_get_range_info (op, &op_min_value, op_max_value))
>         return;
> 
>       /* We can ignore left shifts by negative amounts, which are UB.  */
>       min_value = wi::min (min_value, 0, sign);
> 
>       /* Make sure the highest non-UB shift amount doesn't become UB.  */
>       op_max_value = wi::umin (op_max_value, TYPE_PRECISION (type));
>       auto mask = wi::mask (TYPE_PRECISION (type), false,
>                           op_max_value.to_uhwi ());
>       max_value = wi::max (max_value, mask, sign);
>     }
> 
> Does that look right?

As said it looks overly conservative to me?  For example with my patch
for

void foo (signed char *v, int s)
{
  if (s < 1 || s > 7)
    return;
  for (int i = 0; i < 1024; ++i)
    v[i] = v[i] << s;
}

I get

t.c:5:21: note:   _7 has range [0xffffc000, 0x3f80]
t.c:5:21: note:   can narrow to signed:15 without loss of precision: _7 = 
_6 << s_12(D);
t.c:5:21: note:   only the low 15 bits of _6 are significant
t.c:5:21: note:   _6 has range [0xffffff80, 0x7f]
...
t.c:5:21: note:   vect_recog_over_widening_pattern: detected: _7 = _6 << 
s_12(D);
t.c:5:21: note:   demoting int to signed short
t.c:5:21: note:   Splitting statement: _6 = (int) _5;
t.c:5:21: note:   into pattern statements: patt_24 = (signed short) _5;
t.c:5:21: note:   and: patt_23 = (int) patt_24;
t.c:5:21: note:   created pattern stmt: patt_21 = patt_24 << patt_22;
t.c:5:21: note:   over_widening pattern recognized: patt_18 = (int) 
patt_21;

and eventually

  patt_22 = (signed short) s_12(D);
...
  vect__5.7_28 = MEM <vector(16) signed char> [(signed char 
*)vectp_v.5_10];
  vect_patt_24.8_29 = [vec_unpack_lo_expr] vect__5.7_28;
  vect_patt_24.8_30 = [vec_unpack_hi_expr] vect__5.7_28;
  vect_patt_21.9_31 = vect_patt_24.8_29 << patt_22;
  vect_patt_21.9_32 = vect_patt_24.8_30 << patt_22;
  vect_patt_17.10_33 = VEC_PACK_TRUNC_EXPR <vect_patt_21.9_31, 
vect_patt_21.9_32>;
  MEM <vector(16) signed char> [(signed char *)vectp_v.11_34] = 
vect_patt_17.10_33;

which I think is correct.  If I'd up the range of the shift
operand then the range for _7 would be larger and we wouldn't
be able to shrink the range.  I suppose the _from_users
might back-propagate the truncation to signed char and enable
that though?

> Thanks, and sorry for the scope creep.

Well, if we can agree on that the patch as originally is
correct (but incomplete) we can maybe still install it.

As seen above I wonder if there's really an existing problem
for LSHIFT_EXPR.  I'm sure we mishandle rotates, but again
at the consumer side - in priciple we could analyze them
as lshift + rshift which means the existing behavior is
probably also correct.

Ricahrd.

> Richard
> 
> > +             }
> > +           else if (types_compatible_p (TREE_TYPE (op), type))
> > +             {
> > +               wide_int op_min_value, op_max_value;
> > +               if (!vect_get_range_info (op, &op_min_value, 
> > &op_max_value))
> > +                 return;
> > +               min_value = wi::min (min_value, op_min_value, sign);
> > +               max_value = wi::max (max_value, op_max_value, sign);
> > +             }
> > +           else
> >               return;
> > -
> > -           wide_int op_min_value, op_max_value;
> > -           if (!vect_get_range_info (op, &op_min_value, &op_max_value))
> > -             return;
> > -
> > -           min_value = wi::min (min_value, op_min_value, sign);
> > -           max_value = wi::max (max_value, op_max_value, sign);
> >           }
> >         else
> >           return;

Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts

Reply via email to