Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts

Richard Sandiford via Gcc-patches Wed, 02 Aug 2023 12:48:41 -0700

Richard Biener <rguent...@suse.de> writes:
> [...]
>> >> in vect_determine_precisions_from_range.  Maybe we should drop
>> >> the shift handling from there and instead rely on
>> >> vect_determine_precisions_from_users, extending:
>> >> 
>> >>   if (TREE_CODE (shift) != INTEGER_CST
>> >>       || !wi::ltu_p (wi::to_widest (shift), precision))
>> >>     return;
>> >> 
>> >> to handle ranges where the max is known to be < precision.
>> >> 
>> >> There again, if masking is enough for right shifts and right rotates,
>> >> maybe we should keep the current handling for then (with your fix)
>> >> and skip the types_compatible_p check for those cases.
>> >
>> > I think it should be enough for left-shifts as well?  If we lshift
>> > out like 0x100 << 9 so the lhs range is [0,0] the input range from
>> > op0 will still make us use HImode.  I think we only ever get overly
>> > conservative answers for left-shifts from this function?
>> 
>> But if we have:
>> 
>>   short x, y;
>>   int z = (int) x << (int) y;
>> 
>> and at runtime, x == 1, y == 16, (short) z should be 0 (no UB),
>> whereas x << y would invoke UB and x << (y & 15) would be 1.
>
> True, but we start with the range of the LHS which in this case
> would be of type 'int' and thus 1 << 16 and not zero.  You
> might call that a failure of vect_determine_precisions_from_range
> of course, since it makes it not exactly a forward propagation ...


Ah, right, sorry.  I should have done more checking.

> [...]
>> > Originally I completely disabled shift support but that regressed
>> > the over-widen testcases a lot which at least have widened shifts
>> > by constants a lot.
>> >
>> > x86 has vector rotates only for AMD XOP (which is dead) plus
>> > some for V1TImode AFAICS, but I think we pattern-match rotates
>> > to shifts, so maybe the precision stuff is interesting for the
>> > case where we match the pattern rotate sequence for widenings?
>> >
>> > So for the types_compatible_p issue something along
>> > the following?  We could also exempt the shift operand from
>> > being covered by min_precision so the consumer would have
>> > to make sure it can be represented (I think that's never going
>> > to be an issue in practice until we get 256bit integers vectorized).
>> > It will have to fixup the shift operands anyway.
>> >
>> > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
>> > index e4ab8c2d65b..cdeeaf98a47 100644
>> > --- a/gcc/tree-vect-patterns.cc
>> > +++ b/gcc/tree-vect-patterns.cc
>> > @@ -6378,16 +6378,26 @@ vect_determine_precisions_from_range 
>> > (stmt_vec_info stmt_info, gassign *stmt)
>> >           }
>> >         else if (TREE_CODE (op) == SSA_NAME)
>> >           {
>> > -           /* Ignore codes that don't take uniform arguments.  */
>> > -           if (!types_compatible_p (TREE_TYPE (op), type))
>> > +           /* Ignore codes that don't take uniform arguments.  For shifts
>> > +              the shift amount is known to be in-range.  */
>> 
>> I guess it's more "we can assume that the amount is in range"?
>
> Yes.
>
>> > +           if (code == LSHIFT_EXPR
>> > +               || code == RSHIFT_EXPR
>> > +               || code == LROTATE_EXPR
>> > +               || code == RROTATE_EXPR)
>> > +             {
>> > +               min_value = wi::min (min_value, 0, sign);
>> > +               max_value = wi::max (max_value, TYPE_PRECISION (type), 
>> > sign);
>> 
>> LGTM for shifts right.  Because of the above lshift thing, I think we
>> need something like:
>> 
>>   if (code == LSHIFT_EXPR || code == LROTATE_EXPR)
>>     {
>>       wide_int op_min_value, op_max_value;
>>       if (!vect_get_range_info (op, &op_min_value, op_max_value))
>>         return;
>> 
>>       /* We can ignore left shifts by negative amounts, which are UB.  */
>>       min_value = wi::min (min_value, 0, sign);
>> 
>>       /* Make sure the highest non-UB shift amount doesn't become UB.  */
>>       op_max_value = wi::umin (op_max_value, TYPE_PRECISION (type));
>>       auto mask = wi::mask (TYPE_PRECISION (type), false,
>>                          op_max_value.to_uhwi ());
>>       max_value = wi::max (max_value, mask, sign);
>>     }
>> 
>> Does that look right?
>
> As said it looks overly conservative to me?  For example with my patch
> for
>
> void foo (signed char *v, int s)
> {
>   if (s < 1 || s > 7)
>     return;
>   for (int i = 0; i < 1024; ++i)
>     v[i] = v[i] << s;
> }
>
> I get
>
> t.c:5:21: note:   _7 has range [0xffffc000, 0x3f80]
> t.c:5:21: note:   can narrow to signed:15 without loss of precision: _7 = 
> _6 << s_12(D);
> t.c:5:21: note:   only the low 15 bits of _6 are significant
> t.c:5:21: note:   _6 has range [0xffffff80, 0x7f]
> ...
> t.c:5:21: note:   vect_recog_over_widening_pattern: detected: _7 = _6 << 
> s_12(D);
> t.c:5:21: note:   demoting int to signed short
> t.c:5:21: note:   Splitting statement: _6 = (int) _5;
> t.c:5:21: note:   into pattern statements: patt_24 = (signed short) _5;
> t.c:5:21: note:   and: patt_23 = (int) patt_24;
> t.c:5:21: note:   created pattern stmt: patt_21 = patt_24 << patt_22;
> t.c:5:21: note:   over_widening pattern recognized: patt_18 = (int) 
> patt_21;
>
> and eventually
>
>   patt_22 = (signed short) s_12(D);
> ...
>   vect__5.7_28 = MEM <vector(16) signed char> [(signed char 
> *)vectp_v.5_10];
>   vect_patt_24.8_29 = [vec_unpack_lo_expr] vect__5.7_28;
>   vect_patt_24.8_30 = [vec_unpack_hi_expr] vect__5.7_28;
>   vect_patt_21.9_31 = vect_patt_24.8_29 << patt_22;
>   vect_patt_21.9_32 = vect_patt_24.8_30 << patt_22;
>   vect_patt_17.10_33 = VEC_PACK_TRUNC_EXPR <vect_patt_21.9_31, 
> vect_patt_21.9_32>;
>   MEM <vector(16) signed char> [(signed char *)vectp_v.11_34] = 
> vect_patt_17.10_33;
>
> which I think is correct.

Agreed.

> If I'd up the range of the shift
> operand then the range for _7 would be larger and we wouldn't
> be able to shrink the range.  I suppose the _from_users
> might back-propagate the truncation to signed char and enable
> that though?

Yeah, that's the mechanism I was getting confused with, but like
you say, it isn't a factor for vect_determine_precisions_from_range.

>> Thanks, and sorry for the scope creep.
>
> Well, if we can agree on that the patch as originally is
> correct (but incomplete) we can maybe still install it.

Yeah, works for me.

> As seen above I wonder if there's really an existing problem
> for LSHIFT_EXPR.

No, maybe not.  For:

void foo (int *v, int s)
{
  for (int i = 0; i < 1024; ++i)
    v[i] = 2 << ((v[i] & 3) * 8 + 7);
}

we get:

note:   _8 has range [0x0, 0x1010100]
note:   _7 has range [0x7, 0x1f]
note:   can narrow to unsigned:25 without loss of precision: _8 = 2 << _7;

which isn't true, but I don't think anything will come of it.

With a bit of work, we could in principle realise that:

void foo (short *v, int s)
{
  for (int i = 0; i < 1024; ++i)
    v[i] = 2 << (((v[i] & 3) - 1) & 31);
}

gives a shift range of [0,2] \union {31}, and so the shift result is
either 0, 2, 4 or 8 (range [0, 8]), but we don't compute that result
range at present.

So yeah, I still can't trigger an actual problem.

> I'm sure we mishandle rotates, but again at the consumer side - in
> priciple we could analyze them as lshift + rshift which means the
> existing behavior is probably also correct.

Yeah, please go ahead with the original patch.  And sorry for the
distraction.  It's been a useful exercise for me at least...

Richard

Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts

Reply via email to