Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts

Richard Biener via Gcc-patches Wed, 02 Aug 2023 05:32:01 -0700

On Tue, 1 Aug 2023, Richard Sandiford wrote:

> Richard Sandiford <richard.sandif...@arm.com> writes:
> > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> The following makes sure to limit the shift operand when vectorizing
> >> (short)((int)x >> 31) via (short)x >> 31 as the out of bounds shift
> >> operand otherwise invokes undefined behavior.  When we determine
> >> whether we can demote the operand we know we at most shift in the
> >> sign bit so we can adjust the shift amount.
> >>
> >> Note this has the possibility of un-CSEing common shift operands
> >> as there's no good way to share pattern stmts between patterns.
> >> We'd have to separately pattern recognize the definition.
> >>
> >> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
> >>
> >> Not sure about LSHIFT_EXPR, it probably has the same issue but
> >> the fallback optimistic zero for out-of-range shifts is at least
> >> "corrrect".  Not sure we ever try to demote rotates (probably not).
> >
> > I guess you mean "correct" for x86?  But that's just a quirk of x86.
> > IMO the behaviour is equally wrong for LSHIFT_EXPR.


I meant "correct" for the constant folding that evaluates out-of-bound
shifts as zero.

> Sorry for the multiple messages.  Wanted to get something out quickly
> because I wasn't sure how long it would take me to write this...
> 
> On rotates, for:
> 
> void
> foo (unsigned short *restrict ptr)
> {
>   for (int i = 0; i < 200; ++i)
>     {
>       unsigned int x = ptr[i] & 0xff0;
>       ptr[i] = (x << 1) | (x >> 31);
>     }
> }
> 
> we do get:
> 
> can narrow to unsigned:13 without loss of precision: _5 = x_12 r>> 31;
> 
> although aarch64 doesn't provide rrotate patterns, so nothing actually
> comes of it.

I think it's still correct that we only need unsigned:13 for the input,
we know other bits are zero.  But of course when actually applying
this as documented

/* Record that STMT_INFO could be changed from operating on TYPE to
   operating on a type with the precision and sign given by PRECISION
   and SIGN respectively.

the operation itself has to be altered (the above doesn't suggest
promoting/demoting the operands to TYPE is the only thing to do).

So it seems to be the burden is on the consumers of the information?

> I think the handling of variable shifts is flawed for other reasons.  Given:
> 
> void
> uu (unsigned short *restrict ptr1, unsigned short *restrict ptr2)
> {
>   for (int i = 0; i < 200; ++i)
>     ptr1[i] = ptr1[i] >> ptr2[i];
> }
> 
> void
> us (unsigned short *restrict ptr1, short *restrict ptr2)
> {
>   for (int i = 0; i < 200; ++i)
>     ptr1[i] = ptr1[i] >> ptr2[i];
> }
> 
> void
> su (short *restrict ptr1, unsigned short *restrict ptr2)
> {
>   for (int i = 0; i < 200; ++i)
>     ptr1[i] = ptr1[i] >> ptr2[i];
> }
> 
> void
> ss (short *restrict ptr1, short *restrict ptr2)
> {
>   for (int i = 0; i < 200; ++i)
>     ptr1[i] = ptr1[i] >> ptr2[i];
> }
> 
> we only narrow uu and ss, due to:
> 
>           /* Ignore codes that don't take uniform arguments.  */
>           if (!types_compatible_p (TREE_TYPE (op), type))
>             return;

I suppose that's because we care about the shift operand at all here.
We could possibly use [0 .. precision-1] as known range for it
and only if that doesn't fit 'type' give up (and otherwise simply
ignore the input range of the shift operands here).

> in vect_determine_precisions_from_range.  Maybe we should drop
> the shift handling from there and instead rely on
> vect_determine_precisions_from_users, extending:
> 
>       if (TREE_CODE (shift) != INTEGER_CST
>           || !wi::ltu_p (wi::to_widest (shift), precision))
>         return;
> 
> to handle ranges where the max is known to be < precision.
> 
> There again, if masking is enough for right shifts and right rotates,
> maybe we should keep the current handling for then (with your fix)
> and skip the types_compatible_p check for those cases.

I think it should be enough for left-shifts as well?  If we lshift
out like 0x100 << 9 so the lhs range is [0,0] the input range from
op0 will still make us use HImode.  I think we only ever get overly
conservative answers for left-shifts from this function?

Whatever works for RROTATE should also work for LROTATE.

> So:
> 
> - restrict shift handling in vect_determine_precisions_from_range to
>   RSHIFT_EXPR and RROTATE_EXPR
> 
> - remove types_compatible_p restriction for those cases
> 
> - extend vect_determine_precisions_from_users shift handling to check
>   for ranges on the shift amount
> 
> Does that sound right?

I'm not sure.   This all felt somewhat fragile when looking closer
(I was hoping you would eventually tackle it from the older
referenced bug) ... so my main idea was to perform incremental changes
where I have test coverage (as with the new wrong-code bug).

Originally I completely disabled shift support but that regressed
the over-widen testcases a lot which at least have widened shifts
by constants a lot.

x86 has vector rotates only for AMD XOP (which is dead) plus
some for V1TImode AFAICS, but I think we pattern-match rotates
to shifts, so maybe the precision stuff is interesting for the
case where we match the pattern rotate sequence for widenings?

So for the types_compatible_p issue something along
the following?  We could also exempt the shift operand from
being covered by min_precision so the consumer would have
to make sure it can be represented (I think that's never going
to be an issue in practice until we get 256bit integers vectorized).
It will have to fixup the shift operands anyway.

diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index e4ab8c2d65b..cdeeaf98a47 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -6378,16 +6378,26 @@ vect_determine_precisions_from_range 
(stmt_vec_info stmt_info, gassign *stmt)
          }
        else if (TREE_CODE (op) == SSA_NAME)
          {
-           /* Ignore codes that don't take uniform arguments.  */
-           if (!types_compatible_p (TREE_TYPE (op), type))
+           /* Ignore codes that don't take uniform arguments.  For shifts
+              the shift amount is known to be in-range.  */
+           if (code == LSHIFT_EXPR
+               || code == RSHIFT_EXPR
+               || code == LROTATE_EXPR
+               || code == RROTATE_EXPR)
+             {
+               min_value = wi::min (min_value, 0, sign);
+               max_value = wi::max (max_value, TYPE_PRECISION (type), 
sign);
+             }
+           else if (types_compatible_p (TREE_TYPE (op), type))
+             {
+               wide_int op_min_value, op_max_value;
+               if (!vect_get_range_info (op, &op_min_value, 
&op_max_value))
+                 return;
+               min_value = wi::min (min_value, op_min_value, sign);
+               max_value = wi::max (max_value, op_max_value, sign);
+             }
+           else
              return;
-
-           wide_int op_min_value, op_max_value;
-           if (!vect_get_range_info (op, &op_min_value, &op_max_value))
-             return;
-
-           min_value = wi::min (min_value, op_min_value, sign);
-           max_value = wi::max (max_value, op_max_value, sign);
          }
        else
          return;

Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts

Reply via email to