Tamar Christina <tamar.christ...@arm.com> writes:
>> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
>> >
>> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
>> 69
>> > de2afea139d6 100644
>> > --- a/gcc/tree-vect-patterns.cc
>> > +++ b/gcc/tree-vect-patterns.cc
>> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info *vinfo,
>> >        return pattern_stmt;
>> >      }
>> >    else if ((cst = uniform_integer_cst_p (oprnd1))
>> > -     && targetm.vectorize.can_special_div_by_const (rhs_code,
>> vectype,
>> > -                                                    wi::to_wide (cst),
>> > -                                                    NULL, NULL_RTX,
>> > -                                                    NULL_RTX))
>> > +     && TYPE_UNSIGNED (itype)
>> > +     && rhs_code == TRUNC_DIV_EXPR
>> > +     && vectype
>> > +     && direct_internal_fn_supported_p (IFN_ADDH, vectype,
>> > +                                        OPTIMIZE_FOR_SPEED))
>> >      {
>> > -      return NULL;
>> > +      /* div optimizations using narrowings
>> > +       we can do the division e.g. shorts by 255 faster by calculating it 
>> > as
>> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
>> > +       double the precision of x.
>> > +
>> > +       If we imagine a short as being composed of two blocks of bytes then
>> > +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
>> > +       adding 1 to each sub component:
>> > +
>> > +      short value of 16-bits
>> > +       ┌──────────────┬────────────────┐
>> > +       │              │                │
>> > +       └──────────────┴────────────────┘
>> > +   8-bit part1 ▲  8-bit part2   ▲
>> > +               │                │
>> > +               │                │
>> > +              +1               +1
>> > +
>> > +       after the first addition, we have to shift right by 8, and narrow 
>> > the
>> > +       results back to a byte.  Remember that the addition must be done in
>> > +       double the precision of the input.  However if we know that the
>> addition
>> > +       `x + 257` does not overflow then we can do the operation in the
>> current
>> > +       precision.  In which case we don't need the pack and unpacks.  */
>> > +      auto wcst = wi::to_wide (cst);
>> > +      int pow = wi::exact_log2 (wcst + 1);
>> > +      if (pow == (int) (element_precision (vectype) / 2))
>> > +  {
>> > +    wide_int min,max;
>> > +    /* If we're in a pattern we need to find the orginal definition.  */
>> > +    tree op0 = oprnd0;
>> > +    gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
>> > +    stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
>> > +    if (is_pattern_stmt_p (stmt_info))
>> > +      {
>> > +        auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
>> > +        if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>> > +          op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
>> > +      }
>> 
>> If this is generally safe (I'm skipping thinking about it in the interests 
>> of a
>> quick review :-)), then I think it should be done in vect_get_range_info
>> instead.  Using gimple_get_lhs would be more general than handling just
>> assignments.
>> 
>> > +
>> > +    /* Check that no overflow will occur.  If we don't have range
>> > +       information we can't perform the optimization.  */
>> > +    if (vect_get_range_info (op0, &min, &max))
>> > +      {
>> > +        wide_int one = wi::to_wide (build_one_cst (itype));
>> > +        wide_int adder = wi::add (one, wi::lshift (one, pow));
>> > +        wi::overflow_type ovf;
>> > +        /* We need adder and max in the same precision.  */
>> > +        wide_int zadder
>> > +          = wide_int_storage::from (adder, wi::get_precision (max),
>> > +                                    UNSIGNED);
>> > +        wi::add (max, zadder, UNSIGNED, &ovf);
>> 
>> Could you explain this a bit more?  When do we have mismatched
>> precisions?
>
> C promotion rules will promote e.g.
>
> void fun2(uint8_t* restrict pixel, uint8_t level, int n)
> {
>   for (int i = 0; i < n; i+=1)
>     pixel[i] = (pixel[i] + level) / 0xff;
> }
>
> And have the addition be done as a 32 bit integer.  The vectorizer will 
> demote this down
> to a short, but range information is not stored for patterns.  So In the 
> above the range will
> correctly be 0x1fe but the precision will be that of the original expression, 
> so 32.  This will
> be a mismatch with itype which is derived from the size the vectorizer will 
> perform the
> operation in.

Gah, missed this first time round, sorry.

Richi would know better than me, but I think it's dangerous to rely on
the orig/pattern link for range information.  The end result of a pattern
(vect_stmt_to_vectorize) has to have the same type as the lhs of the
original statement.  But the other statements in the pattern sequence
can do arbitrary things.  Their range isn't predictable from the range
of the original statement result.

IIRC, the addition above is converted to:

  a' = (uint16_t) pixel[i]
  b' = (uint16_t) level
  sum' = a' + b'
  sum = (int) sum'

where sum is the direct replacement of "pixel[i] + level", with the
same type and range.  The division then uses sum' instead of sum.

But the fact that sum' is part of the same pattern as sum doesn't
guarantee that sum' has the same range as sum.  E.g. the pattern
statements added by the division optimisation wouldn't have this
property.

Is it possible to tell ranger to compute the range of expressions that
haven't been added to the IL?  (Genuine question, haven't looked.
It seems pretty powerful though.)

Thanks,
Richard

Reply via email to