Re: [WIP PATCH] Re: Inefficient end-of-loop value computation - missed optimization somewhere?

Richard Guenther Mon, 12 Mar 2012 03:56:50 -0700

On Thu, Mar 8, 2012 at 3:29 PM, Ulrich Weigand <uweig...@de.ibm.com> wrote:
> Richard Guenther wrote:
>> On Tue, Feb 28, 2012 at 4:10 PM, Ulrich Weigand <uweig...@de.ibm.com> wrote:
>> > I'll still need to do proper testing and benchmarking, but I thought
>> > I'd post the patch anyway just as a heads-up ...
>> >
>> > Does this look reasonable to you?
>>
>> Yes, that looks reasonable.  Though instead of unsigned_type_for
>> please use build_nonstandard_integer_type instead (if the type
>> is not already unsigned).  unsigned_type_for does not necessarily
>> preserve type-precision and may even return NULL.
>
> OK, I've changed the patch to use build_nonstandard_integer_type, and it
> still fixes the original problem.  However, on further examination it
> turns out that this "fix" is not because the expression is actually
> simplified at tree level.  Rather, it is just that because of some
> minor reassociation (the +1 is moved to the end), the *RTL* optimizers
> now see a (just as complex but) slightly different RTL sequence, and
> therefore combine now just happens to be able to fold everything
> together (using the RTL-level simplify_plus_minus machinery).
>
> Even worse, the patch causes a number of regressions:
>
>> FAIL: gcc.dg/tree-ssa/loop-15.c scan-tree-dump-times optimized "\\+" 0
>> FAIL: gcc.dg/tree-ssa/loop-15.c scan-tree-dump-times optimized "n_. \\* n_." 
>> 1
>> FAIL: gcc.dg/vect/no-scevccp-noreassoc-outer-4.c scan-tree-dump-times vect 
>> "OUTER LOOP VECTORIZED." 1
>
>
> The loop-15.c problem is probably just a missed optimization
> elsewhere that can be fixed.  The problem is that a loop-end
> value used to be computed as "(n - 1) * n + n" which got
> simplified via fold_plusminus_mult_expr to "n * n".  When
> computing in unsigned, however, the expression becomes something
> along the lines of
>
>  "[...] * (unsigned) n + (unsigned) n"
>
> and the top-level of fold_binary_loc removes the *outermost* NOP_EXPRs
> resulting in
>
>  "[...] * (unsigned) n + n"
>
> which fold_plusminus_mult_expr no longer recognizes as something
> it can handle (since "(unsigned) n" is not operand_equal_p to "n").
>
> I think this can be fixed by having fold_plusminus_mult_expr call
> STRIP_NOPS on sub-operands, as is already done elsewhere in fold_binary_loc.
>
> This basically fixes the test case (due to some remaining casts, the
> scan-tree-dump patterns still don't match as-is, but the expected code
> is again generated).
>
>
> The no-scevccp-noreassoc-outer-4.c problem is more difficult, and would
> seem to point to a fundamental problem with performing the computation
> in unsigned mode.  In this case, the end value of an inner loop forms
> part of a scalar evolution of the outer loop.  When computing the end
> value in unsigned type, that scalar evolution is no longer detected.
> In part, this is because the scalar evolution machinery might need to
> be improved in handling casts.  In particular, in some places it only
> does STRIP_USELESS_TYPE_CONVERSION; changing those to STRIP_NOPS makes
> it detect the evolution again.  But I'm not fully convinced this is
> a safe change ...   In any case, this still doesn't make the outer
> loop vectorizable, since it is no longer provably finite.  Without
> the patch, this could be proven *because* the computation of the
> inner loop's end value used signed arithmetic which cannot overflow.
> By performing the computation in unsigned, this information is in
> fact lost.
>
>
> This would seem to indicate that unconditionally using unsigned
> arithmetic even if not strictly necessary might not always be
> the best solution either.  I've gone back and looked that at the
> original problem that
>
>  (start + 1) + (int)((unsigned int) ~start + (unsigned int) end)
>
> is not simplified by fold-const.
>
> Interestingly enough, it *is* valid to simplify this expression
> to just plain "end", even with signed arithmetic, since the
> transformation does not introduce any new potential overflows.
>
> This is actually recognized in fold_binary_loc, but only one special
> case.  See the code around the following comment:
>                  /* The only case we can still associate with two variables
>                     is if they are the same, modulo negation.  */
>
> Unfortunately, this handles only A and -A; it doesn't recognize that
> A+1 is in fact the negation of ~A ...
>
> There is also code in tree-ssa-forwprop.c:associate_plusminus that
> handles quite a number of special cases where re-association *is*
> allowed even for signed arithmetic:
>
>        (A +- B) - A       ->  +- B
>        (A +- B) -+ B      ->  A
>        (CST +- A) +- CST  ->  CST +- A
>        (A + CST) +- CST   ->  A + CST
>        ~A + A             ->  -1
>        ~A + 1             ->  -A
>        A - (A +- B)       ->  -+ B
>        A +- (B +- A)      ->  +- B
>        CST +- (CST +- A)  ->  CST +- A
>        CST +- (A +- CST)  ->  CST +- A
>        A + ~A             ->  -1
>
> This handles some cases involving ~, but still not quite the case
> we need for this expression.   In addition, the forward propagtion logic
> doesn't seem to handle casts very well (as opposed to fold-const, which
> makes liberal use of STRIP_NOPS).
>
>
> I'm wondering where to go from there:
>
> - Why are those special re-association cases handled only in forwprop,
>  and not in fold-cost?   I would have expected forwprop to just propagate
>  subexpressions into a tree and then use fold-const to actually simplify ...


We are moving away from GENERIC expression simplification towards
simplifications/combines on GIMPLE SSA (Andrew Pinski is working
on this a lot on his branch at the moment).

> - Should I try to improve forwprop to handle casts and additional re-
>  association cases until it handles the above expression?

Yes, ideally by trying to sub-divide this task into separate profitable
transforms.  Maybe Andrew can chime in here?

> - Or should the logic in fold-const be improved to add additional cases
>  of re-association?

I'd say no.

> Thanks for any further suggestions!
>
>
> I've attached some of the changes mentioned above for reference.
>
> Bye,
> Ulrich
>
>
> Patch: Compute loop-end value using unsigned type.
>
> Index: gcc/tree-scalar-evolution.c
> ===================================================================
> --- gcc/tree-scalar-evolution.c (revision 184625)
> +++ gcc/tree-scalar-evolution.c (working copy)
> @@ -474,11 +474,22 @@
>            return chrec_dont_know;
>          else
>            {
> +             tree type = TREE_TYPE (evolution_fn), utype = type;
>              tree res;
>
> +             /* Since the number of iterations is always unsigned, we perform
> +                the computation in an unsigned type (if feasible) to allow 
> for
> +                improved simplification of subexpressions.  */
> +             if ((INTEGRAL_TYPE_P (type) && !TYPE_UNSIGNED (type))
> +                 || POINTER_TYPE_P (type))
> +               utype = build_nonstandard_integer_type
> +                         (TYPE_PRECISION (type), 1);
> +
>              /* evolution_fn is the evolution function in LOOP.  Get
>                 its value in the nb_iter-th iteration.  */
> -             res = chrec_apply (inner_loop->num, evolution_fn, nb_iter);
> +             res = chrec_convert (utype, evolution_fn, NULL);
> +             res = chrec_apply (inner_loop->num, res, nb_iter);
> +             res = chrec_convert (type, res, NULL);
>
>              if (chrec_contains_symbols_defined_in_loop (res, loop->num))
>                res = instantiate_parameters (loop, res);
>
>
>
> Patch: Use STRIP_NOPS on subexpressions in fold_plusminus_mult_expr
>
> Index: gcc/fold-const.c
> ===================================================================
> --- gcc/fold-const.c    (revision 184625)
> +++ gcc/fold-const.c    (working copy)
> @@ -7081,6 +7081,8 @@
>     {
>       arg00 = TREE_OPERAND (arg0, 0);
>       arg01 = TREE_OPERAND (arg0, 1);
> +      STRIP_NOPS (arg00);
> +      STRIP_NOPS (arg01);
>     }
>   else if (TREE_CODE (arg0) == INTEGER_CST)
>     {
> @@ -7099,6 +7101,8 @@
>     {
>       arg10 = TREE_OPERAND (arg1, 0);
>       arg11 = TREE_OPERAND (arg1, 1);
> +      STRIP_NOPS (arg10);
> +      STRIP_NOPS (arg11);
>     }
>   else if (TREE_CODE (arg1) == INTEGER_CST)
>     {
>
>
>
> Patch: Use STRIP_NOPS when following scalar evolutions
>
> Index: gcc/tree-scalar-evolution.c
> ===================================================================
> --- gcc/tree-scalar-evolution.c (revision 184625)
> +++ gcc/tree-scalar-evolution.c (working copy)
> @@ -1154,8 +1165,8 @@
>       rhs0 = TREE_OPERAND (expr, 0);
>       rhs1 = TREE_OPERAND (expr, 1);
>       type = TREE_TYPE (rhs0);
> -      STRIP_USELESS_TYPE_CONVERSION (rhs0);
> -      STRIP_USELESS_TYPE_CONVERSION (rhs1);
> +      STRIP_NOPS (rhs0);
> +      STRIP_NOPS (rhs1);
>       res = follow_ssa_edge_binary (loop, at_stmt, type, rhs0, code, rhs1,
>                                    halting_phi, evolution_of_loop, limit);
>       break;
> @@ -1168,8 +1179,8 @@
>          rhs0 = TREE_OPERAND (expr, 0);
>          rhs1 = TREE_OPERAND (expr, 1);
>          type = TREE_TYPE (rhs0);
> -         STRIP_USELESS_TYPE_CONVERSION (rhs0);
> -         STRIP_USELESS_TYPE_CONVERSION (rhs1);
> +         STRIP_NOPS (rhs0);
> +         STRIP_NOPS (rhs1);
>          res = follow_ssa_edge_binary (loop, at_stmt, type,
>                                        rhs0, POINTER_PLUS_EXPR, rhs1,
>                                        halting_phi, evolution_of_loop, limit);
>
>
>
> --
>  Dr. Ulrich Weigand
>  GNU Toolchain for Linux on System z and Cell BE
>  ulrich.weig...@de.ibm.com
>

Re: [WIP PATCH] Re: Inefficient end-of-loop value computation - missed optimization somewhere?

Reply via email to