Inefficient end-of-loop value computation - missed optimization somewhere?

Ulrich Weigand Mon, 20 Feb 2012 14:20:17 -0800

Hello,

we've noticed that the loop optimizer sometimes inserts weirdly
inefficient code to compute the value of an induction variable
at the end of the loop.


As a test case stripped down to the core of the problem, consider:

int test (int start, int end)
{
  int i;

  for (i = start; i < end; i++)
    ;

  return i;
}

That function really ought to be equivalent to 

int test2 (int start, int end)
{
  return start < end ? end : start;
}

And indeed on i386-linux-gnu, we get mostly identical code
(except for the branch direction prediction).

However, on arm-linux-gnuabi (using -mthumb and -march=armv7-a),
we see for the first function:

test:
        cmp     r0, r1
        bge     .L2
        adds    r3, r0, #1
        mvns    r0, r0
        adds    r1, r1, r0
        adds    r0, r3, r1
.L2:
        bx      lr

instead of simply (as for the second function):

test2:
        cmp     r1, r0
        it      ge
        movge   r0, r1
        bx      lr



Where does that weird addition-and-negation sequence come from?
In 100t.dceloop we still have:

<bb 4>:
  # i_9 = PHI <i_5(5), start_2(D)(3)>
  i_5 = i_9 + 1;
  if (end_4(D) > i_5)
    goto <bb 5>;
  else
    goto <bb 6>;

<bb 5>:
  goto <bb 4>;

<bb 6>:
  # i_1 = PHI <i_5(4)>


But 102t.sccp has:

<bb 4>:
  # i_9 = PHI <i_5(5), start_2(D)(3)>
  i_5 = i_9 + 1;
  if (end_4(D) > i_5)
    goto <bb 5>;
  else
    goto <bb 6>;

<bb 5>:
  goto <bb 4>;

<bb 6>:
  D.4123_10 = start_2(D) + 1;
  D.4124_11 = ~start_2(D);
  D.4125_12 = (unsigned int) D.4124_11;
  D.4126_13 = (unsigned int) end_4(D);
  D.4127_14 = D.4125_12 + D.4126_13;
  D.4128_15 = (int) D.4127_14;
  i_1 = D.4123_10 + D.4128_15;

This is the sequence that ends up in the final assembler code.

Note that this computes:

  i_1 = (start_2(D) + 1)
         + (int)((unsigned int) ~start_2(D) + (unsigned int) end_4(D))

which is (since those casts are no-ops on the assembler level):

  i_1 = start_2(D) + 1 + ~start_2(D) + end_4(D)

which is due to two's-complement arithmetic:

  i_1 = start_2(D) - start_2(D) + end_4(D)

that is simply:

  i_1 = end_4(D)


Unfortunately, no tree-based optimization pass after 102t.sccp is able to
clean up that mess.  This is true even on *i386*: the only reason we get
nice assembler there is due to *RTX* optimization, notably in combine.
This hits on i386 because of an intermediate stage that adds two registers
and the constant 1 (which matches the lea pattern).  On arm, there is no
instruction that would handle that intermediate stage, and combine is not
powerful enough to perform the whole simplification in a single step.


Now that sequence of gimple operations is generated in scev_const_prop
in tree-scalar-evolution.c.  First, the phi node corresponding to the
loop exit value is evaluated:

  def = analyze_scalar_evolution_in_loop (ex_loop, loop, def, NULL);

which returns a chrec of the form

  { 1, +, (start + 1) }

This is then further simplified by

  def = compute_overall_effect_of_inner_loop (ex_loop, def);

which first computes the number of loop iterations:

  tree nb_iter = number_of_latch_executions (loop);

which returns a tree representing

  (unsigned int) ~start + (unsigned int) end

(which at this point seems to be the validly most efficient way to
compute end - start - 1).

The chrec is then evaluated via chrec_apply which basically computes

  (start + 1) + 1 * (int)((unsigned int) ~start + (unsigned int) end)

which ends up being converted to the long gimple sequence seen above.


Now I guess my questions are:

- In the computation shown above, should that tree have been folded
  into just "end" to start with?  The tree is constructed at its
  outermost level in chrec_fold_plus, which simply calls fold_build2.
  While simplify-rtx.c has code to detect sequences of operations
  that cancel out while folding a PLUS or MINUS RTX, there does
  not appear to be corresponding code in fold-const.c to handle
  the equivalent optimization at the tree level.

- If not, should there be another tree pass later on that ought to
  clean this up?  I notice there is code to handle plus/minus
  sequences in tree-ssa-reassoc.c.  But this doesn't seem to be
  able to handle this particular case, possibly because the presence
  of the casts (nop_exprs) ...

- Anywhere else I'm overlooking right now?


I'd appreciate any tips / suggestions how to fix this ...


Thanks,
Ulrich

-- 
  Dr. Ulrich Weigand
  GNU Toolchain for Linux on System z and Cell BE
  ulrich.weig...@de.ibm.com

Inefficient end-of-loop value computation - missed optimization somewhere?

Reply via email to