Hello, we've noticed that the loop optimizer sometimes inserts weirdly inefficient code to compute the value of an induction variable at the end of the loop.
As a test case stripped down to the core of the problem, consider: int test (int start, int end) { int i; for (i = start; i < end; i++) ; return i; } That function really ought to be equivalent to int test2 (int start, int end) { return start < end ? end : start; } And indeed on i386-linux-gnu, we get mostly identical code (except for the branch direction prediction). However, on arm-linux-gnuabi (using -mthumb and -march=armv7-a), we see for the first function: test: cmp r0, r1 bge .L2 adds r3, r0, #1 mvns r0, r0 adds r1, r1, r0 adds r0, r3, r1 .L2: bx lr instead of simply (as for the second function): test2: cmp r1, r0 it ge movge r0, r1 bx lr Where does that weird addition-and-negation sequence come from? In 100t.dceloop we still have: <bb 4>: # i_9 = PHI <i_5(5), start_2(D)(3)> i_5 = i_9 + 1; if (end_4(D) > i_5) goto <bb 5>; else goto <bb 6>; <bb 5>: goto <bb 4>; <bb 6>: # i_1 = PHI <i_5(4)> But 102t.sccp has: <bb 4>: # i_9 = PHI <i_5(5), start_2(D)(3)> i_5 = i_9 + 1; if (end_4(D) > i_5) goto <bb 5>; else goto <bb 6>; <bb 5>: goto <bb 4>; <bb 6>: D.4123_10 = start_2(D) + 1; D.4124_11 = ~start_2(D); D.4125_12 = (unsigned int) D.4124_11; D.4126_13 = (unsigned int) end_4(D); D.4127_14 = D.4125_12 + D.4126_13; D.4128_15 = (int) D.4127_14; i_1 = D.4123_10 + D.4128_15; This is the sequence that ends up in the final assembler code. Note that this computes: i_1 = (start_2(D) + 1) + (int)((unsigned int) ~start_2(D) + (unsigned int) end_4(D)) which is (since those casts are no-ops on the assembler level): i_1 = start_2(D) + 1 + ~start_2(D) + end_4(D) which is due to two's-complement arithmetic: i_1 = start_2(D) - start_2(D) + end_4(D) that is simply: i_1 = end_4(D) Unfortunately, no tree-based optimization pass after 102t.sccp is able to clean up that mess. This is true even on *i386*: the only reason we get nice assembler there is due to *RTX* optimization, notably in combine. This hits on i386 because of an intermediate stage that adds two registers and the constant 1 (which matches the lea pattern). On arm, there is no instruction that would handle that intermediate stage, and combine is not powerful enough to perform the whole simplification in a single step. Now that sequence of gimple operations is generated in scev_const_prop in tree-scalar-evolution.c. First, the phi node corresponding to the loop exit value is evaluated: def = analyze_scalar_evolution_in_loop (ex_loop, loop, def, NULL); which returns a chrec of the form { 1, +, (start + 1) } This is then further simplified by def = compute_overall_effect_of_inner_loop (ex_loop, def); which first computes the number of loop iterations: tree nb_iter = number_of_latch_executions (loop); which returns a tree representing (unsigned int) ~start + (unsigned int) end (which at this point seems to be the validly most efficient way to compute end - start - 1). The chrec is then evaluated via chrec_apply which basically computes (start + 1) + 1 * (int)((unsigned int) ~start + (unsigned int) end) which ends up being converted to the long gimple sequence seen above. Now I guess my questions are: - In the computation shown above, should that tree have been folded into just "end" to start with? The tree is constructed at its outermost level in chrec_fold_plus, which simply calls fold_build2. While simplify-rtx.c has code to detect sequences of operations that cancel out while folding a PLUS or MINUS RTX, there does not appear to be corresponding code in fold-const.c to handle the equivalent optimization at the tree level. - If not, should there be another tree pass later on that ought to clean this up? I notice there is code to handle plus/minus sequences in tree-ssa-reassoc.c. But this doesn't seem to be able to handle this particular case, possibly because the presence of the casts (nop_exprs) ... - Anywhere else I'm overlooking right now? I'd appreciate any tips / suggestions how to fix this ... Thanks, Ulrich -- Dr. Ulrich Weigand GNU Toolchain for Linux on System z and Cell BE ulrich.weig...@de.ibm.com