http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52272
--- Comment #4 from Richard Guenther <rguenth at gcc dot gnu.org> 2012-02-16 12:40:06 UTC --- Before the patch we choose Improved to: cost: 128 (complexity 0) cand_cost: 19 cand_use_cost: 28 (complexity 0) candidates: 2, 4, 7 use:0 --> iv_cand:4, cost=(2,0) use:1 --> iv_cand:4, cost=(2,0) use:2 --> iv_cand:2, cost=(0,0) use:3 --> iv_cand:7, cost=(0,0) use:4 --> iv_cand:7, cost=(4,0) use:5 --> iv_cand:7, cost=(4,0) use:6 --> iv_cand:7, cost=(4,0) use:7 --> iv_cand:7, cost=(4,0) use:8 --> iv_cand:7, cost=(4,0) use:9 --> iv_cand:7, cost=(4,0) and now we do not consider for example candidate 7 for use 4: candidate 7 var_before ivtmp.190 var_after ivtmp.190 incremented before exit test type character(kind=4) base (character(kind=4)) (a_296(D) + (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1) * 8) step 8 base object (void *) a_296(D) use 4 generic in statement D.2322_387 = axp_318(D) + D.2321_367; at position type real(kind=8)[0:D.1963] * restrict base axp_318(D) + (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1) * 8 step 8 base object (void *) axp_318(D) related candidates and we really do not want to do that because of the wrong-code issue. We instead end up with Improved to: cost: 133 (complexity 7) cand_cost: 13 cand_use_cost: 39 (complexity 7) candidates: 4, 5 use:0 --> iv_cand:4, cost=(2,0) use:1 --> iv_cand:4, cost=(2,0) use:2 --> iv_cand:5, cost=(0,0) use:3 --> iv_cand:5, cost=(5,1) use:4 --> iv_cand:5, cost=(5,1) use:5 --> iv_cand:5, cost=(5,1) use:6 --> iv_cand:5, cost=(5,1) use:7 --> iv_cand:5, cost=(5,1) use:8 --> iv_cand:5, cost=(5,1) use:9 --> iv_cand:5, cost=(5,1) where candidate 5 (important) var_before ivtmp.188 var_after ivtmp.188 incremented before exit test type sizetype base 0 step 8 I think what we miss to relate uses 4 to 9 which all are of the form base <parameter> + (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1) * 8 is to have a candidate which has the base object stripped and thus only tracks (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1) * 8 which we have as IV at least: ssa name D.2332_451 type sizetype base (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1) * 8 step 8 and redundant: ssa name D.2354_680 type sizetype base (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1) * 8 step 8 ssa name D.2343_692 type sizetype base (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1) * 8 step 8 ssa name D.2365_752 type sizetype base (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1) * 8 step 8 ssa name D.2376_763 type sizetype base (((sizetype) stride.88_9 + (sizetype) pretmp.141_661) + 1) * 8 step 8 but no associated candidate(s). If we add a candidate for it (9) we end up with Improved to: cost: 131 (complexity 0) cand_cost: 15 cand_use_cost: 35 (complexity 0) candidates: 4, 9 use:0 --> iv_cand:4, cost=(2,0) use:1 --> iv_cand:4, cost=(2,0) use:2 --> iv_cand:9, cost=(3,0) use:3 --> iv_cand:9, cost=(4,0) use:4 --> iv_cand:9, cost=(4,0) use:5 --> iv_cand:9, cost=(4,0) use:6 --> iv_cand:9, cost=(4,0) use:7 --> iv_cand:9, cost=(4,0) use:8 --> iv_cand:9, cost=(4,0) use:9 --> iv_cand:9, cost=(4,0) but with that change we now unroll the innermost loop twice, so I'm not sure it will pay off. The code generation differences even for the originally patch that caused the regression are only in scheduling and register allocation (so -fschedule-insns may recover it, or -fsched-pressure).