http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34723
--- Comment #3 from Jeffrey A. Law <law at redhat dot com> --- Andrew, no. 4.2 didn't muck things up at all. The 4.2 code is clearly better (unless you're vectorizing the loop). What's happening is the IV code changes the loop structure enough that VRP2/DOM2 are unable to peel the iteration off the loop. In gcc-4.2, just prior to VRP2 we have: # BLOCK 2 freq:1000 # PRED: ENTRY [100.0%] (fallthru,exec) # SUCC: 3 [100.0%] (fallthru,exec) # BLOCK 3 freq:10000 # PRED: 3 [90.0%] (dfs_back,true,exec) 2 [100.0%] (fallthru,exec) # ivtmp.43_3 = PHI <ivtmp.43_5(3), 0(2)>; # val_16 = PHI <val_11(3), 0(2)>; <L0>:; D.1880_7 = MEM[symbol: table, index: ivtmp.43_3]{table[i]}; D.1881_8 = (unsigned char) D.1880_7; val.1_9 = (unsigned char) val_16; D.1883_10 = val.1_9 + D.1881_8; val_11 = (char) D.1883_10; ivtmp.43_5 = ivtmp.43_3 + 1; if (ivtmp.43_5 != 10) goto <L0>; else goto <L2>; # SUCC: 3 [90.0%] (dfs_back,true,exec) 4 [10.0%] (loop_exit,false,exec) VRP threads the jump through the backedge for the first iteration of the loop resulting in: # BLOCK 2 freq:1000 # PRED: ENTRY [100.0%] (fallthru,exec) goto <bb 5> (<L8>); # SUCC: 5 [100.0%] (fallthru,exec) # BLOCK 3 freq:9000 # PRED: 5 [100.0%] (fallthru) 3 [88.9%] (true,exec) # ivtmp.43_3 = PHI <ivtmp.43_23(5), ivtmp.43_5(3)>; # val_16 = PHI <val_22(5), val_11(3)>; <L0>:; D.1880_7 = MEM[symbol: table, index: ivtmp.43_3]{table[i]}; D.1881_8 = (unsigned char) D.1880_7; val.1_9 = (unsigned char) val_16; D.1883_10 = val.1_9 + D.1881_8; val_11 = (char) D.1883_10; ivtmp.43_5 = ivtmp.43_3 + 1; if (ivtmp.43_5 != 10) goto <L0>; else goto <L2>; # SUCC: 3 [88.9%] (true,exec) 4 [11.1%] (loop_exit,false,exec) # BLOCK 4 freq:1000 # PRED: 3 [11.1%] (loop_exit,false,exec) # val_2 = PHI <val_11(3)>; <L2>:; D.1884_13 = (int) val_2; return D.1884_13; # SUCC: EXIT [100.0%] # BLOCK 5 freq:1000 # PRED: 2 [100.0%] (fallthru,exec) # ivtmp.43_17 = PHI <0(2)>; # val_1 = PHI <0(2)>; <L8>:; D.1880_18 = MEM[symbol: table, index: ivtmp.43_17]{table[i]}; D.1881_19 = (unsigned char) D.1880_18; val.1_20 = (unsigned char) val_1; D.1883_21 = val.1_20 + D.1881_19; val_22 = (char) D.1883_21; ivtmp.43_23 = ivtmp.43_17 + 1; goto <bb 3> (<L0>); # SUCC: 3 [100.0%] (fallthru) Which will ultimately compile down to the efficient code where the first iteration has been peeled off. If we look at the trunk, DOM2/VRP2 have had the order changed, so if we look at the code immediately prior to DOM2 we have: <bb 2>: ivtmp.10_16 = (unsigned long) &table; _12 = (unsigned long) &MEM[(void *)&table + 10B]; goto <bb 4>; <bb 3>: <bb 4>: # val_14 = PHI <val_8(3), 0(2)> # ivtmp.10_18 = PHI <ivtmp.10_17(3), ivtmp.10_16(2)> _13 = (void *) ivtmp.10_18; _4 = MEM[base: _13, offset: 0B]; _5 = (unsigned char) _4; val.0_6 = (unsigned char) val_14; _7 = _5 + val.0_6; val_8 = (char) _7; ivtmp.10_17 = ivtmp.10_18 + 1; if (ivtmp.10_17 != _12) goto <bb 3>; else goto <bb 5>; Note how the test to go back to the top of the loop has changed. It's no longer testing a simple integer counter, which threading handled nicely. Instead it's a more complex test involving two objects. And neither DOM2 nor VRP2 are able to untangle it to get the code we want. ISTM this should have a regression marker and attached to the jump threading meta-bug. _