http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58626
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- I have updated my do-proper-partition-dependencies patch and verified it fixes this issue. We now generate <bb 2>: __builtin_memmove (&MEM[(void *)&a + 24B], &MEM[(void *)&a + 48B], 8); <bb 3>: # b.0_10 = PHI <b.1_12(4), 0(2)> a[1][3] = 0; _25 = a[3][b.0_10]; a[2][b.0_10] = _25; b.1_12 = b.0_10 + 1; if (b.1_12 <= 1) goto <bb 4>; else goto <bb 5>; <bb 4>: goto <bb 3>; <bb 5>: b = 2; _14 = a[1][1]; if (_14 != 1) note the inner loop is completely peeled before loop distribution and we see <bb 3>: # b.0_10 = PHI <b.1_12(4), 0(2)> a[1][3] = 0; _19 = a[2][b.0_10]; a[1][b.0_10] = _19; _25 = a[3][b.0_10]; a[2][b.0_10] = _25; b.1_12 = b.0_10 + 1; if (b.1_12 <= 1) the patch still needs quite some TLC though.