int main() { static int i, n; static double a[200], b[200]; ... (more variables and control flow) for (i = 0; i < n; i++) a[i] = b[i]; ... }
Tree-level optimisations do not pull out loads of i and n and store to i out of loop. As a result, GCC generates five memory accesses on ia64 for each iteration (4.3.0 20071112): .L9: .mii nop 0 sxt4 r14 = r16 nop 0 .mmi ld4 r15 = [r32] ld4 r58 = [r33] nop 0 ;; .mii shladd r14 = r14, 3, r0 adds r16 = 1, r15 ;; add r15 = r35, r14 .mmi add r14 = r34, r14 st4 [r32] = r16 cmp4.lt p6, p7 = r16, r58 ;; .mmi nop 0 ldfd f6 = [r15] nop 0 ;; .mib stfd [r14] = f6 nop 0 (p6) br.cond.dptk .L9 On x86_64 situation is better (4.3.0 20070930), but not good: .L13: movslq %eax,%rdx movq b.3894(,%rdx,8), %rax movq %rax, x.3895(,%rdx,8) leal 1(%rcx), %eax cmpl %eax, %edi movl %eax, %ecx movl %eax, i.3912(%rip) jg .L13 but the optimization happened on RTL level, as final_cleanup dump reads: <bb 13>: # MPT.140_429 = VDEF <MPT.140_645> x[i.265] = b[i.265]; # VUSE <MPT.140_429> i.23 = i; i.265 = i.23 + 1; # MPT.140_430 = VDEF <MPT.140_429> i = i.265; # VUSE <MPT.140_430> n.274 = n; if (n.274 > i.265) goto <bb 13>; else goto <bb 14>; -- Summary: Useful loop invariant motion missing Product: gcc Version: 4.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: amonakov at gmail dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34160