http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58112
Bug ID: 58112 Summary: Ineffective addressing mode used in loop. Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: neleai at seznam dot cz Hi, in following testcase gcc -O3 generates following loop: movq %rsi, %r9 subq %rdx, %r9 movq %r9, %rdi movq %r9, %rsi leaq 16(%r9), %r8 addq $32, %rdi addq $48, %rsi .p2align 4,,10 .p2align 3 .L14: movdqu (%rdx,%r9), %xmm0 addq $64, %rdx movdqa %xmm0, -64(%rdx) movdqu -64(%rdx,%r8), %xmm0 movdqa %xmm0, -48(%rdx) movdqu -64(%rdx,%rdi), %xmm0 movdqa %xmm0, -32(%rdx) movdqu -64(%rdx,%rsi), %xmm0 movdqa %xmm0, -16(%rdx) cmpq %rdx, %rcx jne .L14 rep; ret It saves one addq $64, %rsi instruction. However it occupies four extra registers, and address calculations done at each iteration cost more and lead to bigger code than instruction saved.