Hello! Consider this simple testcase:
#define N 16 short ia[N]; short ic[N] = {0,3,6,9,12,15,18,21,24,27,30,33,36,39,42,45}; short ib[N] = {0,3,6,9,12,15,18,21,24,27,30,33,36,39,42,45}; int main () { int i; for (i = 0; i < N; i++) ia[i] = ib[i] + ic[i]; return 0; } The loop in this testcase is compiled with 'gcc -O2 -ftree-vectorize -msse2' into: .L2: movdqa ib(%eax), %xmm0 paddw ic(%eax), %xmm0 incl %edx movdqa %xmm0, ia(%eax) addl $16, %eax cmpl $2, %edx jne .L2 There is no (,%reg,16) SIB mode available in i386, and it looks to me that loop optimizer fallbacks to the most simple addressing mode in this case. Unfortunatelly, %edx register is wasted in above code. A better code would be: .L2: movdqa ib(,%eax,8), %xmm0 paddw ic(,%eax,8), %xmm0 movdqa %xmm0, ia(,%eax,8) addl $2, %eax cmpl $4, %eax jne .L2 or with the simplest addressing scheme: .L2: movdqa ib(%eax), %xmm0 paddw ic(%eax), %xmm0 movdqa %xmm0, ia(%eax) addl $16, %eax cmpl $32, %eax jne .L2 Uros. -- Summary: A register is wasted in simple vectorised loops Product: gcc Version: 4.1.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P2 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: uros at kss-loka dot si CC: gcc-bugs at gcc dot gnu dot org GCC build triplet: i686-pc-linux-gnu GCC host triplet: i686-pc-linux-gnu GCC target triplet: i686-pc-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22497