http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51499
--- Comment #2 from fb.programming at gmail dot com 2011-12-11 08:33:40 UTC --- (In reply to comment #1) g++-4.6.2 -S -Wall -O3 -ftree-vectorize -ftree-vectorizer-verbose=2 \ -ffast-math -fno-vect-cost-model gives me exactly the same assembly code as above (which I'm surprised a bit as -funsafe-math-optimizations might as well have eliminated the loop completely). The optimal assembly, however, I would expect to be something like: .L3: addq $1, %rax addpd %xmm0, %xmm3 cmpq %rdi, %rax addpd %xmm0, %xmm2 addpd %xmm0, %xmm1 jne .L3 Where the vector (sum1,sum2) is stored in xmm1, (sum3,sum4) stored in xmm2, etc and (a,a) stored in xmm0. This speeds it up by a factor of 2 and is completely equivalent to the scalar case so I don't see why -ffast-math (which implies -funsafe-math-optimizations) should be necessary in this case, either.