http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51492
Bug #: 51492 Summary: vectorizer generates unnecessary code Classification: Unclassified Product: gcc Version: 4.6.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: drepper....@gmail.com Build: x86_64-linux Compile this code with 4.6.2 on a x86-64 machine with -O3: #define SIZE 65536 #define WSIZE 64 unsigned short head[SIZE] __attribute__((aligned(64))); void f(void) { for (unsigned n = 0; n < SIZE; ++n) { unsigned short m = head[n]; head[n] = (unsigned short)(m >= WSIZE ? m-WSIZE : 0); } } The result I see is this: 0000000000000000 <f>: 0: 66 0f ef d2 pxor %xmm2,%xmm2 4: b8 00 00 00 00 mov $0x0,%eax 5: R_X86_64_32 head 9: 66 0f 6f 25 00 00 00 movdqa 0x0(%rip),%xmm4 # 11 <f+0x11> 10: 00 d: R_X86_64_PC32 .LC0-0x4 11: 66 0f 6f 1d 00 00 00 movdqa 0x0(%rip),%xmm3 # 19 <f+0x19> 18: 00 15: R_X86_64_PC32 .LC1-0x4 19: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 20: 66 0f 6f 00 movdqa (%rax),%xmm0 24: 66 0f 6f c8 movdqa %xmm0,%xmm1 28: 66 0f d9 c4 psubusw %xmm4,%xmm0 2c: 66 0f 75 c2 pcmpeqw %xmm2,%xmm0 30: 66 0f fd cb paddw %xmm3,%xmm1 34: 66 0f df c1 pandn %xmm1,%xmm0 38: 66 0f 7f 00 movdqa %xmm0,(%rax) 3c: 48 83 c0 10 add $0x10,%rax 40: 48 3d 00 00 00 00 cmp $0x0,%rax 42: R_X86_64_32S head+0x20000 46: 75 d8 jne 20 <f+0x20> 48: f3 c3 repz retq There is a lot of unnecessary code. The psubusw instruction alone is sufficient. The purpose of this instruction is to implement saturated subtraction. Why does gcc create all this extra code? The code should just be movdqa (%rax), %xmm0 psubusw %xmm1, %xmm0 movdqa %mm0, (%rax) where %xmm1 has WSIZE in the 16-bit values.