http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56935
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |WAITING --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> 2013-04-15 14:38:40 UTC --- Reduced testcase: typedef struct { long int x; long int y; } S; void foo (S *s) { s->x--; s->y--; } Difference in cost model analysis: before: t.c:7: note: Cost model analysis: Vector inside of basic block cost: 5 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar cost of basic block: 6 after: t.c:7: note: Cost model analysis: Vector inside of basic block cost: 5 Vector prologue cost: 1 Vector epilogue cost: 0 Scalar cost of basic block: 6 after is more correct, as we need to synthesize the { 1, 1 } vector. what isn't really optimal is the unchanged vector inside cost. It's an unaligned load with cost 2, the vector operation with cost 1 and the unaligned store with cost 2. Before we generated pcmpeqd %xmm0, %xmm0 movdqu (%rdi), %xmm1 paddq %xmm1, %xmm0 movdqu %xmm0, (%rdi) ret and afterwards subq $1, (%rdi) subq $1, 8(%rdi) I'd say it's obvious that the non-vectorized variant is better. So, are you sure _this_ basic-block is really the issue?