https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39821
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> --- 0x398f310 _2 * _4 1 times scalar_stmt costs 12 in body ... 0x392b3f0 _1 w* _3 2 times vec_promote_demote costs 8 in body ... t4.c:4:12: note: Cost model analysis: Vector inside of loop cost: 40 Vector prologue cost: 4 Vector epilogue cost: 108 Scalar iteration cost: 40 Scalar outside cost: 32 Vector outside cost: 112 prologue iterations: 0 epilogue iterations: 2 Calculated minimum iters for profitability: 3 so clearly the widening multiplication is not costed correctly. With SSE 4.2 we can do better: .L4: movdqu (%rcx,%rax), %xmm0 movdqu (%rsi,%rax), %xmm1 addq $16, %rax movdqa %xmm0, %xmm3 movdqa %xmm1, %xmm4 punpckldq %xmm0, %xmm3 punpckldq %xmm1, %xmm4 punpckhdq %xmm0, %xmm0 pmuldq %xmm4, %xmm3 punpckhdq %xmm1, %xmm1 pmuldq %xmm1, %xmm0 paddq %xmm3, %xmm2 paddq %xmm0, %xmm2 cmpq %rdi, %rax jne .L4 but even there the costing is imprecise. The vectorizer is unhelpful in categorizing the widen mult as vec_promote_demote which then fails to run into case MULT_EXPR: case WIDEN_MULT_EXPR: case MULT_HIGHPART_EXPR: stmt_cost = ix86_multiplication_cost (ix86_cost, mode); break; fixing that yields 0x392b3f0 _1 w* _3 2 times vector_stmt costs 136 in body for both SSE2 and SSE4.2 and AVX2 so that's over-estimating cost then via /* V*DImode is emulated with 5-8 insns. */ else if (mode == V2DImode || mode == V4DImode) { if (TARGET_XOP && mode == V2DImode) return ix86_vec_cost (mode, cost->mulss * 2 + cost->sse_op * 3); else return ix86_vec_cost (mode, cost->mulss * 3 + cost->sse_op * 5); } with cost->mulss == 16. I suppose it is somehow failing to realize it's doing a widening multiply.