https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54939
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> --- So we now would vectorize this but compute the SLP vectorization as never profitable. Generated code with -fno-vect-cost-model (-Ofast -march=corei7): .L4: movupd (%rdx,%rax), %xmm0 movapd %xmm3, %xmm1 mulpd %xmm0, %xmm1 palignr $8, %xmm0, %xmm0 mulpd %xmm2, %xmm0 addsubpd %xmm0, %xmm1 movupd (%rcx,%rax), %xmm0 addpd %xmm0, %xmm1 movups %xmm1, (%rcx,%rax) addq $16, %rax cmpq %rsi, %rax jne .L4 t.f:6:0: note: Cost model analysis: Vector inside of loop cost: 15 Vector prologue cost: 10 Vector epilogue cost: 0 Scalar iteration cost: 14 Scalar outside cost: 6 Vector outside cost: 10 prologue iterations: 0 epilogue iterations: 0 t.f:6:0: note: cost model: the vector iteration cost = 15 divided by the scalar iteration cost = 14 is greater or equal to the vectorization factor = 1. t.f:6:0: note: not vectorized: vectorization not profitable. t.f:6:0: note: not vectorized: vector version will never be profitable. as it is basically equal to basic-block vectorizing the loop body. Note that we pessimistically handle addsubpd as if it were not present and the code would really end up as vect__31.16_42 = vect__27.10_49 - vect__28.15_43; vect__31.17_41 = vect__27.10_49 + vect__28.15_43; _40 = VEC_PERM_EXPR <vect__31.16_42, vect__31.17_41, { 0, 3 }>; which is what the vectorizer handles this with (two vector_stmt plus one vec_perm cost). The x86 vectorizer cost model would need to be adjusted for this. With cost model enabled we fall back to vectorization using interleaving: .L5: movupd (%rax), %xmm5 addl $1, %r9d addq $32, %rax addq $32, %r8 movupd -32(%r8), %xmm1 movupd -16(%r8), %xmm0 movapd %xmm5, %xmm2 movapd %xmm1, %xmm6 movupd -16(%rax), %xmm9 unpckhpd %xmm0, %xmm1 unpcklpd %xmm0, %xmm6 movapd %xmm6, %xmm0 mulpd %xmm8, %xmm0 unpcklpd %xmm9, %xmm2 unpckhpd %xmm9, %xmm5 mulpd %xmm7, %xmm6 addpd %xmm0, %xmm2 movapd %xmm1, %xmm0 mulpd %xmm7, %xmm0 mulpd %xmm8, %xmm1 subpd %xmm0, %xmm2 movapd %xmm1, %xmm0 addpd %xmm6, %xmm0 movapd %xmm2, %xmm1 addpd %xmm5, %xmm0 unpcklpd %xmm0, %xmm1 unpckhpd %xmm0, %xmm2 movups %xmm1, -32(%rax) movups %xmm2, -16(%rax) cmpl %esi, %r9d jb .L5 t.f:6:0: note: Cost model analysis: Vector inside of loop cost: 26 Vector prologue cost: 10 Vector epilogue cost: 14 Scalar iteration cost: 14 Scalar outside cost: 6 Vector outside cost: 24 prologue iterations: 0 epilogue iterations: 1 Calculated minimum iters for profitability: 6 t.f:6:0: note: Runtime profitability threshold = 5 t.f:6:0: note: Static estimate profitability threshold = 16 Thus this is now a cost model issue.