https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65962
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- While strided stores are now implemented the case is still not handled because single-element interleaving takes precedence (and single-element interleaving isn't supported for stores as that always produces gaps). I have a patch that produces .L2: movdqu 16(%rax), %xmm1 addq $32, %rax movdqu -32(%rax), %xmm0 shufps $136, %xmm1, %xmm0 paddd %xmm2, %xmm0 pshufd $85, %xmm0, %xmm1 movd %xmm0, -32(%rax) movd %xmm1, -24(%rax) movdqa %xmm0, %xmm1 punpckhdq %xmm0, %xmm1 pshufd $255, %xmm0, %xmm0 movd %xmm1, -16(%rax) movd %xmm0, -8(%rax) cmpq %rdx, %rax jne .L2 when you disable the cost model. Otherwise it's deemed not profitable. Using scatters for AVX could in theory make it profitable (not sure). t.c:5:3: note: Cost model analysis: Vector inside of loop cost: 13 Vector prologue cost: 1 Vector epilogue cost: 12 Scalar iteration cost: 3 Scalar outside cost: 0 Vector outside cost: 13 prologue iterations: 0 epilogue iterations: 4 t.c:5:3: note: cost model: the vector iteration cost = 13 divided by the scalar iteration cost = 3 is greater or equal to the vectorization factor = 4. t.c:5:3: note: not vectorized: vectorization not profitable. t.c:5:3: note: not vectorized: vector version will never be profitable. t.c:5:3: note: ==> examining statement: *_8 = _10; t.c:5:3: note: vect_is_simple_use: operand _10 t.c:5:3: note: def_stmt: _10 = _9 + 7; t.c:5:3: note: type of def: internal t.c:5:3: note: vect_model_store_cost: inside_cost = 8, prologue_cost = 0 . so the strided store has cost 8, that's 4 extracts plus 4 scalar stores. With AVX we generate vmovd %xmm0, -32(%rax) vpextrd $1, %xmm0, -24(%rax) vpextrd $2, %xmm0, -16(%rax) vpextrd $3, %xmm0, -8(%rax) so it can combine extract and store, with SSE2 we get pshufd $85, %xmm0, %xmm1 movd %xmm0, -32(%rax) movd %xmm1, -24(%rax) movdqa %xmm0, %xmm1 punpckhdq %xmm0, %xmm1 pshufd $255, %xmm0, %xmm0 movd %xmm1, -16(%rax) movd %xmm0, -8(%rax) which is even worse than expected ;) As usual the cost model isn't target aware enough here (and it errs on the conservative side here)