https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79262
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Component|tree-optimization |target --- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- On x86_64 core-avx2 we get t.c:18:3: note: Cost model analysis: Vector inside of loop cost: 9 Vector prologue cost: 7 Vector epilogue cost: 3 Scalar iteration cost: 3 Scalar outside cost: 6 Vector outside cost: 10 prologue iterations: 0 epilogue iterations: 1 t.c:18:3: note: cost model: the vector iteration cost = 9 divided by the scalar iteration cost = 3 is greater or equal to the vectorization factor = 2. t.c:18:3: note: not vectorized: vectorization not profitable. forcing avx128 and no cost model we'd get .L4: vmovdqu (%rax), %xmm0 vpunpcklqdq 16(%rax), %xmm0, %xmm0 addl $1, %ecx addq $32, %rax vpxor %xmm1, %xmm0, %xmm0 vmovq %xmm0, -32(%rax) vpextrq $1, %xmm0, -16(%rax) cmpl %r9d, %ecx jb .L4 vs. .L3: movslq %edx, %rax addl $1, %edx salq $4, %rax xorq %rdi, 8(%rsi,%rax) cmpl %r8d, %edx jge .L7 note that one of the issues with the scalar store cost model is that it re-uses vec_to_scalar which was originally meant to be only used for vector reduction result to scalar reg cost (aka zero on x86_64). We failed to add a vec_extract_element "simple" cost. The avx256 code looks like .L4: vmovdqu (%rdx), %ymm0 vpunpcklqdq 32(%rdx), %ymm0, %ymm0 addl $1, %esi addq $64, %rdx vpermq $216, %ymm0, %ymm0 vpxor %ymm2, %ymm0, %ymm0 vmovq %xmm0, -64(%rdx) vpextrq $1, %xmm0, -48(%rdx) vextracti128 $0x1, %ymm0, %xmm0 vmovq %xmm0, -32(%rdx) vpextrq $1, %xmm0, -16(%rdx) cmpl %r9d, %esi jb .L4 given x86_64 can successfully cost-model this (reject the vectorization) this is a target issue.