https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123163

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
*_3.x 1 times scalar_load costs 12 in epilogue
_4 + 18446744073709551608 1 times scalar_stmt costs 4 in epilogue
_5 1 times scalar_store costs 12 in epilogue
<unknown> 1 times cond_branch_taken costs 16 in epilogue
t.c:7:23: note:  Cost model analysis:
  Vector inside of loop cost: 64
  Vector prologue cost: 12
  Vector epilogue cost: 44
  Scalar iteration cost: 28
  Scalar outside cost: 32
  Vector outside cost: 56
  prologue iterations: 0
  epilogue iterations: 1
t.c:7:23: missed:  cost model: the vector iteration cost = 64 divided by the
scalar iteration cost = 28 is greater or equal to the vectorization factor = 2.
t.c:7:23: missed:  not vectorized: vectorization not profitable.
t.c:7:23: missed:  not vectorized: vector version will never be profitable.
t.c:7:23: missed:  Loop costings may not be worthwhile.

the issue is the p[i].x are not contiguous but there's 'next' inbetween.
With just x86-64-v2, aka SSE, there's no benefit to perform scalar loads
of two pointers, compose a vector, subtract 8, and decompose for the
scalar stores.  You'd get

.L4:
        movdqu  (%rax), %xmm0
        pinsrq  $1, 16(%rax), %xmm0
        addq    $32, %rax
        paddq   %xmm1, %xmm0
        movq    %xmm0, -32(%rax)
        pextrq  $1, %xmm0, -16(%rax)
        cmpq    %rax, %rdx
        jne     .L4

even w/ v3 (aka AVX2) you get

.L4:
        vmovdqu (%rax), %ymm0
        vpunpcklqdq     32(%rax), %ymm0, %ymm0
        addq    $64, %rax
        vpermq  $216, %ymm0, %ymm0
        vpaddq  %ymm2, %ymm0, %ymm0
        vmovq   %xmm0, -64(%rax)
        vpextrq $1, %xmm0, -48(%rax)
        vextracti128    $0x1, %ymm0, %xmm0
        vmovq   %xmm0, -32(%rax)
        vpextrq $1, %xmm0, -16(%rax)
        cmpq    %rcx, %rax
        jne     .L4

and that's not deemed profitable either.

For 'baz' the issue is inded that with N == 16 you get all loops unrolled
and the vec[] temporary array elided, so the same issue as above.

So IMO it all works as intended?

Reply via email to