7 Regression] load gap with store gap causing performance regression in 462.libquantum

rguenth at gcc dot gnu.org Mon, 30 Jan 2017 01:46:19 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79262


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|tree-optimization           |target

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
On x86_64 core-avx2 we get

t.c:18:3: note: Cost model analysis:
  Vector inside of loop cost: 9
  Vector prologue cost: 7
  Vector epilogue cost: 3
  Scalar iteration cost: 3
  Scalar outside cost: 6
  Vector outside cost: 10
  prologue iterations: 0
  epilogue iterations: 1
t.c:18:3: note: cost model: the vector iteration cost = 9 divided by the scalar
iteration cost = 3 is greater or equal to the vectorization factor = 2.
t.c:18:3: note: not vectorized: vectorization not profitable.

forcing avx128 and no cost model we'd get

.L4:
        vmovdqu (%rax), %xmm0
        vpunpcklqdq     16(%rax), %xmm0, %xmm0
        addl    $1, %ecx
        addq    $32, %rax
        vpxor   %xmm1, %xmm0, %xmm0
        vmovq   %xmm0, -32(%rax)
        vpextrq $1, %xmm0, -16(%rax)
        cmpl    %r9d, %ecx
        jb      .L4

vs.

.L3:
        movslq  %edx, %rax
        addl    $1, %edx
        salq    $4, %rax
        xorq    %rdi, 8(%rsi,%rax)
        cmpl    %r8d, %edx
        jge     .L7

note that one of the issues with the scalar store cost model is that it re-uses
vec_to_scalar which was originally meant to be only used for vector reduction
result to scalar reg cost (aka zero on x86_64).  We failed to add a
vec_extract_element "simple" cost.

The avx256 code looks like

.L4:
        vmovdqu (%rdx), %ymm0
        vpunpcklqdq     32(%rdx), %ymm0, %ymm0
        addl    $1, %esi
        addq    $64, %rdx
        vpermq  $216, %ymm0, %ymm0
        vpxor   %ymm2, %ymm0, %ymm0
        vmovq   %xmm0, -64(%rdx)
        vpextrq $1, %xmm0, -48(%rdx)
        vextracti128    $0x1, %ymm0, %xmm0
        vmovq   %xmm0, -32(%rdx)
        vpextrq $1, %xmm0, -16(%rdx)
        cmpl    %r9d, %esi
        jb      .L4

given x86_64 can successfully cost-model this (reject the vectorization) this
is a target issue.

[Bug target/79262] [6/7 Regression] load gap with store gap causing performance regression in 462.libquantum

Reply via email to