[Bug tree-optimization/54939] Very poor vectorization of loops with complex arithmetic

rguenth at gcc dot gnu.org Wed, 08 Jun 2016 06:46:44 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54939


--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
So we now would vectorize this but compute the SLP vectorization as never
profitable.  Generated code with -fno-vect-cost-model (-Ofast -march=corei7):

.L4:
        movupd  (%rdx,%rax), %xmm0
        movapd  %xmm3, %xmm1
        mulpd   %xmm0, %xmm1
        palignr $8, %xmm0, %xmm0
        mulpd   %xmm2, %xmm0
        addsubpd        %xmm0, %xmm1
        movupd  (%rcx,%rax), %xmm0
        addpd   %xmm0, %xmm1
        movups  %xmm1, (%rcx,%rax)
        addq    $16, %rax
        cmpq    %rsi, %rax
        jne     .L4

t.f:6:0: note: Cost model analysis:
  Vector inside of loop cost: 15
  Vector prologue cost: 10
  Vector epilogue cost: 0
  Scalar iteration cost: 14
  Scalar outside cost: 6
  Vector outside cost: 10
  prologue iterations: 0
  epilogue iterations: 0
t.f:6:0: note: cost model: the vector iteration cost = 15 divided by the scalar
iteration cost = 14 is greater or equal to the vectorization factor = 1.
t.f:6:0: note: not vectorized: vectorization not profitable.
t.f:6:0: note: not vectorized: vector version will never be profitable.

as it is basically equal to basic-block vectorizing the loop body.  Note that
we pessimistically handle addsubpd as if it were not present and the code
would really end up as

  vect__31.16_42 = vect__27.10_49 - vect__28.15_43;
  vect__31.17_41 = vect__27.10_49 + vect__28.15_43;
  _40 = VEC_PERM_EXPR <vect__31.16_42, vect__31.17_41, { 0, 3 }>;

which is what the vectorizer handles this with (two vector_stmt plus one
vec_perm cost).  The x86 vectorizer cost model would need to be adjusted
for this.

With cost model enabled we fall back to vectorization using interleaving:

.L5:
        movupd  (%rax), %xmm5
        addl    $1, %r9d
        addq    $32, %rax
        addq    $32, %r8
        movupd  -32(%r8), %xmm1
        movupd  -16(%r8), %xmm0
        movapd  %xmm5, %xmm2
        movapd  %xmm1, %xmm6
        movupd  -16(%rax), %xmm9
        unpckhpd        %xmm0, %xmm1
        unpcklpd        %xmm0, %xmm6
        movapd  %xmm6, %xmm0
        mulpd   %xmm8, %xmm0
        unpcklpd        %xmm9, %xmm2
        unpckhpd        %xmm9, %xmm5
        mulpd   %xmm7, %xmm6
        addpd   %xmm0, %xmm2
        movapd  %xmm1, %xmm0
        mulpd   %xmm7, %xmm0
        mulpd   %xmm8, %xmm1
        subpd   %xmm0, %xmm2
        movapd  %xmm1, %xmm0
        addpd   %xmm6, %xmm0
        movapd  %xmm2, %xmm1
        addpd   %xmm5, %xmm0
        unpcklpd        %xmm0, %xmm1
        unpckhpd        %xmm0, %xmm2
        movups  %xmm1, -32(%rax)
        movups  %xmm2, -16(%rax)
        cmpl    %esi, %r9d
        jb      .L5

t.f:6:0: note: Cost model analysis:
  Vector inside of loop cost: 26
  Vector prologue cost: 10
  Vector epilogue cost: 14
  Scalar iteration cost: 14
  Scalar outside cost: 6
  Vector outside cost: 24
  prologue iterations: 0
  epilogue iterations: 1
  Calculated minimum iters for profitability: 6
t.f:6:0: note:   Runtime profitability threshold = 5
t.f:6:0: note:   Static estimate profitability threshold = 16



Thus this is now a cost model issue.

[Bug tree-optimization/54939] Very poor vectorization of loops with complex arithmetic

Reply via email to