[Bug target/39821] 120% slowdown with vectorizer

rguenth at gcc dot gnu.org via Gcc-bugs Tue, 27 Jul 2021 00:24:16 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39821


--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
0x398f310 _2 * _4 1 times scalar_stmt costs 12 in body
...
0x392b3f0 _1 w* _3 2 times vec_promote_demote costs 8 in body
...
t4.c:4:12: note:  Cost model analysis:
  Vector inside of loop cost: 40
  Vector prologue cost: 4
  Vector epilogue cost: 108
  Scalar iteration cost: 40
  Scalar outside cost: 32
  Vector outside cost: 112
  prologue iterations: 0
  epilogue iterations: 2
  Calculated minimum iters for profitability: 3

so clearly the widening multiplication is not costed correctly.  With SSE 4.2
we can do better:

.L4:
        movdqu  (%rcx,%rax), %xmm0
        movdqu  (%rsi,%rax), %xmm1
        addq    $16, %rax
        movdqa  %xmm0, %xmm3
        movdqa  %xmm1, %xmm4
        punpckldq       %xmm0, %xmm3
        punpckldq       %xmm1, %xmm4
        punpckhdq       %xmm0, %xmm0
        pmuldq  %xmm4, %xmm3
        punpckhdq       %xmm1, %xmm1
        pmuldq  %xmm1, %xmm0
        paddq   %xmm3, %xmm2
        paddq   %xmm0, %xmm2
        cmpq    %rdi, %rax
        jne     .L4

but even there the costing is imprecise.  The vectorizer is unhelpful in
categorizing the widen mult as vec_promote_demote which then fails to
run into

        case MULT_EXPR:
        case WIDEN_MULT_EXPR:
        case MULT_HIGHPART_EXPR:
          stmt_cost = ix86_multiplication_cost (ix86_cost, mode);
          break;

fixing that yields

0x392b3f0 _1 w* _3 2 times vector_stmt costs 136 in body

for both SSE2 and SSE4.2 and AVX2 so that's over-estimating cost then via

      /* V*DImode is emulated with 5-8 insns.  */
      else if (mode == V2DImode || mode == V4DImode)
        {
          if (TARGET_XOP && mode == V2DImode)
            return ix86_vec_cost (mode, cost->mulss * 2 + cost->sse_op * 3);
          else
            return ix86_vec_cost (mode, cost->mulss * 3 + cost->sse_op * 5);
        }

with cost->mulss == 16.  I suppose it is somehow failing to realize it's
doing a widening multiply.

[Bug target/39821] 120% slowdown with vectorizer

Reply via email to