https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621

--- Comment #15 from Richard Biener <rguenth at gcc dot gnu.org> ---
On trunk only the cost model prevents vectorization of the s32 loop now (with
generic tuning/arch).  With core-avx2 I get for both innermost loops

.L6:
        addl    $1, %r10d
        vmovapd (%rbx,%r8), %ymm3
        vfmadd231pd     (%rax,%r8), %ymm3, %ymm0
        addq    $32, %r8
        cmpl    %r12d, %r10d
        jb      .L6
...

.L26:
        addl    $1, %ecx
        vmovupd (%rdi,%rax), %ymm4
        vfmadd231pd     (%rsi,%rax), %ymm4, %ymm0
        addq    $32, %rax
        cmpl    %r8d, %ecx
        jb      .L26
...

with only the reduction after it varying.  With forcing avx128 the s32 loop
isn't vectorized (cost model again):

t.f90:22:0: note: Cost model analysis:
  Vector inside of loop cost: 16
  Vector prologue cost: 8
  Vector epilogue cost: 12
  Scalar iteration cost: 8
  Scalar outside cost: 6
  Vector outside cost: 20
  prologue iterations: 0
  epilogue iterations: 1

Reply via email to