https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25621
--- Comment #15 from Richard Biener <rguenth at gcc dot gnu.org> --- On trunk only the cost model prevents vectorization of the s32 loop now (with generic tuning/arch). With core-avx2 I get for both innermost loops .L6: addl $1, %r10d vmovapd (%rbx,%r8), %ymm3 vfmadd231pd (%rax,%r8), %ymm3, %ymm0 addq $32, %r8 cmpl %r12d, %r10d jb .L6 ... .L26: addl $1, %ecx vmovupd (%rdi,%rax), %ymm4 vfmadd231pd (%rsi,%rax), %ymm4, %ymm0 addq $32, %rax cmpl %r8d, %ecx jb .L26 ... with only the reduction after it varying. With forcing avx128 the s32 loop isn't vectorized (cost model again): t.f90:22:0: note: Cost model analysis: Vector inside of loop cost: 16 Vector prologue cost: 8 Vector epilogue cost: 12 Scalar iteration cost: 8 Scalar outside cost: 6 Vector outside cost: 20 prologue iterations: 0 epilogue iterations: 1