https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
The core loop is
.L8:
addq $1, %rdx
vaddps (%r8), %ymm1, %ymm1
addq $32, %r8
cmpq %rdx, %rcx
ja .L8
which compared to LLVM is not unrolled. You can use -funroll-loops to
force that which probably fixes the performance compared to LLVM. For
the short loop above I also guess this is not the optimal IV choice.
