[Bug other/71414] 2x slower than clang summing small float array

rguenth at gcc dot gnu.org Mon, 06 Jun 2016 01:22:55 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414


--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
The core loop is

.L8:
        addq    $1, %rdx
        vaddps  (%r8), %ymm1, %ymm1
        addq    $32, %r8
        cmpq    %rdx, %rcx
        ja      .L8

which compared to LLVM is not unrolled.  You can use -funroll-loops to
force that which probably fixes the performance compared to LLVM.  For
the short loop above I also guess this is not the optimal IV choice.

[Bug other/71414] 2x slower than clang summing small float array

Reply via email to