[Bug other/71414] 2x slower than clang summing small float array

2016-06-06 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414 --- Comment #4 from Yichao Yu --- The C code is in the gist linked `a` is a cacheline aligned pointer and `n` is 1024 so `a` should even fits in L1d, which is 32kB on both processors I benchmarked. More precise timing (ns per loop) 6700K ``` %

[Bug other/71414] 2x slower than clang summing small float array

2016-06-06 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414 Richard Biener changed: What|Removed |Added Keywords||missed-optimization CC|

[Bug other/71414] 2x slower than clang summing small float array

2016-06-06 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414 --- Comment #2 from Marc Glisse --- (In reply to Richard Biener from comment #1) > The core loop is > > .L8: > addq$1, %rdx > vaddps (%r8), %ymm1, %ymm1 > addq$32, %r8 > cmpq%rdx, %rcx > ja

[Bug other/71414] 2x slower than clang summing small float array

2016-06-06 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414 --- Comment #1 from Richard Biener --- The core loop is .L8: addq$1, %rdx vaddps (%r8), %ymm1, %ymm1 addq$32, %r8 cmpq%rdx, %rcx ja .L8 which compared to LLVM is not unrolled. You can u