https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414
--- Comment #4 from Yichao Yu ---
The C code is in the gist linked `a` is a cacheline aligned pointer and `n` is
1024 so `a` should even fits in L1d, which is 32kB on both processors I
benchmarked.
More precise timing (ns per loop)
6700K
```
%
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414
Richard Biener changed:
What|Removed |Added
Keywords||missed-optimization
CC|
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414
--- Comment #2 from Marc Glisse ---
(In reply to Richard Biener from comment #1)
> The core loop is
>
> .L8:
> addq$1, %rdx
> vaddps (%r8), %ymm1, %ymm1
> addq$32, %r8
> cmpq%rdx, %rcx
> ja
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414
--- Comment #1 from Richard Biener ---
The core loop is
.L8:
addq$1, %rdx
vaddps (%r8), %ymm1, %ymm1
addq$32, %r8
cmpq%rdx, %rcx
ja .L8
which compared to LLVM is not unrolled. You can u