https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117875
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Component|target |tree-optimization --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- Samples: 530K of event 'cycles:Pu', Event count (approx.): 680879110118 Overhead Samples Command Shared Object Symbol 51.45% 273953 hmmer_peak.amd6 hmmer_peak.amd64-m64-gcc42-nn [.] P7Viterbi 38.49% 202968 hmmer_base.amd6 hmmer_base.amd64-m64-gcc42-nn [.] P7Viterbi 71 │4135c0┌─ vmovd (%r11,%rdi,4),%xmm3 ▒ 1361 │4135c6│ vpaddd %xmm3,%xmm0,%xmm0 ▒ 29411 │4135ca│ mov %rdi,%r8 ▒ 15 │4135cd│ vmovd %xmm0,0x4(%rdx,%rdi,4) ▒ 5826 │4135d3│ vmovd (%rax,%rdi,4),%xmm4 ▒ 981 │4135d8│ vmovd (%r10,%rdi,4),%xmm3 ◆ 725 │4135de│ vpaddd %xmm3,%xmm4,%xmm3 ▒ 3186 │4135e2│ vmovdqa 0x47346(%rip),%xmm4 ▒ 787 │4135ea│ vpmaxsd %xmm4,%xmm3,%xmm3 ▒ 3801 │4135ef│ vpmaxsd %xmm0,%xmm3,%xmm0 ▒ 28932 │4135f4│ vmovd %xmm0,0x4(%rdx,%rdi,4) ▒ 2073 │4135fa│ inc %rdi ▒ 3464 │4135fd├── cmp %r8,%r9 ▒ 11 │413600└── jne 4135c0 <P7Viterbi+0x1100> vs. │413aa0┌─ vmovd (%r11,%rdi,4),%xmm3 ▒ 208 │413aa6│ mov %rdi,%r8 ▒ 393 │413aa9│ vpaddd %xmm3,%xmm0,%xmm0 ▒ 11199 │413aad│ vmovd %xmm0,0x4(%rdx,%rdi,4) ▒ 11840 │413ab3│ vmovd (%rax,%rdi,4),%xmm5 ▒ 3889 │413ab8│ vmovd (%r10,%rdi,4),%xmm3 ▒ 340 │413abe│ vpaddd %xmm3,%xmm5,%xmm3 ▒ 2829 │413ac2│ vmovdqa 0x48656(%rip),%xmm5 ▒ 720 │413aca│ vpmaxsd %xmm5,%xmm3,%xmm3 ▒ 1047 │413acf│ vpmaxsd %xmm0,%xmm3,%xmm0 ▒ 10698 │413ad4│ vmovd %xmm0,0x4(%rdx,%rdi,4) ◆ 12478 │413ada│ inc %rdi ▒ 2966 │413add├── cmp %r8,%r9 ▒ 1 │413ae0└── jne 413aa0 <P7Viterbi+0x1760> that's the scalar epilog, -mtune-ctrl=^avx512_two_epilogues does not help. The regression also shows up on Icelake. For some reason we're dealing with branch misses here which we have none for BASE for the above loop but plenty with PEAK. This seems to be related to loop splitting - for PEAK we have two iterating loops while for BASE there's simply fallthru code before. -fno-split-loops fixes this. We do not seem to realize that splitting for (k = 1; k <= M; k++) { if (k < M) { } } has the k == M loop run only once. That causes us to vectorize the epilog loop as well. A simplified testcase looks like int a[1024], b[1024]; void foo (int M) { for (int k = 1; k <= M; ++k) { a[k] = a[k] + 1; if (k < M) b[k] = b[k] + 1; } } likely "caused" by the loop splitting improvements, though for the simplified testcase above the generated code is the same. I'll note that with GCC 14 we do fast_algorithms.c:145:10: optimized: loop split fast_algorithms.c:133:19: optimized: Loop 3 distributed: split to 3 loops and 0 library calls. fast_algorithms.c:133:19: optimized: Loop 5 distributed: split to 2 loops and 0 library calls. fast_algorithms.c:133:19: optimized: loop vectorized using 64 byte vectors fast_algorithms.c:133:19: optimized: loop versioned for vectorization because of possible aliasing fast_algorithms.c:133:19: optimized: loop vectorized using 32 byte vectors fast_algorithms.c:133:19: optimized: loop vectorized using 64 byte vectors fast_algorithms.c:133:19: optimized: loop versioned for vectorization because of possible aliasing fast_algorithms.c:133:19: optimized: loop vectorized using 32 byte vectors fast_algorithms.c:133:19: optimized: loop vectorized using 64 byte vectors fast_algorithms.c:133:19: optimized: loop versioned for vectorization because of possible aliasing fast_algorithms.c:133:19: optimized: loop vectorized using 32 byte vectors fast_algorithms.c:133:19: optimized: loop with 6 iterations completely unrolled (header execution count 7100547) fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops fast_algorithms.c:133:19: optimized: loop with 6 iterations completely unrolled (header execution count 20163246) fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops fast_algorithms.c:133:19: optimized: loop with 6 iterations completely unrolled (header execution count 16089390) fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops while trunk does fast_algorithms.c:145:10: optimized: loop split fast_algorithms.c:133:19: optimized: Loop 3 distributed: split to 3 loops and 0 library calls. fast_algorithms.c:133:19: optimized: Loop 5 distributed: split to 2 loops and 0 library calls. fast_algorithms.c:165:19: optimized: loop vectorized using 64 byte vectors fast_algorithms.c:165:19: optimized: loop vectorized using 32 byte vectors fast_algorithms.c:165:19: optimized: loop vectorized using 16 byte vectors fast_algorithms.c:133:19: optimized: loop vectorized using 64 byte vectors fast_algorithms.c:133:19: optimized: loop versioned for vectorization because of possible aliasing fast_algorithms.c:133:19: optimized: loop vectorized using 32 byte vectors fast_algorithms.c:133:19: optimized: loop vectorized using 16 byte vectors fast_algorithms.c:133:19: optimized: loop vectorized using 64 byte vectors fast_algorithms.c:133:19: optimized: loop versioned for vectorization because of possible aliasing fast_algorithms.c:133:19: optimized: loop vectorized using 32 byte vectors fast_algorithms.c:133:19: optimized: loop vectorized using 16 byte vectors fast_algorithms.c:133:19: optimized: loop with 2 iterations completely unrolled (header execution count 21835320) fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops fast_algorithms.c:133:19: optimized: loop with 2 iterations completely unrolled (header execution count 13974604) fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops fast_algorithms.c:134:7: optimized: loop turned into non-loop; it never loops Which is mostly the same (but all do not realize the loop from splitting doesn't iterate). The loop splitting is quite pointless (but it elides the condition).