https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108724
Bug ID: 108724 Summary: [11 regression] Poor codegen when summing two arrays without AVX or SSE Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: gbs at canishe dot com Target Milestone: --- This program: void foo(int *a, const int *__restrict b, const int *__restrict c) { for (int i = 0; i < 16; i++) { a[i] = b[i] + c[i]; } } When compiled for x86 by GCC 11.1+ with -O3 -mno-avx -mno-sse, produces: foo: movq %rdx, %rax subq $8, %rsp movl (%rsi), %edx movq %rsi, %rcx addl (%rax), %edx movl 4(%rax), %esi movq $0, (%rsp) movl %edx, (%rsp) movq (%rsp), %rdx addl 4(%rcx), %esi movq %rdx, -8(%rsp) movl %esi, -4(%rsp) movq -8(%rsp), %rdx movq %rdx, (%rdi) movl 8(%rax), %edx addl 8(%rcx), %edx movq $0, -16(%rsp) movl %edx, -16(%rsp) movq -16(%rsp), %rdx movl 12(%rcx), %esi addl 12(%rax), %esi movq %rdx, -24(%rsp) movl %esi, -20(%rsp) movq -24(%rsp), %rdx movq %rdx, 8(%rdi) [snip more of the same] movl 48(%rcx), %edx movq $0, -96(%rsp) addl 48(%rax), %edx movl %edx, -96(%rsp) movq -96(%rsp), %rdx movl 52(%rcx), %esi addl 52(%rax), %esi movq %rdx, -104(%rsp) movl %esi, -100(%rsp) movq -104(%rsp), %rdx movq %rdx, 48(%rdi) movl 56(%rcx), %edx movq $0, -112(%rsp) addl 56(%rax), %edx movl %edx, -112(%rsp) movq -112(%rsp), %rdx movl 60(%rcx), %ecx addl 60(%rax), %ecx movq %rdx, -120(%rsp) movl %ecx, -116(%rsp) movq -120(%rsp), %rdx movq %rdx, 56(%rdi) addq $8, %rsp ret (Godbolt link: https://godbolt.org/z/qq9dbP8ed) This is bizarre - it's storing intermediate results on the stack, instead of keeping them in registers or writing them directly to *a, which is bound to be slow. (GCC 10.4, and Clang, produce more or less what I would expect, using only the provided arrays and a register.) I haven't done any benchmarking myself, but Jonathan Wakely's results (on list: https://gcc.gnu.org/pipermail/gcc-help/2023-February/142181.html) seem to bear this out. >From a bisect, this behavior seems to have been introduced by commit 33c0f246f799b7403171e97f31276a8feddd05c9 (tree-optimization/97626 - handle SCCs properly in SLP stmt analysis) from Oct 2020, and persists into GCC trunk.