https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91594
Bug ID: 91594 Summary: Missing horizontal addition auto-vectorization Product: gcc Version: 9.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: diegoandres91b at hotmail dot com Target Milestone: --- The next code (with -O3 -ffast-math -msse3): float a2[4], b2[4], c2[4]; void hadd2() { c2[0] = a2[0] + a2[1]; c2[1] = a2[2] + a2[3]; c2[2] = b2[0] + b2[1]; c2[3] = b2[2] + b2[3]; } Compiles without auto-vectorization: hadd2(): movss xmm0, DWORD PTR a2[rip] addss xmm0, DWORD PTR a2[rip+4] movss DWORD PTR c2[rip], xmm0 movss xmm0, DWORD PTR a2[rip+8] addss xmm0, DWORD PTR a2[rip+12] movss DWORD PTR c2[rip+4], xmm0 movss xmm0, DWORD PTR b2[rip] addss xmm0, DWORD PTR b2[rip+4] movss DWORD PTR c2[rip+8], xmm0 movss xmm0, DWORD PTR b2[rip+8] addss xmm0, DWORD PTR b2[rip+12] movss DWORD PTR c2[rip+12], xmm0 ret The expected code with HADDPS instruction (which does not compile): hadd2(): movaps xmm0, XMMWORD PTR a1[rip] haddps xmm0, XMMWORD PTR b1[rip] movaps XMMWORD PTR c1[rip], xmm0 ret In contrast, the normal addition code: void add2() { c2[0] = a2[0] + b2[0]; c2[1] = a2[1] + b2[1]; c2[2] = a2[2] + b2[2]; c2[3] = a2[3] + b2[3]; } Compiles with auto-vectorization: add2(): movaps xmm0, XMMWORD PTR a2[rip] addps xmm0, XMMWORD PTR b2[rip] movaps XMMWORD PTR c2[rip], xmm0 ret Compiler Explorer Code: https://gcc.godbolt.org/z/9Hs9su