https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114187
Bug ID: 114187 Summary: [14 regression] bizarre register dance on x86_64 for pass-by-value struct Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: matteo at mitalia dot net Target Milestone: --- Sample code (+ godbolt link https://godbolt.org/z/zf6e16Wcq ) ``` struct P2d { double x, y; }; double sumxy(double x, double y) { return x + y; } double sumxy_p(P2d p) { return p.x + p.y; } double sumxy_p_ref(const P2d& p) { return p.x + p.y; } ``` with g++ 13.2 -O3 generates a perfectly reasonable ``` sumxy(double, double): addsd xmm0, xmm1 ret sumxy_p(P2d): addsd xmm0, xmm1 ret sumxy_p_ref(P2d const&): movsd xmm0, QWORD PTR [rdi] addsd xmm0, QWORD PTR [rdi+8] ret ``` instead with g++ 14 (g++ (Compiler-Explorer-Build-gcc-b05f474c8f7768dad50a99a2d676660ee4db09c6-binutils-2.40) 14.0.1 20240301 (experimental)) we get ``` sumxy(double, double): addsd xmm0, xmm1 ret sumxy_p(P2d): movq rax, xmm1 movq rdx, xmm0 xchg rdx, rax movq xmm0, rax movq xmm2, rdx addsd xmm0, xmm2 ret sumxy_p_ref(P2d const&): movsd xmm0, QWORD PTR [rdi] addsd xmm0, QWORD PTR [rdi+8] ret ``` Notice the bizarre registers dance for sumxy_p(P2d) (p.x goes through xmm0 → rdx → rax → xmm0; p.y in turn xmm1 → rax → rdx → xmm2; then they finally get summed); sumxy(double, double) which, register-wise, should be the same, is unaffected. This exact same code (both for gcc 13 and gcc 14) is generated at all optimization levels I tested (-Og, -O1, -O2, -O3) except -O0 of course, so it doesn't seem to depend from particular optimization passes enabled only at high optimization levels. Also (as reasonable) it doesn't seem to depend on the C++ frontend, as compiling this with plain gcc (adding a typedef for the struct and changing the reference to a pointer) yields the exact same results. Most importantly, it seems something target-specific, as ARM64 builds don't exhibit particular problems, and produce pretty much the same (reasonable) code both on 14.0 and 13.2 ``` sumxy(double, double): fadd d0, d0, d1 ret sumxy_p(P2d): fadd d0, d0, d1 ret sumxy_p_ref(P2d const&): ldp d0, d31, [x0] fadd d0, d0, d31 ret ``` (gcc 13.2 generates slightly different code for sumxy_p_ref, but in a very minor way) Fiddling around, with -march=nocona (that leaves gcc 13.2 unaffected) I get a more compact but still absurd dance: ``` sumxy_p(P2d): movsd QWORD PTR [rsp-8], xmm1 mov rdx, QWORD PTR [rsp-8] movq xmm2, rdx addsd xmm0, xmm2 ret ``` here p.x is left in xmm0 where it should, but xmm1 goes through the stack (!), a GP register (rdx) and finally to xmm2. It feels like in general it wants to launder xmm1 through a 64 bit GP register before summing it, a bit like a light version of -ffloat-store.