[Bug rtl-optimization/114187] New: [14 regression] bizarre register dance on x86_64 for pass-by-value struct

matteo at mitalia dot net via Gcc-bugs Fri, 01 Mar 2024 01:17:05 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114187


            Bug ID: 114187
           Summary: [14 regression] bizarre register dance on x86_64 for
                    pass-by-value struct
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: matteo at mitalia dot net
  Target Milestone: ---

Sample code (+ godbolt link https://godbolt.org/z/zf6e16Wcq )

```
struct P2d {
    double x, y;
};

double sumxy(double x, double y) {
    return x + y;
}

double sumxy_p(P2d p) {
    return p.x + p.y;
}

double sumxy_p_ref(const P2d& p) {
    return p.x + p.y;
}
```

with g++ 13.2 -O3 generates a perfectly reasonable

```
sumxy(double, double):
        addsd   xmm0, xmm1
        ret
sumxy_p(P2d):
        addsd   xmm0, xmm1
        ret
sumxy_p_ref(P2d const&):
        movsd   xmm0, QWORD PTR [rdi]
        addsd   xmm0, QWORD PTR [rdi+8]
        ret
```

instead with g++ 14 (g++
(Compiler-Explorer-Build-gcc-b05f474c8f7768dad50a99a2d676660ee4db09c6-binutils-2.40)
14.0.1 20240301 (experimental)) we get

```
sumxy(double, double):
        addsd   xmm0, xmm1
        ret
sumxy_p(P2d):
        movq    rax, xmm1
        movq    rdx, xmm0
        xchg    rdx, rax
        movq    xmm0, rax
        movq    xmm2, rdx
        addsd   xmm0, xmm2
        ret
sumxy_p_ref(P2d const&):
        movsd   xmm0, QWORD PTR [rdi]
        addsd   xmm0, QWORD PTR [rdi+8]
        ret
```

Notice the bizarre registers dance for sumxy_p(P2d) (p.x goes through xmm0 →
rdx → rax → xmm0; p.y in turn xmm1 → rax → rdx → xmm2; then they finally get
summed); sumxy(double, double) which, register-wise, should be the same, is
unaffected.

This exact same code (both for gcc 13 and gcc 14) is generated at all
optimization levels I tested (-Og, -O1, -O2, -O3) except -O0 of course, so it
doesn't seem to depend from particular optimization passes enabled only at high
optimization levels. Also (as reasonable) it doesn't seem to depend on the C++
frontend, as compiling this with plain gcc (adding a typedef for the struct and
changing the reference to a pointer) yields the exact same results.

Most importantly, it seems something target-specific, as ARM64 builds don't
exhibit particular problems, and produce pretty much the same (reasonable) code
both on 14.0 and 13.2

```
sumxy(double, double):
        fadd    d0, d0, d1
        ret
sumxy_p(P2d):
        fadd    d0, d0, d1
        ret
sumxy_p_ref(P2d const&):
        ldp     d0, d31, [x0]
        fadd    d0, d0, d31
        ret
```

(gcc 13.2 generates slightly different code for sumxy_p_ref, but in a very
minor way)

Fiddling around, with -march=nocona (that leaves gcc 13.2 unaffected) I get a
more compact but still absurd dance:

```
sumxy_p(P2d):
        movsd   QWORD PTR [rsp-8], xmm1
        mov     rdx, QWORD PTR [rsp-8]
        movq    xmm2, rdx
        addsd   xmm0, xmm2
        ret
```

here p.x is left in xmm0 where it should, but xmm1 goes through the stack (!),
a GP register (rdx) and finally to xmm2. It feels like in general it wants to
launder xmm1 through a 64 bit GP register before summing it, a bit like a light
version of -ffloat-store.

[Bug rtl-optimization/114187] New: [14 regression] bizarre register dance on x86_64 for pass-by-value struct

Reply via email to