https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109632
Bug ID: 109632 Summary: Inefficient codegen when complex numbers are emulated with structs Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- Target: aarch64* The following two cases are the same struct complx_t { float re; float im; }; complx_t add(const complx_t &a, const complx_t &b) { return {a.re + b.re, a.im + b.im}; } _Complex float add(const _Complex float *a, const _Complex float *b) { return {__real__ *a + __real__ *b, __imag__ *a + __imag__ *b}; } But we generate much different code (looking at -O2), For the first one we do: ldr d1, [x1] ldr d0, [x0] fadd v0.2s, v0.2s, v1.2s fmov x0, d0 lsr x1, x0, 32 lsr w0, w0, 0 fmov s1, w1 fmov s0, w0 ret which is bad for obvious reasons, but also also never needed to go through the genreg for such a reversal. we could have used many other NEON instructions. For the second one we generate the good instructions: add(float _Complex const*, float _Complex const*): ldp s3, s2, [x0] ldp s0, s1, [x1] fadd s1, s2, s1 fadd s0, s3, s0 ret The difference being that in the second one we have decomposed the initial structure by loading the elements: <bb 2> [local count: 1073741824]: _1 = REALPART_EXPR <*a_8(D)>; _2 = REALPART_EXPR <*b_9(D)>; _3 = _1 + _2; _4 = IMAGPART_EXPR <*a_8(D)>; _5 = IMAGPART_EXPR <*b_9(D)>; _6 = _4 + _5; _10 = COMPLEX_EXPR <_3, _6>; return _10; In the first one we've kept them as vectors: <bb 2> [local count: 1073741824]: vect__1.6_13 = MEM <const vector(2) float> [(float *)a_8(D)]; vect__2.9_15 = MEM <const vector(2) float> [(float *)b_9(D)]; vect__3.10_16 = vect__1.6_13 + vect__2.9_15; MEM <vector(2) float> [(float *)&D.4435] = vect__3.10_16; return D.4435; This part is probably a costing issue, we SLP them even though it's not profitable because for the APCS we have to return them in separate registers. Using -fno-tree-vectorize gets the gimple code right: <bb 2> [local count: 1073741824]: _1 = a_8(D)->re; _2 = b_9(D)->re; _3 = _1 + _2; D.4435.re = _3; _4 = a_8(D)->im; _5 = b_9(D)->im; _6 = _4 + _5; D.4435.im = _6; return D.4435; But we generate worse code: ldp s1, s0, [x0] mov x2, 0 ldp s3, s2, [x1] fadd s1, s1, s3 fadd s0, s0, s2 fmov w1, s1 fmov w0, s0 bfi x2, x1, 0, 32 bfi x2, x0, 32, 32 lsr x0, x2, 32 lsr w2, w2, 0 fmov s1, w0 fmov s0, w2 where we again use genreg as a very complicated way to do a no-op. So there are two bugs here: 1. a costing, we shouldn't SLP 2. an expansion, the code out of expand is bad to begin with.