https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104408
Bug ID: 104408 Summary: SLP discovery fails due to -Ofast rewriting Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- The following testcase: typedef struct { float r, i; } cf; void f (cf *restrict a, cf *restrict b, cf *restrict c, cf *restrict d, cf e) { for (int i = 0; i < 100; ++i) { b[i].r = e.r * (c[i].r - d[i].r) - e.i * (c[i].i - d[i].i); b[i].i = e.r * (c[i].i - d[i].i) + e.i * (c[i].r - d[i].r); } } when compiled at -O3 forms an SLP tree but fails at -Ofast because match.pd rewrites the expression into b[i].r = e.r * (c[i].r - d[i].r) + e.i * (d[i].i - c[i].i); b[i].i = e.r * (c[i].i - d[i].i) + e.i * (c[i].r - d[i].r); and so introduces a different interleaving in the second multiply operation. It's unclear to me what the gain of actually doing this is as it results in worse vector and scalar code due to you losing the sharing of the computed value of the nodes. Without the rewriting the first code can re-use the load from the first vector and just reverse the elements: .L2: ldr q1, [x3, x0] ldr q0, [x2, x0] fsub v0.4s, v0.4s, v1.4s fmul v1.4s, v2.4s, v0.4s fmul v0.4s, v3.4s, v0.4s rev64 v1.4s, v1.4s fneg v0.2d, v0.2d fadd v0.4s, v0.4s, v1.4s str q0, [x1, x0] add x0, x0, 16 cmp x0, 800 bne .L2 While with the rewrite it forces an increase in VF to be able to handle the interleaving .L2: ld2 {v0.4s - v1.4s}, [x3], 32 ld2 {v4.4s - v5.4s}, [x2], 32 fsub v2.4s, v1.4s, v5.4s fsub v3.4s, v4.4s, v0.4s fsub v5.4s, v5.4s, v1.4s fmul v2.4s, v2.4s, v6.4s fmul v4.4s, v6.4s, v3.4s fmla v2.4s, v7.4s, v3.4s fmla v4.4s, v5.4s, v7.4s mov v0.16b, v2.16b mov v1.16b, v4.16b st2 {v0.4s - v1.4s}, [x1], 32 cmp x5, x1 bne .L2 in scalar you lose the ability to re-use the subtract so you get an extra sub.