https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119960
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Last reconfirmed|2025-04-27 00:00:00 |2025-04-28 Status|UNCONFIRMED |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- It seems that -O2 performance is now faster but -O3 regressed and specifically -O3 is slower than -O2. With GCC 14 we vectorize the stores in (inlined) static void pushEdgeFifo(EdgeFifo fifo, unsigned int a, unsigned int b, size_t& offset) { fifo[offset][0] = a; fifo[offset][1] = b; offset = (offset + 1) & 15; } while with GCC 15 we only vectorize (as with GCC 14) lower part of the grouped store to (inlined) 'destination'. static void writeTriangle(void* destination, size_t offset, size_t index_size, unsigned int a, unsigned int b, unsigned int c) { if (index_size == 2) ... else { static_cast<unsigned int*>(destination)[offset + 0] = a; static_cast<unsigned int*>(destination)[offset + 1] = b; static_cast<unsigned int*>(destination)[offset + 2] = c; } } and the reason is we reject this with the default cost model (as we don't emit vector CTORs from PHI args - the incoming 'a' and 'b' are quite elaborately computed: t2.c:1641:18: note: Costing subgraph: t2.c:1641:18: note: node 0x1382e240 (max_nunits=2, refcnt=1) vector(2) unsigned int t2.c:1641:18: note: op template: (*_202)[0] = a_618; t2.c:1641:18: note: stmt 0 (*_202)[0] = a_618; t2.c:1641:18: note: stmt 1 (*_202)[1] = c_76; t2.c:1641:18: note: children 0x1382e900 t2.c:1641:18: note: node (external) 0x1382e900 (max_nunits=1, refcnt=1) vector(2) unsigned int t2.c:1641:18: note: { a_618, c_76 } t2.c:1641:18: note: Cost model analysis: a_618 1 times scalar_store costs 12 in body c_76 1 times scalar_store costs 12 in body a_618 1 times vector_store costs 12 in body node 0x1382e900 1 times vec_construct costs 16 in prologue t2.c:1641:18: note: Cost model analysis for part in loop 1: Vector cost: 28 Scalar cost: 24 t2.c:1641:18: missed: not vectorized: vectorization is not profitable. the reason is the vector construction requires a GPR<->XMM move. If you use any non-generic tuning like -mtune=intel or -mtune=znver4 you get the stores vectorized again. Note the regression is in some of the cases where GCC 14 has t2.c:1641:18: note: Costing subgraph: t2.c:1641:18: note: node 0x350534d8 (max_nunits=2, refcnt=1) vector(2) unsigned int t2.c:1641:18: note: op template: (*_147)[0] = c_64; t2.c:1641:18: note: stmt 0 (*_147)[0] = c_64; t2.c:1641:18: note: stmt 1 (*_147)[1] = b_114; t2.c:1641:18: note: children 0x350535e8 t2.c:1641:18: note: node (external) 0x350535e8 (max_nunits=1, refcnt=1) vector(2) unsigned int t2.c:1641:18: note: { c_64, b_114 } t2.c:1641:18: note: Cost model analysis: c_64 1 times scalar_store costs 12 in body b_114 1 times scalar_store costs 12 in body c_64 1 times vector_store costs 12 in body node 0x350535e8 1 times vec_construct costs 10 in prologue for some. Those do not happen with GCC 15 because of the change as the load that would result in a reduction in cost is in a different basic-block where the fix is required for correctness. One example is: [t2.c:2032:17] b_845 = [t2.c:2032:63] [t2.c:2032:60] edgefifo[_847][1]; _356 = codetri_842 & 15; [t2.c:2035:8] fec_357 = (int) _356; [t2.c:2039:4] if (fecmax_62 > fec_357) goto <bb 211>; [50.00%] else goto <bb 198>; [50.00%] <bb 195> [local count: 316429835]: # next_410 = PHI <next_863(199), [t2.c:2046:10] next_949(213)> # last_412 = PHI <[t2.c:2055:10 discrim 5] c_892(199), last_862(213)> # c_360 = PHI <c_892(199), c_946(213)> # vertexfifooffset_393 = PHI <[t2.c:1662:9] _898(199), [t2.c:1662:9] _955(213)> # data_396 = PHI <data_893(199), data_856(213)> [t2.c:1641:13] _402 = edgefifooffset_858 * 8; [t2.c:1641:13] _406 = [t2.c:2062:17] &edgefifo + _402; [t2.c:1641:18] [t2.c:1641:16] (*_406)[0] = c_360; [t2.c:1642:18] [t2.c:1642:16] (*_406)[1] = b_845; where we place the vector initializer is put into the BB of the store and the costing assumes we'd manage to put the load into a XMM reg directly. So confirmed. I'll think about whether we can do something here.