https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119960

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|2025-04-27 00:00:00         |2025-04-28
             Status|UNCONFIRMED                 |ASSIGNED
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot 
gnu.org
     Ever confirmed|0                           |1

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
It seems that -O2 performance is now faster but -O3 regressed and specifically
-O3 is slower than -O2.

With GCC 14 we vectorize the stores in (inlined)

static void pushEdgeFifo(EdgeFifo fifo, unsigned int a, unsigned int b, size_t&
offset)
{ 
 fifo[offset][0] = a;
 fifo[offset][1] = b;
 offset = (offset + 1) & 15;
}

while with GCC 15 we only vectorize (as with GCC 14) lower part of the
grouped store to (inlined) 'destination'.

static void writeTriangle(void* destination, size_t offset, size_t index_size,
unsigned int a, unsigned int b, unsigned int c)
{  
 if (index_size == 2)
...
 else
 {
  static_cast<unsigned int*>(destination)[offset + 0] = a;
  static_cast<unsigned int*>(destination)[offset + 1] = b;
  static_cast<unsigned int*>(destination)[offset + 2] = c;
 }
}

and the reason is we reject this with the default cost model (as we don't
emit vector CTORs from PHI args - the incoming 'a' and 'b' are quite
elaborately computed:

t2.c:1641:18: note: Costing subgraph:
t2.c:1641:18: note: node 0x1382e240 (max_nunits=2, refcnt=1) vector(2) unsigned
int
t2.c:1641:18: note: op template: (*_202)[0] = a_618;
t2.c:1641:18: note:     stmt 0 (*_202)[0] = a_618;
t2.c:1641:18: note:     stmt 1 (*_202)[1] = c_76;
t2.c:1641:18: note:     children 0x1382e900
t2.c:1641:18: note: node (external) 0x1382e900 (max_nunits=1, refcnt=1)
vector(2) unsigned int
t2.c:1641:18: note:     { a_618, c_76 }
t2.c:1641:18: note: Cost model analysis:
a_618 1 times scalar_store costs 12 in body
c_76 1 times scalar_store costs 12 in body
a_618 1 times vector_store costs 12 in body
node 0x1382e900 1 times vec_construct costs 16 in prologue
t2.c:1641:18: note: Cost model analysis for part in loop 1:
  Vector cost: 28
  Scalar cost: 24
t2.c:1641:18: missed: not vectorized: vectorization is not profitable.

the reason is the vector construction requires a GPR<->XMM move.  If you
use any non-generic tuning like -mtune=intel or -mtune=znver4 you get
the stores vectorized again.

Note the regression is in some of the cases where GCC 14 has

t2.c:1641:18: note: Costing subgraph: 
t2.c:1641:18: note: node 0x350534d8 (max_nunits=2, refcnt=1) vector(2) unsigned
int
t2.c:1641:18: note: op template: (*_147)[0] = c_64;
t2.c:1641:18: note:     stmt 0 (*_147)[0] = c_64;
t2.c:1641:18: note:     stmt 1 (*_147)[1] = b_114;
t2.c:1641:18: note:     children 0x350535e8
t2.c:1641:18: note: node (external) 0x350535e8 (max_nunits=1, refcnt=1)
vector(2) unsigned int
t2.c:1641:18: note:     { c_64, b_114 }
t2.c:1641:18: note: Cost model analysis:
c_64 1 times scalar_store costs 12 in body
b_114 1 times scalar_store costs 12 in body
c_64 1 times vector_store costs 12 in body
node 0x350535e8 1 times vec_construct costs 10 in prologue

for some.  Those do not happen with GCC 15 because of the change as
the load that would result in a reduction in cost is in a different
basic-block where the fix is required for correctness.  One example is:

  [t2.c:2032:17] b_845 = [t2.c:2032:63] [t2.c:2032:60] edgefifo[_847][1];
  _356 = codetri_842 & 15;
  [t2.c:2035:8] fec_357 = (int) _356;
  [t2.c:2039:4] if (fecmax_62 > fec_357)
    goto <bb 211>; [50.00%]
  else
    goto <bb 198>; [50.00%]

  <bb 195> [local count: 316429835]:
  # next_410 = PHI <next_863(199), [t2.c:2046:10] next_949(213)>
  # last_412 = PHI <[t2.c:2055:10 discrim 5] c_892(199), last_862(213)>
  # c_360 = PHI <c_892(199), c_946(213)>
  # vertexfifooffset_393 = PHI <[t2.c:1662:9] _898(199), [t2.c:1662:9]
_955(213)>
  # data_396 = PHI <data_893(199), data_856(213)>
  [t2.c:1641:13] _402 = edgefifooffset_858 * 8;
  [t2.c:1641:13] _406 = [t2.c:2062:17] &edgefifo + _402;
  [t2.c:1641:18] [t2.c:1641:16] (*_406)[0] = c_360;
  [t2.c:1642:18] [t2.c:1642:16] (*_406)[1] = b_845;

where we place the vector initializer is put into the BB of the store
and the costing assumes we'd manage to put the load into a XMM reg
directly.

So confirmed.  I'll think about whether we can do something here.

Reply via email to