https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117874

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
In particular

t.c:11:12: note:   Starting SLP discovery for
t.c:11:12: note:     c_84(D)->e[2][j_110].real = _55;
t.c:11:12: note:     c_84(D)->e[2][j_110].imag = _66;
t.c:11:12: note:   starting SLP discovery for node 0x51582a0
t.c:11:12: note:   SLP discovery for node 0x51582a0 failed 
t.c:11:12: note:   SLP discovery failed 

we fail to align

  _45 = b0r_89 * a0r_97;
  _46 = b0i_90 * a0i_98;
  _47 = _45 + _46; 
  _48 = a1r_99 * b1r_101;
  _49 = _47 + _48; 
  _50 = a1i_100 * b1i_102;
  _51 = _49 + _50; 
  _52 = b2r_82 * a2r_103;
  _53 = _51 + _52; 
  _54 = b2i_83 * a2i_104;
  _55 = _53 + _54; 
  c_84(D)->e[2][j_110].real = _55;
  _56 = b0i_90 * a0r_97;
  _57 = b0r_89 * a0i_98;
  _59 = a1r_99 * b1i_102;
  _117 = _56 + _59;
  _61 = a1i_100 * b1r_101;
  _63 = b2i_83 * a2r_103;
  _118 = _63 + _117;
  _119 = _118 - _57;
  _64 = _119 - _61;
  _65 = b2r_82 * a2i_104;
  _66 = _64 - _65; 
  c_84(D)->e[2][j_110].imag = _66;

t.c:11:12: note:   pre-sorted chains of plus_expr
plus_expr _54 plus_expr _52 plus_expr _50 plus_expr _48 plus_expr _45 plus_expr
_46
plus_expr _63 plus_expr _56 plus_expr _59 minus_expr _65 minus_expr _61
minus_expr _57
t.c:11:12: note:   starting SLP discovery for node 0x52393c0
t.c:11:12: note:   Build SLP for _54 = b2i_83 * a2i_104;
t.c:11:12: note:   precomputed vectype: vector(8) double
t.c:11:12: note:   nunits = 8
t.c:11:12: note:   Build SLP for _63 = b2i_83 * a2r_103;
t.c:11:12: note:   precomputed vectype: vector(8) double
t.c:11:12: note:   nunits = 8
t.c:11:12: note:   vect_is_simple_use: operand b_73(D)->e[2][j_110].imag, type
of def: internal
t.c:11:12: note:   vect_is_simple_use: operand a_70(D)->e[2][2].imag, type of
def: internal
t.c:11:12: note:   failed to line up SLP graph by re-associating operations in
lanes trying regular discovery

which fails quickly but not exactly verbose because we exceed the discovery
limit here.

One issue this highlights is that when we run into hybrid stmts we fail to
consider single-lane SLP as fallback.  Unfortunately "non-SLP" (aka
interleaving) isn't enough to fix the slowdown.

Doubling the discovery limit allows us to SLP vectorize with AVX but that
isn't faster either.  I don't think we did a particularly good job with
GCC 14 here, it seems we were lucky somehow.

BB vectorization ends up not profitable.

Reply via email to