https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117874
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- In particular t.c:11:12: note: Starting SLP discovery for t.c:11:12: note: c_84(D)->e[2][j_110].real = _55; t.c:11:12: note: c_84(D)->e[2][j_110].imag = _66; t.c:11:12: note: starting SLP discovery for node 0x51582a0 t.c:11:12: note: SLP discovery for node 0x51582a0 failed t.c:11:12: note: SLP discovery failed we fail to align _45 = b0r_89 * a0r_97; _46 = b0i_90 * a0i_98; _47 = _45 + _46; _48 = a1r_99 * b1r_101; _49 = _47 + _48; _50 = a1i_100 * b1i_102; _51 = _49 + _50; _52 = b2r_82 * a2r_103; _53 = _51 + _52; _54 = b2i_83 * a2i_104; _55 = _53 + _54; c_84(D)->e[2][j_110].real = _55; _56 = b0i_90 * a0r_97; _57 = b0r_89 * a0i_98; _59 = a1r_99 * b1i_102; _117 = _56 + _59; _61 = a1i_100 * b1r_101; _63 = b2i_83 * a2r_103; _118 = _63 + _117; _119 = _118 - _57; _64 = _119 - _61; _65 = b2r_82 * a2i_104; _66 = _64 - _65; c_84(D)->e[2][j_110].imag = _66; t.c:11:12: note: pre-sorted chains of plus_expr plus_expr _54 plus_expr _52 plus_expr _50 plus_expr _48 plus_expr _45 plus_expr _46 plus_expr _63 plus_expr _56 plus_expr _59 minus_expr _65 minus_expr _61 minus_expr _57 t.c:11:12: note: starting SLP discovery for node 0x52393c0 t.c:11:12: note: Build SLP for _54 = b2i_83 * a2i_104; t.c:11:12: note: precomputed vectype: vector(8) double t.c:11:12: note: nunits = 8 t.c:11:12: note: Build SLP for _63 = b2i_83 * a2r_103; t.c:11:12: note: precomputed vectype: vector(8) double t.c:11:12: note: nunits = 8 t.c:11:12: note: vect_is_simple_use: operand b_73(D)->e[2][j_110].imag, type of def: internal t.c:11:12: note: vect_is_simple_use: operand a_70(D)->e[2][2].imag, type of def: internal t.c:11:12: note: failed to line up SLP graph by re-associating operations in lanes trying regular discovery which fails quickly but not exactly verbose because we exceed the discovery limit here. One issue this highlights is that when we run into hybrid stmts we fail to consider single-lane SLP as fallback. Unfortunately "non-SLP" (aka interleaving) isn't enough to fix the slowdown. Doubling the discovery limit allows us to SLP vectorize with AVX but that isn't faster either. I don't think we did a particularly good job with GCC 14 here, it seems we were lucky somehow. BB vectorization ends up not profitable.