https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117874
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- Note m_mat_na.c and m_mat_nn.c are completely unrolled instead and not vectorized by GCC 14 (nor trunk), still slower as reported (mul_su3_na/nn). trunk seems to unroll less, m_mat_nn.c:73:30: optimized: loop with 3 iterations completely unrolled (header execution count 268435456) on trunk vs. m_mat_nn.c:73:30: optimized: loop with 3 iterations completely unrolled (header execution count 268435456) m_mat_nn.c:73:14: optimized: loop with 2 iterations completely unrolled (header execution count 89478486) on branch. In particular cunroll on GIMPLE does not unroll the outer loop on trunk: Loop 1 iterates 2 times. Loop 1 iterates at most 2 times. Loop 1 likely iterates at most 2 times. size: 104-4, last_iteration: 104-4 Loop size: 104 Estimated size after unrolling: 300 Not unrolling loop 1: number of insns in the unrolled sequence reaches --param max-completely-peeled-insns limit. Not peeling: upper bound is known so can unroll completely vs branch: Loop 1 iterates 2 times. Loop 1 iterates at most 2 times. Loop 1 likely iterates at most 2 times. size: 104-4, last_iteration: 104-4 Loop size: 104 Estimated size after unrolling: 200 that's the r15-919-gef27b91b62c3aa change I think. The heuristic, while careful, doesn't accurately remember what's "innermost" in this case though it's still correct that the body isn't simplified by 1/3 - in this case cunroll has 306 stmts while optimized 276 (FMA disabled), so that's purely CSE. Testcase: typedef struct { double real; double imag; } complex; typedef struct { complex e[3][3]; } su3_matrix; void mult_su3_nn( su3_matrix *a, su3_matrix *b, su3_matrix *c ) { int i,j; double t,ar,ai,br,bi,cr,ci; for(i=0;i<3;i++)for(j=0;j<3;j++){ ar=a->e[i][0].real; ai=a->e[i][0].imag; br=b->e[0][j].real; bi=b->e[0][j].imag; cr=ar*br; t=ai*bi; cr -= t; ci=ar*bi; t=ai*br; ci += t; ar=a->e[i][1].real; ai=a->e[i][1].imag; br=b->e[1][j].real; bi=b->e[1][j].imag; t=ar*br; cr += t; t=ai*bi; cr -= t; t=ar*bi; ci += t; t=ai*br; ci += t; ar=a->e[i][2].real; ai=a->e[i][2].imag; br=b->e[2][j].real; bi=b->e[2][j].imag; t=ar*br; cr += t; t=ai*bi; cr -= t; t=ar*bi; ci += t; t=ai*br; ci += t; c->e[i][j].real=cr; c->e[i][j].imag=ci; } }