https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117875
--- Comment #18 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Richard Biener from comment #17) > -sre_math.c:174:17: optimized: loop vectorized using 16 byte vectors > > -sre_math.c:192:17: optimized: loop vectorized using 16 byte vectors Those two are identical, float ** FMX2Alloc(int rows, int cols) { float **mx; int r; mx = (float **) __builtin_malloc (sizeof(float *) * rows); mx[0] = (float *) __builtin_malloc (sizeof(float) * rows * cols); for (r = 1; r < rows; r++) mx[r] = mx[0] + r*cols; return mx; } where the "failure" is a missed epilogue vectorization due to cost (reproducible with Zen2 and Zen4 tuning, not with generic), where SLP costs t.c:9:17: note: Cost model analysis: Vector inside of loop cost: 136 Vector prologue cost: 86 Vector epilogue cost: 128 Scalar iteration cost: 56 Scalar outside cost: 32 Vector outside cost: 214 prologue iterations: 0 epilogue iterations: 2 Calculated minimum iters for profitability: 6 and classical loop vect t.c:9:17: note: Cost model analysis: Vector inside of loop cost: 136 Vector prologue cost: 68 Vector epilogue cost: 128 Scalar iteration cost: 56 Scalar outside cost: 32 Vector outside cost: 196 prologue iterations: 0 epilogue iterations: 2 Calculated minimum iters for profitability: 5 where the difference is in cols_21(D) * r_42 1 times vector_stmt costs 12 in body node 0x25bf6f00 1 times scalar_to_vec costs 10 in prologue _8 w* 4 1 times vector_stmt costs 40 in prologue <unknown> 1 times vector_load costs 12 in prologue vs. cols_21(D) * r_42 1 times scalar_to_vec costs 4 in prologue cols_21(D) * r_42 1 times vector_stmt costs 12 in body _8 w* 4 1 times vector_stmt costs 40 in prologue we seem to forget to cost the constant 4 load cost in non-SLP and we run into target specific costing of scalar_to_vec applying a GPR->XMM move penalty which we only do for SLP. So, SLP looks fine here. This looks like a not important vectorization. I verified that with Zen2 and epilogue vectorization disabled the regression triggered by --param vect-force-slp=1 remains.