https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102750
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- So for some unknown reason the more vectorized version of the function is slower. Note that all BB vectorization in this function happens when triggered from the loop vectorizer on the if-converted loop body after loop vectorization failed. One difference is -make_ahmat.c:37:13: missed: desired vector type conflicts with earlier one for _334 = _5->c[0].real; -make_ahmat.c:37:13: note: removing SLP instance operations starting from: MEM <struct site> [(struct anti_hermitmat *)s_339].mom[dir_8].m00im = _45; +make_ahmat.c:37:13: note: vect_compute_data_ref_alignment: +make_ahmat.c:37:13: note: can't force alignment of ref: MEM <struct site> [(struct anti_hermitmat *)s_339].mom[dir_8].m00im and the extra vectorization has live lanes that we think we cannot reliably place: +make_ahmat.c:37:13: missed: Cannot determine insertion place for lane extract +make_ahmat.c:37:13: missed: Cannot determine insertion place for lane extract +make_ahmat.c:37:13: missed: Cannot determine insertion place for lane extract +make_ahmat.c:37:13: missed: Cannot determine insertion place for lane extract that will cause the scalar definitions to be retained (but they will be not costed as removed then), possibly causing redundant computations and ressource competition. +su3_proj.c:44:24: note: Cost model analysis for part in loop 1: + Vector cost: 776 + Scalar cost: 836 and before the change for a much smaller portion of the block: make_ahmat.c:40:60: note: Cost model analysis for part in loop 1: Vector cost: 224 Scalar cost: 328 the scalar/vector ratio was better there.