https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65660
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- Looks like we now vectorize using loop vect instead of basic-block vectorization. The overhead might be noticable. For example ./ggSpectrum.h:48:4: note: loop vectorized -./ggSpectrum.h:49:18: note: basic block vectorized -./ggSpectrum.h:49:18: note: basic block vectorized -ggPathDielectricMaterial.cc:36:60: note: basic block vectorized +./ggSpectrum.h:48:4: note: loop vectorized +./ggSpectrum.h:48:4: note: loop peeled for vectorization to enhance alignment +./ggSpectrum.h:48:4: note: loop vectorized +./ggSpectrum.h:48:4: note: loop peeled for vectorization to enhance alignment +./ggSpectrum.h:48:4: note: loop vectorized +./ggSpectrum.h:48:4: note: loop peeled for vectorization to enhance alignment +./ggSpectrum.h:48:4: note: loop vectorized +./ggSpectrum.h:48:4: note: loop peeled for vectorization to enhance alignment +./ggSpectrum.h:48:4: note: loop vectorized +./ggSpectrum.h:48:4: note: loop peeled for vectorization to enhance alignment is all from ggPathDielectricMaterial.cc. Not sure why we peel for alignment at all as bdver2 has vec_align_load_cost == vec_unalign_load_cost == vec_store_cost (there isn't any unaligned store cost but IIRC an unalinged store consumes two store buffers thus aligning the stores might be profitable). Btw, the loop in question is: void Set(float d) { for (int i = 0; i < nComponents(); i++) data[i] = d; } where I can very well imagine that nComponents() is _not_ large enough to warrant loop vectorization (data is an array of 8 floats). nComponents() returns constant 8. With bdver2 we now have t.c:4:20: note: vectorization_factor = 4, niters = 8 t.c:4:20: note: === vect_update_slp_costs_according_to_vf === cost model: prologue peel iters set to vf/2. cost model: epilogue peel iters set to vf/2 because peeling for alignment is unknown. t.c:4:20: note: Cost model analysis: Vector inside of loop cost: 4 Vector prologue cost: 8 Vector epilogue cost: 0 Scalar iteration cost: 4 Scalar outside cost: 0 Vector outside cost: 8 prologue iterations: 2 epilogue iterations: 2 Calculated minimum iters for profitability: 2 t.c:4:20: note: Runtime profitability threshold = 3 t.c:4:20: note: Static estimate profitability threshold = 3 t.c:4:20: note: epilog loop required while generic has t.c:4:20: note: vectorization_factor = 4, niters = 8 t.c:4:20: note: === vect_update_slp_costs_according_to_vf === cost model: prologue peel iters set to vf/2. cost model: epilogue peel iters set to vf/2 because peeling for alignment is unknown. t.c:4:20: note: Cost model analysis: Vector inside of loop cost: 1 Vector prologue cost: 11 Vector epilogue cost: 2 Scalar iteration cost: 1 Scalar outside cost: 0 Vector outside cost: 13 prologue iterations: 2 epilogue iterations: 2 Calculated minimum iters for profitability: 17 t.c:4:20: note: Runtime profitability threshold = 16 t.c:4:20: note: Static estimate profitability threshold = 16 t.c:4:20: note: not vectorized: vectorization not profitable. somehow the prologue cost looks off for bdver2. Testcase: struct ggSpectrum { void Set (float d) { for (int i = 0; i < 8; i++) data[i] = d; } float data[8]; }; void foo (ggSpectrum *s, float d) { s->Set(d); } now the best course of action is of course to not even consider peeling this loop for alignment ... (if it can otherwise vectorize). I think we run into round-off errors with my fix on bdver2, I have a crude fix for that.