https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> --- Created attachment 52640 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52640&action=edit patch Like this - this counts the number of vector stmts and the number of strided loads/stores and then when finishing up: +void +ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) +{ + m_finished = true; + if (m_costing_for_scalar) + return; + + /* When we have more than one strided load or store and the + number of strided stores is high compared to all vector + stmts in the body we require at least an estimated + improvement due to the vectorization of a factor of two. */ + if (m_n_body_strided_load_store > 1 + && m_n_body_stmts / m_n_body_strided_load_store < 4) + { + unsigned vf = 1; + if (is_a <loop_vec_info> (m_vinfo)) + vf = vect_vf_for_cost (as_a <loop_vec_info> (m_vinfo)); + if (scalar_costs->prologue_cost () * vf < 2 * body_cost ()) + m_costs[vect_body] *= 2; + } +} the scaling of m_costs[vect_body] will make the vectorization unprofitable. Instead of a hard limit like this we could also scale the strided load cost based on the overall number of them, like if adding m_n_body_strided_load_store squared to the cost. Note that the "true" cost would only be visible when doing a scheduling model with dependences in mind. Note that for this particular case this is all hand-waving since the true cost is the versioning/branching overhead, not the vectorized loop body and the low number of iterations makes this particularly visible. So for 416.gamess it will be all a hack...