https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104912

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Created attachment 52640
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52640&action=edit
patch

Like this - this counts the number of vector stmts and the number of strided
loads/stores and then when finishing up:

+void
+ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
+{
+  m_finished = true;
+  if (m_costing_for_scalar)
+    return;
+
+  /* When we have more than one strided load or store and the
+     number of strided stores is high compared to all vector
+     stmts in the body we require at least an estimated
+     improvement due to the vectorization of a factor of two.  */
+  if (m_n_body_strided_load_store > 1
+      && m_n_body_stmts / m_n_body_strided_load_store < 4)
+    {
+      unsigned vf = 1;
+      if (is_a <loop_vec_info> (m_vinfo))
+       vf = vect_vf_for_cost (as_a <loop_vec_info> (m_vinfo));
+      if (scalar_costs->prologue_cost () * vf < 2 * body_cost ())
+       m_costs[vect_body] *= 2;
+    }
+}


the scaling of m_costs[vect_body] will make the vectorization unprofitable.
Instead of a hard limit like this we could also scale the strided load
cost based on the overall number of them, like if adding
m_n_body_strided_load_store squared to the cost.

Note that the "true" cost would only be visible when doing a scheduling
model with dependences in mind.  Note that for this particular case this
is all hand-waving since the true cost is the versioning/branching overhead,
not the vectorized loop body and the low number of iterations makes this
particularly visible.  So for 416.gamess it will be all a hack...

Reply via email to