https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102750

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
So for some unknown reason the more vectorized version of the function is
slower.

Note that all BB vectorization in this function happens when triggered from
the loop vectorizer on the if-converted loop body after loop vectorization
failed.  One difference is

-make_ahmat.c:37:13: missed:   desired vector type conflicts with earlier one
for _334 = _5->c[0].real;
-make_ahmat.c:37:13: note:  removing SLP instance operations starting from: MEM
<struct site> [(struct anti_hermitmat *)s_339].mom[dir_8].m00im = _45;
+make_ahmat.c:37:13: note:   vect_compute_data_ref_alignment:
+make_ahmat.c:37:13: note:   can't force alignment of ref: MEM <struct site>
[(struct anti_hermitmat *)s_339].mom[dir_8].m00im

and the extra vectorization has live lanes that we think we cannot reliably
place:

+make_ahmat.c:37:13: missed:   Cannot determine insertion place for lane
extract
+make_ahmat.c:37:13: missed:   Cannot determine insertion place for lane
extract
+make_ahmat.c:37:13: missed:   Cannot determine insertion place for lane
extract
+make_ahmat.c:37:13: missed:   Cannot determine insertion place for lane
extract

that will cause the scalar definitions to be retained (but they will be not
costed as removed then), possibly causing redundant computations and
ressource competition.

+su3_proj.c:44:24: note: Cost model analysis for part in loop 1:
+  Vector cost: 776
+  Scalar cost: 836

and before the change for a much smaller portion of the block:

make_ahmat.c:40:60: note: Cost model analysis for part in loop 1:
  Vector cost: 224
  Scalar cost: 328

the scalar/vector ratio was better there.

Reply via email to