https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37021
--- Comment #22 from Bill Schmidt <wschmidt at gcc dot gnu.org> --- (In reply to Richard Biener from comment #21) > (In reply to Bill Schmidt from comment #20) ...<snip>... > > I see it only failing due to cost issues (tried ppc64le and -mcpu=power8). > The unaligned loads cost 3 and we end up with > > t.f90:8:0: note: Cost model analysis: > Vector inside of loop cost: 40 > Vector prologue cost: 8 > Vector epilogue cost: 4 > Scalar iteration cost: 12 > Scalar outside cost: 6 > Vector outside cost: 12 > prologue iterations: 0 > epilogue iterations: 0 > t.f90:8:0: note: cost model: the vector iteration cost = 40 divided by the > scalar iteration cost = 12 is greater or equal to the vectorization factor = > 1. > > Note that we are (still) not very good in estimating the SLP cost as we > account 4 vector loads here (because we essentially will end up with > 4 different permutations used), so the "unaligned" part is accounted for > too much and likely the permutation cost as well. Both are a limitation > of the SLP data structures and not easily fixable. With > -fvect-cost-model=unlimited I see both loops vectorized. Yes, I get these same results for the loop vectorizer (using -O2 -ftree-vectorize -mcpu=power8 -ffast-math). But I was looking at the failure to do SLP vectorization. In comment 19 you indicated this was now working, presumably on x86, but for Power we fail to SLP-vectorize fast-math-pr37021.f90:9:0. However, with today's trunk my SLP dump looks slightly different so I need to have another look at whether this is still failing due to alignment or something else. I'll comment again when I've dug into it further.