https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79946
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Target| |x86_64-*-* Status|UNCONFIRMED |NEW Last reconfirmed| |2017-03-08 Component|target |tree-optimization Ever confirmed|0 |1 --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- Well, this is what we end up on the GIMPLE leve before RTL expansion: ;; basic block 2, loop depth 0 ;; pred: ENTRY _30 = MEM[(struct vect3d[16] *)d_87(D)]; dx[0] = _30; _27 = MEM[(struct vect3d[16] *)d_87(D) + 24B]; dx[1] = _27; _14 = MEM[(struct vect3d[16] *)d_87(D) + 48B]; dx[2] = _14; _11 = MEM[(struct vect3d[16] *)d_87(D) + 72B]; dx[3] = _11; _367 = *d_87(D)[4].x; dx[4] = _367; _342 = MEM[(struct vect3d[16] *)d_87(D) + 120B]; dx[5] = _342; _335 = MEM[(struct vect3d[16] *)d_87(D) + 144B]; dx[6] = _335; ... vect__226.88_97 = MEM[(real(kind=8) *)&tmp]; vect__233.91_91 = MEM[(real(kind=8) *)&dx]; vect__233.92_88 = MEM[(real(kind=8) *)&dx + 32B]; vect__233.93_85 = MEM[(real(kind=8) *)&dx + 64B]; vect__233.94_81 = MEM[(real(kind=8) *)&dx + 96B]; ... which eventually is coming from the FE: if (S.0 > 4) goto L.2; { integer(kind=8) S.1; integer(kind=8) D.3520; integer(kind=8) D.3521; D.3520 = S.0 * 4 + -5; D.3521 = S.0 * 4 + -5; S.1 = 1; while (1) { if (S.1 > 4) goto L.1; dx[S.1 + D.3521] = (*d)[S.1 + D.3520].x; S.1 = S.1 + 1; } L.1:; } ... and we do not consider vectorizing this with AVX because of the large stride: t.f90:14:0: note: Load permutation 0 3 6 9 t.f90:14:0: note: permutation requires at least three vectors _327 = *d_87(D)[_326].x; t.f90:14:0: note: Build SLP failed: unsupported load permutation dx[_326] = _327; and with SSE because t.f90:14:0: note: Cost model analysis: Vector inside of loop cost: 14 Vector prologue cost: 0 Vector epilogue cost: 8 Scalar iteration cost: 8 Scalar outside cost: 0 Vector outside cost: 8 prologue iterations: 0 epilogue iterations: 1 t.f90:14:0: note: cost model: the vector iteration cost = 14 divided by the scalar iteration cost = 8 is greater or equal to the vectorization factor = 1. t.f90:14:0: note: not vectorized: vectorization not profitable. t.f90:14:0: note: not vectorized: vector version will never be profitable. only late full unrolling exposes the fact that we could elide Dx completely by say, SRA. In this case the vectorizer could consider using strided loads (not sure if the cost model would be favorably of that idea though).