[Bug tree-optimization/79946] Suboptimal code with AVX2 copying all arguments to stack

rguenth at gcc dot gnu.org Wed, 08 Mar 2017 01:03:07 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79946


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |x86_64-*-*
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2017-03-08
          Component|target                      |tree-optimization
     Ever confirmed|0                           |1

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
Well, this is what we end up on the GIMPLE leve before RTL expansion:

;;   basic block 2, loop depth 0
;;    pred:       ENTRY
  _30 = MEM[(struct vect3d[16] *)d_87(D)];
  dx[0] = _30;
  _27 = MEM[(struct vect3d[16] *)d_87(D) + 24B];
  dx[1] = _27;
  _14 = MEM[(struct vect3d[16] *)d_87(D) + 48B];
  dx[2] = _14;
  _11 = MEM[(struct vect3d[16] *)d_87(D) + 72B];
  dx[3] = _11;
  _367 = *d_87(D)[4].x;
  dx[4] = _367;
  _342 = MEM[(struct vect3d[16] *)d_87(D) + 120B];
  dx[5] = _342;
  _335 = MEM[(struct vect3d[16] *)d_87(D) + 144B];
  dx[6] = _335;
...
  vect__226.88_97 = MEM[(real(kind=8) *)&tmp];
  vect__233.91_91 = MEM[(real(kind=8) *)&dx];
  vect__233.92_88 = MEM[(real(kind=8) *)&dx + 32B];
  vect__233.93_85 = MEM[(real(kind=8) *)&dx + 64B];
  vect__233.94_81 = MEM[(real(kind=8) *)&dx + 96B];
...

which eventually is coming from the FE:

          if (S.0 > 4) goto L.2;
          {
            integer(kind=8) S.1;
            integer(kind=8) D.3520;
            integer(kind=8) D.3521;

            D.3520 = S.0 * 4 + -5;
            D.3521 = S.0 * 4 + -5;
            S.1 = 1;
            while (1)
              {
                if (S.1 > 4) goto L.1;
                dx[S.1 + D.3521] = (*d)[S.1 + D.3520].x;
                S.1 = S.1 + 1;
              }
            L.1:;
          }
...

and we do not consider vectorizing this with AVX because of the large stride:

t.f90:14:0: note: Load permutation 0 3 6 9
t.f90:14:0: note: permutation requires at least three vectors _327 =
*d_87(D)[_326].x;
t.f90:14:0: note: Build SLP failed: unsupported load permutation dx[_326] =
_327;

and with SSE because

t.f90:14:0: note: Cost model analysis:
  Vector inside of loop cost: 14
  Vector prologue cost: 0
  Vector epilogue cost: 8
  Scalar iteration cost: 8
  Scalar outside cost: 0
  Vector outside cost: 8
  prologue iterations: 0
  epilogue iterations: 1
t.f90:14:0: note: cost model: the vector iteration cost = 14 divided by the
scalar iteration cost = 8 is greater or equal to the vectorization factor = 1.
t.f90:14:0: note: not vectorized: vectorization not profitable.
t.f90:14:0: note: not vectorized: vector version will never be profitable.

only late full unrolling exposes the fact that we could elide Dx completely
by say, SRA.

In this case the vectorizer could consider using strided loads (not sure
if the cost model would be favorably of that idea though).

[Bug tree-optimization/79946] Suboptimal code with AVX2 copying all arguments to stack

Reply via email to