https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #7)
> I will have a look (and for run validation try to reproduce with gfx1036).

OK, so with gfx1036 we end up using 16 byte vectors and the testcase
passes.  The difference with gfx908 is

/space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
note:   ==> examining statement: _14 = aa[_13];
/space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
note:   vect_model_load_cost: aligned.
/space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
note:   vect_model_load_cost: inside_cost = 2, prologue_cost = 0 .

vs.

/space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
note:   ==> examining statement: _14 = aa[_13];
/space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
missed:   unsupported vect permute { 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10
10 11 11 12 12 13 13 14 14 15 15 }
/space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
missed:   unsupported load permutation
/space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:19:72:
missed:   not vectorized: relevant stmt not supported: _14 = aa[_13];
/space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
note:   removing SLP instance operations starting from: REALPART_EXPR
<(*hadcur_24(D))[_2]> = _86;
/space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
missed:  unsupported SLP instances
/space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
note:  re-trying with SLP disabled

so gfx1036 cannot do such permutes but gfx908 can?

On aarch64 with SVE we are using non-SLP and we're doing load-lanes in the
outer loop.  The reason seems to be also the unsupported load permutation,
but that's possibly because of VLA vectors - GCN uses fixed size but
loop masking.  So the better equivalent would have been x86-64 with loop
masking.

So looking again I think the loop mask in the inner loop is wrong.  We have

      do i = 1,4
         do j = 1,4
            HADCUR(I)=
     $         HADCUR(I)+CMPLX(COEF1)*FORM1*AA(I,J)
         end do
      end do

and the vectorizer sees

  <bb 3> [local count: 214748368]:
  # i_35 = PHI <i_27(7), 1(2)>
  # ivtmp_82 = PHI <ivtmp_81(7), 4(2)>
  _1 = (integer(kind=8)) i_35;
  _2 = _1 + -1;
  hadcur__I_RE_lsm.15_8 = REALPART_EXPR <(*hadcur_24(D))[_2]>;
  hadcur__I_IM_lsm.16_9 = IMAGPART_EXPR <(*hadcur_24(D))[_2]>;

  <bb 4> [local count: 858993456]:
  # j_36 = PHI <j_26(8), 1(3)>
...
  _10 = (integer(kind=8)) j_36;
  _11 = _10 * 4;
  _12 = _1 + _11;
  _13 = _12 + -5;
  _14 = aa[_13];
...
  j_26 = j_36 + 1;

  <bb 5> [local count: 214748368]:
  # _86 = PHI <_49(4)>
  # _85 = PHI <_50(4)>
  REALPART_EXPR <(*hadcur_24(D))[_2]> = _86;
  IMAGPART_EXPR <(*hadcur_24(D))[_2]> = _85;
  i_27 = i_35 + 1;

the loop mask { -1, -1, -1, -1, -1, -1, -1, -1, 0, .... } is OK for
the outer loop grouped load

  vect_hadcur__I_RE_lsm.20_76 = .MASK_LOAD (vectp_hadcur.18_79, 64B,
loop_mask_77);

but for the inner loop we do

  vect__14.23_71 = .MASK_LOAD (vectp_aa.21_73, 64B, loop_mask_77);

with the same mask.  This fails to be pruned for the GAP which means
that my improving of gap handling relies for this case to not end up
in the masked load handling.  In fact get_group_load_store_type doesn't
seem to be prepared for outer loop vectorization.  OTOH the inner loop
isn't "unrolled" (it has a VF of 1), and this might be a mistake of
loop mask handling and bad re-use.

As was said elsewhere outer loop vectorization with inner loop datarefs
is compensating for a missed interchange.

Reply via email to