https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710

            Bug ID: 69710
           Summary: performance issue with SP Linpack with
                    Autovectorization
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: doug.gilmore at imgtec dot com
  Target Milestone: ---

Created attachment 37614
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37614&action=edit
extracted daxpy example

We've noticed a performance problem in single precision
Linpack with the MSA patch applied:

https://gcc.gnu.org/ml/gcc-patches/2016-01/msg00177.html

which I have been able to reproduce with ARM Neon.

The problem that the autovectorization is generating more induction
variables for memory references in daxpy (this is an issue on all
architectures).  That is, when the statement:

  dy[i] = dy[i] + da*dx[i];

is vectorized the vector load associated with load of dy[i] uses
a different Induction Variable (IV) for the subsequent vector store
for dy[i].  For example, for ARM neon after vect we see:

  <bb 12>:
  # i_26 = PHI <i_44(11), i_19(20)>
  # vectp_dy.12_83 = PHI <vectp_dy.13_81(11), vectp_dy.12_84(20)>
  # vectp_dx.15_88 = PHI <vectp_dx.16_86(11), vectp_dx.15_89(20)>
  # vectp_dy.20_96 = PHI <vectp_dy.21_94(11), vectp_dy.20_97(20)>
  # ivtmp_99 = PHI <0(11), ivtmp_100(20)>
  i.0_7 = (unsigned int) i_26;
  _8 = i.0_7 * 4;
  _10 = dy_9(D) + _8;
  vect__12.14_85 = MEM[(float *)vectp_dy.12_83];
  _12 = *_10;
  _14 = dx_13(D) + _8;
  vect__15.17_90 = MEM[(float *)vectp_dx.15_88];
  _15 = *_14;
  vect__16.18_92 = vect_cst__91 * vect__15.17_90;
  _16 = da_6(D) * _15;
  vect__17.19_93 = vect__12.14_85 + vect__16.18_92;
  _17 = _12 + _16;
  MEM[(float *)vectp_dy.20_96] = vect__17.19_93;
  i_19 = i_26 + 1;
  vectp_dy.12_84 = vectp_dy.12_83 + 16;
  vectp_dx.15_89 = vectp_dx.15_88 + 16;
  vectp_dy.20_97 = vectp_dy.20_96 + 16;
  ivtmp_100 = ivtmp_99 + 1;
  if (ivtmp_100 < bnd.9_53)
    goto <bb 20>;
  else
    goto <bb 15>;
...
  <bb 20>:
  goto <bb 12>;

Note that the use of a separate IV for the load and store off of dy
can introduces a false memory dependency which causes poor scheduling
after unrolling.  From what I have seen so far, for double precision
the ivopts phase is able to clean up the induction variables so the
false memory dependency is removed.  However the cleanup does not
happen for single precision.

Attached simple example for single precision, more to follow.

Reply via email to