https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69710
Bug ID: 69710 Summary: performance issue with SP Linpack with Autovectorization Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: doug.gilmore at imgtec dot com Target Milestone: --- Created attachment 37614 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37614&action=edit extracted daxpy example We've noticed a performance problem in single precision Linpack with the MSA patch applied: https://gcc.gnu.org/ml/gcc-patches/2016-01/msg00177.html which I have been able to reproduce with ARM Neon. The problem that the autovectorization is generating more induction variables for memory references in daxpy (this is an issue on all architectures). That is, when the statement: dy[i] = dy[i] + da*dx[i]; is vectorized the vector load associated with load of dy[i] uses a different Induction Variable (IV) for the subsequent vector store for dy[i]. For example, for ARM neon after vect we see: <bb 12>: # i_26 = PHI <i_44(11), i_19(20)> # vectp_dy.12_83 = PHI <vectp_dy.13_81(11), vectp_dy.12_84(20)> # vectp_dx.15_88 = PHI <vectp_dx.16_86(11), vectp_dx.15_89(20)> # vectp_dy.20_96 = PHI <vectp_dy.21_94(11), vectp_dy.20_97(20)> # ivtmp_99 = PHI <0(11), ivtmp_100(20)> i.0_7 = (unsigned int) i_26; _8 = i.0_7 * 4; _10 = dy_9(D) + _8; vect__12.14_85 = MEM[(float *)vectp_dy.12_83]; _12 = *_10; _14 = dx_13(D) + _8; vect__15.17_90 = MEM[(float *)vectp_dx.15_88]; _15 = *_14; vect__16.18_92 = vect_cst__91 * vect__15.17_90; _16 = da_6(D) * _15; vect__17.19_93 = vect__12.14_85 + vect__16.18_92; _17 = _12 + _16; MEM[(float *)vectp_dy.20_96] = vect__17.19_93; i_19 = i_26 + 1; vectp_dy.12_84 = vectp_dy.12_83 + 16; vectp_dx.15_89 = vectp_dx.15_88 + 16; vectp_dy.20_97 = vectp_dy.20_96 + 16; ivtmp_100 = ivtmp_99 + 1; if (ivtmp_100 < bnd.9_53) goto <bb 20>; else goto <bb 15>; ... <bb 20>: goto <bb 12>; Note that the use of a separate IV for the load and store off of dy can introduces a false memory dependency which causes poor scheduling after unrolling. From what I have seen so far, for double precision the ivopts phase is able to clean up the induction variables so the false memory dependency is removed. However the cleanup does not happen for single precision. Attached simple example for single precision, more to follow.