Hi Steve,
On Mon, 6 Mar 2017, Steve Ellcey wrote:
> I was looking at the spec 456.hmmer benchmark and this email string
> from Jeff Law and Micheal Matz:
>
> https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01970.html
>
> and was wondering if anyone was looking at what more it would take
> for GCC to vectorize the loop in P7Viterbi.
It takes what I wrote in there. There are two important things that need
to happen to get the best performance (at least from an analysis I did in
2011, but nothing material should have changed since then):
(1) loop distribution to make some memory streams vectorizable (and leave
the others in non-vectorized form).
(1a) loop splitting based on conditional (to remove the k mc[k]) mc[k] = sc;
if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc;
if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc;
mc[k] += ms[k];
if (mc[k] < -INFTY) mc[k] = -INFTY;
}
for (k = 1; k < M; k++) {
dc[k] = dc[k-1] + tpdd[k-1];
if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;
if (dc[k] < -INFTY) dc[k] = -INFTY;
}
for (k = 1; k < M; k++) {
ic[k] = mpp[k] + tpmi[k];
if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc;
ic[k] += is[k];
if (ic[k] < -INFTY) ic[k] = -INFTY;
}
/* last iteration of original loop */
k = M;
mc[k] = mpp[k-1] + tpmm[k-1];
if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc;
if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc;
if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc;
mc[k] += ms[k];
if (mc[k] < -INFTY) mc[k] = -INFTY;
dc[k] = dc[k-1] + tpdd[k-1];
if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;
if (dc[k] < -INFTY) dc[k] = -INFTY;
(Note again, that this is only valid with disambiguation). Adding
restrict qualifiers at the top of routine like so:
#define R __restrict
int * R mc, * R dc, * R ic;/* pointers to rows of mmx, dmx, imx */
int * R ms, * R is; /* pointers to msc[i], isc[i] */
int * R mpp, * R mpc, * R ip; /* ptrs to mmx[i-1], mmx[i], imx[i-1] */
int * R bp; /* ptr into bsc[] */
int * R ep; /* ptr into esc[] */
int * R dpp; /* ptr into dmx[i-1] (previous row) */
int * R tpmm, * R tpmi, * R tpmd, * R tpim, * R tpii, * R tpdm, * R tpdd; /*
ptrs into tsc */
helps to vectorize this. To get the final rest of performance also this
transformation needs to happen on the dc[] loop:
dctemp=dc[0];
for (k = 1; k < M; k++) {
dctemp = dctemp + tpdd[k-1];
if ((sc = mc[k-1] + tpmd[k-1]) > dctemp) dctemp = sc;
if (dctemp < -INFTY) dctemp = -INFTY;
dc[k] = dctemp;
}
Our loop distribution should actually already be able to split off the
three memory streams when restrict is added everywhere, at the 2011 time
frame it didn't do it nevertheless (and I haven't looked if it would be
able to do that now).
predictive commoning could do the dc[] transformation (part (2)), except
that it can't without disambiguation. That adding restrict doesn't help
here is PR50419, but ultimately it would have to work on the disambiguated
loop (without the restrict pointer).
So really the prerequisite to optimize hmmer is loop disambiguation, even
with the many streams (and hence conditionals) that are there. And it
needs to happen well before the loop vectorizer, because loop splitting
and distribution, _and_ predictive commoning have the disambiguation as
prerequisite in this testcase.
After that loop distribution needs to be looked at why it doesn't want to
distribute the streams, and then a variant of PR50419 needs to be fixed
based on disambiguation info (not based on restrict). For that we need
infrastructure that would enable us to disambiguate mem accesses after
loop nest versioning happened in the "good" version.
Ciao,
Michael.