On 14/06/2021 11:57, Richard Biener wrote:
On Mon, 14 Jun 2021, Richard Biener wrote:

Indeed. For example a simple
int a[1024], b[1024], c[1024];

void foo(int n)
{
   for (int i = 0; i < n; ++i)
     a[i+1] += c[i+i] ? b[i+1] : 0;
}

should usually see peeling for alignment (though on x86 you need
exotic -march= since cost models generally have equal aligned and
unaligned access costs).  For example with -mavx2 -mtune=atom
we'll see an alignment peeling prologue, a AVX2 vector loop,
a SSE2 vectorized epilogue and a scalar epilogue.  It also
shows the original scalar loop being used in the scalar prologue
and epilogue.

We're not even trying to make the counting IV easily used
across loops (we're not counting scalar iterations in the
vector loops).
Specifically we see

<bb 33> [local count: 94607391]:
niters_vector_mult_vf.10_62 = bnd.9_61 << 3;
_67 = niters_vector_mult_vf.10_62 + 7;
_64 = (int) niters_vector_mult_vf.10_62;
tmp.11_63 = i_43 + _64;
if (niters.8_45 == niters_vector_mult_vf.10_62)
   goto <bb 37>; [12.50%]
else
   goto <bb 36>; [87.50%]

after the maini vect loop, recomputing the original IV (i) rather
than using the inserted canonical IV.  And then the vectorized
epilogue header check doing

<bb 36> [local count: 93293400]:
# i_59 = PHI <tmp.11_63(33), 0(18)>
# _66 = PHI <_67(33), 0(18)>
_96 = (unsigned int) n_10(D);
niters.26_95 = _96 - _66;
_108 = (unsigned int) n_10(D);
_109 = _108 - _66;
_110 = _109 + 4294967295;
if (_110 <= 3)
   goto <bb 47>; [10.00%]
else
   goto <bb 40>; [90.00%]

re-computing everything from scratch again (also notice how
the main vect loop guard jumps around the alignment prologue
as well and lands here - and the vectorized epilogue using
unaligned accesses - good!).

That is, I'd expect _much_ easier jobs if we'd manage to
track the number of performed scalar iterations (or the
number of scalar iterations remaining) using the canonical
IV we add to all loops across all of the involved loops.

Richard.


So I am now looking at using an IV that counts scalar iterations rather than vector iterations and reusing that through all loops, (prologue, main loop, vect_epilogue and scalar epilogue). The first is easy, since that's what we already do for partial vectors or non-constant VFs. The latter requires some plumbing and removing a lot of the code in there that creates new IV's going from [0, niters - previous iterations]. I don't yet have a clear cut view of how to do this, I first thought of keeping track of the 'control' IV in the loop_vinfo, but the prologue and scalar epilogues won't have one. 'loop' keeps a control_ivs struct, but that is used for overflow detection and only keeps track of what looks like a constant 'base' and 'step'. Not quite sure how all that works, but intuitively doesn't seem like the right thing to reuse.

I'll go hack around and keep you posted on progress.

Regards,
Andre

Reply via email to