On 03/03/2020 15:57, Richard Sandiford wrote:
Andrew Stubbs <a...@codesourcery.com> writes:
Hi all,
Up until now the AMD GCN port has been using exclusively 64-lane vectors
with masking for smaller sizes.
This works quite well, where it works, but there remain many test cases
(and no doubt some real code) that refuse to vectorize because the
number of iterations (or SLP equivalent) are smaller than the
vectorization factor.
My question is: are there any plans to fill in these missing cases? Or,
is relying on masking alone just not feasible?
This is supported for loop vectorisation. E.g.:
void f (short *x) { for (int i = 0; i < 7; ++i) x[i] += 1; }
generates:
ptrue p0.h, vl7
ld1h z0.h, p0/z, [x0]
add z0.h, z0.h, #1
st1h z0.h, p0, [x0]
ret
Yes, this works on GCN, albeit not quite so prettily:
s_mov_b64 exec, -1
v_mov_b32 v0, 0
s_mov_b64 exec, 127
flat_load_ushort v0, v[4:5]
s_waitcnt 0
s_mov_b64 exec, -1
v_add_u32 v0, vcc, 1, v0
s_mov_b64 exec, 127
flat_store_short v[4:5], v0
s_setpc_b64 s[18:19]
for SVE. BB SLP is on the wish-list for GCC 11, but no promises. :-)
Early peeling/complete unrolling can cause loops to be straight-line
code by the time the vectoriser sees them. E.g. the loop above doesn't
use masked SVE for "i < 3".
Which kind of cases fail for GCN?
Certainly SLP account for many of them; gfortran.dg/vect/pr62283-2.f
says "unsupported data-type real(kind=4)", which I think is another way
of saying it wants a vector of precisely 4 elements.
For loops, examples are gcc.dg/vect/vect-reduc-1char.c and its
relations. The "big-array" variants of the same tests vectorize just fine.
Andrew