On 03/03/2020 15:57, Richard Sandiford wrote:
Andrew Stubbs <a...@codesourcery.com> writes:
Hi all,

Up until now the AMD GCN port has been using exclusively 64-lane vectors
with masking for smaller sizes.

This works quite well, where it works, but there remain many test cases
(and no doubt some real code) that refuse to vectorize because the
number of iterations (or SLP equivalent) are smaller than the
vectorization factor.

My question is: are there any plans to fill in these missing cases? Or,
is relying on masking alone just not feasible?

This is supported for loop vectorisation.  E.g.:

   void f (short *x) { for (int i = 0; i < 7; ++i) x[i] += 1; }

generates:

         ptrue   p0.h, vl7
         ld1h    z0.h, p0/z, [x0]
         add     z0.h, z0.h, #1
         st1h    z0.h, p0, [x0]
         ret

Yes, this works on GCN, albeit not quite so prettily:

     s_mov_b64       exec, -1
     v_mov_b32       v0, 0
     s_mov_b64       exec, 127
     flat_load_ushort        v0, v[4:5]
     s_waitcnt       0
     s_mov_b64       exec, -1
     v_add_u32       v0, vcc, 1, v0
     s_mov_b64       exec, 127
     flat_store_short        v[4:5], v0
     s_setpc_b64     s[18:19]

for SVE.  BB SLP is on the wish-list for GCC 11, but no promises. :-)

Early peeling/complete unrolling can cause loops to be straight-line
code by the time the vectoriser sees them.  E.g. the loop above doesn't
use masked SVE for "i < 3".

Which kind of cases fail for GCN?

Certainly SLP account for many of them; gfortran.dg/vect/pr62283-2.f says "unsupported data-type real(kind=4)", which I think is another way of saying it wants a vector of precisely 4 elements.

For loops, examples are gcc.dg/vect/vect-reduc-1char.c and its relations. The "big-array" variants of the same tests vectorize just fine.

Andrew

Reply via email to