Richard,

Thanks a lot for the response! I will definitely
try the constructor approach.

You mentioned non-grouped store. Is the handling
of such stores actually there and I just missed
it? It was a big hassle for me as I didn't manage
to find it and assumed there isn't one.

I have a question (a lot of them, though, but this
one is bothering me most). It is actually paragraph
4 of my previous letter. In real world code there can
be a case that the loading of the elements and the use
of them (for a reduction) are in different BBs (I saw
this myself). So, not only it complicates the things
in general but for me it breaks some SLP code assuming
single BB operation (IIRC, some dataref analysis phase
assumes single BB). Did anybody consider this before?

OK, I know I start looking kind of stubborn here but
what about the case:

foo(A[0]+...A[n])

There won't be any store here by definition while a
reduction will. Or is it something too rarely seen?

--
  Thanks,
  Anton


On 24/1/2019 13:47, Richard Biener wrote:
On Mon, Jan 21, 2019 at 2:20 PM Anton Youdkevitch
<anton.youdkevi...@bell-sw.com> wrote:

Here is the prototype for doing vectorized reduction
using SLP approach. I would appreciate feedback if this
is a feasible approach and if overall the direction is
right.

The idea is to vectorize reduction like this

S = A[0]+A[1]+...A[N];

into

Sv = Av[0]+Av[1]+...+Av[N/VL];


So that, for instance, the following code:

typedef double T;
T sum;

void foo (T*  __restrict__ a)
{
         sum = a[0]+ a[1] + a[2]+ a[3] + a[4]+ a[5] + a[6]+ a[7];
}


instead of:

foo:
.LFB23:
         .cfi_startproc
         movsd   (%rdi), %xmm0
         movsd   16(%rdi), %xmm1
         addsd   8(%rdi), %xmm0
         addsd   24(%rdi), %xmm1
         addsd   %xmm1, %xmm0
         movsd   32(%rdi), %xmm1
         addsd   40(%rdi), %xmm1
         addsd   %xmm1, %xmm0
         movsd   48(%rdi), %xmm1
         addsd   56(%rdi), %xmm1
         addsd   %xmm1, %xmm0
         movsd   %xmm0, sum2(%rip)
         ret
         .cfi_endproc


be compiled into:

foo:
.LFB11:
         .cfi_startproc
         movupd  32(%rdi), %xmm0
         movupd  48(%rdi), %xmm3
         movupd  (%rdi), %xmm1
         movupd  16(%rdi), %xmm2
         addpd   %xmm3, %xmm0
         addpd   %xmm2, %xmm1
         addpd   %xmm1, %xmm0
         haddpd  %xmm0, %xmm0
         movlpd  %xmm0, sum(%rip)
         ret
         .cfi_endproc


As this is a very crude prototype there are some things
to consider.

1. As the current SLP framework assumes presence of
group stores I cannot use directly it as reduction
does not require group stores (or even stores at all),
so, I'm partially using the existing functionality but
sometimes I have to create a stripped down version
of it for my own needs;

2. The current version considers only PLUS reduction
as it is encountered most often and therefore is the
most practical;

3. While normally SLP transformation should operate
inside single basic block this requirement greatly
restricts it's practical application as in a code
complex enough there will be vectorizable subexpressions
defined in basic block(s) different from that where the
reduction result resides. However, for the sake of
simplicity only single uses in the same block are
considered now;

4. For the same sake the current version does not deal
with partial reductions which would require partial sum
merging and careful removal of the scalars that participate
in the vector part. The latter gets done automatically
by DCE in the case of full reduction vectorization;

5. There is no cost model yet for the reasons mentioned
in the paragraphs 3 and 4.

First sorry for the late response.

No, I don't think your approach of bypassing the "rest"
is OK.  I've once started to implement BB reduction
support but somehow got distracted IIRC.  Your testcase
(and the prototype I worked on) still has a (scalar non-grouped)
store which can be keyed on in SLP discovery phase.

You should be able to "re-use" (by a lot of refactoring I guess)
the reduction finding code (vect_is_slp_reduction) to see
wheter such a store is fed by a reduction chain.  Note that
if you adjust the testcase to have

  sum[0] = a[0] + ... + a[n];
  sum[1] = b[0] + ... + b[n];

then you'll have a grouped store fed by reductions.  You
can also consider

  for (i = ...)
   {
      sum[i] = a[i*4] + a[i*4+1] + a[i*4+2] + a[i*4+3];
   }

which we should be able to handle.

So the whole problem of doing BB reductions boils down
to SLP tree discovery, the rest should be more straight-forward
(of course code-gen needs to be adapted for the non-loop case
as well).

It's not the easiest problem you try to tackle btw ;)  May
I suggest you become familiar with the code by BB vectorizing
vector CONSTRUCTORs instead?

typedef int v4si __attribute__((vector_size(16)));

v4si foo (int *i, *j)
{
   return (v4si) { i[0] + j[0], i[1] + j[1], i[2] + j[2], i[3] + j[3] };
}

it has the same SLP discovery "issue", this time somewhat
easier as a CONSTRUCTOR directly plays the role of the
"grouped store".

Richard.

Thanks in advance.

--
   Anton

Reply via email to