Re: SLP-based reduction vectorization

Richard Biener Thu, 24 Jan 2019 04:17:08 -0800

On Thu, Jan 24, 2019 at 1:04 PM Anton Youdkevitch
<[email protected]> wrote:
>
> Richard,
>
> Thanks a lot for the response! I will definitely
> try the constructor approach.
>
> You mentioned non-grouped store. Is the handling
> of such stores actually there and I just missed
> it? It was a big hassle for me as I didn't manage
> to find it and assumed there isn't one.


No, it isn't there.  On a branch I'm working on I'm
just doing sth like

+      /* Find SLP sequences starting from single stores.  */
+      data_reference_p dr;
+      FOR_EACH_VEC_ELT (vinfo->shared->datarefs, i, dr)
+       if (DR_IS_WRITE (dr))
+         {
+           stmt_vec_info stmt_info = vinfo->lookup_dr (dr)->stmt;
+           if (STMT_SLP_TYPE (stmt_info))
+             continue;
+           if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
+             continue;
+           vect_analyze_slp_instance (vinfo, stmt_info, max_tree_size);
+         }

note that this alone won't work since you have to actually
build a first set of scalar stmts (like vect_analyze_slp_instance
does for groups just by gathering its elements) from the
reduction.  But here you could (and IIRC I did back in time
when prototyping reduction BB vect support) hack in
some ad-hoc pattern-matching of a series of PLUS.

> I have a question (a lot of them, though, but this
> one is bothering me most). It is actually paragraph
> 4 of my previous letter. In real world code there can
> be a case that the loading of the elements and the use
> of them (for a reduction) are in different BBs (I saw
> this myself). So, not only it complicates the things
> in general but for me it breaks some SLP code assuming
> single BB operation (IIRC, some dataref analysis phase
> assumes single BB). Did anybody consider this before?

Sure I considered this but usually restricting one to a
single BB works quite well and simplifies dependence
analysis a lot.

> OK, I know I start looking kind of stubborn here but
> what about the case:
>
> foo(A[0]+...A[n])
>
> There won't be any store here by definition while a
> reduction will. Or is it something too rarely seen?

You are right - in principle a reduction can be "rooted"
at any point.  But you need to come up with an
algorithm with sensible cost (in time and memory)
to detect the reduction group.  The greedy matching
I talked about above can be applied anywhere, not
just at stores.

> --
>    Thanks,
>    Anton
>
>
> On 24/1/2019 13:47, Richard Biener wrote:
> > On Mon, Jan 21, 2019 at 2:20 PM Anton Youdkevitch
> > <[email protected]> wrote:
> >>
> >> Here is the prototype for doing vectorized reduction
> >> using SLP approach. I would appreciate feedback if this
> >> is a feasible approach and if overall the direction is
> >> right.
> >>
> >> The idea is to vectorize reduction like this
> >>
> >> S = A[0]+A[1]+...A[N];
> >>
> >> into
> >>
> >> Sv = Av[0]+Av[1]+...+Av[N/VL];
> >>
> >>
> >> So that, for instance, the following code:
> >>
> >> typedef double T;
> >> T sum;
> >>
> >> void foo (T*  __restrict__ a)
> >> {
> >>          sum = a[0]+ a[1] + a[2]+ a[3] + a[4]+ a[5] + a[6]+ a[7];
> >> }
> >>
> >>
> >> instead of:
> >>
> >> foo:
> >> .LFB23:
> >>          .cfi_startproc
> >>          movsd   (%rdi), %xmm0
> >>          movsd   16(%rdi), %xmm1
> >>          addsd   8(%rdi), %xmm0
> >>          addsd   24(%rdi), %xmm1
> >>          addsd   %xmm1, %xmm0
> >>          movsd   32(%rdi), %xmm1
> >>          addsd   40(%rdi), %xmm1
> >>          addsd   %xmm1, %xmm0
> >>          movsd   48(%rdi), %xmm1
> >>          addsd   56(%rdi), %xmm1
> >>          addsd   %xmm1, %xmm0
> >>          movsd   %xmm0, sum2(%rip)
> >>          ret
> >>          .cfi_endproc
> >>
> >>
> >> be compiled into:
> >>
> >> foo:
> >> .LFB11:
> >>          .cfi_startproc
> >>          movupd  32(%rdi), %xmm0
> >>          movupd  48(%rdi), %xmm3
> >>          movupd  (%rdi), %xmm1
> >>          movupd  16(%rdi), %xmm2
> >>          addpd   %xmm3, %xmm0
> >>          addpd   %xmm2, %xmm1
> >>          addpd   %xmm1, %xmm0
> >>          haddpd  %xmm0, %xmm0
> >>          movlpd  %xmm0, sum(%rip)
> >>          ret
> >>          .cfi_endproc
> >>
> >>
> >> As this is a very crude prototype there are some things
> >> to consider.
> >>
> >> 1. As the current SLP framework assumes presence of
> >> group stores I cannot use directly it as reduction
> >> does not require group stores (or even stores at all),
> >> so, I'm partially using the existing functionality but
> >> sometimes I have to create a stripped down version
> >> of it for my own needs;
> >>
> >> 2. The current version considers only PLUS reduction
> >> as it is encountered most often and therefore is the
> >> most practical;
> >>
> >> 3. While normally SLP transformation should operate
> >> inside single basic block this requirement greatly
> >> restricts it's practical application as in a code
> >> complex enough there will be vectorizable subexpressions
> >> defined in basic block(s) different from that where the
> >> reduction result resides. However, for the sake of
> >> simplicity only single uses in the same block are
> >> considered now;
> >>
> >> 4. For the same sake the current version does not deal
> >> with partial reductions which would require partial sum
> >> merging and careful removal of the scalars that participate
> >> in the vector part. The latter gets done automatically
> >> by DCE in the case of full reduction vectorization;
> >>
> >> 5. There is no cost model yet for the reasons mentioned
> >> in the paragraphs 3 and 4.
> >
> > First sorry for the late response.
> >
> > No, I don't think your approach of bypassing the "rest"
> > is OK.  I've once started to implement BB reduction
> > support but somehow got distracted IIRC.  Your testcase
> > (and the prototype I worked on) still has a (scalar non-grouped)
> > store which can be keyed on in SLP discovery phase.
> >
> > You should be able to "re-use" (by a lot of refactoring I guess)
> > the reduction finding code (vect_is_slp_reduction) to see
> > wheter such a store is fed by a reduction chain.  Note that
> > if you adjust the testcase to have
> >
> >   sum[0] = a[0] + ... + a[n];
> >   sum[1] = b[0] + ... + b[n];
> >
> > then you'll have a grouped store fed by reductions.  You
> > can also consider
> >
> >   for (i = ...)
> >    {
> >       sum[i] = a[i*4] + a[i*4+1] + a[i*4+2] + a[i*4+3];
> >    }
> >
> > which we should be able to handle.
> >
> > So the whole problem of doing BB reductions boils down
> > to SLP tree discovery, the rest should be more straight-forward
> > (of course code-gen needs to be adapted for the non-loop case
> > as well).
> >
> > It's not the easiest problem you try to tackle btw ;)  May
> > I suggest you become familiar with the code by BB vectorizing
> > vector CONSTRUCTORs instead?
> >
> > typedef int v4si __attribute__((vector_size(16)));
> >
> > v4si foo (int *i, *j)
> > {
> >    return (v4si) { i[0] + j[0], i[1] + j[1], i[2] + j[2], i[3] + j[3] };
> > }
> >
> > it has the same SLP discovery "issue", this time somewhat
> > easier as a CONSTRUCTOR directly plays the role of the
> > "grouped store".
> >
> > Richard.
> >
> >> Thanks in advance.
> >>
> >> --
> >>    Anton
>

Re: SLP-based reduction vectorization

Reply via email to