Hello,
On 18 Sep 10:31, Richard Biener wrote:
> On Thu, 17 Sep 2015, Ilya Enkovich wrote:
> 
> > 2015-09-16 15:30 GMT+03:00 Richard Biener <rguent...@suse.de>:
> > > On Mon, 14 Sep 2015, Kirill Yukhin wrote:
> > >
> > >> Hello,
> > >> I'd like to initiate discussion on vectorization of loops which
> > >> boundaries are not aligned to VF. Main target for this optimization
> > >> right now is x86's AVX-512, which features per-element embedded masking
> > >> for all instructions. The main goal for this mail is to agree on overall
> > >> design of the feature.
> > >>
> > >> This approach was presented @ GNU Cauldron 2015 by Ilya Enkovich [1].
> > >>
> > >> Here's a sketch of the algorithm:
> > >>   1. Add check on basic stmts for masking: possibility to introduce 
> > >> index vector and
> > >>      corresponding mask
> > >>   2. At the check if statements are vectorizable we additionally check 
> > >> if stmts
> > >>      need and can be masked and compute masking cost. Result is stored 
> > >> in `stmt_vinfo`.
> > >>      We are going  to mask only mem. accesses, reductions and modify 
> > >> mask for already
> > >>      masked stmts (mask load, mask store and vect. condition)
> > >
> > > I think you also need to mask divisions (for integer divide by zero) and
> > > want to mask FP ops which may result in NaNs or denormals (because that's
> > > generally to slow down execution a lot in my experience).
> > >
> > > Why not simply mask all stmts?
> > 
> > Hi,
> > 
> > Statement masking may be not free. Especially if we need to transform
> > mask somehow to do it. It also may be unsupported on a platform (e.g.
> > for AVX-512 not all instructions support masking) but still not be a
> > problem to mask a loop. BTW for AVX-512 masking doesn't boost
> > performance even if we have some special cases like NaNs. We don't
> > consider exceptions in vector code (and it seems to be a case now?)
> > otherwise we would need to mask them also.
> 
> Well, we do need to honor
> 
>   if (x != 0.)
>    y[i] = z[i] / x;
> 
> in some way.  I think if-conversion currently simply gives up here.
> So if we have the epilogue and using masked loads what are the
> contents of the 'masked' elements (IIRC they are zero or all-ones, 
> right)?  If the end up as zero then even simple code like
> 
>   for (i;;)
>    a[i] = b[i] / c[i];
> 
> cannot be transformed in the suggested way with -ftrapping-math
> and the remainder iteration might get slow if processing NaN
> operands is still as slow as it was 10 years ago.
> 
> IMHO for if-converting possibly trapping stmts (like the above
> example) we need some masking support anyway (and a way to express
> the masking in GIMPLE).
We'll use if-cvt technique. If op is trapping - we do not apply masking for 
loop remainder
This is subject for further development. Currently we don't try truly mask 
existing GIMPLE
stmts. All masking is achieved using `vec_cond` and we're not sure that 
trapping is really
useful feature while vectorization is on.

> > >>   3. Make a decision about masking: take computed costs and est. 
> > >> iterations count
> > >>      into consideration
> > >>   4. Modify prologue/epilogue generation according decision made at 
> > >> analysis. Three
> > >>      options available:
> > >>     a. Use scalar remainder
> > >>     b. Use masked remainder. Won't be supported in first version
> > >>     c. Mask main loop
> > >>   5.Support vectorized loop masking:
> > >>     - Create stmts for mask generation
> > >>     - Support generation of masked vector code (create generic vector 
> > >> code then
> > >>       patch it w/ masks)
> > >>       -  Mask loads/stores/vconds/reductions only
> > >>
> > >>  In first version (targeted v6) we're not going to support 4.b and loop
> > >> mask pack/unpack. No `pack/unpack` means that masking will be supported
> > >> only for types w/ the same size as index variable
> > >
> > > This means that if ncopies for any stmt is > 1 masking won't be supported,
> > > right?  (you'd need two or more different masks)
> > 
> > We don't think it is a very important feature to have in initial
> > version. It can be added later and shouldn't affect overall
> > implementation design much. BTW currently masked loads and stores
> > don't support masks of other sizes and don't do masks pack/unpack.
> 
> I think masked loads/stores support this just fine.  Remember the
> masks are regular vectors generated by cond exprs in the current code.
Not quite true, mask load/stores are not supported for different size.
E.g. this example is not vectorized:
  int a[LENGTH], b[LENGTH];
  long long c[LENGTH];

  int test ()
  {
    int i;
    #pragma omp simd safelen(16)
    for (i = 0; i < LENGTH; i++)
      if (a[i] > b[i])
        c[i] = 1;
  }

> > >> [1] - 
> > >> https://gcc.gnu.org/wiki/cauldron2015?action=AttachFile&do=view&target=Vectorization+for+Intel+AVX-512.pdf
> > >>
> > >> What do you think?
> > >
> > > There was the idea some time ago to use single-iteration vector
> > > variants for prologues/epilogues by simply overlapping them with
> > > the vector loop (and either making sure to mask out the overlap
> > > area or make sure the result stays the same).  This kind-of is
> > > similar to 4b and thus IMHO it's better to get 4b implemented
> > > rather than trying 4c.  So for example
> > >
> > >  int a[];
> > >  for (i=0; i < 13; ++i)
> > >    a[i] = i;
> > >
> > > would be vectorized (with v4si) as
> > >
> > >  for (i=0; i < 13 / 4; ++i)
> > >    ((v4si *)a)[i] = { ... };
> > >  *(v4si *)(&a[9]) = { ... };
> > >
> > > where the epilogue store of course would be unaligned.  The masked
> > > variant can avoid the data pointer adjustment and instead use a masked
> > > store.
> > >
> > > OTOH it might be that the unaligned scheme is as efficient as the
> > > masked version.  Only the masked version is more trivially correct,
> > > data dependences can make the above idea not work without masking
> > > out stores like for
> > >
> > >  for (i=0; i < 13; ++i)
> > >    a[i] = a[i+1];
> > >
> > > obviously the overlapping iterations in the epilogue would
> > > compute bogus values.  To avoid this we can merge the result
> > > with the previously stored values (using properly computed masks)
> > > before storing it.
> > >
> > > Basically both 4b and the above idea need to peel a vector
> > > iteration and "modify" it.  The same trick can be applied to
> > > prologue loops of course.
> > >
> > > Any chance you can try working on 4b instead?  It also feels
> > > like it would need less hacks throughout the vectorizer
> > > (basically post-processing the generated vector loop).
> > >
> > > If 4b is implemented I don't think 4c is worth doing.
> > 
> > I agree 4b is a more universal approach and should cover more cases.
> > But we consider 4c is the first step to have 4b. I think significant
> > difference of 4b from a replay you described is that in 4b masked
> > remainder may be used to execute short trip count loops which mean we
> > can execute masked remainder without actually executing main
> > vectorized loop. It causes various implications and make peeling more
> > complex due to multiple remainder entries.
> 
> Can you elaborate on this?
We need to transform CFG for peeled remainder. It is entered from both
main loop and align-peeled loop, but have uses from main loop prolog.
Thus we need to copy prolog and fix peeled remainder accordingly or do it
in some other way.
It is doable, but can avoid it for now (4c, v6) and do it later (4b, v7).

> > We see the first goal as a 4c implementation with following re-usage
> > of all its parts later (GCC 7) for 4b. I don't see how 4c may require
> > more hacking in a vectorizer. Basically 4c consists of two parts:
> > 
> > 1. Masks applicability analysis for all statements (require/not
> > require [mem ref, reduction, exceptions etc.], need/don't need [due to
> > perf considerations], maskable/not maskable, masking cost). This is
> > integrated into current statement analysis and is used by both 4b and
> > 4c to make a decision about a remainder.
> > 2. Loop transformation (masking). We are not sure yet what is the best
> > way to implement it. We see two options here:
> >   a) transform already vectorized loop as a separate post-processing
> >   b) initially generate masked statements on-the-fly in vect_transform_stmt
> 
> I'd prefer a) as we can't re-use b) for 4b.
> 
> > 2b looks simpler to implement because we have all info required for
> > masking in place. 2a seems as a more universal approach and can be
> > re-used for 4b for peeled iteration. The problem for 2a is that we may
> > miss required vec_info for peeled statements. E.g. if we have several
> > ncopies of some statement then we have to have a way to identify which
> > part of the mask corresponds to which statement.
> 
> I think the vectorizers IL "carefully" creates vinfos for all vector
> stmts generated and links them up to the "main" stmt via the
> STMT_VINFO_RELATED_STMT chain.  There may of course be bugs so that
> this doesn't happen 100% correct, but I think it should.
> 
> Of course it would be nice to have a way better internal representation
> for all of the vectorizer state...
> 
> In the end I think 2a is eaiser to implement (in fact analysis whether
> masking is possible and masking itself can be done at the same time - 
> after the main loop vectorization, if analysis fails we create the
> regular epilogue loop).
Isn't scalar remainder created by that moment. We want to avoid unnecessary
copies creations/removals.

> The interesting parts to handle are reductions and the reduction
> epilogue anyway (not sure if you propose to do that with the
> first iteration).
We'd like to support reductions in the first iteration (4c, v6).

> > Do you have an idea how "masking" is better be organized to be usable
> > for both 4b and 4c?
> 
> Do 2a ...
Okay.

--
Thanks, K
> > Thanks,
> > Ilya
> > 
> > 
> > >
> > > Thanks,
> > > Richard.
> > 
> > 
> 
> -- 
> Richard Biener <rguent...@suse.de>
> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
> 21284 (AG Nuernberg)

Reply via email to