Re: [RFC] Masking vectorized loops with bound not aligned to VF.

Richard Biener Wed, 23 Sep 2015 06:44:19 -0700

On Fri, Sep 18, 2015 at 6:07 PM, Kirill Yukhin <kirill.yuk...@gmail.com> wrote:
> Hello,
> On 18 Sep 10:31, Richard Biener wrote:
>> On Thu, 17 Sep 2015, Ilya Enkovich wrote:
>>
>> > 2015-09-16 15:30 GMT+03:00 Richard Biener <rguent...@suse.de>:
>> > > On Mon, 14 Sep 2015, Kirill Yukhin wrote:
>> > >
>> > >> Hello,
>> > >> I'd like to initiate discussion on vectorization of loops which
>> > >> boundaries are not aligned to VF. Main target for this optimization
>> > >> right now is x86's AVX-512, which features per-element embedded masking
>> > >> for all instructions. The main goal for this mail is to agree on overall
>> > >> design of the feature.
>> > >>
>> > >> This approach was presented @ GNU Cauldron 2015 by Ilya Enkovich [1].
>> > >>
>> > >> Here's a sketch of the algorithm:
>> > >>   1. Add check on basic stmts for masking: possibility to introduce 
>> > >> index vector and
>> > >>      corresponding mask
>> > >>   2. At the check if statements are vectorizable we additionally check 
>> > >> if stmts
>> > >>      need and can be masked and compute masking cost. Result is stored 
>> > >> in `stmt_vinfo`.
>> > >>      We are going  to mask only mem. accesses, reductions and modify 
>> > >> mask for already
>> > >>      masked stmts (mask load, mask store and vect. condition)
>> > >
>> > > I think you also need to mask divisions (for integer divide by zero) and
>> > > want to mask FP ops which may result in NaNs or denormals (because that's
>> > > generally to slow down execution a lot in my experience).
>> > >
>> > > Why not simply mask all stmts?
>> >
>> > Hi,
>> >
>> > Statement masking may be not free. Especially if we need to transform
>> > mask somehow to do it. It also may be unsupported on a platform (e.g.
>> > for AVX-512 not all instructions support masking) but still not be a
>> > problem to mask a loop. BTW for AVX-512 masking doesn't boost
>> > performance even if we have some special cases like NaNs. We don't
>> > consider exceptions in vector code (and it seems to be a case now?)
>> > otherwise we would need to mask them also.
>>
>> Well, we do need to honor
>>
>>   if (x != 0.)
>>    y[i] = z[i] / x;
>>
>> in some way.  I think if-conversion currently simply gives up here.
>> So if we have the epilogue and using masked loads what are the
>> contents of the 'masked' elements (IIRC they are zero or all-ones,
>> right)?  If the end up as zero then even simple code like
>>
>>   for (i;;)
>>    a[i] = b[i] / c[i];
>>
>> cannot be transformed in the suggested way with -ftrapping-math
>> and the remainder iteration might get slow if processing NaN
>> operands is still as slow as it was 10 years ago.
>>
>> IMHO for if-converting possibly trapping stmts (like the above
>> example) we need some masking support anyway (and a way to express
>> the masking in GIMPLE).
> We'll use if-cvt technique. If op is trapping - we do not apply masking for 
> loop remainder
> This is subject for further development. Currently we don't try truly mask 
> existing GIMPLE
> stmts. All masking is achieved using `vec_cond` and we're not sure that 
> trapping is really
> useful feature while vectorization is on.


Ok.  And yes, we'd need to have a way to predicate such stmts directly.

>> > >>   3. Make a decision about masking: take computed costs and est. 
>> > >> iterations count
>> > >>      into consideration
>> > >>   4. Modify prologue/epilogue generation according decision made at 
>> > >> analysis. Three
>> > >>      options available:
>> > >>     a. Use scalar remainder
>> > >>     b. Use masked remainder. Won't be supported in first version
>> > >>     c. Mask main loop
>> > >>   5.Support vectorized loop masking:
>> > >>     - Create stmts for mask generation
>> > >>     - Support generation of masked vector code (create generic vector 
>> > >> code then
>> > >>       patch it w/ masks)
>> > >>       -  Mask loads/stores/vconds/reductions only
>> > >>
>> > >>  In first version (targeted v6) we're not going to support 4.b and loop
>> > >> mask pack/unpack. No `pack/unpack` means that masking will be supported
>> > >> only for types w/ the same size as index variable
>> > >
>> > > This means that if ncopies for any stmt is > 1 masking won't be 
>> > > supported,
>> > > right?  (you'd need two or more different masks)
>> >
>> > We don't think it is a very important feature to have in initial
>> > version. It can be added later and shouldn't affect overall
>> > implementation design much. BTW currently masked loads and stores
>> > don't support masks of other sizes and don't do masks pack/unpack.
>>
>> I think masked loads/stores support this just fine.  Remember the
>> masks are regular vectors generated by cond exprs in the current code.
> Not quite true, mask load/stores are not supported for different size.
> E.g. this example is not vectorized:
>   int a[LENGTH], b[LENGTH];
>   long long c[LENGTH];
>
>   int test ()
>   {
>     int i;
>     #pragma omp simd safelen(16)
>     for (i = 0; i < LENGTH; i++)
>       if (a[i] > b[i])
>         c[i] = 1;
>   }

I see.  We do if-convert this though.  For some reason vectorization fails with

t.c:10:10: note: not vectorized: relevant stmt not supported: _ifc__23
= _5 > _6 ? 18446744073709551615 : 0;

so it doesn't fail on the masked store but on the RHS.

>> > >> [1] - 
>> > >> https://gcc.gnu.org/wiki/cauldron2015?action=AttachFile&do=view&target=Vectorization+for+Intel+AVX-512.pdf
>> > >>
>> > >> What do you think?
>> > >
>> > > There was the idea some time ago to use single-iteration vector
>> > > variants for prologues/epilogues by simply overlapping them with
>> > > the vector loop (and either making sure to mask out the overlap
>> > > area or make sure the result stays the same).  This kind-of is
>> > > similar to 4b and thus IMHO it's better to get 4b implemented
>> > > rather than trying 4c.  So for example
>> > >
>> > >  int a[];
>> > >  for (i=0; i < 13; ++i)
>> > >    a[i] = i;
>> > >
>> > > would be vectorized (with v4si) as
>> > >
>> > >  for (i=0; i < 13 / 4; ++i)
>> > >    ((v4si *)a)[i] = { ... };
>> > >  *(v4si *)(&a[9]) = { ... };
>> > >
>> > > where the epilogue store of course would be unaligned.  The masked
>> > > variant can avoid the data pointer adjustment and instead use a masked
>> > > store.
>> > >
>> > > OTOH it might be that the unaligned scheme is as efficient as the
>> > > masked version.  Only the masked version is more trivially correct,
>> > > data dependences can make the above idea not work without masking
>> > > out stores like for
>> > >
>> > >  for (i=0; i < 13; ++i)
>> > >    a[i] = a[i+1];
>> > >
>> > > obviously the overlapping iterations in the epilogue would
>> > > compute bogus values.  To avoid this we can merge the result
>> > > with the previously stored values (using properly computed masks)
>> > > before storing it.
>> > >
>> > > Basically both 4b and the above idea need to peel a vector
>> > > iteration and "modify" it.  The same trick can be applied to
>> > > prologue loops of course.
>> > >
>> > > Any chance you can try working on 4b instead?  It also feels
>> > > like it would need less hacks throughout the vectorizer
>> > > (basically post-processing the generated vector loop).
>> > >
>> > > If 4b is implemented I don't think 4c is worth doing.
>> >
>> > I agree 4b is a more universal approach and should cover more cases.
>> > But we consider 4c is the first step to have 4b. I think significant
>> > difference of 4b from a replay you described is that in 4b masked
>> > remainder may be used to execute short trip count loops which mean we
>> > can execute masked remainder without actually executing main
>> > vectorized loop. It causes various implications and make peeling more
>> > complex due to multiple remainder entries.
>>
>> Can you elaborate on this?
> We need to transform CFG for peeled remainder. It is entered from both
> main loop and align-peeled loop, but have uses from main loop prolog.
> Thus we need to copy prolog and fix peeled remainder accordingly or do it
> in some other way.
> It is doable, but can avoid it for now (4c, v6) and do it later (4b, v7).

That's hardly a blocker.  The "CFG transform" should be easy (just not
share the main loop epilogue with the peeled remainder epilogue).

>> > We see the first goal as a 4c implementation with following re-usage
>> > of all its parts later (GCC 7) for 4b. I don't see how 4c may require
>> > more hacking in a vectorizer. Basically 4c consists of two parts:
>> >
>> > 1. Masks applicability analysis for all statements (require/not
>> > require [mem ref, reduction, exceptions etc.], need/don't need [due to
>> > perf considerations], maskable/not maskable, masking cost). This is
>> > integrated into current statement analysis and is used by both 4b and
>> > 4c to make a decision about a remainder.
>> > 2. Loop transformation (masking). We are not sure yet what is the best
>> > way to implement it. We see two options here:
>> >   a) transform already vectorized loop as a separate post-processing
>> >   b) initially generate masked statements on-the-fly in vect_transform_stmt
>>
>> I'd prefer a) as we can't re-use b) for 4b.
>>
>> > 2b looks simpler to implement because we have all info required for
>> > masking in place. 2a seems as a more universal approach and can be
>> > re-used for 4b for peeled iteration. The problem for 2a is that we may
>> > miss required vec_info for peeled statements. E.g. if we have several
>> > ncopies of some statement then we have to have a way to identify which
>> > part of the mask corresponds to which statement.
>>
>> I think the vectorizers IL "carefully" creates vinfos for all vector
>> stmts generated and links them up to the "main" stmt via the
>> STMT_VINFO_RELATED_STMT chain.  There may of course be bugs so that
>> this doesn't happen 100% correct, but I think it should.
>>
>> Of course it would be nice to have a way better internal representation
>> for all of the vectorizer state...
>>
>> In the end I think 2a is eaiser to implement (in fact analysis whether
>> masking is possible and masking itself can be done at the same time -
>> after the main loop vectorization, if analysis fails we create the
>> regular epilogue loop).
> Isn't scalar remainder created by that moment. We want to avoid unnecessary
> copies creations/removals.
>
>> The interesting parts to handle are reductions and the reduction
>> epilogue anyway (not sure if you propose to do that with the
>> first iteration).
> We'd like to support reductions in the first iteration (4c, v6).
>
>> > Do you have an idea how "masking" is better be organized to be usable
>> > for both 4b and 4c?
>>
>> Do 2a ...
> Okay.

Thanks,
Richard.

> --
> Thanks, K
>> > Thanks,
>> > Ilya
>> >
>> >
>> > >
>> > > Thanks,
>> > > Richard.
>> >
>> >
>>
>> --
>> Richard Biener <rguent...@suse.de>
>> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
>> 21284 (AG Nuernberg)

Re: [RFC] Masking vectorized loops with bound not aligned to VF.

Reply via email to