On Fri, Sep 18, 2015 at 6:07 PM, Kirill Yukhin <kirill.yuk...@gmail.com> wrote: > Hello, > On 18 Sep 10:31, Richard Biener wrote: >> On Thu, 17 Sep 2015, Ilya Enkovich wrote: >> >> > 2015-09-16 15:30 GMT+03:00 Richard Biener <rguent...@suse.de>: >> > > On Mon, 14 Sep 2015, Kirill Yukhin wrote: >> > > >> > >> Hello, >> > >> I'd like to initiate discussion on vectorization of loops which >> > >> boundaries are not aligned to VF. Main target for this optimization >> > >> right now is x86's AVX-512, which features per-element embedded masking >> > >> for all instructions. The main goal for this mail is to agree on overall >> > >> design of the feature. >> > >> >> > >> This approach was presented @ GNU Cauldron 2015 by Ilya Enkovich [1]. >> > >> >> > >> Here's a sketch of the algorithm: >> > >> 1. Add check on basic stmts for masking: possibility to introduce >> > >> index vector and >> > >> corresponding mask >> > >> 2. At the check if statements are vectorizable we additionally check >> > >> if stmts >> > >> need and can be masked and compute masking cost. Result is stored >> > >> in `stmt_vinfo`. >> > >> We are going to mask only mem. accesses, reductions and modify >> > >> mask for already >> > >> masked stmts (mask load, mask store and vect. condition) >> > > >> > > I think you also need to mask divisions (for integer divide by zero) and >> > > want to mask FP ops which may result in NaNs or denormals (because that's >> > > generally to slow down execution a lot in my experience). >> > > >> > > Why not simply mask all stmts? >> > >> > Hi, >> > >> > Statement masking may be not free. Especially if we need to transform >> > mask somehow to do it. It also may be unsupported on a platform (e.g. >> > for AVX-512 not all instructions support masking) but still not be a >> > problem to mask a loop. BTW for AVX-512 masking doesn't boost >> > performance even if we have some special cases like NaNs. We don't >> > consider exceptions in vector code (and it seems to be a case now?) >> > otherwise we would need to mask them also. >> >> Well, we do need to honor >> >> if (x != 0.) >> y[i] = z[i] / x; >> >> in some way. I think if-conversion currently simply gives up here. >> So if we have the epilogue and using masked loads what are the >> contents of the 'masked' elements (IIRC they are zero or all-ones, >> right)? If the end up as zero then even simple code like >> >> for (i;;) >> a[i] = b[i] / c[i]; >> >> cannot be transformed in the suggested way with -ftrapping-math >> and the remainder iteration might get slow if processing NaN >> operands is still as slow as it was 10 years ago. >> >> IMHO for if-converting possibly trapping stmts (like the above >> example) we need some masking support anyway (and a way to express >> the masking in GIMPLE). > We'll use if-cvt technique. If op is trapping - we do not apply masking for > loop remainder > This is subject for further development. Currently we don't try truly mask > existing GIMPLE > stmts. All masking is achieved using `vec_cond` and we're not sure that > trapping is really > useful feature while vectorization is on.
Ok. And yes, we'd need to have a way to predicate such stmts directly. >> > >> 3. Make a decision about masking: take computed costs and est. >> > >> iterations count >> > >> into consideration >> > >> 4. Modify prologue/epilogue generation according decision made at >> > >> analysis. Three >> > >> options available: >> > >> a. Use scalar remainder >> > >> b. Use masked remainder. Won't be supported in first version >> > >> c. Mask main loop >> > >> 5.Support vectorized loop masking: >> > >> - Create stmts for mask generation >> > >> - Support generation of masked vector code (create generic vector >> > >> code then >> > >> patch it w/ masks) >> > >> - Mask loads/stores/vconds/reductions only >> > >> >> > >> In first version (targeted v6) we're not going to support 4.b and loop >> > >> mask pack/unpack. No `pack/unpack` means that masking will be supported >> > >> only for types w/ the same size as index variable >> > > >> > > This means that if ncopies for any stmt is > 1 masking won't be >> > > supported, >> > > right? (you'd need two or more different masks) >> > >> > We don't think it is a very important feature to have in initial >> > version. It can be added later and shouldn't affect overall >> > implementation design much. BTW currently masked loads and stores >> > don't support masks of other sizes and don't do masks pack/unpack. >> >> I think masked loads/stores support this just fine. Remember the >> masks are regular vectors generated by cond exprs in the current code. > Not quite true, mask load/stores are not supported for different size. > E.g. this example is not vectorized: > int a[LENGTH], b[LENGTH]; > long long c[LENGTH]; > > int test () > { > int i; > #pragma omp simd safelen(16) > for (i = 0; i < LENGTH; i++) > if (a[i] > b[i]) > c[i] = 1; > } I see. We do if-convert this though. For some reason vectorization fails with t.c:10:10: note: not vectorized: relevant stmt not supported: _ifc__23 = _5 > _6 ? 18446744073709551615 : 0; so it doesn't fail on the masked store but on the RHS. >> > >> [1] - >> > >> https://gcc.gnu.org/wiki/cauldron2015?action=AttachFile&do=view&target=Vectorization+for+Intel+AVX-512.pdf >> > >> >> > >> What do you think? >> > > >> > > There was the idea some time ago to use single-iteration vector >> > > variants for prologues/epilogues by simply overlapping them with >> > > the vector loop (and either making sure to mask out the overlap >> > > area or make sure the result stays the same). This kind-of is >> > > similar to 4b and thus IMHO it's better to get 4b implemented >> > > rather than trying 4c. So for example >> > > >> > > int a[]; >> > > for (i=0; i < 13; ++i) >> > > a[i] = i; >> > > >> > > would be vectorized (with v4si) as >> > > >> > > for (i=0; i < 13 / 4; ++i) >> > > ((v4si *)a)[i] = { ... }; >> > > *(v4si *)(&a[9]) = { ... }; >> > > >> > > where the epilogue store of course would be unaligned. The masked >> > > variant can avoid the data pointer adjustment and instead use a masked >> > > store. >> > > >> > > OTOH it might be that the unaligned scheme is as efficient as the >> > > masked version. Only the masked version is more trivially correct, >> > > data dependences can make the above idea not work without masking >> > > out stores like for >> > > >> > > for (i=0; i < 13; ++i) >> > > a[i] = a[i+1]; >> > > >> > > obviously the overlapping iterations in the epilogue would >> > > compute bogus values. To avoid this we can merge the result >> > > with the previously stored values (using properly computed masks) >> > > before storing it. >> > > >> > > Basically both 4b and the above idea need to peel a vector >> > > iteration and "modify" it. The same trick can be applied to >> > > prologue loops of course. >> > > >> > > Any chance you can try working on 4b instead? It also feels >> > > like it would need less hacks throughout the vectorizer >> > > (basically post-processing the generated vector loop). >> > > >> > > If 4b is implemented I don't think 4c is worth doing. >> > >> > I agree 4b is a more universal approach and should cover more cases. >> > But we consider 4c is the first step to have 4b. I think significant >> > difference of 4b from a replay you described is that in 4b masked >> > remainder may be used to execute short trip count loops which mean we >> > can execute masked remainder without actually executing main >> > vectorized loop. It causes various implications and make peeling more >> > complex due to multiple remainder entries. >> >> Can you elaborate on this? > We need to transform CFG for peeled remainder. It is entered from both > main loop and align-peeled loop, but have uses from main loop prolog. > Thus we need to copy prolog and fix peeled remainder accordingly or do it > in some other way. > It is doable, but can avoid it for now (4c, v6) and do it later (4b, v7). That's hardly a blocker. The "CFG transform" should be easy (just not share the main loop epilogue with the peeled remainder epilogue). >> > We see the first goal as a 4c implementation with following re-usage >> > of all its parts later (GCC 7) for 4b. I don't see how 4c may require >> > more hacking in a vectorizer. Basically 4c consists of two parts: >> > >> > 1. Masks applicability analysis for all statements (require/not >> > require [mem ref, reduction, exceptions etc.], need/don't need [due to >> > perf considerations], maskable/not maskable, masking cost). This is >> > integrated into current statement analysis and is used by both 4b and >> > 4c to make a decision about a remainder. >> > 2. Loop transformation (masking). We are not sure yet what is the best >> > way to implement it. We see two options here: >> > a) transform already vectorized loop as a separate post-processing >> > b) initially generate masked statements on-the-fly in vect_transform_stmt >> >> I'd prefer a) as we can't re-use b) for 4b. >> >> > 2b looks simpler to implement because we have all info required for >> > masking in place. 2a seems as a more universal approach and can be >> > re-used for 4b for peeled iteration. The problem for 2a is that we may >> > miss required vec_info for peeled statements. E.g. if we have several >> > ncopies of some statement then we have to have a way to identify which >> > part of the mask corresponds to which statement. >> >> I think the vectorizers IL "carefully" creates vinfos for all vector >> stmts generated and links them up to the "main" stmt via the >> STMT_VINFO_RELATED_STMT chain. There may of course be bugs so that >> this doesn't happen 100% correct, but I think it should. >> >> Of course it would be nice to have a way better internal representation >> for all of the vectorizer state... >> >> In the end I think 2a is eaiser to implement (in fact analysis whether >> masking is possible and masking itself can be done at the same time - >> after the main loop vectorization, if analysis fails we create the >> regular epilogue loop). > Isn't scalar remainder created by that moment. We want to avoid unnecessary > copies creations/removals. > >> The interesting parts to handle are reductions and the reduction >> epilogue anyway (not sure if you propose to do that with the >> first iteration). > We'd like to support reductions in the first iteration (4c, v6). > >> > Do you have an idea how "masking" is better be organized to be usable >> > for both 4b and 4c? >> >> Do 2a ... > Okay. Thanks, Richard. > -- > Thanks, K >> > Thanks, >> > Ilya >> > >> > >> > > >> > > Thanks, >> > > Richard. >> > >> > >> >> -- >> Richard Biener <rguent...@suse.de> >> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB >> 21284 (AG Nuernberg)