On Tue, Nov 3, 2015 at 1:08 PM, Yuri Rumyantsev <ysrum...@gmail.com> wrote: > Richard, > > It looks like misunderstanding - we assume that for GCCv6 the simple > scheme of remainder will be used through introducing new IV : > https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html > > Is it true or we missed something?
<quote> > > Do you have an idea how "masking" is better be organized to be usable > > for both 4b and 4c? > > Do 2a ... Okay. </quote> Richard. > Now we are testing vectorization of loops with small non-constant trip count. > Yuri. > > 2015-11-03 14:47 GMT+03:00 Richard Biener <richard.guent...@gmail.com>: >> On Wed, Oct 28, 2015 at 11:45 AM, Yuri Rumyantsev <ysrum...@gmail.com> wrote: >>> Hi All, >>> >>> Here is a preliminary patch to combine vectorized loop with its scalar >>> remainder, draft of which was proposed by Kirill Yukhin month ago: >>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html >>> It was tested wwith '-mavx2' option to run on Haswell processor. >>> The main goal of it is to improve performance of vectorized loops for >>> AVX512. >>> Note that only loads/stores and simple reductions with binary operations are >>> converted to masked form, e.g. load --> masked load and reduction like >>> r1 = f <op> r2 --> t = f <op> r2; r1 = m ? t : r2. Masking is performed >>> through >>> creation of a new vector induction variable initialized with consequent >>> values >>> from 0.. VF-1, new const vector upper bound which contains number of >>> iterations >>> and the result of comparison which is considered as mask vector. >>> This implementation has several restrictions: >>> >>> 1. Multiple types are not supported. >>> 2. SLP is not supported. >>> 3. Gather/Scatter's are also not supported. >>> 4. Vectorization of the loops with low trip count is not implemented yet >>> since >>> it requires additional design and tuning. >>> >>> We are planning to eleminate all these restrictions in GCCv7. >>> >>> This patch will be extended to include cost model to reject unprofutable >>> transformations, e.g. new vector body cost will be evaluated through new >>> target hook which estimates cast of masking different vector statements. New >>> threshold parameter will be introduced which determines permissible cost >>> increasing which will be tuned on an AVX512 machine. >>> This patch is not in sync with changes of Ilya Enkovich for AVX512 masked >>> load/store support since only part of them is in trunk compiler. >>> >>> Any comments will be appreciated. >> >> As stated in the previous discussion I don't think the extra mask IV >> is a good idea >> and we instead should have a masked final iteration for the epilogue >> (yes, that's >> not really "combined" then). This is because in the end we'd not only >> want AVX512 >> to benefit from this work but also other ISAs that can do unaligned or masked >> operations (we can overlap the epilogue work with the vectorized work or use >> masked loads/stores available with AVX). Note that the same applies to >> the alignment prologue if present, I can't see how you can handle that with >> the >> in-loop approach. >> >> Richard.