On October 21, 2017 9:50:13 PM GMT+02:00, Denis Bakhvalov <dendib...@gmail.com> wrote: >Hello Richard, >Thank you. I achieved vectorization with vf = 16, using >#pragma GCC optimize ("no-unroll-loops") >__attribute__ ((__target__ ("sse4.2"))) >and options -march=core-avx2 -mprefer-avx-128 > >But now I have a question: Is it possible in gcc to have vectorization >with vf < 16?
No, not at the moment. Richard. >On 20/10/2017, Richard Biener <richard.guent...@gmail.com> wrote: >> On Fri, Oct 20, 2017 at 12:13 PM, Denis Bakhvalov ><dendib...@gmail.com> >> wrote: >>> Thank you for the reply! >>> >>> Regarding last part of your message, this is also what clang will do >>> when you are passing vf of 4 (with the pragma from my first message) >>> for the loop operating on chars plus using SSE2. It will do >meaningful >>> work only for 4 chars per iteration (a[0], zero, zero, zero, a[1], >>> zero, zero, zero, etc.). >>> >>> Please see example here: >>> https://godbolt.org/g/3LAqZw >>> >>> Let's say that I know all possible trip counts for my inner loop. >They >>> all do not exceed 32. In the example above vf for this loop is 32. >>> There is a runtime check, such that if trip count do not exceed 32 >it >>> will fall back to scalar version. >>> >>> As long as trip count is always lower that 32 - it always chooses >>> scalar version at runtime. >>> But theoretically, using SSE2 for trip count = 8 it can use half of >>> xmm register (8 chars) to do meaningfull work. >>> >>> Is gcc vectorizer capable of doing this? >>> If yes, can I somehow achieve this in gcc by tweaking the code or >>> adding some pragma? >> >> The closest is to use -mprefer-avx128 so you get SSE rather than AVX >> vector sizes. Eventually this option is among the valid target >attributes >> for #pragma GCC target >> >>> On 19/10/2017, Jakub Jelinek <ja...@redhat.com> wrote: >>>> On Thu, Oct 19, 2017 at 10:38:28AM +0200, Richard Biener wrote: >>>>> On Thu, Oct 19, 2017 at 9:22 AM, Denis Bakhvalov ><dendib...@gmail.com> >>>>> wrote: >>>>> > Hello! >>>>> > >>>>> > I have a hot inner loop which was vectorized by gcc, but I also >want >>>>> > compiler to unroll this loop by some factor. >>>>> > It can be controled in clang with this pragma: >>>>> > #pragma clang loop vectorize(enable) vectorize_width(8) >>>>> > Please see example here: >>>>> > https://godbolt.org/g/UJoUJn >>>>> > >>>>> > So I want to tell gcc something like this: >>>>> > "I want you to vectorize the loop. After that I want you to >unroll >>>>> > this vectorized loop by some defined factor." >>>>> > >>>>> > I was playing with #pragma omp simd with the safelen clause, and >>>>> > #pragma GCC optimize("unroll-loops") with no success. Compiler >option >>>>> > -fmax-unroll-times is not suitable for me, because it will >affect >>>>> > other parts of the code. >>>>> > >>>>> > Is it possible to achieve this somehow? >>>>> >>>>> No. >>>> >>>> #pragma omp simd has simdlen clause which is a hint on the >preferable >>>> vectorization factor, but the vectorizer doesn't use it so far; >>>> probably it wouldn't be that hard to at least use that as the >starting >>>> factor if the target has multiple ones if it is one of those. >>>> The vectorizer has some support for using wider vectorization >factors >>>> if there are mixed width types within the same loop, so perhaps >>>> supporting 2x/4x/8x etc. sizes of the normally chosen width might >not be >>>> that hard. >>>> What we don't have right now is support for using smaller >>>> vectorization factors, which might be sometimes beneficial for -O2 >>>> vectorization of mixed width type loops. We always use the vf >derived >>>> from the smallest width type, say when using SSE2 and there is a >char >>>> type, >>>> we try to use vf of 16 and if there is also int type, do operations >on >>>> those >>>> in 4x as many instructions, while there is also an option to use >>>> vf of 4 and for operations on char just do something meaningful >only in >>>> 1/4 >>>> of vector elements. The various x86 vector ISAs have instructions >to >>>> widen or narrow for conversions. >>>> >>>> In any case, no is the right answer right now, we don't have that >>>> implemented. >>>> >>>> Jakub >>>> >>> >>> >>> -- >>> Best regards, >>> Denis. >>