Hello Richard, Thank you. I achieved vectorization with vf = 16, using #pragma GCC optimize ("no-unroll-loops") __attribute__ ((__target__ ("sse4.2"))) and options -march=core-avx2 -mprefer-avx-128
But now I have a question: Is it possible in gcc to have vectorization with vf < 16? On 20/10/2017, Richard Biener <richard.guent...@gmail.com> wrote: > On Fri, Oct 20, 2017 at 12:13 PM, Denis Bakhvalov <dendib...@gmail.com> > wrote: >> Thank you for the reply! >> >> Regarding last part of your message, this is also what clang will do >> when you are passing vf of 4 (with the pragma from my first message) >> for the loop operating on chars plus using SSE2. It will do meaningful >> work only for 4 chars per iteration (a[0], zero, zero, zero, a[1], >> zero, zero, zero, etc.). >> >> Please see example here: >> https://godbolt.org/g/3LAqZw >> >> Let's say that I know all possible trip counts for my inner loop. They >> all do not exceed 32. In the example above vf for this loop is 32. >> There is a runtime check, such that if trip count do not exceed 32 it >> will fall back to scalar version. >> >> As long as trip count is always lower that 32 - it always chooses >> scalar version at runtime. >> But theoretically, using SSE2 for trip count = 8 it can use half of >> xmm register (8 chars) to do meaningfull work. >> >> Is gcc vectorizer capable of doing this? >> If yes, can I somehow achieve this in gcc by tweaking the code or >> adding some pragma? > > The closest is to use -mprefer-avx128 so you get SSE rather than AVX > vector sizes. Eventually this option is among the valid target attributes > for #pragma GCC target > >> On 19/10/2017, Jakub Jelinek <ja...@redhat.com> wrote: >>> On Thu, Oct 19, 2017 at 10:38:28AM +0200, Richard Biener wrote: >>>> On Thu, Oct 19, 2017 at 9:22 AM, Denis Bakhvalov <dendib...@gmail.com> >>>> wrote: >>>> > Hello! >>>> > >>>> > I have a hot inner loop which was vectorized by gcc, but I also want >>>> > compiler to unroll this loop by some factor. >>>> > It can be controled in clang with this pragma: >>>> > #pragma clang loop vectorize(enable) vectorize_width(8) >>>> > Please see example here: >>>> > https://godbolt.org/g/UJoUJn >>>> > >>>> > So I want to tell gcc something like this: >>>> > "I want you to vectorize the loop. After that I want you to unroll >>>> > this vectorized loop by some defined factor." >>>> > >>>> > I was playing with #pragma omp simd with the safelen clause, and >>>> > #pragma GCC optimize("unroll-loops") with no success. Compiler option >>>> > -fmax-unroll-times is not suitable for me, because it will affect >>>> > other parts of the code. >>>> > >>>> > Is it possible to achieve this somehow? >>>> >>>> No. >>> >>> #pragma omp simd has simdlen clause which is a hint on the preferable >>> vectorization factor, but the vectorizer doesn't use it so far; >>> probably it wouldn't be that hard to at least use that as the starting >>> factor if the target has multiple ones if it is one of those. >>> The vectorizer has some support for using wider vectorization factors >>> if there are mixed width types within the same loop, so perhaps >>> supporting 2x/4x/8x etc. sizes of the normally chosen width might not be >>> that hard. >>> What we don't have right now is support for using smaller >>> vectorization factors, which might be sometimes beneficial for -O2 >>> vectorization of mixed width type loops. We always use the vf derived >>> from the smallest width type, say when using SSE2 and there is a char >>> type, >>> we try to use vf of 16 and if there is also int type, do operations on >>> those >>> in 4x as many instructions, while there is also an option to use >>> vf of 4 and for operations on char just do something meaningful only in >>> 1/4 >>> of vector elements. The various x86 vector ISAs have instructions to >>> widen or narrow for conversions. >>> >>> In any case, no is the right answer right now, we don't have that >>> implemented. >>> >>> Jakub >>> >> >> >> -- >> Best regards, >> Denis. > -- Best regards, Denis.