On Fri, Oct 20, 2017 at 12:13 PM, Denis Bakhvalov <dendib...@gmail.com> wrote: > Thank you for the reply! > > Regarding last part of your message, this is also what clang will do > when you are passing vf of 4 (with the pragma from my first message) > for the loop operating on chars plus using SSE2. It will do meaningful > work only for 4 chars per iteration (a[0], zero, zero, zero, a[1], > zero, zero, zero, etc.). > > Please see example here: > https://godbolt.org/g/3LAqZw > > Let's say that I know all possible trip counts for my inner loop. They > all do not exceed 32. In the example above vf for this loop is 32. > There is a runtime check, such that if trip count do not exceed 32 it > will fall back to scalar version. > > As long as trip count is always lower that 32 - it always chooses > scalar version at runtime. > But theoretically, using SSE2 for trip count = 8 it can use half of > xmm register (8 chars) to do meaningfull work. > > Is gcc vectorizer capable of doing this? > If yes, can I somehow achieve this in gcc by tweaking the code or > adding some pragma?
The closest is to use -mprefer-avx128 so you get SSE rather than AVX vector sizes. Eventually this option is among the valid target attributes for #pragma GCC target > On 19/10/2017, Jakub Jelinek <ja...@redhat.com> wrote: >> On Thu, Oct 19, 2017 at 10:38:28AM +0200, Richard Biener wrote: >>> On Thu, Oct 19, 2017 at 9:22 AM, Denis Bakhvalov <dendib...@gmail.com> >>> wrote: >>> > Hello! >>> > >>> > I have a hot inner loop which was vectorized by gcc, but I also want >>> > compiler to unroll this loop by some factor. >>> > It can be controled in clang with this pragma: >>> > #pragma clang loop vectorize(enable) vectorize_width(8) >>> > Please see example here: >>> > https://godbolt.org/g/UJoUJn >>> > >>> > So I want to tell gcc something like this: >>> > "I want you to vectorize the loop. After that I want you to unroll >>> > this vectorized loop by some defined factor." >>> > >>> > I was playing with #pragma omp simd with the safelen clause, and >>> > #pragma GCC optimize("unroll-loops") with no success. Compiler option >>> > -fmax-unroll-times is not suitable for me, because it will affect >>> > other parts of the code. >>> > >>> > Is it possible to achieve this somehow? >>> >>> No. >> >> #pragma omp simd has simdlen clause which is a hint on the preferable >> vectorization factor, but the vectorizer doesn't use it so far; >> probably it wouldn't be that hard to at least use that as the starting >> factor if the target has multiple ones if it is one of those. >> The vectorizer has some support for using wider vectorization factors >> if there are mixed width types within the same loop, so perhaps >> supporting 2x/4x/8x etc. sizes of the normally chosen width might not be >> that hard. >> What we don't have right now is support for using smaller >> vectorization factors, which might be sometimes beneficial for -O2 >> vectorization of mixed width type loops. We always use the vf derived >> from the smallest width type, say when using SSE2 and there is a char type, >> we try to use vf of 16 and if there is also int type, do operations on >> those >> in 4x as many instructions, while there is also an option to use >> vf of 4 and for operations on char just do something meaningful only in 1/4 >> of vector elements. The various x86 vector ISAs have instructions to >> widen or narrow for conversions. >> >> In any case, no is the right answer right now, we don't have that >> implemented. >> >> Jakub >> > > > -- > Best regards, > Denis.