Re: How to force gcc to vectorize the loop with particular vectorization width
Hello Richard, Thank you. I achieved vectorization with vf = 16, using #pragma GCC optimize ("no-unroll-loops") __attribute__ ((__target__ ("sse4.2"))) and options -march=core-avx2 -mprefer-avx-128 But now I have a question: Is it possible in gcc to have vectorization with vf < 16? On 20/10/2017, Richard Biener wrote: > On Fri, Oct 20, 2017 at 12:13 PM, Denis Bakhvalov > wrote: >> Thank you for the reply! >> >> Regarding last part of your message, this is also what clang will do >> when you are passing vf of 4 (with the pragma from my first message) >> for the loop operating on chars plus using SSE2. It will do meaningful >> work only for 4 chars per iteration (a[0], zero, zero, zero, a[1], >> zero, zero, zero, etc.). >> >> Please see example here: >> https://godbolt.org/g/3LAqZw >> >> Let's say that I know all possible trip counts for my inner loop. They >> all do not exceed 32. In the example above vf for this loop is 32. >> There is a runtime check, such that if trip count do not exceed 32 it >> will fall back to scalar version. >> >> As long as trip count is always lower that 32 - it always chooses >> scalar version at runtime. >> But theoretically, using SSE2 for trip count = 8 it can use half of >> xmm register (8 chars) to do meaningfull work. >> >> Is gcc vectorizer capable of doing this? >> If yes, can I somehow achieve this in gcc by tweaking the code or >> adding some pragma? > > The closest is to use -mprefer-avx128 so you get SSE rather than AVX > vector sizes. Eventually this option is among the valid target attributes > for #pragma GCC target > >> On 19/10/2017, Jakub Jelinek wrote: >>> On Thu, Oct 19, 2017 at 10:38:28AM +0200, Richard Biener wrote: On Thu, Oct 19, 2017 at 9:22 AM, Denis Bakhvalov wrote: > Hello! > > I have a hot inner loop which was vectorized by gcc, but I also want > compiler to unroll this loop by some factor. > It can be controled in clang with this pragma: > #pragma clang loop vectorize(enable) vectorize_width(8) > Please see example here: > https://godbolt.org/g/UJoUJn > > So I want to tell gcc something like this: > "I want you to vectorize the loop. After that I want you to unroll > this vectorized loop by some defined factor." > > I was playing with #pragma omp simd with the safelen clause, and > #pragma GCC optimize("unroll-loops") with no success. Compiler option > -fmax-unroll-times is not suitable for me, because it will affect > other parts of the code. > > Is it possible to achieve this somehow? No. >>> >>> #pragma omp simd has simdlen clause which is a hint on the preferable >>> vectorization factor, but the vectorizer doesn't use it so far; >>> probably it wouldn't be that hard to at least use that as the starting >>> factor if the target has multiple ones if it is one of those. >>> The vectorizer has some support for using wider vectorization factors >>> if there are mixed width types within the same loop, so perhaps >>> supporting 2x/4x/8x etc. sizes of the normally chosen width might not be >>> that hard. >>> What we don't have right now is support for using smaller >>> vectorization factors, which might be sometimes beneficial for -O2 >>> vectorization of mixed width type loops. We always use the vf derived >>> from the smallest width type, say when using SSE2 and there is a char >>> type, >>> we try to use vf of 16 and if there is also int type, do operations on >>> those >>> in 4x as many instructions, while there is also an option to use >>> vf of 4 and for operations on char just do something meaningful only in >>> 1/4 >>> of vector elements. The various x86 vector ISAs have instructions to >>> widen or narrow for conversions. >>> >>> In any case, no is the right answer right now, we don't have that >>> implemented. >>> >>> Jakub >>> >> >> >> -- >> Best regards, >> Denis. > -- Best regards, Denis.
Re: atomic_thread_fence() semantics
On Fri, 2017-10-20 at 18:46 +0300, Alexander Monakov wrote: > On Fri, 20 Oct 2017, Torvald Riegel wrote: > > On Thu, 2017-10-19 at 15:31 +0300, Alexander Monakov wrote: > > > On Thu, 19 Oct 2017, Andrew Haley wrote: > > > > No, you did not. This looks like a bug. Please report it. > > > > > > This bug is fixed on trunk, so should work from gcc-8 onwards (PR 80640). > > > > The test case is invalid (I added some more detail as a comment on this > > bug). > > Sorry, I was imprecise. To be clear, the issue I referred to above as the > "bug [that was] fixed on trunk" is the issue Andrew Haley pointed out: when > GCC transitioned from GIMPLE to RTL IR, empty RTL was emitted for the fence > statement, losing its compile-time effect as a compiler memory barrier > entirely. What I tried to convey was that I think this can be a (part of a) valid implementation on certain hardware, and when considering C11/C++11 or more recent, which require programs to not have any data races (as defined by these standards).
Re: How to force gcc to vectorize the loop with particular vectorization width
On October 21, 2017 9:50:13 PM GMT+02:00, Denis Bakhvalov wrote: >Hello Richard, >Thank you. I achieved vectorization with vf = 16, using >#pragma GCC optimize ("no-unroll-loops") >__attribute__ ((__target__ ("sse4.2"))) >and options -march=core-avx2 -mprefer-avx-128 > >But now I have a question: Is it possible in gcc to have vectorization >with vf < 16? No, not at the moment. Richard. >On 20/10/2017, Richard Biener wrote: >> On Fri, Oct 20, 2017 at 12:13 PM, Denis Bakhvalov > >> wrote: >>> Thank you for the reply! >>> >>> Regarding last part of your message, this is also what clang will do >>> when you are passing vf of 4 (with the pragma from my first message) >>> for the loop operating on chars plus using SSE2. It will do >meaningful >>> work only for 4 chars per iteration (a[0], zero, zero, zero, a[1], >>> zero, zero, zero, etc.). >>> >>> Please see example here: >>> https://godbolt.org/g/3LAqZw >>> >>> Let's say that I know all possible trip counts for my inner loop. >They >>> all do not exceed 32. In the example above vf for this loop is 32. >>> There is a runtime check, such that if trip count do not exceed 32 >it >>> will fall back to scalar version. >>> >>> As long as trip count is always lower that 32 - it always chooses >>> scalar version at runtime. >>> But theoretically, using SSE2 for trip count = 8 it can use half of >>> xmm register (8 chars) to do meaningfull work. >>> >>> Is gcc vectorizer capable of doing this? >>> If yes, can I somehow achieve this in gcc by tweaking the code or >>> adding some pragma? >> >> The closest is to use -mprefer-avx128 so you get SSE rather than AVX >> vector sizes. Eventually this option is among the valid target >attributes >> for #pragma GCC target >> >>> On 19/10/2017, Jakub Jelinek wrote: On Thu, Oct 19, 2017 at 10:38:28AM +0200, Richard Biener wrote: > On Thu, Oct 19, 2017 at 9:22 AM, Denis Bakhvalov > > wrote: > > Hello! > > > > I have a hot inner loop which was vectorized by gcc, but I also >want > > compiler to unroll this loop by some factor. > > It can be controled in clang with this pragma: > > #pragma clang loop vectorize(enable) vectorize_width(8) > > Please see example here: > > https://godbolt.org/g/UJoUJn > > > > So I want to tell gcc something like this: > > "I want you to vectorize the loop. After that I want you to >unroll > > this vectorized loop by some defined factor." > > > > I was playing with #pragma omp simd with the safelen clause, and > > #pragma GCC optimize("unroll-loops") with no success. Compiler >option > > -fmax-unroll-times is not suitable for me, because it will >affect > > other parts of the code. > > > > Is it possible to achieve this somehow? > > No. #pragma omp simd has simdlen clause which is a hint on the >preferable vectorization factor, but the vectorizer doesn't use it so far; probably it wouldn't be that hard to at least use that as the >starting factor if the target has multiple ones if it is one of those. The vectorizer has some support for using wider vectorization >factors if there are mixed width types within the same loop, so perhaps supporting 2x/4x/8x etc. sizes of the normally chosen width might >not be that hard. What we don't have right now is support for using smaller vectorization factors, which might be sometimes beneficial for -O2 vectorization of mixed width type loops. We always use the vf >derived from the smallest width type, say when using SSE2 and there is a >char type, we try to use vf of 16 and if there is also int type, do operations >on those in 4x as many instructions, while there is also an option to use vf of 4 and for operations on char just do something meaningful >only in 1/4 of vector elements. The various x86 vector ISAs have instructions >to widen or narrow for conversions. In any case, no is the right answer right now, we don't have that implemented. Jakub >>> >>> >>> -- >>> Best regards, >>> Denis. >>