Re: How to force gcc to vectorize the loop with particular vectorization width

2017-10-21 Thread Denis Bakhvalov
Hello Richard,
Thank you. I achieved vectorization with vf = 16, using
#pragma GCC optimize ("no-unroll-loops")
__attribute__ ((__target__ ("sse4.2")))
and options -march=core-avx2 -mprefer-avx-128

But now I have a question: Is it possible in gcc to have vectorization
with vf < 16?

On 20/10/2017, Richard Biener  wrote:
> On Fri, Oct 20, 2017 at 12:13 PM, Denis Bakhvalov 
> wrote:
>> Thank you for the reply!
>>
>> Regarding last part of your message, this is also what clang will do
>> when you are passing vf of 4 (with the pragma from my first message)
>> for the loop operating on chars plus using SSE2. It will do meaningful
>> work only for 4 chars per iteration (a[0], zero, zero, zero, a[1],
>> zero, zero, zero, etc.).
>>
>> Please see example here:
>> https://godbolt.org/g/3LAqZw
>>
>> Let's say that I know all possible trip counts for my inner loop. They
>> all do not exceed 32. In the example above vf for this loop is 32.
>> There is a runtime check, such that if trip count do not exceed 32 it
>> will fall back to scalar version.
>>
>> As long as trip count is always lower that 32 - it always chooses
>> scalar version at runtime.
>> But theoretically, using SSE2 for trip count = 8 it can use half of
>> xmm register (8 chars) to do meaningfull work.
>>
>> Is gcc vectorizer capable of doing this?
>> If yes, can I somehow achieve this in gcc by tweaking the code or
>> adding some pragma?
>
> The closest is to use -mprefer-avx128 so you get SSE rather than AVX
> vector sizes.  Eventually this option is among the valid target attributes
> for #pragma GCC target
>
>> On 19/10/2017, Jakub Jelinek  wrote:
>>> On Thu, Oct 19, 2017 at 10:38:28AM +0200, Richard Biener wrote:
 On Thu, Oct 19, 2017 at 9:22 AM, Denis Bakhvalov 
 wrote:
 > Hello!
 >
 > I have a hot inner loop which was vectorized by gcc, but I also want
 > compiler to unroll this loop by some factor.
 > It can be controled in clang with this pragma:
 > #pragma clang loop vectorize(enable) vectorize_width(8)
 > Please see example here:
 > https://godbolt.org/g/UJoUJn
 >
 > So I want to tell gcc something like this:
 > "I want you to vectorize the loop. After that I want you to unroll
 > this vectorized loop by some defined factor."
 >
 > I was playing with #pragma omp simd with the safelen clause, and
 > #pragma GCC optimize("unroll-loops") with no success. Compiler option
 > -fmax-unroll-times is not suitable for me, because it will affect
 > other parts of the code.
 >
 > Is it possible to achieve this somehow?

 No.
>>>
>>> #pragma omp simd has simdlen clause which is a hint on the preferable
>>> vectorization factor, but the vectorizer doesn't use it so far;
>>> probably it wouldn't be that hard to at least use that as the starting
>>> factor if the target has multiple ones if it is one of those.
>>> The vectorizer has some support for using wider vectorization factors
>>> if there are mixed width types within the same loop, so perhaps
>>> supporting 2x/4x/8x etc. sizes of the normally chosen width might not be
>>> that hard.
>>> What we don't have right now is support for using smaller
>>> vectorization factors, which might be sometimes beneficial for -O2
>>> vectorization of mixed width type loops.  We always use the vf derived
>>> from the smallest width type, say when using SSE2 and there is a char
>>> type,
>>> we try to use vf of 16 and if there is also int type, do operations on
>>> those
>>> in 4x as many instructions, while there is also an option to use
>>> vf of 4 and for operations on char just do something meaningful only in
>>> 1/4
>>> of vector elements.  The various x86 vector ISAs have instructions to
>>> widen or narrow for conversions.
>>>
>>> In any case, no is the right answer right now, we don't have that
>>> implemented.
>>>
>>>   Jakub
>>>
>>
>>
>> --
>> Best regards,
>> Denis.
>


-- 
Best regards,
Denis.


Re: atomic_thread_fence() semantics

2017-10-21 Thread Torvald Riegel
On Fri, 2017-10-20 at 18:46 +0300, Alexander Monakov wrote:
> On Fri, 20 Oct 2017, Torvald Riegel wrote:
> > On Thu, 2017-10-19 at 15:31 +0300, Alexander Monakov wrote:
> > > On Thu, 19 Oct 2017, Andrew Haley wrote:
> > > > No, you did not.  This looks like a bug.  Please report it.
> > > 
> > > This bug is fixed on trunk, so should work from gcc-8 onwards (PR 80640).
> > 
> > The test case is invalid (I added some more detail as a comment on this
> > bug).
> 
> Sorry, I was imprecise.  To be clear, the issue I referred to above as the
> "bug [that was] fixed on trunk" is the issue Andrew Haley pointed out: when
> GCC transitioned from GIMPLE to RTL IR, empty RTL was emitted for the fence
> statement, losing its compile-time effect as a compiler memory barrier 
> entirely.

What I tried to convey was that I think this can be a (part of a) valid
implementation on certain hardware, and when considering C11/C++11 or
more recent, which require programs to not have any data races (as
defined by these standards).




Re: How to force gcc to vectorize the loop with particular vectorization width

2017-10-21 Thread Richard Biener
On October 21, 2017 9:50:13 PM GMT+02:00, Denis Bakhvalov  
wrote:
>Hello Richard,
>Thank you. I achieved vectorization with vf = 16, using
>#pragma GCC optimize ("no-unroll-loops")
>__attribute__ ((__target__ ("sse4.2")))
>and options -march=core-avx2 -mprefer-avx-128
>
>But now I have a question: Is it possible in gcc to have vectorization
>with vf < 16?

No, not at the moment. 

Richard. 

>On 20/10/2017, Richard Biener  wrote:
>> On Fri, Oct 20, 2017 at 12:13 PM, Denis Bakhvalov
>
>> wrote:
>>> Thank you for the reply!
>>>
>>> Regarding last part of your message, this is also what clang will do
>>> when you are passing vf of 4 (with the pragma from my first message)
>>> for the loop operating on chars plus using SSE2. It will do
>meaningful
>>> work only for 4 chars per iteration (a[0], zero, zero, zero, a[1],
>>> zero, zero, zero, etc.).
>>>
>>> Please see example here:
>>> https://godbolt.org/g/3LAqZw
>>>
>>> Let's say that I know all possible trip counts for my inner loop.
>They
>>> all do not exceed 32. In the example above vf for this loop is 32.
>>> There is a runtime check, such that if trip count do not exceed 32
>it
>>> will fall back to scalar version.
>>>
>>> As long as trip count is always lower that 32 - it always chooses
>>> scalar version at runtime.
>>> But theoretically, using SSE2 for trip count = 8 it can use half of
>>> xmm register (8 chars) to do meaningfull work.
>>>
>>> Is gcc vectorizer capable of doing this?
>>> If yes, can I somehow achieve this in gcc by tweaking the code or
>>> adding some pragma?
>>
>> The closest is to use -mprefer-avx128 so you get SSE rather than AVX
>> vector sizes.  Eventually this option is among the valid target
>attributes
>> for #pragma GCC target
>>
>>> On 19/10/2017, Jakub Jelinek  wrote:
 On Thu, Oct 19, 2017 at 10:38:28AM +0200, Richard Biener wrote:
> On Thu, Oct 19, 2017 at 9:22 AM, Denis Bakhvalov
>
> wrote:
> > Hello!
> >
> > I have a hot inner loop which was vectorized by gcc, but I also
>want
> > compiler to unroll this loop by some factor.
> > It can be controled in clang with this pragma:
> > #pragma clang loop vectorize(enable) vectorize_width(8)
> > Please see example here:
> > https://godbolt.org/g/UJoUJn
> >
> > So I want to tell gcc something like this:
> > "I want you to vectorize the loop. After that I want you to
>unroll
> > this vectorized loop by some defined factor."
> >
> > I was playing with #pragma omp simd with the safelen clause, and
> > #pragma GCC optimize("unroll-loops") with no success. Compiler
>option
> > -fmax-unroll-times is not suitable for me, because it will
>affect
> > other parts of the code.
> >
> > Is it possible to achieve this somehow?
>
> No.

 #pragma omp simd has simdlen clause which is a hint on the
>preferable
 vectorization factor, but the vectorizer doesn't use it so far;
 probably it wouldn't be that hard to at least use that as the
>starting
 factor if the target has multiple ones if it is one of those.
 The vectorizer has some support for using wider vectorization
>factors
 if there are mixed width types within the same loop, so perhaps
 supporting 2x/4x/8x etc. sizes of the normally chosen width might
>not be
 that hard.
 What we don't have right now is support for using smaller
 vectorization factors, which might be sometimes beneficial for -O2
 vectorization of mixed width type loops.  We always use the vf
>derived
 from the smallest width type, say when using SSE2 and there is a
>char
 type,
 we try to use vf of 16 and if there is also int type, do operations
>on
 those
 in 4x as many instructions, while there is also an option to use
 vf of 4 and for operations on char just do something meaningful
>only in
 1/4
 of vector elements.  The various x86 vector ISAs have instructions
>to
 widen or narrow for conversions.

 In any case, no is the right answer right now, we don't have that
 implemented.

   Jakub

>>>
>>>
>>> --
>>> Best regards,
>>> Denis.
>>