Re: How to force gcc to vectorize the loop with particular vectorization width

2017-10-20 Thread Denis Bakhvalov
Thank you for the reply!

Regarding last part of your message, this is also what clang will do
when you are passing vf of 4 (with the pragma from my first message)
for the loop operating on chars plus using SSE2. It will do meaningful
work only for 4 chars per iteration (a[0], zero, zero, zero, a[1],
zero, zero, zero, etc.).

Please see example here:
https://godbolt.org/g/3LAqZw

Let's say that I know all possible trip counts for my inner loop. They
all do not exceed 32. In the example above vf for this loop is 32.
There is a runtime check, such that if trip count do not exceed 32 it
will fall back to scalar version.

As long as trip count is always lower that 32 - it always chooses
scalar version at runtime.
But theoretically, using SSE2 for trip count = 8 it can use half of
xmm register (8 chars) to do meaningfull work.

Is gcc vectorizer capable of doing this?
If yes, can I somehow achieve this in gcc by tweaking the code or
adding some pragma?

On 19/10/2017, Jakub Jelinek  wrote:
> On Thu, Oct 19, 2017 at 10:38:28AM +0200, Richard Biener wrote:
>> On Thu, Oct 19, 2017 at 9:22 AM, Denis Bakhvalov 
>> wrote:
>> > Hello!
>> >
>> > I have a hot inner loop which was vectorized by gcc, but I also want
>> > compiler to unroll this loop by some factor.
>> > It can be controled in clang with this pragma:
>> > #pragma clang loop vectorize(enable) vectorize_width(8)
>> > Please see example here:
>> > https://godbolt.org/g/UJoUJn
>> >
>> > So I want to tell gcc something like this:
>> > "I want you to vectorize the loop. After that I want you to unroll
>> > this vectorized loop by some defined factor."
>> >
>> > I was playing with #pragma omp simd with the safelen clause, and
>> > #pragma GCC optimize("unroll-loops") with no success. Compiler option
>> > -fmax-unroll-times is not suitable for me, because it will affect
>> > other parts of the code.
>> >
>> > Is it possible to achieve this somehow?
>>
>> No.
>
> #pragma omp simd has simdlen clause which is a hint on the preferable
> vectorization factor, but the vectorizer doesn't use it so far;
> probably it wouldn't be that hard to at least use that as the starting
> factor if the target has multiple ones if it is one of those.
> The vectorizer has some support for using wider vectorization factors
> if there are mixed width types within the same loop, so perhaps
> supporting 2x/4x/8x etc. sizes of the normally chosen width might not be
> that hard.
> What we don't have right now is support for using smaller
> vectorization factors, which might be sometimes beneficial for -O2
> vectorization of mixed width type loops.  We always use the vf derived
> from the smallest width type, say when using SSE2 and there is a char type,
> we try to use vf of 16 and if there is also int type, do operations on
> those
> in 4x as many instructions, while there is also an option to use
> vf of 4 and for operations on char just do something meaningful only in 1/4
> of vector elements.  The various x86 vector ISAs have instructions to
> widen or narrow for conversions.
>
> In any case, no is the right answer right now, we don't have that
> implemented.
>
>   Jakub
>


-- 
Best regards,
Denis.


Re: How to force gcc to vectorize the loop with particular vectorization width

2017-10-20 Thread Richard Biener
On Fri, Oct 20, 2017 at 12:13 PM, Denis Bakhvalov  wrote:
> Thank you for the reply!
>
> Regarding last part of your message, this is also what clang will do
> when you are passing vf of 4 (with the pragma from my first message)
> for the loop operating on chars plus using SSE2. It will do meaningful
> work only for 4 chars per iteration (a[0], zero, zero, zero, a[1],
> zero, zero, zero, etc.).
>
> Please see example here:
> https://godbolt.org/g/3LAqZw
>
> Let's say that I know all possible trip counts for my inner loop. They
> all do not exceed 32. In the example above vf for this loop is 32.
> There is a runtime check, such that if trip count do not exceed 32 it
> will fall back to scalar version.
>
> As long as trip count is always lower that 32 - it always chooses
> scalar version at runtime.
> But theoretically, using SSE2 for trip count = 8 it can use half of
> xmm register (8 chars) to do meaningfull work.
>
> Is gcc vectorizer capable of doing this?
> If yes, can I somehow achieve this in gcc by tweaking the code or
> adding some pragma?

The closest is to use -mprefer-avx128 so you get SSE rather than AVX
vector sizes.  Eventually this option is among the valid target attributes
for #pragma GCC target

> On 19/10/2017, Jakub Jelinek  wrote:
>> On Thu, Oct 19, 2017 at 10:38:28AM +0200, Richard Biener wrote:
>>> On Thu, Oct 19, 2017 at 9:22 AM, Denis Bakhvalov 
>>> wrote:
>>> > Hello!
>>> >
>>> > I have a hot inner loop which was vectorized by gcc, but I also want
>>> > compiler to unroll this loop by some factor.
>>> > It can be controled in clang with this pragma:
>>> > #pragma clang loop vectorize(enable) vectorize_width(8)
>>> > Please see example here:
>>> > https://godbolt.org/g/UJoUJn
>>> >
>>> > So I want to tell gcc something like this:
>>> > "I want you to vectorize the loop. After that I want you to unroll
>>> > this vectorized loop by some defined factor."
>>> >
>>> > I was playing with #pragma omp simd with the safelen clause, and
>>> > #pragma GCC optimize("unroll-loops") with no success. Compiler option
>>> > -fmax-unroll-times is not suitable for me, because it will affect
>>> > other parts of the code.
>>> >
>>> > Is it possible to achieve this somehow?
>>>
>>> No.
>>
>> #pragma omp simd has simdlen clause which is a hint on the preferable
>> vectorization factor, but the vectorizer doesn't use it so far;
>> probably it wouldn't be that hard to at least use that as the starting
>> factor if the target has multiple ones if it is one of those.
>> The vectorizer has some support for using wider vectorization factors
>> if there are mixed width types within the same loop, so perhaps
>> supporting 2x/4x/8x etc. sizes of the normally chosen width might not be
>> that hard.
>> What we don't have right now is support for using smaller
>> vectorization factors, which might be sometimes beneficial for -O2
>> vectorization of mixed width type loops.  We always use the vf derived
>> from the smallest width type, say when using SSE2 and there is a char type,
>> we try to use vf of 16 and if there is also int type, do operations on
>> those
>> in 4x as many instructions, while there is also an option to use
>> vf of 4 and for operations on char just do something meaningful only in 1/4
>> of vector elements.  The various x86 vector ISAs have instructions to
>> widen or narrow for conversions.
>>
>> In any case, no is the right answer right now, we don't have that
>> implemented.
>>
>>   Jakub
>>
>
>
> --
> Best regards,
> Denis.


Have "gcc -v" print information about slow run-time checking?

2017-10-20 Thread Thomas Schwinge
Hi!

In , we got a user wonder about released vs.
trunk GCC execution times, not knowing that the latter has more run-time
checking enabled.  Once that got clarified, the discussion proceeded:

(In reply to petschy from comment #6)
> Would it be sensible to put an extra line to the output of 'gcc/g++ -v' if
> the slow checks are enabled, which just states this fact / warns about
> (possibly mentioning the use of --enable-checking=release at configure)?
> Future tickets like this might be avoided this way.

(In reply to Andrew Pinski from comment #7)
> We already output one if you use -ftime-report.

(In reply to Jonathan Wakely from comment #8)
> We get reports like this every few months, and nobody ever uses
> -ftime-report before filing a bug. I think something in the -v output would
> be useful.

We kind-of already report that by means of "--enable-checking=[...]" as
part of the "Configured with:" line:

$ gcc-6 -v 2>&1 | sed -n '/^Configured with: /s%.*--enable-checking=\([^ 
]*\).*%\1%p'
release

$ build-gcc/gcc/xgcc -v 2>&1 | sed -n '/^Configured with: 
/s%.*--enable-checking=\([^ ]*\).*%\1%p'
yes,df,fold,extra,rtl

Though, that only gets displayed if "--enable-checking=[...]" has been
specified explicitly, I think?


Is that is generally considered useful, should I look into adding a
separate "Run-time checking: [...]" line?  Should that just print the
configured checks, or also some "slow!" notice, as suggested?  The latter
only for the really slow checks?  Are we able to identify these
generally?


Also, from reading the documentation, I can't tell if (it's the idea
that) running checking-enabled GCC with "-fno-checking" would indeed get
rid of *all* the checking's run-time overhead?  (Basically, if all
"ENABLE_*_CHECKING" usage is guarded by "flag_checking", too?  Not yet
verified.)


Grüße
 Thomas


Re: atomic_thread_fence() semantics

2017-10-20 Thread Torvald Riegel
On Thu, 2017-10-19 at 13:58 +0200, Mattias Rönnblom wrote:
> Hi.
> 
> I have this code:
> 
> #include 
> 
> int ready;
> int message;
> 
> void send_x4711(int m) {
>  message = m*4711;
>  atomic_thread_fence(memory_order_release);
>  ready = 1;
> }
> 
> When I compile it with GCC 7.2 -O3 -std=c11 on x86_64 it produces the 
> following code:
> 
> send_x4711:
> .LFB0:
> .LVL0:
>  imuledi, edi, 4711
> .LVL1:
>  mov DWORD PTR ready[rip], 1
>  mov DWORD PTR message[rip], edi
>  ret
> 
> I expected the store to 'message' and 'ready' to be in program order.
> 
> Did I misunderstand the semantics of 
> atomic_thread_fence+memory_order_release?

Yes.  You must make your program data-race-free.  This is required by
C11.  No other thread can observe "ready" without a data race or other
synchronization, so the fence is a noop in this program snippet.



Re: atomic_thread_fence() semantics

2017-10-20 Thread Torvald Riegel
On Thu, 2017-10-19 at 15:31 +0300, Alexander Monakov wrote:
> On Thu, 19 Oct 2017, Andrew Haley wrote:
> > On 19/10/17 12:58, Mattias Rönnblom wrote:
> > > Did I misunderstand the semantics of 
> > > atomic_thread_fence+memory_order_release?
> > 
> > No, you did not.  This looks like a bug.  Please report it.
> 
> This bug is fixed on trunk, so should work from gcc-8 onwards (PR 80640).

The test case is invalid (I added some more detail as a comment on this
bug).




Re: atomic_thread_fence() semantics

2017-10-20 Thread Torvald Riegel
On Thu, 2017-10-19 at 13:18 +0100, Andrew Haley wrote:
> On 19/10/17 13:10, Jonathan Wakely wrote:
> > There are no atomic operations on atomic objects here, so the fence
> > doesn't synchronize with anything.
> 
> Really?  This seems rather unhelpful, to say the least.
> 
> An atomic release operation X in thread A synchronizes-with an acquire
> fence F in thread B, if
> 
> there exists an atomic read Y (with any memory order)
> Y reads the value written by X (or by the release sequence headed by X)
> Y is sequenced-before F in thread B

You write that X is an _atomic_ release operation, but that would have
to be an atomic memory_order_release store.  Alternatively, it would
have to be an atomic memory_order_relaxed store sequenced after a
release fence.  There are only nonatomic stores in this example, so
reordering them before the release fence is not observable in a correct
program (ie, a data-race-free one).



Re: Have "gcc -v" print information about slow run-time checking?

2017-10-20 Thread Richard Biener
On Fri, Oct 20, 2017 at 12:40 PM, Thomas Schwinge
 wrote:
> Hi!
>
> In , we got a user wonder about released vs.
> trunk GCC execution times, not knowing that the latter has more run-time
> checking enabled.  Once that got clarified, the discussion proceeded:
>
> (In reply to petschy from comment #6)
>> Would it be sensible to put an extra line to the output of 'gcc/g++ -v' if
>> the slow checks are enabled, which just states this fact / warns about
>> (possibly mentioning the use of --enable-checking=release at configure)?
>> Future tickets like this might be avoided this way.
>
> (In reply to Andrew Pinski from comment #7)
>> We already output one if you use -ftime-report.
>
> (In reply to Jonathan Wakely from comment #8)
>> We get reports like this every few months, and nobody ever uses
>> -ftime-report before filing a bug. I think something in the -v output would
>> be useful.
>
> We kind-of already report that by means of "--enable-checking=[...]" as
> part of the "Configured with:" line:
>
> $ gcc-6 -v 2>&1 | sed -n '/^Configured with: /s%.*--enable-checking=\([^ 
> ]*\).*%\1%p'
> release
>
> $ build-gcc/gcc/xgcc -v 2>&1 | sed -n '/^Configured with: 
> /s%.*--enable-checking=\([^ ]*\).*%\1%p'
> yes,df,fold,extra,rtl
>
> Though, that only gets displayed if "--enable-checking=[...]" has been
> specified explicitly, I think?
>
>
> Is that is generally considered useful, should I look into adding a
> separate "Run-time checking: [...]" line?  Should that just print the
> configured checks, or also some "slow!" notice, as suggested?  The latter
> only for the really slow checks?  Are we able to identify these
> generally?

All checking slows down the compiler.  I guess adding stuff to -v output might
break scripts out in the wild...  but yes, we could output the same boiler-plate
as with -ftime-report.

>
> Also, from reading the documentation, I can't tell if (it's the idea
> that) running checking-enabled GCC with "-fno-checking" would indeed get
> rid of *all* the checking's run-time overhead?  (Basically, if all
> "ENABLE_*_CHECKING" usage is guarded by "flag_checking", too?  Not yet
> verified.)

No, for example all the tree and rtl macro checking isn't guarded by
flag_checking
nor are gcc_checking_assert()s, etc.  The flag wasn't intended to be used in the
negative form but to allow better debugging of problems in a release checking
compiler so when you get a random ICE you can run with -fchecking and
immediately pinpoint a more useful place where things go wrong (you mostly
get IL checking that way).

Richard.

>
> Grüße
>  Thomas


Re: atomic_thread_fence() semantics

2017-10-20 Thread Torvald Riegel
On Fri, 2017-10-20 at 12:47 +0200, Torvald Riegel wrote:
> On Thu, 2017-10-19 at 13:58 +0200, Mattias Rönnblom wrote:
> > Hi.
> > 
> > I have this code:
> > 
> > #include 
> > 
> > int ready;
> > int message;
> > 
> > void send_x4711(int m) {
> >  message = m*4711;
> >  atomic_thread_fence(memory_order_release);
> >  ready = 1;
> > }
> > 
> > When I compile it with GCC 7.2 -O3 -std=c11 on x86_64 it produces the 
> > following code:
> > 
> > send_x4711:
> > .LFB0:
> > .LVL0:
> >  imuledi, edi, 4711
> > .LVL1:
> >  mov DWORD PTR ready[rip], 1
> >  mov DWORD PTR message[rip], edi
> >  ret
> > 
> > I expected the store to 'message' and 'ready' to be in program order.
> > 
> > Did I misunderstand the semantics of 
> > atomic_thread_fence+memory_order_release?
> 
> Yes.  You must make your program data-race-free.  This is required by
> C11.  No other thread can observe "ready" without a data race or other
> synchronization, so the fence is a noop in this program snippet.

Just to avoid confusion:  the store to "message" must, *conceptually*,
not be reordered to after the release thread fence.  What that means
precisely for the implementation depends on whether the thread fence has
to emit a HW fence, and on whether all atomics are compiler barriers.  

For example, on x86's TSO model, release fences are implicit, so if any
atomic store is a compiler barrier, one doesn't need to add a compiler
barrier; non-atomic accesses can be moved to before the release MO,
which means that in the example above the thread fence is conceptually
at the end of the function (which is fine).
If the store to "ready" were atomic (and a compiler barrier), the
release MO fence would conceptually sit right before that store:
message = m*4711;
atomic_thread_fence(memory_order_release);
foo = 123;  // can move to before the fence, so can be reordered
// freely wrt. the store to message
atomic_store (ready, 1, memory_order_relaxed);
That depends on the relaxed atomic store to be a compiler barrier, which
it doesn't necessarily have to be in a valid implementations.



Re: atomic_thread_fence() semantics

2017-10-20 Thread Alexander Monakov
On Fri, 20 Oct 2017, Torvald Riegel wrote:
> On Thu, 2017-10-19 at 15:31 +0300, Alexander Monakov wrote:
> > On Thu, 19 Oct 2017, Andrew Haley wrote:
> > > No, you did not.  This looks like a bug.  Please report it.
> > 
> > This bug is fixed on trunk, so should work from gcc-8 onwards (PR 80640).
> 
> The test case is invalid (I added some more detail as a comment on this
> bug).

Sorry, I was imprecise.  To be clear, the issue I referred to above as the
"bug [that was] fixed on trunk" is the issue Andrew Haley pointed out: when
GCC transitioned from GIMPLE to RTL IR, empty RTL was emitted for the fence
statement, losing its compile-time effect as a compiler memory barrier entirely.

I agree that the testcase in the opening message of this thread is not valid
in the sense that this reordering could not have changed the behavior of a
conforming program, but the optimization that GCC performed here was entirely
unintentional, not something the compiler is presently designed to do.

Alexander