On 26/12/14 22:49, Matt Godbolt wrote:
> On Fri, Dec 26, 2014 at 4:26 PM, Andrew Haley <a...@redhat.com> wrote:
>> On 26/12/14 20:32, Matt Godbolt wrote:
>>> Is there a reason why (in principal) the volatile increment can't be
>>> made into a single add? Clang and ICC both emit the same code for the
>>> volatile and non-volatile case.
>>
>> Yes.  Volatiles use the "as if" rule, where every memory access is as
>> written.  a volatile increment is defined as a load, an increment, and
>> a store.
> 
> That makes sense to me from a logical point of view. My
> understanding though is the volatile keyword was mainly used when
> working with memory-mapped devices, where memory loads and stores
> could not be elided. A single-instruction load-modify-write like
> "increment [addr]" adheres to these constraints even though it is a
> single instruction.  I realise my understanding could be wrong here!
> If not though, both clang and icc are taking a short-cut that may
> puts them into non-compliant state.

It's hard to be certain.  The language used by the standard is very
unhelpful: it requires all accesses to be as written, but does not
define exactly what constitutes an access.

>> If you want single atomic increment, atomics are what you
>> should use.  If you want an increment to be written to memory, use a
>> store barrier after the increment.
> 
> Thanks. I realise I was unclear in my original email. I'm really
> looking for a way to say "do a non-lock-prefixed increment".

Why?

> Atomics are too strong and enforce a bus lock.  Doing a store
> barrier after the increment also appears heavy-handed: while I wish
> for eventual consistency with memory, I do not require it. I do
> however need the compiler to not move or elide my increment.

You could just use a compiler barrier: asm volatile(""); But this is
good only for x86 and a few others.  Everyone else needs a real store
barrier.

> At the moment I think the best I can do is to use an inline assembly
> version of the increment which prevents GCC from doing any
> optimisation upon it. That seems rather ugly though, and if anyone has
> any better suggestions I'd be very grateful.

Well, that's the problem: do you want a barrier or not?  With no
barrier there is no guarantee that the data will ever be written to
memory.  Do you only care about x86 processors?

> To give a concrete example:
> 
> uint64_t num_done = 0;
> void process_work() { /* does something somewhat expensive */}
> void worker_thread(int num_work) {
>   for  (int i = 0; i < num_work; ++i) {
>     process_work();
>     num_done++;  // ideally a relaxed atomic increment here
>   }
> }
> 
> void reporting_thread() {
>   while(true) {
>    sleep(60);
>    printf("worker has done %d\n", num_done);  // ideally a relaxed read here
>   }
> }
> 
> 
> In the non-atomic case above, no locked instructions are used. Given
> enough information about what process_work() does, the compiler can
> realise that num_done can be added to outside of the loop (num_done +=
> num_work); which is the part I'd like to avoid.  By making the int
> atomic and using relaxed, I get this guarantee but at the cost of a
> "lock addl".

Ok, I get that, but not why.  If you care about a particular x86
instruction, you can use it in an inlne asm.  I'm not at all sure what
you want, really.

Andrew.

Reply via email to