Re: volatile access optimization (C++ / x86_64)

Matt Godbolt Fri, 26 Dec 2014 14:49:52 -0800

On Fri, Dec 26, 2014 at 4:26 PM, Andrew Haley <a...@redhat.com> wrote:
> On 26/12/14 20:32, Matt Godbolt wrote:
>> Is there a reason why (in principal) the volatile increment can't be
>> made into a single add? Clang and ICC both emit the same code for the
>> volatile and non-volatile case.
>
> Yes.  Volatiles use the "as if" rule, where every memory access is as
> written.  a volatile increment is defined as a load, an increment, and
> a store.


That makes sense to me from a logical point of view. My understanding
though is the volatile keyword was mainly used when working with
memory-mapped devices, where memory loads and stores could not be
elided. A single-instruction load-modify-write like "increment [addr]"
adheres to these constraints even though it is a single instruction.
I realise my understanding could be wrong here!  If not though, both
clang and icc are taking a short-cut that may puts them into
non-compliant state.

> If you want single atomic increment, atomics are what you
> should use.  If you want an increment to be written to memory, use a
> store barrier after the increment.

Thanks. I realise I was unclear in my original email. I'm really
looking for a way to say "do a non-lock-prefixed increment". Atomics
are too strong and enforce a bus lock.  Doing a store barrier after
the increment also appears heavy-handed: while I wish for eventual
consistency with memory, I do not require it. I do however need the
compiler to not move or elide my increment.

At the moment I think the best I can do is to use an inline assembly
version of the increment which prevents GCC from doing any
optimisation upon it. That seems rather ugly though, and if anyone has
any better suggestions I'd be very grateful.

To give a concrete example:

uint64_t num_done = 0;
void process_work() { /* does something somewhat expensive */}
void worker_thread(int num_work) {
  for  (int i = 0; i < num_work; ++i) {
    process_work();
    num_done++;  // ideally a relaxed atomic increment here
  }
}

void reporting_thread() {
  while(true) {
   sleep(60);
   printf("worker has done %d\n", num_done);  // ideally a relaxed read here
  }
}


In the non-atomic case above, no locked instructions are used. Given
enough information about what process_work() does, the compiler can
realise that num_done can be added to outside of the loop (num_done +=
num_work); which is the part I'd like to avoid.  By making the int
atomic and using relaxed, I get this guarantee but at the cost of a
"lock addl".

Thanks in advance for any ideas,

Matt

Re: volatile access optimization (C++ / x86_64)

Reply via email to