On 26/12/14 22:49, Matt Godbolt wrote: > On Fri, Dec 26, 2014 at 4:26 PM, Andrew Haley <a...@redhat.com> wrote: >> On 26/12/14 20:32, Matt Godbolt wrote: >>> Is there a reason why (in principal) the volatile increment can't be >>> made into a single add? Clang and ICC both emit the same code for the >>> volatile and non-volatile case. >> >> Yes. Volatiles use the "as if" rule, where every memory access is as >> written. a volatile increment is defined as a load, an increment, and >> a store. > > That makes sense to me from a logical point of view. My > understanding though is the volatile keyword was mainly used when > working with memory-mapped devices, where memory loads and stores > could not be elided. A single-instruction load-modify-write like > "increment [addr]" adheres to these constraints even though it is a > single instruction. I realise my understanding could be wrong here! > If not though, both clang and icc are taking a short-cut that may > puts them into non-compliant state.
It's hard to be certain. The language used by the standard is very unhelpful: it requires all accesses to be as written, but does not define exactly what constitutes an access. >> If you want single atomic increment, atomics are what you >> should use. If you want an increment to be written to memory, use a >> store barrier after the increment. > > Thanks. I realise I was unclear in my original email. I'm really > looking for a way to say "do a non-lock-prefixed increment". Why? > Atomics are too strong and enforce a bus lock. Doing a store > barrier after the increment also appears heavy-handed: while I wish > for eventual consistency with memory, I do not require it. I do > however need the compiler to not move or elide my increment. You could just use a compiler barrier: asm volatile(""); But this is good only for x86 and a few others. Everyone else needs a real store barrier. > At the moment I think the best I can do is to use an inline assembly > version of the increment which prevents GCC from doing any > optimisation upon it. That seems rather ugly though, and if anyone has > any better suggestions I'd be very grateful. Well, that's the problem: do you want a barrier or not? With no barrier there is no guarantee that the data will ever be written to memory. Do you only care about x86 processors? > To give a concrete example: > > uint64_t num_done = 0; > void process_work() { /* does something somewhat expensive */} > void worker_thread(int num_work) { > for (int i = 0; i < num_work; ++i) { > process_work(); > num_done++; // ideally a relaxed atomic increment here > } > } > > void reporting_thread() { > while(true) { > sleep(60); > printf("worker has done %d\n", num_done); // ideally a relaxed read here > } > } > > > In the non-atomic case above, no locked instructions are used. Given > enough information about what process_work() does, the compiler can > realise that num_done can be added to outside of the loop (num_done += > num_work); which is the part I'd like to avoid. By making the int > atomic and using relaxed, I get this guarantee but at the cost of a > "lock addl". Ok, I get that, but not why. If you care about a particular x86 instruction, you can use it in an inlne asm. I'm not at all sure what you want, really. Andrew.