On Fri, Dec 26, 2014 at 5:19 PM, Andrew Haley <a...@redhat.com> wrote: > On 26/12/14 22:49, Matt Godbolt wrote: >> On Fri, Dec 26, 2014 at 4:26 PM, Andrew Haley <a...@redhat.com> wrote: >>> On 26/12/14 20:32, Matt Godbolt wrote:
>> I realise my understanding could be wrong here! >> If not though, both clang and icc are taking a short-cut that may >> puts them into non-compliant state. > > It's hard to be certain. The language used by the standard is very > unhelpful: it requires all accesses to be as written, but does not > define exactly what constitutes an access. Thanks. My world is very x86-centric and so I find it hard to understand why a single instruction's RMW is different from three separate instructions; but I appreciate the standard is vague around volatiles, and that atomics go some way to using more well-defined semantics. >> Thanks. I realise I was unclear in my original email. I'm really >> looking for a way to say "do a non-lock-prefixed increment". > > Why? Performance. The single-threaded writers do not need to use a lock prefix: the atomicity of their read-add-write is guaranteed by my knowing no other threads write to the value. Thus the bus lock they take out unnecessarily slows down the instruction and potentially causes extra coherency traffic. The order of stores (on x86) is guaranteed and so provided I take a relaxed view in the consumer there's not even a need for any other flush. The memory write will necessarily "eventually" become visible to the reader. Within the constraints of the architecture I'm working in, this is plenty enough for a metric. > You could just use a compiler barrier: asm volatile(""); But this is > good only for x86 and a few others. This may be all I need, but my worry is this will inhibit other valid optimisations. I know that the "trick" used elsewhere as a barrier (asm voliatile("":::"memory");) has the effect of flushing enregistered values to memory. Ideally this wouldn't be necessary. I'll be honest; I don't know the semantics of an empty volatile asm(), but I'm not sure how it could cause only the one write (metric++) to be emitted without affecting other variables too. > Everyone else needs a real store barrier. This is certainly true if the writer needs to guarantee visibility to other threads. But that's not the case for my use case. > Well, that's the problem: do you want a barrier or not? With no > barrier there is no guarantee that the data will ever be written to > memory. Do you only care about x86 processors? I appreciate your patience in understanding my case (given I'm not explaining myself very well!) In this instance, yes, only x86 processors. I do not need an explicit ISA-level flush. I do need a guarantee that the compiler cannot optimise the increment by loop-invariant motion. >> To give a concrete example: [snip] >> By making the int >> atomic and using relaxed, I get this guarantee but at the cost of a >> "lock addl". > > Ok, I get that, but not why. If you care about a particular x86 > instruction, you can use it in an inlne asm. I'm not at all sure what > you want, really. I hope my other comments at least help to explain the why! It's not a particular instruction inasmuch as communicating to the compiler that there's only one writer, and so the lock prefix is unnecessary (for x86) as the write of the read-modify-write will not race with other writers (as none exist) and the write will eventually become visible to other threads in strict memory order (as the x86 guarantees). This last stage I believe is consistent with a "relaxed" model, with an optimisation that if no other writers exist, no bus lock is required on the writer. Again, thanks for the reply and the time taken thinking about the issue especially at this festive time of year! Best regards, Matt