Richard Henderson wrote:

To keep all this in perspective, folks should remember that atomic
operations are *slow*.  Very very slow.  Orders of magnitude slower
than function calls.  Seriously.  Taking p4 as the extreme example,
one can expect a null function call in around 10 cycles, but a locked
memory operation to take 1000.  Usually things aren't that bad, but
I believe some poor design decisions were made for p4 here.  But even
on a platform without such problems you can expect a factor of 30
difference.
Apologies in advance if the following is not relevant...

Even on a P4, inlining may enable compiler optimizations. One case is when the compiler can see that the return value of __sync_fetch_and_or (for instance) isn't used. It's possible to use a wait-free "lock or" instead of a "lock cmpxchg" loop (MSVC 8 does this for _InterlockedOr.)
Another case is when inlining results in a sequence of K adjacent 
__sync_fetch_and_add( &x, 1 ) operations. These can legally be replaced with 
a single __sync_fetch_and_add.
Currently the __sync_* intrinsics seem to be fully locked, but if 
acquire/release/unordered variants are added, other platforms may also 
suffer from lack of inlining. On a PowerPC an unordered atomic increment is 
pretty much the same speed as an ordinary increment (when there is no 
contention.) 



Reply via email to