Richard Henderson wrote:
To keep all this in perspective, folks should remember that atomic
operations are *slow*. Very very slow. Orders of magnitude slower
than function calls. Seriously. Taking p4 as the extreme example,
one can expect a null function call in around 10 cycles, but a locked
memory operation to take 1000. Usually things aren't that bad, but
I believe some poor design decisions were made for p4 here. But even
on a platform without such problems you can expect a factor of 30
difference.
Apologies in advance if the following is not relevant...
Even on a P4, inlining may enable compiler optimizations. One case is when
the compiler can see that the return value of __sync_fetch_and_or (for
instance) isn't used. It's possible to use a wait-free "lock or" instead of
a "lock cmpxchg" loop (MSVC 8 does this for _InterlockedOr.)
Another case is when inlining results in a sequence of K adjacent
__sync_fetch_and_add( &x, 1 ) operations. These can legally be replaced with
a single __sync_fetch_and_add.
Currently the __sync_* intrinsics seem to be fully locked, but if
acquire/release/unordered variants are added, other platforms may also
suffer from lack of inlining. On a PowerPC an unordered atomic increment is
pretty much the same speed as an ordinary increment (when there is no
contention.)