I have multi-threaded C++ application that relies on fine-grain parallelism and makes extensive use of interlocked instructions and memory barriers for inter-thread synchronization and communication.
I currently use LFENCE/SFENCE/MFENCE instructions for memory barriers (on processors that have these instructions, otherwise resorting to LOCK-OR). I was looking to relax barriers from LFENCE/SFENCE/MFENCE to LOCK-OR/no-op/LOCK-OR correspondingly -- assuming I am able to verify that no non-temporal SSE/3DNow instructions are used by the compiler-generated code or runtime library without proper bracketing such instructions with memory fences. The reasons for desired relaxation is that: 1) LOCK-OR is believed to take less cycles than xFENCE instructions (though I have not actually benchmarked them yet, but this appears to be a common belief); 2) even more importantly, since memory barriers are most usually coupled with interlocked instructions that access data entities used for inter-thread synchronization anyway, it would be beneficial to eliminate extra memory barrier in such places as redundant on x86/x64, because memory barrier is already provided by interlocked instruction operating on primary datum, and therefore extra xFENCE is redundant -- except only for the possibility of unbracketed non-temporal instructions in CPU instruction stream. The issue is significant since application can make thousands inter-thread transactions per second. The question is whether there is any GCC/runtime policy on non-temporal SSE/3DNow instructions? Specifically, can application expect that: 1) Compiler-generated code will not contain non-temporal instructions blocks not bracketed by xFENCE on both sides? 2) Normal application code that may engage in inter-thread communication won't be embedded inside such blocks? 3) Run-time library won't use non-temporal instructions blocks not bracketed by xFENCE on both sides? Or is it a completely gray area?