> From: Lawrence Crowl [mailto:cr...@google.com] > > On 8/20/09, Boehm, Hans <hans.bo...@hp.com> wrote: > > > -----Original Message----- > > > From: Lawrence Crowl [mailto:cr...@google.com] The > problem is that > > > gcc does support 80386. It also supports other > processors that have > > > less-than-complete support for concurrency. Just in the > x86 line, > > > we get some additional capability in many new layers. > > > > > > 8086 LOCK XCHG > > > 80486 CMPXCHG XADD > > > Pentium CMPXCHG8B > > > SSE SFENCE > > > > Aside to an interesting discussion: > > > > I believe the current conclusion is that SFENCE should be ignored, > > except for library or compiler-generated code that uses > > non-temporal/coalescing stores, which I believe are also a recent > > addition. Normal stores are ordered anyway, so it's not needed. > > Thus you are faced with a choice of either (a) implementing > fences on > > the assumption that ordinary code may contain non-temporal > stores, or > > (b) making sure that non-temporal stores are always > surrounded by the > > appropriate fences. This is really an important ABI issue, > but it's > > something that I believe no ABI currently specifies. Our > conclusion > > in earlier discussions among a different group of people > was that (b) > > made more sense, since non-temporal stores of various kinds > seemed to > > be largely confined to a few library routines. > > Hm. I would expect that given the C++0x memory model, > compilers could be much more aggressive about using > non-temporal stores, potentially improving performance > substantially. That is, it may be better to accept a > slightly less efficient ABI for today's compilers to gain a > more efficient ABI for tomorrow's compilers. > > > It would be really nice if everyone somehow managed to > agree on this. > > Inconsistency here, probably even between Windows and Linux, seems > > likely to result in really subtle bugs. > > > > Note that this also affects correctness of spinlock > implementations, > > not just atomics. A simple store to release a lock doesn't work if > > the critical section may contain unfenced non-temporal stores. > > Yes, but the spinning acquire doesn't require the fence, only > the the release. So, is this additional instruction a > performance problem? > I haven't looked at this terribly systematically. I do know that in Pentium 4 days, sfence was tremendously expensive (basically equivalent to mfence or cmpxchg, i.e. 100+ cycles), even in contexts in which it was a no-op. Thus ABI convention (a) roughly doubles the (already very high) cost of an uncontended spin-lock on a Pentium 4. I suspect that got better on later implementations, but I'm not sure by how much.
I think the only nontemporal stores on X86 are vector instructions. I would guess that for many applications neither these nor spin-lock times matter a lot, and for most of the rest, these vector instructions won't make up for the cost of doubling spin-lock execution times. If you do manage to automatically generate non-temporal stores at all, you will usually generate a bunch of them between potential synchronization operations, so that you can amortize the sfence. As I recall, we did look briefly during earlier discussions, and didn't find them used much even in hand-crafted libc code. But this is all hand-waving and guessing. Certainly real measurements would be much better. The most important issue of course is that we need to stick to one convention or the other. Currently a lot of code seems to assume that an X86 spin lock can be released with a simple store, so invalidating that would be tricky, especially since sfence was a fairly recent introduction. Hans