RE: Implementing C++1x and C1x atomics (really an aside on SFENCE)

Boehm, Hans Wed, 09 Sep 2009 16:42:20 -0700

> From: Lawrence Crowl [mailto:[email protected]] 
> 
> On 8/20/09, Boehm, Hans <[email protected]> wrote:
> > > -----Original Message-----
> > > From: Lawrence Crowl [mailto:[email protected]] The 
> problem is that 
> > > gcc does support 80386.  It also supports other 
> processors that have 
> > > less-than-complete support for concurrency.  Just in the 
> x86 line, 
> > > we get some additional capability in many new layers.
> > >
> > >   8086        LOCK XCHG
> > >   80486       CMPXCHG XADD
> > >   Pentium     CMPXCHG8B
> > >   SSE         SFENCE
> >
> > Aside to an interesting discussion:
> >
> > I believe the current conclusion is that SFENCE should be ignored, 
> > except for library or compiler-generated code that uses 
> > non-temporal/coalescing stores, which I believe are also a recent 
> > addition.  Normal stores are ordered anyway, so it's not needed.
> > Thus you are faced with a choice of either (a) implementing 
> fences on 
> > the assumption that ordinary code may contain non-temporal 
> stores, or 
> > (b) making sure that non-temporal stores are always 
> surrounded by the 
> > appropriate fences.  This is really an important ABI issue, 
> but it's 
> > something that I believe no ABI currently specifies.  Our 
> conclusion 
> > in earlier discussions among a different group of people 
> was that (b) 
> > made more sense, since non-temporal stores of various kinds 
> seemed to 
> > be largely confined to a few library routines.
> 
> Hm.  I would expect that given the C++0x memory model, 
> compilers could be much more aggressive about using 
> non-temporal stores, potentially improving performance 
> substantially.  That is, it may be better to accept a 
> slightly less efficient ABI for today's compilers to gain a 
> more efficient ABI for tomorrow's compilers.
> 
> > It would be really nice if everyone somehow managed to 
> agree on this.
> > Inconsistency here, probably even between Windows and Linux, seems 
> > likely to result in really subtle bugs.
> >
> > Note that this also affects correctness of spinlock 
> implementations, 
> > not just atomics.  A simple store to release a lock doesn't work if 
> > the critical section may contain unfenced non-temporal stores.
> 
> Yes, but the spinning acquire doesn't require the fence, only 
> the the release.  So, is this additional instruction a 
> performance problem?
> 
I haven't looked at this terribly systematically.  I do know that in Pentium 4 
days, sfence was tremendously expensive (basically equivalent to mfence or 
cmpxchg, i.e. 100+ cycles), even in contexts in which it was a no-op.  Thus ABI 
convention (a) roughly doubles the (already very high) cost of an uncontended 
spin-lock on a Pentium 4.  I suspect that got better on later implementations, 
but I'm not sure by how much.


I think the only nontemporal stores on X86 are vector instructions.  I would 
guess that for many applications neither these nor spin-lock times matter a 
lot, and for most of the rest, these vector instructions won't make up for the 
cost of doubling spin-lock execution times.  If you do manage to automatically 
generate non-temporal stores at all, you will usually generate a bunch of them 
between potential synchronization operations, so that you can amortize the 
sfence.  As I recall, we did look briefly during earlier discussions, and 
didn't find them used much even in hand-crafted libc code.

But this is all hand-waving and guessing.  Certainly real measurements would be 
much better.

The most important issue of course is that we need to stick to one convention 
or the other.  Currently a lot of code seems to assume that an X86 spin lock 
can be released with a simple store, so invalidating that would be tricky, 
especially since sfence was a fairly recent introduction.

Hans

RE: Implementing C++1x and C1x atomics (really an aside on SFENCE)

Reply via email to