On Thu, Jun 23, 2011 at 1:34 PM, Florian Pflug <f...@phlo.org> wrote: > On Jun23, 2011, at 17:42 , Robert Haas wrote: >> On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <f...@phlo.org> wrote: >>> On Jun12, 2011, at 23:39 , Robert Haas wrote: >>>> So, the majority (60%) of the excess spinning appears to be due to >>>> SInvalReadLock. A good chunk are due to ProcArrayLock (25%). >>> >>> Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32. >>> However, cache lines are 64 bytes large on recent Intel CPUs AFAIK, >>> so I guess that two adjacent LWLocks currently share one cache line. >>> >>> Currently, the ProcArrayLock has index 4 while SInvalReadLock has >>> index 5, so if I'm not mistaken exactly the two locks where you saw >>> the largest contention on are on the same cache line... >>> >>> Might make sense to try and see if these numbers change if you >>> either make LWLockPadded 64bytes or arrange the locks differently... >> >> I fooled around with this a while back and saw no benefit. It's >> possible a more careful test would turn up something, but I think the >> only real way forward here is going to be to eliminate some of that >> locking altogether. > > It seems hard to believe that there isn't some effect of two locks > sharing a cache line. There are architectures (even some of the > Intel architectures, I believe) where cache lines are 32 bit, though - > so measuring this certainly isn't easy. Also, I'm absolute no > expert for this, so it might very well be that' I'm missing something > fundamental. > >> SInvalReadLock is a good example of a lock which seems barely to be >> necessary at all. Except on extraordinarily rare occasions, it is >> only ever taken in share mode. > > This is the case we should optimize for, I think. Currently, there's > contention even between two SHARE lockers of a LWLock because our > LWLock implementation uses a spinlock underneath. I've googled around > a bit, and found this: > > http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git > > It's an userspace rwlock implementation based on atomic instructions > and futexes (for blocking) written by Linus Torwalds as a response > to the following thread > > http://lwn.net/Articles/284741/ > > It seems to require only a single atomic increment instruction to > acquire a share lock in the uncontested case, and a single compare- > and-exchange instruction to acquire an exlcusive lock (again in > the uncontested case). > > The code doesn't seem to have a license attached, and seems to have > been written over a weekend and never touched since, so we might > not want (or be able to) to use this directly. But it'd still be > interesting to see if replacing our LWLocks with this implementation > affects throughput. > >> Only SICleanupQueue() needs to lock >> out readers, and in the (probably common) case where all backends are >> reasonably close to being caught up, it wouldn't even matter if the >> determination of minMsgNum were a little bit fuzzy. I've been >> straining my brain trying to figure out whether there's some way to >> rewrite this logic so that it can run concurrently with readers; then >> we could get rid of SInvalReadLock altogether. I have a feeling if I >> were 25% smarter I could do it, but I'm not. > > This sounds like something which could be done with RCU (Read-Copy-Update) > in a lockless way. The idea is to copy the data structure (or parts) > of it before updating it, and to "commit" the change afterwards by > adjusting a single pointer to point to the new structure (kinda like > we do atomic commits by an atomic update of one bit in the clog). > > The hard part is usually to figure when it's OK to clean up or > reuse the old version of the data structure - for that, you need to > know that all previous readers have completed. > > Here's a rough description of how that could work for the invalidation > queue. We'd have two queue structures in shared memory, each containing > a version number. We'd also have a "current" pointer pointing to the > most recent one. Each reader would publish the version number of the > queue he's about to read (the "current" one) in shared memory > (and issue a write barrier). SICleanupQueue() would check whether the > non-current queue structure is unused by comparing its version to the > published versions of all backends (after issuing a read barrier, and > after taking a lock that prevents concurrent SICleanupQueue runs of > course). If there still are users of that version, SICleanupQueue() > out-waits them. Afterwards, it simply copies the current structure > over the old one, does the necessary cleanups, increments the version > number to be larger than the one of the "current" structure, > and *then* updates the "current" pointer. > >> So my fallback position is to try to find some way to optimize the >> common case where there are no messages to be read. We're already >> allowing readers and writers to run in parallel, so it must not be >> important for a reader to see a message that a writer is in the >> process of writing, but has not yet finished writing. I believe the >> idea here is that sinval messages should be emitted before releasing >> the related locks, so if we've successfully acquired a lock on an >> object, then any pertinent sinval messages must already be completely >> written. What I'm thinking we could do is just keep a global counter >> indicating how many messages have been written, and each backend could >> keep a counter indicating the highest-numbered message it has so far >> read. When a message is written, the global counter is bumped. When >> a backend goes to AcceptInvalidationMessages(), it compares the global >> counter to its own counter, and if they're the same, then it doesn't >> bother taking SInvalReadLock and just exits quickly. > > Sounds sensible. I do fear, however, that if we start to micro-optimize > around every single LWLock we're in for quite a maintenance burden in > the future. Even if each single modification is relatively simple, the > complexity still adds ups. So I still believe that it'd be better to > start with optimizing LWLocks in general, and to only touch the LWLocks > which are still hot spots afterwards. > >> There is an obvious (potential) memory-ordering problem here. If it's >> possible for a backend doing AcceptInvalidationMessages() to see a >> stale version of the counter, then it might fail to read messages from >> the queue that it really ought to have seen. Protecting the global >> counter with a spinlock defeats the purpose of doing any of this in >> the first place, because the whole problem is too many people fighting >> over the spinlock protecting SInvalReadLock. It should be sufficient, >> though, for the writing backend to execute a memory-barrier operation >> just after bumping the global counter and the reading backend to >> execute one just before. As it happens, LWLockRelease() is such a >> barrier, since it must acquire and release a spinlock, so we should be >> able to just bump the counter right before releasing the LWLock and >> call it good. On the reading side, the immediately-preceding lock >> acquisition seems like it ought to be sufficient - surely it can't be >> possible to acquire a heavyweight lock on an object without seeing >> memory updates that some other process made before releasing the same >> heavyweight lock. > > If we start doing lockless algorithms, I really think we should add > explicit barrier primitives. We could start out with something > simple such as > > #define MemoryBarrierWrite \ > do { slock_t l; S_UNLOCK(l); } while (0) > #define MemoryBarrierRead \ > do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); } > #define MemoryBarrierReadWrite \ > do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); } > > But yeah, it seems that in the case of the sinval queue, the surrounding > LWLocks actually provide the necessary barriers. > >> A possible objection here is that a 32-bit counter might wrap around, >> giving a false negative, and a read from a 64-bit counter might not be >> atomic. In practice, the chances of a problem here seem remote, but I >> think we can guard against it easily enough. We can put each >> backend's counter of messages read into shared memory, protected by a >> per-backend lock (probably the same LWLock the fast relation lock >> patch already introduces). When a backend compares its counter to the >> global counter, it does so while holding this lock. When >> SICleanupQueue() resets a backend, it grabs the lock and sets that >> backend's counter to a special value (like 0) that we guarantee will >> never match the global counter (which we can cause to wrap from 2^32-1 >> to 1). So then AcceptInvalidationMessages() - in the common case >> where there are no invalidation messages to process - will only need >> the per-backend LWLock rather than fighting over SInvalLock. > > OTOH, don't all architectures that are interesting targets of > performance tuning provide atomic access to 64-bit values? Even i386 > has an 8-byte compare-and-exchange instruction I think...
yup, 'lock cmpxchg8b'. see: http://www.niallryan.com/node/137 for a way to do it via fpu without a loop. the 'lock' is a bus lock. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers