On Jul7, 2011, at 03:35 , Robert Haas wrote: > On Thu, Jun 23, 2011 at 11:42 AM, Robert Haas <robertmh...@gmail.com> wrote: >> On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <f...@phlo.org> wrote: >>> On Jun12, 2011, at 23:39 , Robert Haas wrote: >>>> So, the majority (60%) of the excess spinning appears to be due to >>>> SInvalReadLock. A good chunk are due to ProcArrayLock (25%). >>> >>> Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32. >>> However, cache lines are 64 bytes large on recent Intel CPUs AFAIK, >>> so I guess that two adjacent LWLocks currently share one cache line. >>> >>> Currently, the ProcArrayLock has index 4 while SInvalReadLock has >>> index 5, so if I'm not mistaken exactly the two locks where you saw >>> the largest contention on are on the same cache line... >>> >>> Might make sense to try and see if these numbers change if you >>> either make LWLockPadded 64bytes or arrange the locks differently... >> >> I fooled around with this a while back and saw no benefit. It's >> possible a more careful test would turn up something, but I think the >> only real way forward here is going to be to eliminate some of that >> locking altogether. > > I also tried inserting an > acquire-and-release cycle on MyProc->backendLock, which was just as > good. That seems like a pretty encouraging result, since there appear > to be several ways of reimplementing SIGetDataEntries() that would > work with a per-backend lock rather than a global one.
I did some research and found "TLRW: Return of the Read-Write Lock" by D. Dice and N. Shavit. PDF can be found here http://transact09.cs.washington.edu/19_paper.pdf They describe a read-write lock called "bytelock" with set of "slots" for shared lockers, where each shared locker only needs to increment his slot, and thus doesn't interfere with other concurrent shared locking attempts. The price is that, after incrementing its slot, a shared locker must issue a fencing instruction and then verify that no writer has taken the lock in the mean while. In their application, they distinguish between "slotted" threads - those who own a designated slot, and "unslotted" thread - those who operation on the normal shared counter. I implemented that idea, but with a few modifications. First, I don't distinguish between "slotted" and "unslotted" threads, but allow multiple backends to share a slot. This means my implementation cannot use a simple unlocked increment to update the slot, but instead uses "locked xadd". I also moved the slots out of the lock itself, and into a separate set of LWLockPart structures. Each such structure contains the counters for one slot id and multiple locks. In effect, the resulting thing is an LWLock with a partitioned shared counter. The partition one backend operates on for shared locks is determined by its backend id. I've added the implementation to the lock benchmarking tool at https://github.com/fgp/lockbench and also pushed a patched version of postgres to https://github.com/fgp/postgres/tree/lwlock_part The number of shared counter partitions is current 4, but can easily be adjusted in lwlock.h. The code uses GCC's atomic fetch and add intrinsic if available, otherwise it falls back to using a per-partition spin lock. Benchmarking results look promising so far. This is the through-put I get, in acquisition/release cycles per second, for the LWLock implementations on a 4-core machine. pg_lwlock_part,N,X is the partitioned lock with N lock partitions and using LOCKED XADD if X == yes. The numbers are normalized to the through-put in the single-worker case. ---------- Cycles / Second (Wall-Time) --------------------- 1 2 4 8 16 worker wall wall wall wall wall time 1 1.9 3.9 3.9 3.9 none 1 0.2 0.9 0.8 0.6 pg_lwlock_orig 1 0.4 0.3 0.3 0.3 pg_lwlock_cas 1 0.3 1.0 0.8 0.6 pg_lwlock_part,1,no 1 0.4 0.4 0.4 0.4 pg_lwlock_part,1,yes 1 2.0 0.4 0.4 0.3 pg_lwlock_part,2,no 1 2.0 0.7 0.8 0.7 pg_lwlock_part,2,yes 1 2.0 4.1 2.1 1.0 pg_lwlock_part,4,no 1 2.1 3.8 1.8 2.2 pg_lwlock_part,4,yes These numbers show that as long is the number of workers does not exceed the number of lock partitions, throughput increases approximately linearly. It drops off afterwards, but I that's to be expected - the test executes LQLockAcquire/Release in a tight loop, so once there's any kind of contention (even cache-line bouncing), performance drops. This is also why the original LWLock beats CAS and the partitioned lock with less than 4 partitions here - effectively, at least half of the workers are sleeping at any given time, as the following user vs. wall time numbers show. ---------- User-Time / Wall-Time ---------------------------- 1 2 4 8 16 worker 1.0 1.9 3.9 3.9 3.9 none 1.0 1.9 1.1 1.3 1.8 pg_lwlock_orig 1.0 2.0 4.0 4.0 4.0 pg_lwlock_cas 1.0 1.9 1.1 1.4 1.8 pg_lwlock_part,1,no 1.0 2.0 4.0 3.9 4.0 pg_lwlock_part,1,yes 1.0 2.0 3.6 3.8 3.6 pg_lwlock_part,2,no 1.0 2.0 3.8 3.9 3.9 pg_lwlock_part,2,yes 1.0 2.0 4.1 4.1 3.9 pg_lwlock_part,4,no 1.0 2.0 4.0 3.9 3.9 pg_lwlock_part,4,yes I lack the hardware to produce any meaningful benchmark results with pgbench for this, so again I'd be very cool if someone (Robert? ;-) who has access to that kind of hardware could give the patched version from github a spin with pgbench. The patched version is current set of 4 shared counter partitions, and should use LOCKED XADD if compiled with GCC >= 4.1 (or ICC). > I'm wondering if the problem may be > not so much that we have continuous spinlock contention, but rather > than every once in a while a process gets time-sliced out while it > holds a spinlock. If it's an important spinlock (like the one > protecting SInvalReadLock), the system will quickly evolve into a > state where every single processor is doing nothing but trying to get > that spinlock. Even after the hapless lock-holder gets to run again > and lets go of the lock, you have a whole pile of other backends who > are sitting there firing of lock xchgb in a tight loop, and they can > only get it one at a time, so you have ferocious cache line contention > until the backlog clears. Then things are OK again for a bit until > the same thing happens to some other backend. This is just a theory, > I might be totally wrong... Hm, but one would expect most of the spin lock contesters to have given up and gone to sleep by the time the interrupted lock holder resumes. If that is not what happens, then maybe we should revisit the logic in SpinLockAcquire()... best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers