Re: [HACKERS] spinlock contention

Florian Pflug Thu, 07 Jul 2011 02:55:15 -0700

On Jul7, 2011, at 03:35 , Robert Haas wrote:
> On Thu, Jun 23, 2011 at 11:42 AM, Robert Haas <robertmh...@gmail.com> wrote:
>> On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <f...@phlo.org> wrote:
>>> On Jun12, 2011, at 23:39 , Robert Haas wrote:
>>>> So, the majority (60%) of the excess spinning appears to be due to
>>>> SInvalReadLock.  A good chunk are due to ProcArrayLock (25%).
>>> 
>>> Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
>>> However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
>>> so I guess that two adjacent LWLocks currently share one cache line.
>>> 
>>> Currently, the ProcArrayLock has index 4 while SInvalReadLock has
>>> index 5, so if I'm not mistaken exactly the two locks where you saw
>>> the largest contention on are on the same cache line...
>>> 
>>> Might make sense to try and see if these numbers change if you
>>> either make LWLockPadded 64bytes or arrange the locks differently...
>> 
>> I fooled around with this a while back and saw no benefit.  It's
>> possible a more careful test would turn up something, but I think the
>> only real way forward here is going to be to eliminate some of that
>> locking altogether.
> 
> I also tried inserting an
> acquire-and-release cycle on MyProc->backendLock, which was just as
> good.  That seems like a pretty encouraging result, since there appear
> to be several ways of reimplementing SIGetDataEntries() that would
> work with a per-backend lock rather than a global one.


I did some research and found "TLRW: Return of the Read-Write Lock"
by D. Dice and N. Shavit. PDF can be found here
  http://transact09.cs.washington.edu/19_paper.pdf

They describe a read-write lock called "bytelock" with set of "slots"
for shared lockers, where each shared locker only needs to increment his
slot, and thus doesn't interfere with other concurrent shared locking
attempts. The price is that, after incrementing its slot, a shared 
locker must issue a fencing instruction and then verify that no writer
has taken the lock in the mean while. In their application, they 
distinguish between "slotted" threads - those who own a designated
slot, and "unslotted" thread - those who operation on the normal
shared counter.

I implemented that idea, but with a few modifications. First, I don't
distinguish between "slotted" and "unslotted" threads, but allow
multiple backends to share a slot. This means my implementation cannot
use a simple unlocked increment to update the slot, but instead uses
"locked xadd". I also moved the slots out of the lock itself, and into
a separate set of LWLockPart structures. Each such structure contains
the counters for one slot id and multiple locks.

In effect, the resulting thing is an LWLock with a partitioned shared
counter. The partition one backend operates on for shared locks is
determined by its backend id.

I've added the implementation to the lock benchmarking tool at
  https://github.com/fgp/lockbench
and also pushed a patched version of postgres to
  https://github.com/fgp/postgres/tree/lwlock_part

The number of shared counter partitions is current 4, but can easily
be adjusted in lwlock.h. The code uses GCC's atomic fetch and add
intrinsic if available, otherwise it falls back to using a per-partition
spin lock.

Benchmarking results look promising so far. This is the through-put
I get, in acquisition/release cycles per second, for the LWLock
implementations on a 4-core machine. pg_lwlock_part,N,X is the
partitioned lock with N lock partitions and using LOCKED XADD if
X == yes. The numbers are normalized to the through-put in the
single-worker case.

---------- Cycles / Second (Wall-Time) ---------------------
1       2       4       8       16      worker
wall    wall    wall    wall    wall    time
1       1.9     3.9     3.9     3.9     none
1       0.2     0.9     0.8     0.6     pg_lwlock_orig
1       0.4     0.3     0.3     0.3     pg_lwlock_cas
1       0.3     1.0     0.8     0.6     pg_lwlock_part,1,no
1       0.4     0.4     0.4     0.4     pg_lwlock_part,1,yes
1       2.0     0.4     0.4     0.3     pg_lwlock_part,2,no
1       2.0     0.7     0.8     0.7     pg_lwlock_part,2,yes
1       2.0     4.1     2.1     1.0     pg_lwlock_part,4,no
1       2.1     3.8     1.8     2.2     pg_lwlock_part,4,yes

These numbers show that as long is the number of workers does not
exceed the number of lock partitions, throughput increases approximately
linearly. It drops off afterwards, but I that's to be expected - 
the test executes LQLockAcquire/Release in a tight loop, so once there's
any kind of contention (even cache-line bouncing), performance drops.
This is also why the original LWLock beats CAS and the partitioned
lock with less than 4 partitions here - effectively, at least half of
the workers are sleeping at any given time, as the following
user vs. wall time numbers show.

---------- User-Time / Wall-Time ----------------------------
1       2       4       8       16      worker
1.0     1.9     3.9     3.9     3.9     none
1.0     1.9     1.1     1.3     1.8     pg_lwlock_orig
1.0     2.0     4.0     4.0     4.0     pg_lwlock_cas
1.0     1.9     1.1     1.4     1.8     pg_lwlock_part,1,no
1.0     2.0     4.0     3.9     4.0     pg_lwlock_part,1,yes
1.0     2.0     3.6     3.8     3.6     pg_lwlock_part,2,no
1.0     2.0     3.8     3.9     3.9     pg_lwlock_part,2,yes
1.0     2.0     4.1     4.1     3.9     pg_lwlock_part,4,no
1.0     2.0     4.0     3.9     3.9     pg_lwlock_part,4,yes

I lack the hardware to produce any meaningful benchmark results
with pgbench for this, so again I'd be very cool if someone
(Robert? ;-) who has access to that kind of hardware could give the
patched version from github a spin with pgbench.

The patched version is current set of 4 shared counter partitions,
and should use LOCKED XADD if compiled with GCC >= 4.1 (or ICC).

> I'm wondering if the problem may be
> not so much that we have continuous spinlock contention, but rather
> than every once in a while a process gets time-sliced out while it
> holds a spinlock.  If it's an important spinlock (like the one
> protecting SInvalReadLock), the system will quickly evolve into a
> state where every single processor is doing nothing but trying to get
> that spinlock.  Even after the hapless lock-holder gets to run again
> and lets go of the lock, you have a whole pile of other backends who
> are sitting there firing of lock xchgb in a tight loop, and they can
> only get it one at a time, so you have ferocious cache line contention
> until the backlog clears.  Then things are OK again for a bit until
> the same thing happens to some other backend.  This is just a theory,
> I might be totally wrong...

Hm, but one would expect most of the spin lock contesters to have
given up and gone to sleep by the time the interrupted lock holder
resumes. If that is not what happens, then maybe we should revisit
the logic in SpinLockAcquire()...

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] spinlock contention

Reply via email to