Re: Lockless queue of waiters in LWLock

Andres Freund Fri, 04 Nov 2022 12:08:00 -0700

Hi,

On 2022-11-03 14:50:11 +0400, Pavel Borisov wrote:
> Or maybe there is another explanation for now small performance
> difference around 20 connections described in [0]?
> Thoughts?


Using xadd is quite a bit cheaper than cmpxchg, and now every lock release
uses a compare-exchange, I think.

In the past I had a more complicated version of LWLockAcquire which tried to
use an xadd to acquire locks. IIRC (and this is long enough ago that I might
not) that proved to be a benefit, but I was worried about the complexity. And
just getting in the version that didn't always use a spinlock was the higher
priority.

The use of cmpxchg vs lock inc/lock add/xadd is one of the major reasons why
lwlocks are slower than a spinlock (but obviously are better under contention
nonetheless).


I have a benchmark program that starts a thread for each physical core and
just increments a counter on an atomic value.

On my dual Xeon Gold 5215 workstation:

cmpxchg:
32: throughput per thread: 0.55M/s, total: 11.02M/s
64: throughput per thread: 0.63M/s, total: 12.68M/s

lock add:
32: throughput per thread: 2.10M/s, total: 41.98M/s
64: throughput per thread: 2.12M/s, total: 42.40M/s

xadd:
32: throughput per thread: 2.10M/s, total: 41.91M/s
64: throughput per thread: 2.04M/s, total: 40.71M/s


and even when there's no contention, every thread just updating its own
cacheline:

cmpxchg:
32: throughput per thread: 88.83M/s, total: 1776.51M/s
64: throughput per thread: 96.46M/s, total: 1929.11M/s

lock add:
32: throughput per thread: 166.07M/s, total: 3321.31M/s
64: throughput per thread: 165.86M/s, total: 3317.22M/s

add (no lock):
32: throughput per thread: 530.78M/s, total: 10615.62M/s
64: throughput per thread: 531.22M/s, total: 10624.35M/s

xadd:
32: throughput per thread: 165.88M/s, total: 3317.51M/s
64: throughput per thread: 165.93M/s, total: 3318.53M/s


Greetings,

Andres Freund

Re: Lockless queue of waiters in LWLock

Reply via email to