Hi, On 2022-11-03 14:50:11 +0400, Pavel Borisov wrote: > Or maybe there is another explanation for now small performance > difference around 20 connections described in [0]? > Thoughts?
Using xadd is quite a bit cheaper than cmpxchg, and now every lock release uses a compare-exchange, I think. In the past I had a more complicated version of LWLockAcquire which tried to use an xadd to acquire locks. IIRC (and this is long enough ago that I might not) that proved to be a benefit, but I was worried about the complexity. And just getting in the version that didn't always use a spinlock was the higher priority. The use of cmpxchg vs lock inc/lock add/xadd is one of the major reasons why lwlocks are slower than a spinlock (but obviously are better under contention nonetheless). I have a benchmark program that starts a thread for each physical core and just increments a counter on an atomic value. On my dual Xeon Gold 5215 workstation: cmpxchg: 32: throughput per thread: 0.55M/s, total: 11.02M/s 64: throughput per thread: 0.63M/s, total: 12.68M/s lock add: 32: throughput per thread: 2.10M/s, total: 41.98M/s 64: throughput per thread: 2.12M/s, total: 42.40M/s xadd: 32: throughput per thread: 2.10M/s, total: 41.91M/s 64: throughput per thread: 2.04M/s, total: 40.71M/s and even when there's no contention, every thread just updating its own cacheline: cmpxchg: 32: throughput per thread: 88.83M/s, total: 1776.51M/s 64: throughput per thread: 96.46M/s, total: 1929.11M/s lock add: 32: throughput per thread: 166.07M/s, total: 3321.31M/s 64: throughput per thread: 165.86M/s, total: 3317.22M/s add (no lock): 32: throughput per thread: 530.78M/s, total: 10615.62M/s 64: throughput per thread: 531.22M/s, total: 10624.35M/s xadd: 32: throughput per thread: 165.88M/s, total: 3317.51M/s 64: throughput per thread: 165.93M/s, total: 3318.53M/s Greetings, Andres Freund