Hi, On 2024-07-31 22:32:19 +1200, Thomas Munro wrote: > > That old comment means that both SpinLockAcquire() and SpinLockRelease() > > acted as full memory barriers, and looking at the implementations, that > > was indeed so. With the new implementation, SpinLockAcquire() will have > > "acquire semantics" and SpinLockRelease will have "release semantics". > > That's very sensible, and I don't believe it will break anything, but > > it's a change in semantics nevertheless. > > Yeah. It's interesting that our pg_atomic_clear_flag(f) is like > standard atomic_flag_clear_explicit(f, memory_order_release), not like > atomic_flag_clear(f) which is short for atomic_flag_clear_explicit(f, > memory_order_seq_cst). Example spinlock code I've seen written in > modern C or C++ therefore uses the _explicit variants, so it can get > acquire/release, which is what people usually want from a lock-like > thing. What's a good way to test the performance in PostgreSQL?
I've used c=8;pgbench -n -Mprepared -c$c -j$c -P1 -T10 -f <(echo "SELECT pg_logical_emit_message(false, \:client_id::text, '1'), generate_series(1, 1000) OFFSET 1000;") in the past. Because of NUM_XLOGINSERT_LOCKS = 8 this ends up with 8 backends doing tiny xlog insertions and heavily contending on insertpos_lck. The generate_series() is necessary as otherwise the context switch and executor startup overhead dominates. > In a naive loop that just test-and-sets and clears a flag a billion times in > a loop and does nothing else, I see 20-40% performance increase depending on > architecture when comparing _seq_cst with _acquire/_release. I'd expect the difference to be even bigger on concurrent workloads on x86-64 - the added memory barrier during lock release really hurts. I have a test program to play around with this and the difference in isolation is like 0.4x the throughput with a full barrier release on my older 2 socket workstation [1]. Of course it's not trivial to hit "pure enough" cases in the real world. On said workstation [1], with the above pgbench, I get ~1.95M inserts/sec (1959 TPS * 1000) on HEAD and 1.80M insert/sec after adding #define S_UNLOCK(lock) __atomic_store_n(lock, 0, __ATOMIC_SEQ_CST) If I change NUM_XLOGINSERT_LOCKS = 40 and use 40 clients, I get 1.03M inserts/sec with the current code and 0.86M inserts/sec with __ATOMIC_SEQ_CST. Greetings, Andres Freund [1] 2x Xeon Gold 5215