Re: Remove last traces of HPPA support

Andres Freund Wed, 31 Jul 2024 12:45:40 -0700

Hi,

On 2024-07-31 22:32:19 +1200, Thomas Munro wrote:
> > That old comment means that both SpinLockAcquire() and SpinLockRelease()
> > acted as full memory barriers, and looking at the implementations, that
> > was indeed so. With the new implementation, SpinLockAcquire() will have
> > "acquire semantics" and SpinLockRelease will have "release semantics".
> > That's very sensible, and I don't believe it will break anything, but
> > it's a change in semantics nevertheless.
> 
> Yeah.  It's interesting that our pg_atomic_clear_flag(f) is like
> standard atomic_flag_clear_explicit(f, memory_order_release), not like
> atomic_flag_clear(f) which is short for atomic_flag_clear_explicit(f,
> memory_order_seq_cst).  Example spinlock code I've seen written in
> modern C or C++ therefore uses the _explicit variants, so it can get
> acquire/release, which is what people usually want from a lock-like
> thing.  What's a good way to test the performance in PostgreSQL?


I've used
  c=8;pgbench -n -Mprepared -c$c -j$c -P1 -T10 -f <(echo "SELECT 
pg_logical_emit_message(false, \:client_id::text, '1'), generate_series(1, 
1000) OFFSET 1000;")
in the past. Because of NUM_XLOGINSERT_LOCKS = 8 this ends up with 8 backends
doing tiny xlog insertions and heavily contending on insertpos_lck.

The generate_series() is necessary as otherwise the context switch and
executor startup overhead dominates.


> In a naive loop that just test-and-sets and clears a flag a billion times in
> a loop and does nothing else, I see 20-40% performance increase depending on
> architecture when comparing _seq_cst with _acquire/_release.

I'd expect the difference to be even bigger on concurrent workloads on x86-64
- the added memory barrier during lock release really hurts. I have a test
program to play around with this and the difference in isolation is like 0.4x
the throughput with a full barrier release on my older 2 socket workstation
[1]. Of course it's not trivial to hit "pure enough" cases in the real world.


On said workstation [1], with the above pgbench, I get ~1.95M inserts/sec
(1959 TPS * 1000) on HEAD and 1.80M insert/sec after adding
#define S_UNLOCK(lock) __atomic_store_n(lock, 0, __ATOMIC_SEQ_CST)


If I change NUM_XLOGINSERT_LOCKS = 40 and use 40 clients, I get
1.03M inserts/sec with the current code and 0.86M inserts/sec with
__ATOMIC_SEQ_CST.

Greetings,

Andres Freund

[1] 2x Xeon Gold 5215

Re: Remove last traces of HPPA support

Reply via email to