My colleague Salvatore Dipietro (CC'd) sent me a couple of profiles that showed an enormous amount of s_lock() time going to the __sync_lock_test_and_set() call in the AArch64 implementation of tas(). Upon closer inspection, I noticed that we don't implement a custom TAS_SPIN() for this architecture, so I quickly hacked together the attached patch and ran a couple of benchmarks that stressed the spinlock code. I found no discussion about TAS_SPIN() on ARM in the archives, but I did notice that the initial AArch64 support was added [0] before x86_64 started using a non-locking test [1].
These benchmarks are for a c8g.24xlarge running a select-only pgbench with 256 clients and pg_stat_statements.track_planning enabled. without the patch: 90.04% postgres [.] s_lock 1.07% pg_stat_statements.so [.] pgss_store 0.59% postgres [.] LWLockRelease 0.56% postgres [.] perform_spin_delay 0.31% [kernel] [k] arch_local_irq_enable | while (TAS_SPIN(lock)) | { | perform_spin_delay(&delayStatus); 0.12 |2c: -> bl perform_spin_delay | tas(): | return __sync_lock_test_and_set(lock, 1); 0.01 |30: swpa w20, w1, [x19] | s_lock(): 99.87 | add x0, sp, #0x28 | while (TAS_SPIN(lock)) 0.00 | ^ cbnz w1, 2c tps = 74135.100891 (without initial connection time) with the patch: 30.46% postgres [.] s_lock 5.88% postgres [.] perform_spin_delay 4.61% [kernel] [k] arch_local_irq_enable 3.31% [kernel] [k] next_uptodate_page 2.50% postgres [.] hash_search_with_hash_value | while (TAS_SPIN(lock)) | { | perform_spin_delay(&delayStatus); 0.63 |2c:+-->add x0, sp, #0x28 0.07 | |-> bl perform_spin_delay | |while (TAS_SPIN(lock)) 0.25 |34:| ldr w0, [x19] 65.19 | +--cbnz w0, 2c | tas(): | return __sync_lock_test_and_set(lock, 1); 0.00 | swpa w20, w0, [x19] | s_lock(): 33.82 | ^ cbnz w0, 2c tps = 549462.785554 (without initial connection time) [0] https://postgr.es/c/5c7603c [1] https://postgr.es/c/b03d196 -- nathan
>From a69b7d38c40f0fb49f714c49bb45c292e7c8587d Mon Sep 17 00:00:00 2001 From: Nathan Bossart <nat...@postgresql.org> Date: Tue, 22 Oct 2024 14:33:39 -0500 Subject: [PATCH v1 1/1] Use a non-locking initial test in TAS_SPIN on AArch64. --- src/include/storage/s_lock.h | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h index e94ed5f48b..07f343baf8 100644 --- a/src/include/storage/s_lock.h +++ b/src/include/storage/s_lock.h @@ -270,6 +270,12 @@ tas(volatile slock_t *lock) */ #if defined(__aarch64__) +/* + * On ARM64, it's a win to use a non-locking test before the TAS proper. It + * may be a win on 32-bit ARM, too, but nobody's tested it yet. + */ +#define TAS_SPIN(lock) (*(lock) ? 1 : TAS(lock)) + #define SPIN_DELAY() spin_delay() static __inline__ void -- 2.39.5 (Apple Git-154)