My colleague Salvatore Dipietro (CC'd) sent me a couple of profiles that
showed an enormous amount of s_lock() time going to the
__sync_lock_test_and_set() call in the AArch64 implementation of tas().
Upon closer inspection, I noticed that we don't implement a custom
TAS_SPIN() for this architecture, so I quickly hacked together the attached
patch and ran a couple of benchmarks that stressed the spinlock code.  I
found no discussion about TAS_SPIN() on ARM in the archives, but I did
notice that the initial AArch64 support was added [0] before x86_64 started
using a non-locking test [1].

These benchmarks are for a c8g.24xlarge running a select-only pgbench with
256 clients and pg_stat_statements.track_planning enabled.

without the patch:

   90.04%  postgres                [.] s_lock
    1.07%  pg_stat_statements.so   [.] pgss_store
    0.59%  postgres                [.] LWLockRelease
    0.56%  postgres                [.] perform_spin_delay
    0.31%  [kernel]                [k] arch_local_irq_enable

           |    while (TAS_SPIN(lock))
           |    {
           |    perform_spin_delay(&delayStatus);
      0.12 |2c: -> bl   perform_spin_delay
           |    tas():
           |    return __sync_lock_test_and_set(lock, 1);
      0.01 |30:   swpa w20, w1, [x19]
           |    s_lock():
     99.87 |      add  x0, sp, #0x28
           |    while (TAS_SPIN(lock))
      0.00 |    ^ cbnz w1, 2c

    tps = 74135.100891 (without initial connection time)

with the patch:

   30.46%  postgres                [.] s_lock
    5.88%  postgres                [.] perform_spin_delay
    4.61%  [kernel]                [k] arch_local_irq_enable
    3.31%  [kernel]                [k] next_uptodate_page
    2.50%  postgres                [.] hash_search_with_hash_value

           |    while (TAS_SPIN(lock))
           |    {
           |    perform_spin_delay(&delayStatus);
      0.63 |2c:+-->add  x0, sp, #0x28
      0.07 |   |-> bl   perform_spin_delay
           |   |while (TAS_SPIN(lock))
      0.25 |34:|  ldr  w0, [x19]
     65.19 |   +--cbnz w0, 2c
           |    tas():
           |    return __sync_lock_test_and_set(lock, 1);
      0.00 |      swpa w20, w0, [x19]
           |    s_lock():
     33.82 |    ^ cbnz w0, 2c

    tps = 549462.785554 (without initial connection time)

[0] https://postgr.es/c/5c7603c
[1] https://postgr.es/c/b03d196

-- 
nathan
>From a69b7d38c40f0fb49f714c49bb45c292e7c8587d Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nat...@postgresql.org>
Date: Tue, 22 Oct 2024 14:33:39 -0500
Subject: [PATCH v1 1/1] Use a non-locking initial test in TAS_SPIN on AArch64.

---
 src/include/storage/s_lock.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index e94ed5f48b..07f343baf8 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -270,6 +270,12 @@ tas(volatile slock_t *lock)
  */
 #if defined(__aarch64__)
 
+/*
+ * On ARM64, it's a win to use a non-locking test before the TAS proper.  It
+ * may be a win on 32-bit ARM, too, but nobody's tested it yet.
+ */
+#define TAS_SPIN(lock)    (*(lock) ? 1 : TAS(lock))
+
 #define SPIN_DELAY() spin_delay()
 
 static __inline__ void
-- 
2.39.5 (Apple Git-154)

Reply via email to