Re: Improving spin-lock implementation on ARM.

Krunal Bauskar Thu, 26 Nov 2020 04:41:27 -0800

On Thu, 26 Nov 2020 at 16:02, Heikki Linnakangas <hlinn...@iki.fi> wrote:


> On 26/11/2020 06:30, Krunal Bauskar wrote:
> > Improving spin-lock implementation on ARM.
> > ------------------------------------------------------------
> >
> > * Spin-Lock is known to have a significant effect on performance
> >    with increasing scalability.
> >
> > * Existing Spin-Lock implementation for ARM is sub-optimal due to
> >    use of TAS (test and swap)
> >
> > * TAS is implemented on ARM as load-store so even if the lock is not
> free,
> >    store operation will execute to replace the same value.
> >    This redundant operation (mainly store) is costly.
> >
> > * CAS is implemented on ARM as load-check-store-check that means if the
> >    lock is not free, check operation, post-load will cause the loop to
> >    return there-by saving on costlier store operation. [1]
>
> Can you add some code comments to explain that why CAS is cheaper than
> TAS on ARM?
>

1. As Alexey too pointed out in followup email
CAS = load value -> check if the value is expected -> if yes then replace
(store new value) -> else exit/break
TAS = load value -> store new value

This means each TAS would be converted to 2 operations that are LOAD and
STORE and of-course
STORE is costlier in terms of latency. CAS ensures optimization in this
regard with an early check.

Let's look at some micro-benchmarking. I implemented a simple spin-loop
using both approaches and
as expected with increase scalability, CAS continues to out-perform TAS by
a margin up to 50%.

---- TAS ----
Running 128 parallelism
Elapsed time: 1.34271 s
Running 256 parallelism
Elapsed time: 3.6487 s
Running 512 parallelism
Elapsed time: 11.3725 s
Running 1024 parallelism
Elapsed time: 43.5376 s
---- CAS ----
Running 128 parallelism
Elapsed time: 1.00131 s
Running 256 parallelism
Elapsed time: 2.53202 s
Running 512 parallelism
Elapsed time: 7.66829 s
Running 1024 parallelism
Elapsed time: 22.6294 s

This could be also observed from the perf profiling

TAS:
 15.57 │44:   ldxr  w0, [x19]
 83.93 │      stxr  w1, w21, [x19]

CAS:
 81.29 │58: ↓ b.ne  cc
....
  9.86 │cc:   ldaxr w0, [x22]
  8.84 │      cmp   w0, #0x0
       │    ↑ b.ne  58
       │      stxr  w1, w20, [x22]

*In TAS: STORE is pretty costly.*

2.  I have added the needed comment in the patch. Updated patch attached.

----------------------
Thanks for taking look at this and surely let me know if any more info is
needed.


>
> Is there some official ARM documentation, like a programmer's reference
> manual or something like that, that would show a reference
> implementation of a spinlock on ARM? It would be good to refer to an
> authoritative source on this.
>
> - Heikki
>


-- 
Regards,
Krunal Bauskar

#include <iostream>
#include <thread>
#include <chrono>
#include <zlib.h>
#include <string.h>

#define likely(x)       __builtin_expect((x),1)
#define unlikely(x)     __builtin_expect((x),0)

const size_t data_block_size = 64*1024;
char data_block[64*1024];

size_t k_iter=100;
unsigned long global_crcval = 0;

/* ---------------- spin-lock ---------------- */
uint32_t lock_unit;

void lock()
{
  while (true) {
    uint32_t expected = 0;
#ifdef CAS
    if (!__atomic_compare_exchange_n(&lock_unit, &expected, 1, false,
                                     __ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE)) {
#elif TAS
    if (__sync_lock_test_and_set(&lock_unit, 1)) {
#endif
      /* Spin-and-Retry as the lock is acquired by some other thread */
      __asm__ __volatile__("" ::: "memory");
      continue;
    }
    break;
  }
}

void unlock()
{
#ifdef CAS
  __atomic_store_n(&lock_unit, 0, __ATOMIC_RELEASE);
#else
  __sync_lock_release(&lock_unit);
#endif
}

/* ---------------- workload ---------------- */
void workload_execute()
{
  for (size_t i = 0; i < k_iter; ++i) {
    lock();
    /* Each thread try to take lock -> execute critical section -> unlock */
    memset(data_block, rand() % 255, data_block_size);
    unsigned long crcval = 0;
    crc32(crcval, (const unsigned char *)data_block, data_block_size);
    global_crcval += crcval;
    unlock();
  }
}

int main(int argc, char *argv[]) {
  if (argc != 2) {
    std::cerr << "usage: <program> <number-of-threads/parallelism>" << std::endl;
    return 1;
  }
  size_t num_of_threads = atol(argv[1]);

  std::thread* handles[num_of_threads];

  auto start = std::chrono::high_resolution_clock::now();

  for (size_t i = 0; i < num_of_threads; ++i) {
    handles[i] = new std::thread(workload_execute);
  }
  for (size_t i = 0; i < num_of_threads; ++i) {
    handles[i]->join();
  }

  auto finish = std::chrono::high_resolution_clock::now();

  for (size_t i = 0; i < num_of_threads; ++i) {
    delete handles[i];
  }

  std::chrono::duration<double> elapsed = finish - start;
  std::cout << "Elapsed time: " << elapsed.count() << " s\n";
}

diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 31a5ca6fb3..5110de37d4 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -321,7 +321,33 @@ tas(volatile slock_t *lock)
  * than other widths.
  */
 #if defined(__arm__) || defined(__arm) || defined(__aarch64__) || defined(__aarch64)
-#ifdef HAVE_GCC__SYNC_INT32_TAS
+
+#ifdef HAVE_GCC__ATOMIC_INT32_CAS
+/* just reusing the same flag to avoid re-declaration of default tas functions below */
+#define HAS_TEST_AND_SET
+
+/*
+ * compare-and-swap on arm is load -> check if value = expected
+ *                                  -> if yes: store
+ *                                  -> if no: break
+ *  vs swap is load -> store that means swap will always cause
+ *  store to execute replacing the value with existing value too.
+ *  store being write operation that too atomic is costlier so it is better
+ *  avoided if possible that is acheived using compare-and-swap optimization
+ */
+#define TAS(lock) cas(lock)
+typedef int slock_t;
+
+static __inline__ int
+cas(volatile slock_t *lock)
+{
+	slock_t expected = 0;
+	return !(__atomic_compare_exchange_n(lock, &expected, (slock_t) 1,
+				false, __ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE));
+}
+
+#define S_UNLOCK(lock) __atomic_store_n(lock, (slock_t) 0, __ATOMIC_RELEASE);
+#elif HAVE_GCC__SYNC_INT32_TAS
 #define HAS_TEST_AND_SET
 
 #define TAS(lock) tas(lock)

Re: Improving spin-lock implementation on ARM.

Reply via email to