On Thu, 26 Nov 2020 at 16:02, Heikki Linnakangas <hlinn...@iki.fi> wrote:
> On 26/11/2020 06:30, Krunal Bauskar wrote: > > Improving spin-lock implementation on ARM. > > ------------------------------------------------------------ > > > > * Spin-Lock is known to have a significant effect on performance > > with increasing scalability. > > > > * Existing Spin-Lock implementation for ARM is sub-optimal due to > > use of TAS (test and swap) > > > > * TAS is implemented on ARM as load-store so even if the lock is not > free, > > store operation will execute to replace the same value. > > This redundant operation (mainly store) is costly. > > > > * CAS is implemented on ARM as load-check-store-check that means if the > > lock is not free, check operation, post-load will cause the loop to > > return there-by saving on costlier store operation. [1] > > Can you add some code comments to explain that why CAS is cheaper than > TAS on ARM? > 1. As Alexey too pointed out in followup email CAS = load value -> check if the value is expected -> if yes then replace (store new value) -> else exit/break TAS = load value -> store new value This means each TAS would be converted to 2 operations that are LOAD and STORE and of-course STORE is costlier in terms of latency. CAS ensures optimization in this regard with an early check. Let's look at some micro-benchmarking. I implemented a simple spin-loop using both approaches and as expected with increase scalability, CAS continues to out-perform TAS by a margin up to 50%. ---- TAS ---- Running 128 parallelism Elapsed time: 1.34271 s Running 256 parallelism Elapsed time: 3.6487 s Running 512 parallelism Elapsed time: 11.3725 s Running 1024 parallelism Elapsed time: 43.5376 s ---- CAS ---- Running 128 parallelism Elapsed time: 1.00131 s Running 256 parallelism Elapsed time: 2.53202 s Running 512 parallelism Elapsed time: 7.66829 s Running 1024 parallelism Elapsed time: 22.6294 s This could be also observed from the perf profiling TAS: 15.57 │44: ldxr w0, [x19] 83.93 │ stxr w1, w21, [x19] CAS: 81.29 │58: ↓ b.ne cc .... 9.86 │cc: ldaxr w0, [x22] 8.84 │ cmp w0, #0x0 │ ↑ b.ne 58 │ stxr w1, w20, [x22] *In TAS: STORE is pretty costly.* 2. I have added the needed comment in the patch. Updated patch attached. ---------------------- Thanks for taking look at this and surely let me know if any more info is needed. > > Is there some official ARM documentation, like a programmer's reference > manual or something like that, that would show a reference > implementation of a spinlock on ARM? It would be good to refer to an > authoritative source on this. > > - Heikki > -- Regards, Krunal Bauskar
#include <iostream> #include <thread> #include <chrono> #include <zlib.h> #include <string.h> #define likely(x) __builtin_expect((x),1) #define unlikely(x) __builtin_expect((x),0) const size_t data_block_size = 64*1024; char data_block[64*1024]; size_t k_iter=100; unsigned long global_crcval = 0; /* ---------------- spin-lock ---------------- */ uint32_t lock_unit; void lock() { while (true) { uint32_t expected = 0; #ifdef CAS if (!__atomic_compare_exchange_n(&lock_unit, &expected, 1, false, __ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE)) { #elif TAS if (__sync_lock_test_and_set(&lock_unit, 1)) { #endif /* Spin-and-Retry as the lock is acquired by some other thread */ __asm__ __volatile__("" ::: "memory"); continue; } break; } } void unlock() { #ifdef CAS __atomic_store_n(&lock_unit, 0, __ATOMIC_RELEASE); #else __sync_lock_release(&lock_unit); #endif } /* ---------------- workload ---------------- */ void workload_execute() { for (size_t i = 0; i < k_iter; ++i) { lock(); /* Each thread try to take lock -> execute critical section -> unlock */ memset(data_block, rand() % 255, data_block_size); unsigned long crcval = 0; crc32(crcval, (const unsigned char *)data_block, data_block_size); global_crcval += crcval; unlock(); } } int main(int argc, char *argv[]) { if (argc != 2) { std::cerr << "usage: <program> <number-of-threads/parallelism>" << std::endl; return 1; } size_t num_of_threads = atol(argv[1]); std::thread* handles[num_of_threads]; auto start = std::chrono::high_resolution_clock::now(); for (size_t i = 0; i < num_of_threads; ++i) { handles[i] = new std::thread(workload_execute); } for (size_t i = 0; i < num_of_threads; ++i) { handles[i]->join(); } auto finish = std::chrono::high_resolution_clock::now(); for (size_t i = 0; i < num_of_threads; ++i) { delete handles[i]; } std::chrono::duration<double> elapsed = finish - start; std::cout << "Elapsed time: " << elapsed.count() << " s\n"; }
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h index 31a5ca6fb3..5110de37d4 100644 --- a/src/include/storage/s_lock.h +++ b/src/include/storage/s_lock.h @@ -321,7 +321,33 @@ tas(volatile slock_t *lock) * than other widths. */ #if defined(__arm__) || defined(__arm) || defined(__aarch64__) || defined(__aarch64) -#ifdef HAVE_GCC__SYNC_INT32_TAS + +#ifdef HAVE_GCC__ATOMIC_INT32_CAS +/* just reusing the same flag to avoid re-declaration of default tas functions below */ +#define HAS_TEST_AND_SET + +/* + * compare-and-swap on arm is load -> check if value = expected + * -> if yes: store + * -> if no: break + * vs swap is load -> store that means swap will always cause + * store to execute replacing the value with existing value too. + * store being write operation that too atomic is costlier so it is better + * avoided if possible that is acheived using compare-and-swap optimization + */ +#define TAS(lock) cas(lock) +typedef int slock_t; + +static __inline__ int +cas(volatile slock_t *lock) +{ + slock_t expected = 0; + return !(__atomic_compare_exchange_n(lock, &expected, (slock_t) 1, + false, __ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE)); +} + +#define S_UNLOCK(lock) __atomic_store_n(lock, (slock_t) 0, __ATOMIC_RELEASE); +#elif HAVE_GCC__SYNC_INT32_TAS #define HAS_TEST_AND_SET #define TAS(lock) tas(lock)