spin_delay() for ARM

Amit Khandekar Fri, 10 Apr 2020 00:40:45 -0700

Hi,

We use (an equivalent of) the PAUSE instruction in spin_delay() for
Intel architectures. The goal is to slow down the spinlock tight loop
and thus prevent it from eating CPU and causing CPU starvation, so
that other processes get their fair share of the CPU time. Intel
documentation [1] clearly mentions this, along with other benefits of
PAUSE, like, low power consumption, and avoidance of memory order
violation while exiting the loop.

Similar to PAUSE, the ARM architecture has YIELD instruction, which is
also clearly documented [2]. It explicitly says that it is a way to
hint the CPU that it is being called in a spinlock loop and this
process can be preempted out. But for ARM, we are not using any kind
of spin delay.

For PG spinlocks, the goal of both of these instructions are the same,
and also both architectures recommend using them in spinlock loops.
Also, I found multiple places where YIELD is already used in same
situations : Linux kernel [3] ; OpenJDK [4],[5]

Now, for ARM implementations that don't implement YIELD, it runs as a
no-op. Unfortunately the ARM machine I have does not implement YIELD.
But recently there has been some ARM implementations that are
hyperthreaded, so they are expected to actually do the YIELD, although
the docs do not explicitly say that YIELD has to be implemented only
by hyperthreaded implementations.

I ran some pgbench tests to test PAUSE/YIELD on the respective
architectures, once with the instruction present, and once with the
instruction removed. Didn't see change in the TPS numbers; they were
more or less same. For Arm, this was expected because my ARM machine
does not implement it.

On my Intel Xeon machine with 8 cores, I tried to test PAUSE also
using a sample C program (attached spin.c). Here, many child processes
(much more than CPUs) wait in a tight loop for a shared variable to
become 0, while the parent process continuously increments a sequence
number for a fixed amount of time, after which, it sets the shared
variable to 0. The child's tight loop calls PAUSE in each iteration.
What I hoped was that because of PAUSE in children, the parent process
would get more share of the CPU, due to which, in a given time, the
sequence number will reach a higher value. Also, I expected the CPU
cycles spent by child processes to drop down, thanks to PAUSE. None of
these happened. There was no change.

Possibly, this testcase is not right. Probably the process preemption
occurs only within the set of hyperthreads attached to a single core.
And in my testcase, the parent process is the only one who is ready to
run. Still, I have anyway attached the program (spin.c) for archival;
in case somebody with a YIELD-supporting ARM machine wants to use it
to test YIELD.

Nevertheless, I think because we have clear documentation that
strongly recommends to use it, and because it has been used in other
use-cases such as linux kernel and JDK, we should start using YIELD
for spin_delay() in ARM.

Attached is the trivial patch (spin_delay_for_arm.patch). To start
with, it contains changes only for aarch64. I haven't yet added
changes in configure[.in] for making sure yield compiles successfully
(YIELD is present in manuals from ARMv6 onwards). Before that I
thought of getting some comments; so didn't do configure changes yet.

[1] https://c9x.me/x86/html/file_module_x86_id_232.html
[2]
https://developer.arm.com/docs/100076/0100/instruction-set-reference/a64-general-instructions/yield
[3]
https://elixir.bootlin.com/linux/latest/source/arch/arm64/include/asm/processor.h#L259
[4] http://cr.openjdk.java.net/~dchuyko/8186670/yield/spinwait.html
[5]
http://mail.openjdk.java.net/pipermail/aarch64-port-dev/2017-August/004880.html

--
Thanks,
-Amit Khandekar
Huawei Technologies

/*
 * Sample program to test the effect of PAUSE/YIELD instruction in a highly
 * contended scenario.  The Intel and ARM docs recommend the use of PAUSE and
 * YIELD respectively, in spinlock tight loops.
 *
 * This program can be run with :
 * gcc -O3 -o spin spin.c -lrt ; ./spin [number_of_processes]
 * By default, 4 processes are spawned.
 *
 * Child processes wait in a tight loop for a shared variable to become 0,
 * while the parent process continuously increments a sequence number for a
 * fixed amount of time, after which, it sets the shared variable to 0. The
 * child tight loop calls YIELD/PAUSE in each iteration.
 *
 * The intention is to create a number of processes much larger than the
 * available CPUs, so that the scheduler hopefully pre-empts the processes
 * because of the PAUSE, and the main process gets more CPU share because of
 * which it will increment its sequence number more number of times. So the
 * expectation is that with PAUSE, the program will end up with a much higher
 * sequence number than without PAUSE. Similarly, the child processes should
 * have lesser CPU cycles with PAUSE than without PAUSE.
 *
 * Author: Amit Khandekar
 */

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/stat.h>        /* For mode constants */
#include <fcntl.h>
#include <float.h>
#include <time.h>
#include <signal.h>

#include <unistd.h>
#include <sys/types.h>

#define SIZE sizeof(int)
#define SHM_NAME "/shm"
#define RUN_DURATION 15

volatile char timer_exceeded = 0;

typedef void     (*sigfunc_type)(int);

static void pqsignal(int signo, sigfunc_type func);
static void handle_sig_alarm(int dummy);

static __inline__ void
spin_delay(void)
{
    /*
     * Adding a PAUSE in the spin delay loop may help slow down the tight loop
     */
    __asm__ __volatile__(
        " pause          \n");
}

int main(int argn, char *argv[])
{
	int	i, fd;
	int nprocs = 4;
	int childpid = 0;
	volatile void *shared_address;

	if (argn > 1)
		sscanf(argv[1], "%d", &nprocs);

	if ((fd = shm_open(SHM_NAME, O_RDWR | O_CREAT, S_IRWXU | S_IRWXG | S_IRWXO)) == -1)
	{
		perror("Could not create shared memory");
		return -1;
	}

	if (ftruncate(fd, SIZE) < 0)
	{
		perror("ftruncate failed");
		return -1;
	}

	shared_address = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	if (shared_address == MAP_FAILED)
	{
		if (shm_unlink(SHM_NAME) != 0)
		{
			perror("could not destroy shared memory");
			return -1;
		}
	}
	close(fd);

	/* This will cause children to keep on spinning until it is set back to 0 */
	*(int *) shared_address = 1;

	/* Spawn children */
	for (i = 0; i < nprocs; i++)
	{
		if (fork() == 0)
		{
			childpid = getpid();
			break;
		}
	}

	if (childpid == 0) /* Am I a parent ? */
	{
		double dbl = -10000000; /* Some random initial value */

		/* For RUN_DURATION seconds, let me keep incrementing the double value */
    	pqsignal(SIGALRM, handle_sig_alarm);
    	alarm(RUN_DURATION);
		while (!timer_exceeded)
		{
			dbl += 1;
		}
		printf("Final sequence number: %g\n", dbl);

		/* Unblock the children  */
		*(int*) shared_address = 0;

		if (shm_unlink(SHM_NAME) != 0)
		{
			perror("could not destroy shared memory");
			return -1;
		}
	}
	else /* I am a child */
	{
		clock_t cpu_time = clock();
		int num ;
		volatile int *add = (int*) shared_address;

		/* Keep on spinning with delay, until parent unblocks me */
		do
		{
			spin_delay();
			num = *add;
		}
		while (num == 1);

		printf("pid: %d; cpu cycles by me: %ld\n",
			   childpid, (long) (clock() - cpu_time));

		/* We have come out of loop, that means parent set *shared_address to 0 */
	}

	return 0;
}

static void
pqsignal(int signo, sigfunc_type func)
{
    struct sigaction act, oact;

    act.sa_handler = func;
    sigemptyset(&act.sa_mask);
    act.sa_flags = SA_RESTART;
    if (sigaction(signo, &act, &oact) < 0)
	{
		perror("sigaction returned error");
        exit(-1);
    }
}

static void
handle_sig_alarm(int dummy)
{
    timer_exceeded = 1;
//	printf("alarm went\n");
}

diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 31a5ca6fb3..72fd0d2a6f 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -337,6 +337,31 @@ tas(volatile slock_t *lock)
 #define S_UNLOCK(lock) __sync_lock_release(lock)
 
 #endif	 /* HAVE_GCC__SYNC_INT32_TAS */
+
+/* Use YIELD only for aarch64, for now. */
+#if defined(__aarch64__) || defined(__aarch64)
+
+#define SPIN_DELAY() spin_delay()
+
+static __inline__ void
+spin_delay(void)
+{
+	/*
+	 * The ARM Architecture Manual recommends the use of YIELD instruction :
+	 * "The YIELD instruction provides a hint that the task performed by a
+	 * thread is of low importance so that it could yield.  This mechanism can
+	 * be used to improve overall performance in a Symmetric Multithreading
+	 * (SMT) or Symmetric Multiprocessing (SMP) system.  Examples of when the
+	 * YIELD instruction might be used include a thread that is sitting in a
+	 * spin-lock, or where the arbitration priority of the snoop bit in an SMP
+	 * system is modified. The YIELD instruction permits binary compatibility
+	 * between SMT and SMP systems.  The YIELD instruction is a NOP hint
+	 * instruction."
+	 */
+	__asm__ __volatile__(
+		" yield			\n");
+}
+#endif	 /* __aarch64__ || __aarch64 */
 #endif	 /* __arm__ || __arm || __aarch64__ || __aarch64 */

spin_delay() for ARM

Reply via email to