Re: Add spin_delay() implementation for Arm in s_lock.h

Tom Lane Thu, 06 Jan 2022 18:12:35 -0800

"Blake, Geoff" <blakg...@amazon.com> writes:
> Hope everything is well going into the new year.  I'd like to pick this 
> discussion back up and your thoughts on the patch with the data I posted 2 
> weeks prior.  Is there more data that would be helpful?  Different setup?  
> Data on older versions of Postgresql to ascertain if it makes more sense on 
> versions before the large re-work of the snapshot algorithm that exhibited 
> quite a bit of synchronization contention?


I spent some time working on this.  I don't have a lot of faith in
pgbench as a measurement testbed for spinlock contention, because over
the years we've done a good job of getting rid of that in our main
code paths (both the specific change you mention, and many others).
After casting around a bit and thinking about writing a bespoke test
framework, I landed on the idea of adding some intentional spinlock
contention to src/test/modules/test_shm_mq, which is a prefab test
framework for passing data among multiple worker processes.  The
attached quick-hack patch makes it grab and release a spinlock once
per passed message.  I'd initially expected that this would show only
marginal changes, because you'd hope that a spinlock acquisition would
be reasonably cheap compared to shm_mq_receive plus shm_mq_send.
Turns out not though.

The proposed test case is

(1) patch test_shm_mq as below

(2) time this query:

SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, n);

for various values of "n" up to about how many cores you have.
(You'll probably need to bump up max_worker_processes.)

For context, on my Intel-based main workstation (8-core Xeon W-2245),
the time to do this with stock test_shm_mq is fairly level.
Reporting best-of-3 runs:

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1386.413 ms (00:01.386)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 1302.503 ms (00:01.303)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 1373.121 ms (00:01.373)

However, after applying the contention patch:

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1346.362 ms (00:01.346)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 3313.490 ms (00:03.313)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 7660.329 ms (00:07.660)

So this seems like (a) it's a plausible model for code that has
unoptimized spinlock contention, and (b) the effects are large
enough that you needn't fret too much about measurement noise.

I tried this out on a handy Apple M1 mini, which I concede
is not big iron but it's pretty decent aarch64 hardware.
With current HEAD's spinlock code, I get (again best-of-3):

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1630.255 ms (00:01.630)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 3495.066 ms (00:03.495)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 19541.929 ms (00:19.542)

With your spin-delay patch:

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1643.524 ms (00:01.644)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 3404.625 ms (00:03.405)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 19260.721 ms (00:19.261)

So I don't see a lot of reason to think your patch changes anything.
Maybe on something with more cores?

For grins I also tried this same test with the use-CAS-for-TAS patch
that was being discussed November before last, and it didn't
really show up as any improvement either:

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1608.642 ms (00:01.609)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 3396.564 ms (00:03.397)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 20092.683 ms (00:20.093)

Maybe that's a little better in the uncontended (single-worker)
case, but it's worse at the high end.

I'm really curious to hear if this measurement method shows
any interesting improvements on your hardware.

                        regards, tom lane

diff --git a/src/test/modules/test_shm_mq/setup.c b/src/test/modules/test_shm_mq/setup.c
index e05e97c6de..d535dbe408 100644
--- a/src/test/modules/test_shm_mq/setup.c
+++ b/src/test/modules/test_shm_mq/setup.c
@@ -135,6 +135,7 @@ setup_dynamic_shared_memory(int64 queue_size, int nworkers,
 	hdr->workers_total = nworkers;
 	hdr->workers_attached = 0;
 	hdr->workers_ready = 0;
+	hdr->messages_xfred = 0;
 	shm_toc_insert(toc, 0, hdr);
 
 	/* Set up one message queue per worker, plus one. */
diff --git a/src/test/modules/test_shm_mq/test_shm_mq.h b/src/test/modules/test_shm_mq/test_shm_mq.h
index a666121834..ff35211811 100644
--- a/src/test/modules/test_shm_mq/test_shm_mq.h
+++ b/src/test/modules/test_shm_mq/test_shm_mq.h
@@ -32,6 +32,7 @@ typedef struct
 	int			workers_total;
 	int			workers_attached;
 	int			workers_ready;
+	uint64		messages_xfred;
 } test_shm_mq_header;
 
 /* Set up dynamic shared memory and background workers for test run. */
diff --git a/src/test/modules/test_shm_mq/worker.c b/src/test/modules/test_shm_mq/worker.c
index 9b037b98fe..5aceb8ca73 100644
--- a/src/test/modules/test_shm_mq/worker.c
+++ b/src/test/modules/test_shm_mq/worker.c
@@ -31,7 +31,8 @@
 static void attach_to_queues(dsm_segment *seg, shm_toc *toc,
 							 int myworkernumber, shm_mq_handle **inqhp,
 							 shm_mq_handle **outqhp);
-static void copy_messages(shm_mq_handle *inqh, shm_mq_handle *outqh);
+static void copy_messages(volatile test_shm_mq_header *hdr,
+						  shm_mq_handle *inqh, shm_mq_handle *outqh);
 
 /*
  * Background worker entrypoint.
@@ -131,7 +132,7 @@ test_shm_mq_main(Datum main_arg)
 	SetLatch(&registrant->procLatch);
 
 	/* Do the work. */
-	copy_messages(inqh, outqh);
+	copy_messages(hdr, inqh, outqh);
 
 	/*
 	 * We're done.  For cleanliness, explicitly detach from the shared memory
@@ -173,7 +174,8 @@ attach_to_queues(dsm_segment *seg, shm_toc *toc, int myworkernumber,
  * after this point is cleanup.
  */
 static void
-copy_messages(shm_mq_handle *inqh, shm_mq_handle *outqh)
+copy_messages(volatile test_shm_mq_header *hdr,
+			  shm_mq_handle *inqh, shm_mq_handle *outqh)
 {
 	Size		len;
 	void	   *data;
@@ -189,7 +191,12 @@ copy_messages(shm_mq_handle *inqh, shm_mq_handle *outqh)
 		if (res != SHM_MQ_SUCCESS)
 			break;
 
-		/* Send it back out. */
+		/* Create some contention on the mutex. */
+		SpinLockAcquire(&hdr->mutex);
+		++hdr->messages_xfred;
+		SpinLockRelease(&hdr->mutex);
+
+		/* Send the message back out. */
 		res = shm_mq_send(outqh, len, data, false, true);
 		if (res != SHM_MQ_SUCCESS)
 			break;

Re: Add spin_delay() implementation for Arm in s_lock.h

Reply via email to