Re: connection establishment versus parallel workers

Thomas Munro Thu, 12 Dec 2024 18:56:55 -0800

On Fri, Dec 13, 2024 at 11:00 AM Nathan Bossart
<nathandboss...@gmail.com> wrote:
> On Fri, Dec 13, 2024 at 02:29:53AM +1300, Thomas Munro wrote:
> > Here's an experimental patch to try changing that policy.  It improves
> > the connection times on my small computer with your test, but I doubt
> > I'm seeing the real issue.  But in theory, assuming a backlog of
> > connections and workers to start, I think each server loop should be
> > able to accept and fork one client backend, and fork up to 100
> > (MAX_BGWORKERS_TO_LAUNCH) background workers.
>
> Thanks for the quick response!  I'm taking a look at the patch...


After sleeping on the let-a-hundred-workers-fork logic, I dug out
aa1351f1eec4 (2017) and read a couple of interesting things:

    One could also question the wisdom of using an O(N^2) processing
    method if the system is intended to support so many bgworkers.

So that's about the linear search for workers to queue up for
starting, in BackgroundWorkerStateChange().  Hmm.  I suppose we should
only really process PM signals when we see WL_LATCH_SET, or I think
we'd run BackgroundWorkerStateChange() twice as often as before, when
the 0001 patch feeds you WL_LATCH_SET and WL_SOCKET_ACCEPT at the same
time, under your backlog-of-all-kinds-of-events workload.  Maybe it
doesn't really matter much, but it'd be an unintended behavioural
change, so here is a patch.  You could call it a revert of 1/4 of
239b1753.  I wonder if that has any problematic side effects for other
PM signal kinds, but I'm looking and I can't see them.  And:

    There is talk of rewriting the postmaster to use a WaitEventSet to
    avoid the signal-response-delay problem, but I'd argue that this change
    should be kept even after that happens (if it ever does).

It did happen, and we did keep that.  Someone might want to research
that policy, which gives massive preference to workers over clients
(the 100 is more of a cap, in practice it means "start all pending
workers", a bit like how child exit means "reap all dead children",
while we only accept one socket per server socket per server loop),
but no one has complained in half a decade, until, if I intuited
correctly, this report of what I assume is the unbounded socket event
starvation state.  On the other hand, you only said that v16 was worse
than v15, not that v15 was actually good!  The workload could easily
have been suffering from undiagnosed and unwanted massive bias on v15
already.  And starting a lot of workers might be an even more serious
issue if your system is very slow at forking, or if your system
doesn't even have fork() and if PostgreSQL doesn't have threads.

But like 04a09ee9, for me this is not really about a resource
allocation policy that we could debate and perhaps adjust separately.
The main question is whether complete starvation (DOS) is possible, or
whether the postmaster should always try to service all of its
conceptual work queues within bounded delays, no matter what is
happening on the others.  I think it should.

0001 patch is unchanged, 0002 patch sketches out a response to the
observation a couple of paragraphs above.

From 1d87696494834087c3391ae172088a8f7f5b5799 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Thu, 12 Dec 2024 23:46:50 +1300
Subject: [PATCH v2 1/2] Latches shouldn't starve socket event delivery.

If a WaitEventSetWait() caller wants multiple events, an already-set
latch shouldn't prevent other events from being reported at the same
time.  It should certainly prevent sleeping, but that just requires
using a zero timeout when polling the kernel.

The main caller of WaitEventSetWait() with nevents > 1 is the
postmaster.  Even if its latch is frequently set, we don't want socket
accept processing to be starved.

XXX experimental

Reported-by: Nathan Bossart <nathandboss...@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGLOcxUa6m7UinPN1gZXFyr92L8btG_pGTHPiWY2YbRw2w%40mail.gmail.com
---
 src/backend/storage/ipc/latch.c | 35 ++++++++++++++++++++-------------
 1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index f19ce5e7aff..67ace0964d7 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -1452,9 +1452,9 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 		int			rc;
 
 		/*
-		 * Check if the latch is set already. If so, leave the loop
-		 * immediately, avoid blocking again. We don't attempt to report any
-		 * other events that might also be satisfied.
+		 * Check if the latch is set already first.  If so, we either exit
+		 * immediately or ask the kernel for any more events available right
+		 * now without waiting, depending on how many events the caller wants.
 		 *
 		 * If someone sets the latch between this and the
 		 * WaitEventSetWaitBlock() below, the setter will write a byte to the
@@ -1499,7 +1499,16 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 			/* could have been set above */
 			set->latch->maybe_sleeping = false;
 
-			break;
+			if (returned_events == nevents)
+				break;	/* output buffer full already */
+
+			/*
+			 * Even though we already have an event, we'll poll just once with
+			 * zero timeout to see what non-latch events we can fit into the
+			 * output buffer at the same time.
+			 */
+			cur_timeout = 0;
+			timeout = 0;
 		}
 
 		/*
@@ -1508,18 +1517,16 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 		 * to retry, everything >= 1 is the number of returned events.
 		 */
 		rc = WaitEventSetWaitBlock(set, cur_timeout,
-								   occurred_events, nevents);
+								   occurred_events, nevents - returned_events);
 
-		if (set->latch)
-		{
-			Assert(set->latch->maybe_sleeping);
+		if (set->latch &&
+			set->latch->maybe_sleeping)
 			set->latch->maybe_sleeping = false;
-		}
 
 		if (rc == -1)
 			break;				/* timeout occurred */
 		else
-			returned_events = rc;
+			returned_events += rc;
 
 		/* If we're not done, update cur_timeout for next iteration */
 		if (returned_events == 0 && timeout >= 0)
@@ -1607,7 +1614,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			/* Drain the signalfd. */
 			drain();
 
-			if (set->latch && set->latch->is_set)
+			if (set->latch && set->latch->maybe_sleeping && set->latch->is_set)
 			{
 				occurred_events->fd = PGINVALID_SOCKET;
 				occurred_events->events = WL_LATCH_SET;
@@ -1766,7 +1773,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 		if (cur_event->events == WL_LATCH_SET &&
 			cur_kqueue_event->filter == EVFILT_SIGNAL)
 		{
-			if (set->latch && set->latch->is_set)
+			if (set->latch && set->latch->maybe_sleeping && set->latch->is_set)
 			{
 				occurred_events->fd = PGINVALID_SOCKET;
 				occurred_events->events = WL_LATCH_SET;
@@ -1891,7 +1898,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			/* There's data in the self-pipe, clear it. */
 			drain();
 
-			if (set->latch && set->latch->is_set)
+			if (set->latch && set->latch->maybe_sleeping && set->latch->is_set)
 			{
 				occurred_events->fd = PGINVALID_SOCKET;
 				occurred_events->events = WL_LATCH_SET;
@@ -2107,7 +2114,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			if (!ResetEvent(set->handles[cur_event->pos + 1]))
 				elog(ERROR, "ResetEvent failed: error code %lu", GetLastError());
 
-			if (set->latch && set->latch->is_set)
+			if (set->latch && set->latch->maybe_sleeping && set->latch->is_set)
 			{
 				occurred_events->fd = PGINVALID_SOCKET;
 				occurred_events->events = WL_LATCH_SET;
-- 
2.47.0

From bf0e1069cb6898f24a5607bdf1d0291f1f351be1 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Fri, 13 Dec 2024 01:11:01 +1300
Subject: [PATCH v2 2/2] Adjust pending_pm_pmsignal processing priority.

Commit 239b175 gave high priority to shutdown requests, child exit
notifications, reload requests and PM signals received from backends.

Undo that last case.  While the others have good reasons for
queue-jumping (supremacy of administrative commands, ordering
requirement of reload-connect sequences, and desire to release child
process resources ASAP), PM signals do not.  Let's go back to processing
PM signals only when a WL_LATCH_SET event is received, and thus only
once per server loop.

It didn't make much difference before, because we'd very likely run the
inner loop that scan events[] just once, because only WL_LATCH_SET was
reported in the common case.  Now that we've adjusted WaitEventSetWait()
to return WL_LATCH_SET and potentially WL_SOCKET_ACCEPT at the same time
when the latch was already set, let's reconsider that.  We don't want to
run BackgroundWorkerStateChange() and similar PM-signal-processing code
twice as often as before in high load scenarios.

XXX experimental

Reported-by: Nathan Bossart <nathandboss...@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGLOcxUa6m7UinPN1gZXFyr92L8btG_pGTHPiWY2YbRw2w%40mail.gmail.com
---
 src/backend/postmaster/postmaster.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index f0f9c66487c..11e7a938f9f 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1657,7 +1657,8 @@ ServerLoop(void)
 			 * didn't see WL_LATCH_SET.  This gives high priority to shutdown
 			 * and reload requests where the latch happens to appear later in
 			 * events[] or will be reported by a later call to
-			 * WaitEventSetWait().
+			 * WaitEventSetWait().  We also want to release child resources as
+			 * soon as as possible, before accepting sockets.
 			 */
 			if (pending_pm_shutdown_request)
 				process_pm_shutdown_request();
@@ -1665,8 +1666,17 @@ ServerLoop(void)
 				process_pm_reload_request();
 			if (pending_pm_child_exit)
 				process_pm_child_exit();
-			if (pending_pm_pmsignal)
-				process_pm_pmsignal();
+
+			/*
+			 * When receiving high frequency PM signals, we only want to
+			 * process them once per server loop, not once per event of any
+			 * kind.  They are neither very high priority, nor causally linked
+			 * to socket events that might arrive out of order, so we defer
+			 * processing until we see WL_LATCH_SET.
+			 */
+			if (events[i].events & WL_LATCH_SET)
+				if (pending_pm_pmsignal)
+					process_pm_pmsignal();

 			if (events[i].events & WL_SOCKET_ACCEPT)
 			{
-- 
2.47.0

Re: connection establishment versus parallel workers

Reply via email to