Re: connection establishment versus parallel workers

Thomas Munro Sun, 19 Jan 2025 21:34:49 -0800

On Tue, Jan 14, 2025 at 9:42 AM Thomas Munro <[email protected]> wrote:
> On Tue, Jan 14, 2025 at 8:50 AM Nathan Bossart <[email protected]> 
> wrote:
> > I gave these a closer look, and I still feel that they are both
> > straightforward and reasonable.  IIUC the main open question is whether
> > this might cause problems for other PM signal kinds.  Like you, I don't see
> > anything immediately obvious there, but I'll admit I'm not terribly
> > familiar with the precise characteristics of postmaster signals.  In any
> > case, 0001 feels pretty safe to me.
>
> Cool.  Thanks.  I'll think about what else could be affected by that
> change as you say, and if nothing jumps out I'll go ahead and commit
> them, back to 16.


I pushed 0001, addressing the main problem.

I think 0002 described and addressed a real phenomenon but only when
you have multiple sockets with non-empty listen queues.  If we fixed
the real underlying problems it wouldn't be an issue.  I decided to
unsee that for now.

> I have done a lot more study of this problem and was about to write in
> with some more patches to propose for master only.  Basically that
> "100" is destroying performance in this workload, which at least on my
> machine hardly gets any parallelism at all, and only in sporadic
> bursts.  You can argue that we aren't designed for high frequency
> short-lived workers (we'll have to reuse workers in some way to be
> good at that), but I don't think it has to fail as badly as it does
> today.  It falls off a cliff instead of plateauing: we are so busy
> forking that we don't get around to reaping children, so all our slots
> are (artificially) used up most of the time, and the queries that do
> manage to nab one then sit on their hands for a long time at query
> end.  "1" gets much smoother results, but as prophesied in aa1351f1,
> the complexity is terrible, possibly even O(n^3) in places depending
> on how you count: there are many places that scan the whole worker
> list, and one that even scans it again for each item, and that is for
> each thing that starts.  IOW we have to fix the complexity
> fundamentally.  I have a WIP patch that adds a couple of work queues,
> so that the postmaster never has to consider anything more than the
> head of a queue in various places.  More soon...

Here's the WIP code I have up with for that so far.

Remaining opportunities not attempted:
1.  When a child exits, we could use a hash table to find it by pid.
2.  When looking for a bgworker slot that is not in use, we could do
something better than linear search.

From 3a323e9eda8df187961ba5efc2b0d63af9520cfe Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Tue, 31 Dec 2024 16:37:38 +1300
Subject: [PATCH 1/3] Remove BackgroundWorkerStateChange()'s outer loop.

Previously, if a backend had modified a background worker slot and sent
PMSIGNAL_BACKGROUND_WORKER_CHANGE, then the postmaster would scan all
slots to find what changed.  Add a circular change queue.  The
requesting backend inserts the index of the changed slot into the queue,
and the postmaster consumes it from there directly.  The values are
range-checked so that the postmaster is no less robust against rogue
backends trashing memory.

Discussion: https://postgr.es/m/Z1n5UpAiGDmFcMmd%40nathan
---
 src/backend/postmaster/bgworker.c | 203 +++++++++++++++++++++++-------
 1 file changed, 160 insertions(+), 43 deletions(-)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index b288915cec8..0de0422bb53 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -96,9 +96,26 @@ typedef struct BackgroundWorkerArray
 	int			total_slots;
 	uint32		parallel_register_count;
 	uint32		parallel_terminate_count;
+
+	/*
+	 * A circular queue of slot change notifications.  Backends write slot
+	 * numbers into this queue at the "head" position to tell the postmaster
+	 * that a slot is now being used, or should now be terminated.
+	 */
+	int		   *change_queue;
+	uint32		change_queue_tail;	/* postmaster advances (consumer) */
+	uint32		change_queue_head;	/* backends advance (producers) */
+
 	BackgroundWorkerSlot slot[FLEXIBLE_ARRAY_MEMBER];
 } BackgroundWorkerArray;
 
+/*
+ * Each slot can have at most two unconsumed events produced by a backend:
+ * when in_use is set, and when terminate is set.  We need one extra element
+ * because tail==head means empty.
+ */
+#define BGWORKER_CHANGE_QUEUE_SIZE (max_worker_processes * 2 + 1)
+
 struct BackgroundWorkerHandle
 {
 	int			slot;
@@ -151,6 +168,7 @@ BackgroundWorkerShmemSize(void)
 	size = offsetof(BackgroundWorkerArray, slot);
 	size = add_size(size, mul_size(max_worker_processes,
 								   sizeof(BackgroundWorkerSlot)));
+	size = add_size(size, mul_size(BGWORKER_CHANGE_QUEUE_SIZE, sizeof(int)));
 
 	return size;
 }
@@ -175,6 +193,15 @@ BackgroundWorkerShmemInit(void)
 		BackgroundWorkerData->parallel_register_count = 0;
 		BackgroundWorkerData->parallel_terminate_count = 0;
 
+		/*
+		 * Set up the change notification queue.  The queue itself is after
+		 * slots[] in memory.
+		 */
+		BackgroundWorkerData->change_queue = (int *)
+			&BackgroundWorkerData->slot[max_worker_processes];
+		BackgroundWorkerData->change_queue_head = 0;
+		BackgroundWorkerData->change_queue_tail = 0;
+
 		/*
 		 * Copy contents of worker list into shared memory.  Record the shared
 		 * memory slot assigned to each worker.  This ensures a 1-to-1
@@ -195,6 +222,10 @@ BackgroundWorkerShmemInit(void)
 			rw->rw_shmem_slot = slotno;
 			rw->rw_worker.bgw_notify_pid = 0;	/* might be reinit after crash */
 			memcpy(&slot->worker, &rw->rw_worker, sizeof(BackgroundWorker));
+
+			/* Enqueue change notifications for all populated slots. */
+			BackgroundWorkerData->change_queue[BackgroundWorkerData->change_queue_head++] = slotno;
+
 			++slotno;
 		}
 
@@ -208,6 +239,7 @@ BackgroundWorkerShmemInit(void)
 			slot->in_use = false;
 			++slotno;
 		}
+
 	}
 	else
 		Assert(found);
@@ -235,7 +267,7 @@ FindRegisteredWorkerBySlotNumber(int slotno)
 }
 
 /*
- * Notice changes to shared memory made by other backends.
+ * Process state change notifications enqueued by backends.
  * Accept new worker requests only if allow_new_workers is true.
  *
  * This code runs in the postmaster, so we must be very careful not to assume
@@ -245,41 +277,80 @@ FindRegisteredWorkerBySlotNumber(int slotno)
 void
 BackgroundWorkerStateChange(bool allow_new_workers)
 {
-	int			slotno;
+	uint32		tail;
+	uint32		head;
+
+	Assert(!IsUnderPostmaster);
 
 	/*
-	 * The total number of slots stored in shared memory should match our
-	 * notion of max_worker_processes.  If it does not, something is very
-	 * wrong.  Further down, we always refer to this value as
-	 * max_worker_processes, in case shared memory gets corrupted while we're
-	 * looping.
+	 * The queue head might be advanced concurrently, but in that case we'll
+	 * receive another PMSIGNAL_BACKGROUND_WORKER_CHANGE notification.  So we
+	 * take one snapshot of the current values, and process only that range.
+	 * The signal handler is assumed to have been a memory barrier, so we see
+	 * the new head.  The tail is only written by the postmaster.
 	 */
-	if (max_worker_processes != BackgroundWorkerData->total_slots)
+	tail = BackgroundWorkerData->change_queue_tail;
+	head = BackgroundWorkerData->change_queue_head;
+
+	/* Sanity check queue range. */
+	if (tail >= BGWORKER_CHANGE_QUEUE_SIZE ||
+		head >= BGWORKER_CHANGE_QUEUE_SIZE)
 	{
+		/* This corruption isn't recoverable, so just squawk. */
 		ereport(LOG,
-				(errmsg("inconsistent background worker state (\"max_worker_processes\"=%d, total slots=%d)",
-						max_worker_processes,
-						BackgroundWorkerData->total_slots)));
+				(errmsg("background worker change queue range out of bounds (tail=%u, head=%u, size=%d)",
+						tail, head, BGWORKER_CHANGE_QUEUE_SIZE)));
 		return;
 	}
 
 	/*
-	 * Iterate through slots, looking for newly-registered workers or workers
-	 * who must die.
+	 * Make sure that we can load change_queue[] values and referenced slot[]
+	 * values contents that were stored before head was stored.  Pairs with
+	 * EnqueueBackgroundWorkerStateChange().
 	 */
-	for (slotno = 0; slotno < max_worker_processes; ++slotno)
+	pg_read_barrier();
+
+	/*
+	 * Consume from slot change queue, looking for newly-registered workers or
+	 * workers who must die.
+	 */
+	while (tail != head)
 	{
-		BackgroundWorkerSlot *slot = &BackgroundWorkerData->slot[slotno];
+		int			slotno = BackgroundWorkerData->change_queue[tail];
+		BackgroundWorkerSlot *slot;
 		RegisteredBgWorker *rw;
 
-		if (!slot->in_use)
+		/*
+		 * Advance tail.  It doesn't really need to be in shared memory
+		 * because we reserved space to make overflow impossible, but we put
+		 * it there so that EnqueueBackgroundWorkerStateChange() can make an
+		 * assertion check.
+		 */
+		if (++tail == BGWORKER_CHANGE_QUEUE_SIZE)
+			tail = 0;
+		BackgroundWorkerData->change_queue_tail = tail;
+
+		/* Sanity check slot index. */
+		if (slotno < 0 || slotno >= max_worker_processes)
+		{
+			ereport(LOG,
+					(errmsg("background worker change queue contains out of range slot number %d",
+							slotno)));
 			continue;
+		}
 
 		/*
-		 * Make sure we don't see the in_use flag before the updated slot
-		 * contents.
+		 * We only expect to receive change notifications for slots that are
+		 * in use.
 		 */
-		pg_read_barrier();
+		slot = &BackgroundWorkerData->slot[slotno];
+		if (!slot->in_use)
+		{
+			ereport(LOG,
+					(errmsg("received background worker change request for slot %d but it is not in use",
+							slotno)));
+			continue;
+		}
 
 		/* See whether we already know about this worker. */
 		rw = FindRegisteredWorkerBySlotNumber(slotno);
@@ -312,6 +383,21 @@ BackgroundWorkerStateChange(bool allow_new_workers)
 		if (!allow_new_workers)
 			slot->terminate = true;
 
+		/*
+		 * Try to allocate space for the registration data in the registered
+		 * workers list, or log and treat as terminated immediately.
+		 */
+		rw = MemoryContextAllocExtended(PostmasterContext,
+										sizeof(RegisteredBgWorker),
+										MCXT_ALLOC_NO_OOM | MCXT_ALLOC_ZERO);
+		if (rw == NULL)
+		{
+			ereport(LOG,
+					(errcode(ERRCODE_OUT_OF_MEMORY),
+					 errmsg("out of memory")));
+			slot->terminate = true;
+		}
+
 		/*
 		 * If the worker is marked for termination, we don't need to add it to
 		 * the registered workers list; we can just free the slot. However, if
@@ -339,21 +425,10 @@ BackgroundWorkerStateChange(bool allow_new_workers)
 			if (notify_pid != 0)
 				kill(notify_pid, SIGUSR1);
 
-			continue;
-		}
+			if (rw)
+				pfree(rw);
 
-		/*
-		 * Copy the registration data into the registered workers list.
-		 */
-		rw = MemoryContextAllocExtended(PostmasterContext,
-										sizeof(RegisteredBgWorker),
-										MCXT_ALLOC_NO_OOM | MCXT_ALLOC_ZERO);
-		if (rw == NULL)
-		{
-			ereport(LOG,
-					(errcode(ERRCODE_OUT_OF_MEMORY),
-					 errmsg("out of memory")));
-			return;
+			continue;
 		}
 
 		/*
@@ -415,6 +490,48 @@ BackgroundWorkerStateChange(bool allow_new_workers)
 	}
 }
 
+/*
+ * Enqueue a change notification for a background worker slot.  The caller is
+ * reponsible for sending PMSIGNAL_BACKGROUND_WORKER_CHANGE, and serializing
+ * insertions with BackgroundWorkerLock.  The memory barriers pair with
+ * BackgroundWorkerStateChange() in the postmaster, which consumes from the
+ * queue without taking the lock.
+ */
+static inline void
+EnqueueBackgroundWorkerStateChange(int slotno)
+{
+	uint32		head;
+
+	Assert(IsUnderPostmaster);
+	Assert(LWLockHeldByMe(BackgroundWorkerLock));
+
+	/* Make sure slot contents are stored before queue contents. */
+	pg_write_barrier();
+
+	/* Write the slot number at head position in the queue. */
+	head = BackgroundWorkerData->change_queue_head;
+	Assert(head < BGWORKER_CHANGE_QUEUE_SIZE);
+	BackgroundWorkerData->change_queue[head] = slotno;
+
+	/* Make sure the queue entry is stored before the new head. */
+	pg_write_barrier();
+	if (++head == BGWORKER_CHANGE_QUEUE_SIZE)
+		head = 0;
+	BackgroundWorkerData->change_queue_head = head;
+
+	/*
+	 * It is impossible for the head to crash into the tail, because we
+	 * reserved space for the is_use and terminate state changes, and no other
+	 * changes are possible until the slot is recycled by the postmaster. This
+	 * barrier pairs with the memory barrier issued by the postmaster before
+	 * it clears in_use.
+	 */
+#ifdef USE_ASSERT_CHECKING
+	pg_read_barrier();
+	Assert(head != BackgroundWorkerData->change_queue_tail);
+#endif
+}
+
 /*
  * Forget about a background worker that's no longer needed.
  *
@@ -1104,14 +1221,8 @@ RegisterDynamicBackgroundWorker(BackgroundWorker *worker,
 			generation = slot->generation;
 			if (parallel)
 				BackgroundWorkerData->parallel_register_count++;
-
-			/*
-			 * Make sure postmaster doesn't see the slot as in use before it
-			 * sees the new contents.
-			 */
-			pg_write_barrier();
-
 			slot->in_use = true;
+			EnqueueBackgroundWorkerStateChange(slotno);
 			success = true;
 			break;
 		}
@@ -1301,16 +1412,22 @@ TerminateBackgroundWorker(BackgroundWorkerHandle *handle)
 	Assert(handle->slot < max_worker_processes);
 	slot = &BackgroundWorkerData->slot[handle->slot];
 
-	/* Set terminate flag in shared memory, unless slot has been reused. */
+	/*
+	 * Set terminate flag in shared memory, unless slot has been reused, or a
+	 * backend has already set terminate.  The latter check avoids overflowing
+	 * the chagne queue in the unlikely event of many calls to terminate the
+	 * same worker (see BGWORKER_CHANGE_QUEUE_SIZE).
+	 */
 	LWLockAcquire(BackgroundWorkerLock, LW_EXCLUSIVE);
-	if (handle->generation == slot->generation)
+	if (handle->generation == slot->generation && !slot->terminate)
 	{
 		slot->terminate = true;
+		EnqueueBackgroundWorkerStateChange(handle->slot);
 		signal_postmaster = true;
 	}
 	LWLockRelease(BackgroundWorkerLock);
 
-	/* Make sure the postmaster notices the change to shared memory. */
+	/* Tell postmaster to process state changes */
 	if (signal_postmaster)
 		SendPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE);
 }
-- 
2.47.1

From 5b41df0db817970ca2278a6354cc0790cd565a0c Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Tue, 31 Dec 2024 17:44:54 +1300
Subject: [PATCH 2/3] Remove BackgroundWorkerStateChange()'s inner loop.

Previously, when processing a slot that might have changed, the
postmaster would scan the list of registered background workers looking
for the one using that slot.

Maintain a table in postmaster private memory that can be used to find
entries by slot number.

Discussion: https://postgr.es/m/Z1n5UpAiGDmFcMmd%40nathan
---
 src/backend/postmaster/bgworker.c | 43 ++++++++++++++++++++++++-------
 1 file changed, 33 insertions(+), 10 deletions(-)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 0de0422bb53..3e326e48540 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -39,6 +39,13 @@
  */
 dlist_head	BackgroundWorkerList = DLIST_STATIC_INIT(BackgroundWorkerList);
 
+/*
+ * An array of registered background workers indexed by slot number.
+ * BackgroundWorkerList entries are also in this table for fast lookup, and
+ * must be kept in sync.
+ */
+static RegisteredBgWorker **BackgroundWorkerTable;
+
 /*
  * BackgroundWorkerSlots exist in shared memory and can be accessed (via
  * the BackgroundWorkerArray) by both the postmaster and by regular backends.
@@ -202,6 +209,19 @@ BackgroundWorkerShmemInit(void)
 		BackgroundWorkerData->change_queue_head = 0;
 		BackgroundWorkerData->change_queue_tail = 0;
 
+		/*
+		 * Allocate an array to let us find RegisteredBgWorker objects by slot
+		 * number.
+		 *
+		 * XXX Where else could this allocation happen?
+		 */
+		if (PostmasterContext && !BackgroundWorkerTable)
+			BackgroundWorkerTable =
+				MemoryContextAllocExtended(PostmasterContext,
+										   sizeof(BackgroundWorkerTable[0]) *
+										   max_worker_processes,
+										   MCXT_ALLOC_ZERO);
+
 		/*
 		 * Copy contents of worker list into shared memory.  Record the shared
 		 * memory slot assigned to each worker.  This ensures a 1-to-1
@@ -223,6 +243,9 @@ BackgroundWorkerShmemInit(void)
 			rw->rw_worker.bgw_notify_pid = 0;	/* might be reinit after crash */
 			memcpy(&slot->worker, &rw->rw_worker, sizeof(BackgroundWorker));
 
+			/* Add to slot-number lookup table. */
+			BackgroundWorkerTable[rw->rw_shmem_slot] = rw;
+
 			/* Enqueue change notifications for all populated slots. */
 			BackgroundWorkerData->change_queue[BackgroundWorkerData->change_queue_head++] = slotno;
 
@@ -252,18 +275,11 @@ BackgroundWorkerShmemInit(void)
 static RegisteredBgWorker *
 FindRegisteredWorkerBySlotNumber(int slotno)
 {
-	dlist_iter	iter;
+	RegisteredBgWorker *rw = BackgroundWorkerTable[slotno];
 
-	dlist_foreach(iter, &BackgroundWorkerList)
-	{
-		RegisteredBgWorker *rw;
-
-		rw = dlist_container(RegisteredBgWorker, rw_lnode, iter.cur);
-		if (rw->rw_shmem_slot == slotno)
-			return rw;
-	}
+	Assert(!rw || rw->rw_shmem_slot == slotno);
 
-	return NULL;
+	return rw;
 }
 
 /*
@@ -487,6 +503,9 @@ BackgroundWorkerStateChange(bool allow_new_workers)
 								 rw->rw_worker.bgw_name)));
 
 		dlist_push_head(&BackgroundWorkerList, &rw->rw_lnode);
+
+		/* Add to slot-number lookup table. */
+		BackgroundWorkerTable[rw->rw_shmem_slot] = rw;
 	}
 }
 
@@ -551,6 +570,10 @@ ForgetBackgroundWorker(RegisteredBgWorker *rw)
 	slot = &BackgroundWorkerData->slot[rw->rw_shmem_slot];
 	Assert(slot->in_use);
 
+	/* Remove from slot-number lookup table. */
+	Assert(BackgroundWorkerTable[rw->rw_shmem_slot] == rw);
+	BackgroundWorkerTable[rw->rw_shmem_slot] = NULL;
+
 	/*
 	 * We need a memory barrier here to make sure that the update of
 	 * parallel_terminate_count completes before the store to in_use.
-- 
2.47.1

From 40f77c2e714aa6adf39952cb9fecfcc19330c7b8 Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Thu, 9 Jan 2025 10:37:05 +1300
Subject: [PATCH 3/3] Remove loops over BackgroundWorkerList.

In maybe_start_bgworkers(), the postmaster would loop over
BackgroundWorkerList, possibly starting up to 100 workers.  In
DetermineSleepTime(), it would loop again to compute a rarely needed
nap time.

The reason for starting so many workers in each server loop was to try
to mitigate the O(n^2) or worse logic caused by the data structures used
to track workers (see commit aa1351f1).  Unfortunately it also delays
processing of new socket connections and child exit events, and also
create artificial shortages of backend and worker slots.

Address the book-keeping complexity directly, by introducing three new
queues organized in processing order.  Non-running workers sit in one of
them, according to their state:

  (1) ready to start now,
  (2) waiting for a postmaster state change, or
  (3) waiting for a future time to be restarted.

Now the postmaster's main loop only has to handle the head item in the
start and time delay queues, or, more commonly, see that they are empty.

XXX Work in progress!

Discussion: https://postgr.es/m/Z1n5UpAiGDmFcMmd%40nathan
---
 src/backend/postmaster/bgworker.c           |  45 ++-
 src/backend/postmaster/postmaster.c         | 343 +++++++++-----------
 src/include/postmaster/bgworker_internals.h |   8 +-
 3 files changed, 198 insertions(+), 198 deletions(-)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 3e326e48540..b06db0fb152 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -39,6 +39,19 @@
  */
 dlist_head	BackgroundWorkerList = DLIST_STATIC_INIT(BackgroundWorkerList);
 
+/*
+ * A background worker that is not running is also in one of these queues
+ * waiting to start, using rw_queue_node.  These queues represent workers that
+ * should be started immediately, after a future postmaster state is reached,
+ * or after a specific time in the future, respectively.
+ */
+dlist_head	BackgroundWorkerStartQueue =
+DLIST_STATIC_INIT(BackgroundWorkerStartQueue);
+dlist_head	BackgroundWorkerWaitStateQueue =
+DLIST_STATIC_INIT(BackgroundWorkerWaitStateQueue);
+dlist_head	BackgroundWorkerWaitTimeQueue =
+DLIST_STATIC_INIT(BackgroundWorkerWaitTimeQueue);
+
 /*
  * An array of registered background workers indexed by slot number.
  * BackgroundWorkerList entries are also in this table for fast lookup, and
@@ -246,8 +259,11 @@ BackgroundWorkerShmemInit(void)
 			/* Add to slot-number lookup table. */
 			BackgroundWorkerTable[rw->rw_shmem_slot] = rw;
 
-			/* Enqueue change notifications for all populated slots. */
-			BackgroundWorkerData->change_queue[BackgroundWorkerData->change_queue_head++] = slotno;
+			/*
+			 * Schedule to start.  It'll be requeued if it wants to wait for a
+			 * different state.
+			 */
+			dlist_push_tail(&BackgroundWorkerStartQueue, &rw->rw_queue_node);
 
 			++slotno;
 		}
@@ -493,7 +509,6 @@ BackgroundWorkerStateChange(bool allow_new_workers)
 
 		/* Initialize postmaster bookkeeping. */
 		rw->rw_pid = 0;
-		rw->rw_crashed_at = 0;
 		rw->rw_shmem_slot = slotno;
 		rw->rw_terminate = false;
 
@@ -502,10 +517,14 @@ BackgroundWorkerStateChange(bool allow_new_workers)
 				(errmsg_internal("registering background worker \"%s\"",
 								 rw->rw_worker.bgw_name)));
 
+		/* Put it on the list of all workers. */
 		dlist_push_head(&BackgroundWorkerList, &rw->rw_lnode);
 
 		/* Add to slot-number lookup table. */
 		BackgroundWorkerTable[rw->rw_shmem_slot] = rw;
+
+		/* Add to the start queue. */
+		dlist_push_tail(&BackgroundWorkerStartQueue, &rw->rw_queue_node);
 	}
 }
 
@@ -588,6 +607,8 @@ ForgetBackgroundWorker(RegisteredBgWorker *rw)
 			(errmsg_internal("unregistering background worker \"%s\"",
 							 rw->rw_worker.bgw_name)));
 
+	if (!dlist_node_is_detached(&rw->rw_queue_node))
+		dlist_delete_thoroughly(&rw->rw_queue_node);
 	dlist_delete(&rw->rw_lnode);
 	pfree(rw);
 }
@@ -726,6 +747,9 @@ ResetBackgroundWorkerCrashTimes(void)
 
 		rw = dlist_container(RegisteredBgWorker, rw_lnode, iter.cur);
 
+		/* We waited for them all to exit, so they should have no pid. */
+		Assert(rw->rw_pid == 0);
+
 		if (rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART)
 		{
 			/*
@@ -748,18 +772,20 @@ ResetBackgroundWorkerCrashTimes(void)
 			 */
 			Assert((rw->rw_worker.bgw_flags & BGWORKER_CLASS_PARALLEL) == 0);
 
-			/*
-			 * Allow this worker to be restarted immediately after we finish
-			 * resetting.
-			 */
-			rw->rw_crashed_at = 0;
-
 			/*
 			 * If there was anyone waiting for it, they're history.
 			 */
 			rw->rw_worker.bgw_notify_pid = 0;
 		}
 	}
+
+	/* Remove everything from queues. */
+	while (!dlist_is_empty(&BackgroundWorkerStartQueue))
+		dlist_delete_thoroughly(dlist_head_node(&BackgroundWorkerStartQueue));
+	while (!dlist_is_empty(&BackgroundWorkerWaitStateQueue))
+		dlist_delete_thoroughly(dlist_head_node(&BackgroundWorkerWaitStateQueue));
+	while (!dlist_is_empty(&BackgroundWorkerWaitTimeQueue))
+		dlist_delete_thoroughly(dlist_head_node(&BackgroundWorkerWaitTimeQueue));
 }
 
 /*
@@ -1165,7 +1191,6 @@ RegisterBackgroundWorker(BackgroundWorker *worker)
 
 	rw->rw_worker = *worker;
 	rw->rw_pid = 0;
-	rw->rw_crashed_at = 0;
 	rw->rw_terminate = false;
 
 	dlist_push_head(&BackgroundWorkerList, &rw->rw_lnode);
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5f615d0f605..1041c4b3686 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -371,10 +371,6 @@ static bool avlauncher_needs_signal = false;
 /* received START_WALRECEIVER signal */
 static bool WalReceiverRequested = false;
 
-/* set when there's a worker that needs to be started up */
-static bool StartWorkerNeeded = true;
-static bool HaveCrashedWorker = false;
-
 /* set when signals arrive */
 static volatile sig_atomic_t pending_pm_pmsignal;
 static volatile sig_atomic_t pending_pm_child_exit;
@@ -429,13 +425,16 @@ static bool SignalChildren(int signal, BackendTypeMask targetMask);
 static void TerminateChildren(int signal);
 static int	CountChildren(BackendTypeMask targetMask);
 static void LaunchMissingBackgroundProcesses(void);
-static void maybe_start_bgworkers(void);
+static void maybe_start_bgworker(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
 static PMChild *StartChildProcess(BackendType type);
 static void StartSysLogger(void);
 static void StartAutovacuumWorker(void);
 static bool StartBackgroundWorker(RegisteredBgWorker *rw);
 static void InitPostmasterDeathWatchHandle(void);
+static void bgworker_requeue_on_state_change(void);
+static int	bgworker_requeue_on_time(void);
+static void bgworker_schedule_restart(RegisteredBgWorker *rw);
 
 #ifdef WIN32
 #define WNOHANG 0				/* ignored, so any integer value will do */
@@ -1371,9 +1370,6 @@ PostmasterMain(int argc, char *argv[])
 	StartupStatus = STARTUP_RUNNING;
 	UpdatePMState(PM_STARTUP);
 
-	/* Some workers may be scheduled to start now */
-	maybe_start_bgworkers();
-
 	status = ServerLoop();
 
 	/*
@@ -1520,14 +1516,10 @@ checkControlFile(void)
 static int
 DetermineSleepTime(void)
 {
-	TimestampTz next_wakeup = 0;
+	int			sleep_ms;
 
-	/*
-	 * Normal case: either there are no background workers at all, or we're in
-	 * a shutdown sequence (during which we ignore bgworkers altogether).
-	 */
-	if (Shutdown > NoShutdown ||
-		(!StartWorkerNeeded && !HaveCrashedWorker))
+	/* If we're in a shutdown sequence, we ignore bgworkers altogether. */
+	if (Shutdown > NoShutdown)
 	{
 		if (AbortStartTime != 0)
 		{
@@ -1543,54 +1535,18 @@ DetermineSleepTime(void)
 			return 60 * 1000;
 	}
 
-	if (StartWorkerNeeded)
-		return 0;
-
-	if (HaveCrashedWorker)
-	{
-		dlist_mutable_iter iter;
-
-		/*
-		 * When there are crashed bgworkers, we sleep just long enough that
-		 * they are restarted when they request to be.  Scan the list to
-		 * determine the minimum of all wakeup times according to most recent
-		 * crash time and requested restart interval.
-		 */
-		dlist_foreach_modify(iter, &BackgroundWorkerList)
-		{
-			RegisteredBgWorker *rw;
-			TimestampTz this_wakeup;
-
-			rw = dlist_container(RegisteredBgWorker, rw_lnode, iter.cur);
-
-			if (rw->rw_crashed_at == 0)
-				continue;
-
-			if (rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART
-				|| rw->rw_terminate)
-			{
-				ForgetBackgroundWorker(rw);
-				continue;
-			}
-
-			this_wakeup = TimestampTzPlusMilliseconds(rw->rw_crashed_at,
-													  1000L * rw->rw_worker.bgw_restart_time);
-			if (next_wakeup == 0 || this_wakeup < next_wakeup)
-				next_wakeup = this_wakeup;
-		}
-	}
-
-	if (next_wakeup != 0)
-	{
-		int			ms;
+	/*
+	 * Check if any deferred workers should be moved to the start queue.  This
+	 * also tells us how long until the next one is ready, or INT_MAX if there
+	 * are none.
+	 */
+	sleep_ms = bgworker_requeue_on_time();
 
-		/* result of TimestampDifferenceMilliseconds is in [0, INT_MAX] */
-		ms = (int) TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
-												   next_wakeup);
-		return Min(60 * 1000, ms);
-	}
+	/* Don't sleep at all if there is a worker ready to be started. */
+	if (!dlist_is_empty(&BackgroundWorkerStartQueue))
+		return 0;
 
-	return 60 * 1000;
+	return Min(60 * 1000, sleep_ms);
 }
 
 /*
@@ -2313,13 +2269,6 @@ process_pm_child_exit(void)
 			UpdatePMState(PM_RUN);
 			connsAllowed = true;
 
-			/*
-			 * At the next iteration of the postmaster's main loop, we will
-			 * crank up the background tasks like the autovacuum launcher and
-			 * background workers that were not started earlier already.
-			 */
-			StartWorkerNeeded = true;
-
 			/* at this point we are really open for business */
 			ereport(LOG,
 					(errmsg("database system is ready to accept connections")));
@@ -2615,6 +2564,8 @@ CleanupBackend(PMChild *bp,
 
 	if (crashed)
 	{
+		if (rw)
+			rw->rw_pid = 0;
 		HandleChildCrash(bp_pid, exitstatus, procname);
 		return;
 	}
@@ -2635,19 +2586,18 @@ CleanupBackend(PMChild *bp,
 	 */
 	if (bp_bkend_type == B_BG_WORKER)
 	{
-		if (!EXIT_STATUS_0(exitstatus))
-		{
-			/* Record timestamp, so we know when to restart the worker. */
-			rw->rw_crashed_at = GetCurrentTimestamp();
-		}
-		else
-		{
-			/* Zero exit status means terminate */
-			rw->rw_crashed_at = 0;
+		rw->rw_pid = 0;
+
+		/*
+		 * Zero exist status means terminate.  Otherwise we schedule a delayed
+		 * restart.  Note that ReportBackgroundWorkerExit() will free the
+		 * worker if appropriate.
+		 */
+		if (EXIT_STATUS_0(exitstatus))
 			rw->rw_terminate = true;
-		}
+		else
+			bgworker_schedule_restart(rw);
 
-		rw->rw_pid = 0;
 		ReportBackgroundWorkerExit(rw); /* report child death */
 
 		if (!logged)
@@ -2656,9 +2606,6 @@ CleanupBackend(PMChild *bp,
 						 procname, bp_pid, exitstatus);
 			logged = true;
 		}
-
-		/* have it be restarted */
-		HaveCrashedWorker = true;
 	}
 
 	if (!logged)
@@ -3144,8 +3091,7 @@ pmstate_name(PMState state)
 }
 
 /*
- * Simple wrapper for updating pmState. The main reason to have this wrapper
- * is that it makes it easy to log all state transitions.
+ * Update pmState.
  */
 static void
 UpdatePMState(PMState newState)
@@ -3153,6 +3099,9 @@ UpdatePMState(PMState newState)
 	elog(DEBUG1, "updating PMState from %s to %s",
 		 pmstate_name(pmState), pmstate_name(newState));
 	pmState = newState;
+
+	/* Some bgworkers might need to move to the start queue. */
+	bgworker_requeue_on_state_change();
 }
 
 /*
@@ -3265,8 +3214,7 @@ LaunchMissingBackgroundProcesses(void)
 		WalSummarizerPMChild = StartChildProcess(B_WAL_SUMMARIZER);
 
 	/* Get other worker processes running, if needed */
-	if (StartWorkerNeeded || HaveCrashedWorker)
-		maybe_start_bgworkers();
+	maybe_start_bgworker();
 }
 
 /*
@@ -3621,9 +3569,6 @@ process_pm_pmsignal(void)
 
 		UpdatePMState(PM_HOT_STANDBY);
 		connsAllowed = true;
-
-		/* Some workers may be scheduled to start now */
-		StartWorkerNeeded = true;
 	}
 
 	/* Process background worker state changes. */
@@ -3631,7 +3576,6 @@ process_pm_pmsignal(void)
 	{
 		/* Accept new worker requests only if not stopping. */
 		BackgroundWorkerStateChange(pmState < PM_STOP_BACKENDS);
-		StartWorkerNeeded = true;
 	}
 
 	/* Tell syslogger to rotate logfile if requested */
@@ -3954,7 +3898,8 @@ StartBackgroundWorker(RegisteredBgWorker *rw)
 		ereport(LOG,
 				(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
 				 errmsg("no slot available for new background worker process")));
-		rw->rw_crashed_at = GetCurrentTimestamp();
+		bgworker_schedule_restart(rw);
+		ReportBackgroundWorkerExit(rw);
 		return false;
 	}
 	bn->rw = rw;
@@ -3975,8 +3920,8 @@ StartBackgroundWorker(RegisteredBgWorker *rw)
 		/* undo what AssignPostmasterChildSlot did */
 		ReleasePostmasterChildSlot(bn);
 
-		/* mark entry as crashed, so we'll try again later */
-		rw->rw_crashed_at = GetCurrentTimestamp();
+		bgworker_schedule_restart(rw);
+		ReportBackgroundWorkerExit(rw);
 		return false;
 	}
 
@@ -4026,48 +3971,128 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 }
 
 /*
- * If the time is right, start background worker(s).
- *
- * As a side effect, the bgworker control variables are set or reset
- * depending on whether more workers may need to be started.
- *
- * We limit the number of workers started per call, to avoid consuming the
- * postmaster's attention for too long when many such requests are pending.
- * As long as StartWorkerNeeded is true, ServerLoop will not block and will
- * call this function again after dealing with any other issues.
+ * Schedule deferred workers to start if the state they are waiting for has
+ * arrived.  This should be called whenever pmState moves to states for which
+ * bgworker_should_start_now() can return true.
  */
 static void
-maybe_start_bgworkers(void)
+bgworker_requeue_on_state_change(void)
 {
-#define MAX_BGWORKERS_TO_LAUNCH 100
-	int			num_launched = 0;
-	TimestampTz now = 0;
 	dlist_mutable_iter iter;
 
+	dlist_foreach_modify(iter, &BackgroundWorkerWaitStateQueue)
+	{
+		RegisteredBgWorker *rw;
+
+		rw = dlist_container(RegisteredBgWorker, rw_queue_node, iter.cur);
+		if (bgworker_should_start_now(rw->rw_worker.bgw_start_time))
+		{
+			dlist_delete(&rw->rw_queue_node);
+			dlist_push_tail(&BackgroundWorkerStartQueue, &rw->rw_queue_node);
+		}
+	}
+}
+
+/*
+ * Schedule workers to start if their start time has arrived.  Returns the
+ * number of milliseconds until the next time-delayed worker wants to start,
+ * or INT_MAX if there are none.
+ */
+static int
+bgworker_requeue_on_time(void)
+{
+	dlist_head *queue = &BackgroundWorkerWaitTimeQueue;
+
+	if (!dlist_is_empty(queue))
+	{
+		TimestampTz now = GetCurrentTimestamp();
+
+		do
+		{
+			RegisteredBgWorker *rw;
+
+			/* If the head item in the queue needs more time, we are done. */
+			rw = dlist_head_element(RegisteredBgWorker, rw_queue_node, queue);
+			if (rw->rw_restart_at > now)
+				return TimestampDifferenceMilliseconds(now, rw->rw_restart_at);
+
+			/* Move this one to the start queue, and go around again. */
+			dlist_delete(&rw->rw_queue_node);
+			dlist_push_tail(&BackgroundWorkerStartQueue, &rw->rw_queue_node);
+		} while (!dlist_is_empty(queue));
+	}
+
+	return INT_MAX;
+}
+
+/*
+ * After a background worker exits unsuccessfully or fails to start, this
+ * function can be used to schedule a restart attempt after "bgw_restart_time"
+ * seconds.
+ */
+static void
+bgworker_schedule_restart(RegisteredBgWorker *rw)
+{
+	dlist_head *queue = &BackgroundWorkerWaitTimeQueue;
+	RegisteredBgWorker *other = NULL;
+
+	if (rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART)
+		return;
+
+	/* Compute restart time. */
+	rw->rw_restart_at = TimestampTzPlusSeconds(GetCurrentTimestamp(),
+											   rw->rw_worker.bgw_restart_time);
+
 	/*
-	 * During crash recovery, we have no need to be called until the state
-	 * transition out of recovery.
+	 * Find insertion point in the time-order queue.  Start the search at the
+	 * tail end, since it has the highest time so far and there is a good
+	 * chance we have a higher time.
+	 *
+	 * XXX We could use a binary heap instead, but this should be very
+	 * infrequent and the list should usually be empty, so it doesn't really
+	 * matter.  The more important thing is that bgworker_requeue_on_time() is
+	 * O(1), since it is called frequently.
 	 */
-	if (FatalError)
+	if (!dlist_is_empty(queue))
 	{
-		StartWorkerNeeded = false;
-		HaveCrashedWorker = false;
-		return;
+		other = dlist_tail_element(RegisteredBgWorker,
+								   rw_queue_node,
+								   queue);
+		while (other && other->rw_restart_at > rw->rw_restart_at)
+		{
+			if (!dlist_has_prev(queue, &other->rw_queue_node))
+				other = NULL;
+			else
+				other =
+					dlist_container(RegisteredBgWorker,
+									rw_queue_node,
+									dlist_prev_node(queue,
+													&other->rw_queue_node));
+		}
 	}
 
-	/* Don't need to be called again unless we find a reason for it below */
-	StartWorkerNeeded = false;
-	HaveCrashedWorker = false;
+	if (other)
+		dlist_insert_after(&other->rw_queue_node, &rw->rw_queue_node);
+	else
+		dlist_push_head(queue, &rw->rw_queue_node);
+}
 
-	dlist_foreach_modify(iter, &BackgroundWorkerList)
+/*
+ * Start at most one background worker from BackgroundWorkerStartList.
+ */
+static void
+maybe_start_bgworker(void)
+{
+	while (!dlist_is_empty(&BackgroundWorkerStartQueue))
 	{
 		RegisteredBgWorker *rw;
 
-		rw = dlist_container(RegisteredBgWorker, rw_lnode, iter.cur);
+		/* pop thoroughly (so we can tell we're not on any list) */
+		rw = dlist_head_element(RegisteredBgWorker, rw_queue_node,
+								&BackgroundWorkerStartQueue);
+		dlist_delete_thoroughly(&rw->rw_queue_node);
 
-		/* ignore if already running */
-		if (rw->rw_pid != 0)
-			continue;
+		Assert(rw->rw_pid == 0);
 
 		/* if marked for death, clean up and remove from list */
 		if (rw->rw_terminate)
@@ -4077,76 +4102,22 @@ maybe_start_bgworkers(void)
 		}
 
 		/*
-		 * If this worker has crashed previously, maybe it needs to be
-		 * restarted (unless on registration it specified it doesn't want to
-		 * be restarted at all).  Check how long ago did a crash last happen.
-		 * If the last crash is too recent, don't start it right away; let it
-		 * be restarted once enough time has passed.
+		 * If this worker is waiting for a different postmaster state, then
+		 * requeue it.
 		 */
-		if (rw->rw_crashed_at != 0)
+		if (!bgworker_should_start_now(rw->rw_worker.bgw_start_time))
 		{
-			if (rw->rw_worker.bgw_restart_time == BGW_NEVER_RESTART)
-			{
-				int			notify_pid;
-
-				notify_pid = rw->rw_worker.bgw_notify_pid;
-
-				ForgetBackgroundWorker(rw);
-
-				/* Report worker is gone now. */
-				if (notify_pid != 0)
-					kill(notify_pid, SIGUSR1);
-
-				continue;
-			}
-
-			/* read system time only when needed */
-			if (now == 0)
-				now = GetCurrentTimestamp();
-
-			if (!TimestampDifferenceExceeds(rw->rw_crashed_at, now,
-											rw->rw_worker.bgw_restart_time * 1000))
-			{
-				/* Set flag to remember that we have workers to start later */
-				HaveCrashedWorker = true;
-				continue;
-			}
+			dlist_push_tail(&BackgroundWorkerWaitStateQueue,
+							&rw->rw_queue_node);
+			continue;
 		}
 
-		if (bgworker_should_start_now(rw->rw_worker.bgw_start_time))
-		{
-			/* reset crash time before trying to start worker */
-			rw->rw_crashed_at = 0;
-
-			/*
-			 * Try to start the worker.
-			 *
-			 * On failure, give up processing workers for now, but set
-			 * StartWorkerNeeded so we'll come back here on the next iteration
-			 * of ServerLoop to try again.  (We don't want to wait, because
-			 * there might be additional ready-to-run workers.)  We could set
-			 * HaveCrashedWorker as well, since this worker is now marked
-			 * crashed, but there's no need because the next run of this
-			 * function will do that.
-			 */
-			if (!StartBackgroundWorker(rw))
-			{
-				StartWorkerNeeded = true;
-				return;
-			}
-
-			/*
-			 * If we've launched as many workers as allowed, quit, but have
-			 * ServerLoop call us again to look for additional ready-to-run
-			 * workers.  There might not be any, but we'll find out the next
-			 * time we run.
-			 */
-			if (++num_launched >= MAX_BGWORKERS_TO_LAUNCH)
-			{
-				StartWorkerNeeded = true;
-				return;
-			}
-		}
+		/*
+		 * Only try to start one worker, and then return control so that other
+		 * kinds of postmaster work can be processed.
+		 */
+		StartBackgroundWorker(rw);
+		break;
 	}
 }
 
diff --git a/src/include/postmaster/bgworker_internals.h b/src/include/postmaster/bgworker_internals.h
index 092b1610663..c58daff10de 100644
--- a/src/include/postmaster/bgworker_internals.h
+++ b/src/include/postmaster/bgworker_internals.h
@@ -33,13 +33,17 @@ typedef struct RegisteredBgWorker
 {
 	BackgroundWorker rw_worker; /* its registry entry */
 	pid_t		rw_pid;			/* 0 if not running */
-	TimestampTz rw_crashed_at;	/* if not 0, time it last crashed */
 	int			rw_shmem_slot;
 	bool		rw_terminate;
-	dlist_node	rw_lnode;		/* list link */
+	dlist_node	rw_lnode;		/* node for list of all workers */
+	dlist_node	rw_queue_node;	/* node for start/wait queues */
+	TimestampTz rw_restart_at;	/* deferred start time after failure */
 } RegisteredBgWorker;
 
 extern PGDLLIMPORT dlist_head BackgroundWorkerList;
+extern PGDLLIMPORT dlist_head BackgroundWorkerStartQueue;
+extern PGDLLIMPORT dlist_head BackgroundWorkerWaitStateQueue;
+extern PGDLLIMPORT dlist_head BackgroundWorkerWaitTimeQueue;
 
 extern Size BackgroundWorkerShmemSize(void);
 extern void BackgroundWorkerShmemInit(void);
-- 
2.47.1

Re: connection establishment versus parallel workers

Reply via email to