Over in [1] we have a report of a postmaster shutdown that seems to
have occurred because some client logic was overaggressively spawning
connection requests, causing the postmaster's child-process arrays to
be temporarily full, and then some parallel query tried to launch a
new bgworker process.  The postmaster's bgworker-spawning logic lacks
any check for the arrays being full, so when AssignPostmasterChildSlot
failed to find a free slot, kaboom!

The attached proposed patch fixes this by making bgworker spawning
include a canAcceptConnections() test.  That's perhaps overkill, since
we really just need to check the CountChildren() total; but I judged
that going through that function and having it decide what to test or
not test was a better design than duplicating the CountChildren() test
elsewhere.

I'd first imagined also replacing the one-size-fits-all check

    if (CountChildren(BACKEND_TYPE_ALL) >= MaxLivePostmasterChildren())
        result = CAC_TOOMANY;

with something like

    switch (backend_type)
    {
        case BACKEND_TYPE_NORMAL:
            if (CountChildren(backend_type) >= 2 * MaxConnections)
                result = CAC_TOOMANY;
            break;
        case BACKEND_TYPE_AUTOVAC:
            if (CountChildren(backend_type) >= 2 * autovacuum_max_workers)
                result = CAC_TOOMANY;
            break;
        ...
    }

so as to subdivide the pool of child-process slots and prevent client
requests from consuming slots meant for background processes.  But on
closer examination that's not really worth the trouble, because this
pool is already considerably bigger than MaxBackends; so even if we
prevented a failure here we could still have bgworker startup failure
later on when it tries to acquire a PGPROC.

Barring objections, I'll apply and back-patch this soon.

                        regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/CADCf-WNZk_9680Q0YjfBzuiR0Oe8LzvDs2Ts3_tq6Tv1e8raQQ%40mail.gmail.com

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index eb9e022..7947c96 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -409,7 +409,7 @@ static void SendNegotiateProtocolVersion(List *unrecognized_protocol_options);
 static void processCancelRequest(Port *port, void *pkt);
 static int	initMasks(fd_set *rmask);
 static void report_fork_failure_to_client(Port *port, int errnum);
-static CAC_state canAcceptConnections(void);
+static CAC_state canAcceptConnections(int backend_type);
 static bool RandomCancelKey(int32 *cancel_key);
 static void signal_child(pid_t pid, int signal);
 static bool SignalSomeChildren(int signal, int targets);
@@ -2401,13 +2401,15 @@ processCancelRequest(Port *port, void *pkt)
  * canAcceptConnections --- check to see if database state allows connections.
  */
 static CAC_state
-canAcceptConnections(void)
+canAcceptConnections(int backend_type)
 {
 	CAC_state	result = CAC_OK;
 
 	/*
 	 * Can't start backends when in startup/shutdown/inconsistent recovery
-	 * state.
+	 * state.  We treat autovac workers the same as user backends for this
+	 * purpose.  However, bgworkers are excluded from this test; we expect
+	 * bgworker_should_start_now() decided whether the DB state allows them.
 	 *
 	 * In state PM_WAIT_BACKUP only superusers can connect (this must be
 	 * allowed so that a superuser can end online backup mode); we return
@@ -2415,7 +2417,8 @@ canAcceptConnections(void)
 	 * that neither CAC_OK nor CAC_WAITBACKUP can safely be returned until we
 	 * have checked for too many children.
 	 */
-	if (pmState != PM_RUN)
+	if (pmState != PM_RUN &&
+		backend_type != BACKEND_TYPE_BGWORKER)
 	{
 		if (pmState == PM_WAIT_BACKUP)
 			result = CAC_WAITBACKUP;	/* allow superusers only */
@@ -2435,9 +2438,9 @@ canAcceptConnections(void)
 	/*
 	 * Don't start too many children.
 	 *
-	 * We allow more connections than we can have backends here because some
+	 * We allow more connections here than we can have backends because some
 	 * might still be authenticating; they might fail auth, or some existing
-	 * backend might exit before the auth cycle is completed. The exact
+	 * backend might exit before the auth cycle is completed.  The exact
 	 * MaxBackends limit is enforced when a new backend tries to join the
 	 * shared-inval backend array.
 	 *
@@ -4114,7 +4117,7 @@ BackendStartup(Port *port)
 	bn->cancel_key = MyCancelKey;
 
 	/* Pass down canAcceptConnections state */
-	port->canAcceptConnections = canAcceptConnections();
+	port->canAcceptConnections = canAcceptConnections(BACKEND_TYPE_NORMAL);
 	bn->dead_end = (port->canAcceptConnections != CAC_OK &&
 					port->canAcceptConnections != CAC_WAITBACKUP);
 
@@ -5486,7 +5489,7 @@ StartAutovacuumWorker(void)
 	 * we have to check to avoid race-condition problems during DB state
 	 * changes.
 	 */
-	if (canAcceptConnections() == CAC_OK)
+	if (canAcceptConnections(BACKEND_TYPE_AUTOVAC) == CAC_OK)
 	{
 		/*
 		 * Compute the cancel key that will be assigned to this session. We
@@ -5863,6 +5866,19 @@ assign_backendlist_entry(RegisteredBgWorker *rw)
 	Backend    *bn;
 
 	/*
+	 * Check that database state allows another connection.  Currently the
+	 * only possible failure is CAC_TOOMANY, so we just log an error message
+	 * based on that rather than checking the error code precisely.
+	 */
+	if (canAcceptConnections(BACKEND_TYPE_BGWORKER) != CAC_OK)
+	{
+		ereport(LOG,
+				(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+				 errmsg("no slot available for new worker process")));
+		return false;
+	}
+
+	/*
 	 * Compute the cancel key that will be assigned to this session. We
 	 * probably don't need cancel keys for background workers, but we'd better
 	 * have something random in the field to prevent unfriendly people from

Reply via email to