Re: Refactoring postmaster's code to cleanup after child exit

Heikki Linnakangas Fri, 06 Sep 2024 06:14:04 -0700

On 04/09/2024 17:35, Andres Freund wrote:

On 2024-08-12 12:55:00 +0300, Heikki Linnakangas wrote:

 From dc53f89edbeec99f8633def8aa5f47cd98e7a150 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakan...@iki.fi>
Date: Mon, 12 Aug 2024 10:59:04 +0300
Subject: [PATCH v4 4/8] Introduce a separate BackendType for dead-end children


And replace postmaster.c's own "backend type" codes with BackendType


Hm - it seems a bit odd to open-code this when we actually have a "table
driven configuration" available?  Why isn't the type a field in
child_process_kind?

Sorry, I didn't understand this. What exactly would you add tochild_process_kind? Where would you use it?

+/*
+ * MaxLivePostmasterChildren
+ *
+ * This reports the number postmaster child processes that can be active.  It
+ * includes all children except for dead_end children.  This allows the array
+ * in shared memory (PMChildFlags) to have a fixed maximum size.
+ */
+int
+MaxLivePostmasterChildren(void)
+{
+       int                     n = 0;
+
+       /* We know exactly how mamy worker and aux processes can be active */
+       n += autovacuum_max_workers;
+       n += max_worker_processes;
+       n += NUM_AUXILIARY_PROCS;
+
+       /*
+        * We allow more connections here than we can have backends because some
+        * might still be authenticating; they might fail auth, or some existing
+        * backend might exit before the auth cycle is completed.  The exact
+        * MaxBackends limit is enforced when a new backend tries to join the
+        * shared-inval backend array.
+        */
+       n += 2 * (MaxConnections + max_wal_senders);
+
+       return n;
+}


I wonder if we could instead maintain at least some of this in
child_process_kinds? Manually listing different types of processes in
different places doesn't seem particularly sustainable.

Hmm, you mean adding "max this kind of children" field tochild_process_kinds? Perhaps.

+/*
+ * Initialize at postmaster startup
+ */
+void
+InitPostmasterChildSlots(void)
+{
+       int                     num_pmchild_slots;
+       int                     slotno;
+       PMChild    *slots;
+
+       dlist_init(&freeBackendList);
+       dlist_init(&freeAutoVacWorkerList);
+       dlist_init(&freeBgWorkerList);
+       dlist_init(&freeAuxList);
+       dlist_init(&ActiveChildList);
+
+       num_pmchild_slots = MaxLivePostmasterChildren();
+
+       slots = palloc(num_pmchild_slots * sizeof(PMChild));
+
+       slotno = 0;
+       for (int i = 0; i < 2 * (MaxConnections + max_wal_senders); i++)
+       {
+               init_slot(&slots[slotno], slotno, &freeBackendList);
+               slotno++;
+       }
+       for (int i = 0; i < autovacuum_max_workers; i++)
+       {
+               init_slot(&slots[slotno], slotno, &freeAutoVacWorkerList);
+               slotno++;
+       }
+       for (int i = 0; i < max_worker_processes; i++)
+       {
+               init_slot(&slots[slotno], slotno, &freeBgWorkerList);
+               slotno++;
+       }
+       for (int i = 0; i < NUM_AUXILIARY_PROCS; i++)
+       {
+               init_slot(&slots[slotno], slotno, &freeAuxList);
+               slotno++;
+       }
+       Assert(slotno == num_pmchild_slots);
+}


Along the same vein - could we generalize this into one array of different
slot types and then loop over that to initialize / acquire the slots?


Makes sense.

+/* Return the appropriate free-list for the given backend type */
+static dlist_head *
+GetFreeList(BackendType btype)
+{
+       switch (btype)
+       {
+               case B_BACKEND:
+               case B_BG_WORKER:
+               case B_WAL_SENDER:
+               case B_SLOTSYNC_WORKER:
+                       return &freeBackendList;


Maybe a daft question - but why are all of these in the same list? Sure,
they're essentially all full backends, but they're covered by different GUCs?

No reason. No particular reason they should *not* share the same listeither though.

+                       /*
+                        * Auxiliary processes.  There can be only one of each 
of these
+                        * running at a time.
+                        */
+               case B_AUTOVAC_LAUNCHER:
+               case B_ARCHIVER:
+               case B_BG_WRITER:
+               case B_CHECKPOINTER:
+               case B_STARTUP:
+               case B_WAL_RECEIVER:
+               case B_WAL_SUMMARIZER:
+               case B_WAL_WRITER:
+                       return &freeAuxList;
+
+                       /*
+                        * Logger is not connected to shared memory, and does 
not have a
+                        * PGPROC entry, but we still allocate a child slot for 
it.
+                        */


Tangential: Why do we need a freelist for these and why do we choose a random
pgproc for these instead of assigning one statically?

Background: I'd like to not provide AIO workers with "bounce buffers" (for IO
of buffers that can't be done in-place, like writes when checksums are
enabled). The varying proc numbers make that harder than it'd have to be...

Yeah, we can make these fixed.Currently, the # of slots reserved for auxprocesses is sized by NUM_AUXILIARY_PROCS, which is one smaller than thenumber of different aux proces kinds:

/*
 * We set aside some extra PGPROC structures for auxiliary processes,
 * ie things that aren't full-fledged backends but need shmem access.
 *
 * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
 * run during normal operation.  Startup process and WAL receiver also consume
 * 2 slots, but WAL writer is launched only after startup has exited, so we
 * only need 6 slots.
 */
#define NUM_AUXILIARY_PROCS             6


For PMChildSlot numbers, we could certainly just allocate one more slot.

It would probably make sense for PGPROCs too, even though PGPROC is amuch larger struct.

+PMChild *
+AssignPostmasterChildSlot(BackendType btype)
+{
+       dlist_head *freelist;
+       PMChild    *pmchild;
+
+       freelist = GetFreeList(btype);
+
+       if (dlist_is_empty(freelist))
+               return NULL;
+
+       pmchild = dlist_container(PMChild, elem, dlist_pop_head_node(freelist));
+       pmchild->pid = 0;
+       pmchild->bkend_type = btype;
+       pmchild->rw = NULL;
+       pmchild->bgworker_notify = true;
+
+       /*
+        * pmchild->child_slot for each entry was initialized when the array of
+        * slots was allocated.
+        */
+
+       dlist_push_head(&ActiveChildList, &pmchild->elem);
+
+       ReservePostmasterChildSlot(pmchild->child_slot);
+
+       /* FIXME: find a more elegant way to pass this */
+       MyPMChildSlot = pmchild->child_slot;


What if we assigned one offset for each process and assigned its ID here and
also used that for its ProcNumber - that way we wouldn't need to manage
freelists in two places.

It's currently possible to have up to 2 * max_connections backends inthe authentication phase. We would have to change that behaviour, ormake the PGPROC array 2x larger.

It might well be worth it, I don't know how sensible the currentbehaviour is. But I'd like to punt that to later patch, to keep thescope of this patch set reasonable. It's pretty straightforward to dolater on top of this if we want to.

+PMChild *
+FindPostmasterChildByPid(int pid)
+{
+       dlist_iter      iter;
+
+       dlist_foreach(iter, &ActiveChildList)
+       {
+               PMChild    *bp = dlist_container(PMChild, elem, iter.cur);
+
+               if (bp->pid == pid)
+                       return bp;
+       }
+       return NULL;
+}


It's not new, but it's quite sad that postmaster's process exit handling is
effectively O(Backends^2)...

It would be straightforward to turn ActiveChildList into a hash table.But I'd like to leave that to a followup patch too.

        /*
         * We're ready to rock and roll...
         */
-       StartupPID = StartChildProcess(B_STARTUP);
-       Assert(StartupPID != 0);
+       StartupPMChild = StartChildProcess(B_STARTUP);
+       Assert(StartupPMChild != NULL);


This (not new) assertion is ... odd.


Yeah, it's an assertion because StartChildProcess has this:

                /*
                 * fork failure is fatal during startup, but there's no need to 
choke
                 * immediately if starting other child types fails.
                 */
                if (type == B_STARTUP)
                        ExitPostmaster(1);

@@ -1961,26 +1915,6 @@ process_pm_reload_request(void)
                                (errmsg("received SIGHUP, reloading configuration 
files")));
                ProcessConfigFile(PGC_SIGHUP);
                SignalSomeChildren(SIGHUP, BACKEND_TYPE_ALL & ~(1 << 
B_DEAD_END_BACKEND));
-               if (StartupPID != 0)
-                       signal_child(StartupPID, SIGHUP);
-               if (BgWriterPID != 0)
-                       signal_child(BgWriterPID, SIGHUP);
-               if (CheckpointerPID != 0)
-                       signal_child(CheckpointerPID, SIGHUP);
-               if (WalWriterPID != 0)
-                       signal_child(WalWriterPID, SIGHUP);
-               if (WalReceiverPID != 0)
-                       signal_child(WalReceiverPID, SIGHUP);
-               if (WalSummarizerPID != 0)
-                       signal_child(WalSummarizerPID, SIGHUP);
-               if (AutoVacPID != 0)
-                       signal_child(AutoVacPID, SIGHUP);
-               if (PgArchPID != 0)
-                       signal_child(PgArchPID, SIGHUP);
-               if (SysLoggerPID != 0)
-                       signal_child(SysLoggerPID, SIGHUP);
-               if (SlotSyncWorkerPID != 0)
-                       signal_child(SlotSyncWorkerPID, SIGHUP);

/* Reload authentication config files too */

                if (!load_hba())


For a moment I wondered why this change was part of this commit - but I guess
we didn't have any of these in an array/list before this change...


Correct.

@@ -2469,11 +2410,15 @@ process_pm_child_exit(void)
                }

/* Was it the system logger? If so, try to start a new one */

-               if (pid == SysLoggerPID)
+               if (SysLoggerPMChild && pid == SysLoggerPMChild->pid)
                {
-                       SysLoggerPID = 0;
                        /* for safety's sake, launch new logger *first* */
-                       SysLoggerPID = SysLogger_Start();
+                       SysLoggerPMChild->pid = SysLogger_Start();
+                       if (SysLoggerPMChild->pid == 0)
+                       {
+                               FreePostmasterChildSlot(SysLoggerPMChild);
+                               SysLoggerPMChild = NULL;
+                       }
                        if (!EXIT_STATUS_0(exitstatus))
                                LogChildExit(LOG, _("system logger process"),


Seems a bit weird to have one place with a different memory lifetime handling
than other places. Why don't we just do this the same way as in other places
but continue to defer the logging until after we tried to start the new
logger?


Hmm, you mean let LaunchMissingBackgroundProcesses() handle the restart?

I'm a little scared of changing the existing logic. We don't have amechanism for deferring logging, so we would have to invent that, or thelogs would just accumulate in the pipe until syslogger starts up.There's some code between here and LaunchMissingBackgroundProcesses(),so might postmaster get blocked between writing to the syslogger pipe,before having restarted it?


If forking the syslogger process fails, that can happen anyway, though.

Might be worth having a test ensuring that loggers restart OK.


Yeah..

        /* Construct a process name for log message */
+
+       /*
+        * FIXME: use GetBackendTypeDesc here? How does the localization of that
+        * work?
+        */
        if (bp->bkend_type == B_DEAD_END_BACKEND)
        {
                procname = _("dead end backend");


Might be worth having a version of GetBackendTypeDesc() that returns a
translated string?


Constructing the string for background workers is a little more complicated:

    snprintf(namebuf, MAXPGPATH, _("background worker \"%s\""),
             bp->rw->rw_worker.bgw_type);

We could still do that for background workers and use the transalatedvariant of GetBackendTypeDesc() for everything else though.

@@ -2697,9 +2643,16 @@ HandleChildCrash(int pid, int exitstatus, const char 
*procname)
        {
                dlist_iter      iter;

- dlist_foreach(iter, &BackendList)

+               dlist_foreach(iter, &ActiveChildList)
                {
-                       Backend    *bp = dlist_container(Backend, elem, 
iter.cur);
+                       PMChild    *bp = dlist_container(PMChild, elem, 
iter.cur);
+
+                       /* We do NOT restart the syslogger */
+                       if (bp == SysLoggerPMChild)
+                               continue;


That comment seems a bit misleading - we do restart syslogger, we just don't
do it here, no?  I realize it's an old comment, but it still seems like it's
worth fixing given that you touch all the code here...

No, we really do not restart the syslogger. This code runs when*another* process has crashed unexpectedly. We kill all other processes,reinitialize shared memory and restart, but the old syslogger keepsrunning through all that.

I'll add a note on that to InitPostmasterChildSlots(), as it's a bitsurprising.

@@ -2871,29 +2786,27 @@ PostmasterStateMachine(void)
<snip>
-               if (WalSummarizerPID != 0)
-                       signal_child(WalSummarizerPID, SIGTERM);
-               if (SlotSyncWorkerPID != 0)
-                       signal_child(SlotSyncWorkerPID, SIGTERM);
+               targetMask |= (1 << B_STARTUP);
+               targetMask |= (1 << B_WAL_RECEIVER);
+
+               targetMask |= (1 << B_WAL_SUMMARIZER);
+               targetMask |= (1 << B_SLOTSYNC_WORKER);
                /* checkpointer, archiver, stats, and syslogger may continue 
for now */

+ SignalSomeChildren(SIGTERM, targetMask);

+
                /* Now transition to PM_WAIT_BACKENDS state to wait for them to 
die */
                pmState = PM_WAIT_BACKENDS;
<snip>


It's likely the right thing to not do as one patch, but IMO this really wants
to be a state table. Perhaps as part of child_process_kinds, perhaps separate
from that.

Yeah. I've tried to refactor this into a table before, but didn't comeup with anything that I was happy with. I also feel there must be abetter way to organize this, but not sure what exactly. I hope that willbecome more apparent after these other changes.

@@ -3130,8 +3047,21 @@ static void
  LaunchMissingBackgroundProcesses(void)
  {
        /* Syslogger is active in all states */
-       if (SysLoggerPID == 0 && Logging_collector)
-               SysLoggerPID = SysLogger_Start();
+       if (SysLoggerPMChild == NULL && Logging_collector)
+       {
+               SysLoggerPMChild = AssignPostmasterChildSlot(B_LOGGER);
+               if (!SysLoggerPMChild)
+                       elog(LOG, "no postmaster child slot available for 
syslogger");


How could this elog() be reached? Seems something seriously would have gone
wrong to get here - in which case a LOG that might not even be visible (due to
logger not working) doesn't seem like the right response.


I'll turn it into an assertion or PANIC.

@@ -3334,29 +3270,12 @@ SignalSomeChildren(int signal, uint32 targetMask)
  static void
  TerminateChildren(int signal)
  {


The comment for TerminateChildren() says "except syslogger and dead_end
backends." - aren't you including the latter here?

The comment is adjusted inv4-0004-Introduce-a-separate-BackendType-for-dead-end-chi.patch. Beforethat, SignalChildren() does ignore dead-end children.


Thanks for the review!

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Refactoring postmaster's code to cleanup after child exit

Reply via email to