On 04/09/2024 17:35, Andres Freund wrote:
On 2024-08-12 12:55:00 +0300, Heikki Linnakangas wrote:
From dc53f89edbeec99f8633def8aa5f47cd98e7a150 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakan...@iki.fi>
Date: Mon, 12 Aug 2024 10:59:04 +0300
Subject: [PATCH v4 4/8] Introduce a separate BackendType for dead-end children
And replace postmaster.c's own "backend type" codes with BackendType
Hm - it seems a bit odd to open-code this when we actually have a "table
driven configuration" available? Why isn't the type a field in
child_process_kind?
Sorry, I didn't understand this. What exactly would you add to
child_process_kind? Where would you use it?
+/*
+ * MaxLivePostmasterChildren
+ *
+ * This reports the number postmaster child processes that can be active. It
+ * includes all children except for dead_end children. This allows the array
+ * in shared memory (PMChildFlags) to have a fixed maximum size.
+ */
+int
+MaxLivePostmasterChildren(void)
+{
+ int n = 0;
+
+ /* We know exactly how mamy worker and aux processes can be active */
+ n += autovacuum_max_workers;
+ n += max_worker_processes;
+ n += NUM_AUXILIARY_PROCS;
+
+ /*
+ * We allow more connections here than we can have backends because some
+ * might still be authenticating; they might fail auth, or some existing
+ * backend might exit before the auth cycle is completed. The exact
+ * MaxBackends limit is enforced when a new backend tries to join the
+ * shared-inval backend array.
+ */
+ n += 2 * (MaxConnections + max_wal_senders);
+
+ return n;
+}
I wonder if we could instead maintain at least some of this in
child_process_kinds? Manually listing different types of processes in
different places doesn't seem particularly sustainable.
Hmm, you mean adding "max this kind of children" field to
child_process_kinds? Perhaps.
+/*
+ * Initialize at postmaster startup
+ */
+void
+InitPostmasterChildSlots(void)
+{
+ int num_pmchild_slots;
+ int slotno;
+ PMChild *slots;
+
+ dlist_init(&freeBackendList);
+ dlist_init(&freeAutoVacWorkerList);
+ dlist_init(&freeBgWorkerList);
+ dlist_init(&freeAuxList);
+ dlist_init(&ActiveChildList);
+
+ num_pmchild_slots = MaxLivePostmasterChildren();
+
+ slots = palloc(num_pmchild_slots * sizeof(PMChild));
+
+ slotno = 0;
+ for (int i = 0; i < 2 * (MaxConnections + max_wal_senders); i++)
+ {
+ init_slot(&slots[slotno], slotno, &freeBackendList);
+ slotno++;
+ }
+ for (int i = 0; i < autovacuum_max_workers; i++)
+ {
+ init_slot(&slots[slotno], slotno, &freeAutoVacWorkerList);
+ slotno++;
+ }
+ for (int i = 0; i < max_worker_processes; i++)
+ {
+ init_slot(&slots[slotno], slotno, &freeBgWorkerList);
+ slotno++;
+ }
+ for (int i = 0; i < NUM_AUXILIARY_PROCS; i++)
+ {
+ init_slot(&slots[slotno], slotno, &freeAuxList);
+ slotno++;
+ }
+ Assert(slotno == num_pmchild_slots);
+}
Along the same vein - could we generalize this into one array of different
slot types and then loop over that to initialize / acquire the slots?
Makes sense.
+/* Return the appropriate free-list for the given backend type */
+static dlist_head *
+GetFreeList(BackendType btype)
+{
+ switch (btype)
+ {
+ case B_BACKEND:
+ case B_BG_WORKER:
+ case B_WAL_SENDER:
+ case B_SLOTSYNC_WORKER:
+ return &freeBackendList;
Maybe a daft question - but why are all of these in the same list? Sure,
they're essentially all full backends, but they're covered by different GUCs?
No reason. No particular reason they should *not* share the same list
either though.
+ /*
+ * Auxiliary processes. There can be only one of each
of these
+ * running at a time.
+ */
+ case B_AUTOVAC_LAUNCHER:
+ case B_ARCHIVER:
+ case B_BG_WRITER:
+ case B_CHECKPOINTER:
+ case B_STARTUP:
+ case B_WAL_RECEIVER:
+ case B_WAL_SUMMARIZER:
+ case B_WAL_WRITER:
+ return &freeAuxList;
+
+ /*
+ * Logger is not connected to shared memory, and does
not have a
+ * PGPROC entry, but we still allocate a child slot for
it.
+ */
Tangential: Why do we need a freelist for these and why do we choose a random
pgproc for these instead of assigning one statically?
Background: I'd like to not provide AIO workers with "bounce buffers" (for IO
of buffers that can't be done in-place, like writes when checksums are
enabled). The varying proc numbers make that harder than it'd have to be...
Yeah, we can make these fixed.Currently, the # of slots reserved for aux
processes is sized by NUM_AUXILIARY_PROCS, which is one smaller than the
number of different aux proces kinds:
/*
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
* Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
* run during normal operation. Startup process and WAL receiver also consume
* 2 slots, but WAL writer is launched only after startup has exited, so we
* only need 6 slots.
*/
#define NUM_AUXILIARY_PROCS 6
For PMChildSlot numbers, we could certainly just allocate one more slot.
It would probably make sense for PGPROCs too, even though PGPROC is a
much larger struct.
+PMChild *
+AssignPostmasterChildSlot(BackendType btype)
+{
+ dlist_head *freelist;
+ PMChild *pmchild;
+
+ freelist = GetFreeList(btype);
+
+ if (dlist_is_empty(freelist))
+ return NULL;
+
+ pmchild = dlist_container(PMChild, elem, dlist_pop_head_node(freelist));
+ pmchild->pid = 0;
+ pmchild->bkend_type = btype;
+ pmchild->rw = NULL;
+ pmchild->bgworker_notify = true;
+
+ /*
+ * pmchild->child_slot for each entry was initialized when the array of
+ * slots was allocated.
+ */
+
+ dlist_push_head(&ActiveChildList, &pmchild->elem);
+
+ ReservePostmasterChildSlot(pmchild->child_slot);
+
+ /* FIXME: find a more elegant way to pass this */
+ MyPMChildSlot = pmchild->child_slot;
What if we assigned one offset for each process and assigned its ID here and
also used that for its ProcNumber - that way we wouldn't need to manage
freelists in two places.
It's currently possible to have up to 2 * max_connections backends in
the authentication phase. We would have to change that behaviour, or
make the PGPROC array 2x larger.
It might well be worth it, I don't know how sensible the current
behaviour is. But I'd like to punt that to later patch, to keep the
scope of this patch set reasonable. It's pretty straightforward to do
later on top of this if we want to.
+PMChild *
+FindPostmasterChildByPid(int pid)
+{
+ dlist_iter iter;
+
+ dlist_foreach(iter, &ActiveChildList)
+ {
+ PMChild *bp = dlist_container(PMChild, elem, iter.cur);
+
+ if (bp->pid == pid)
+ return bp;
+ }
+ return NULL;
+}
It's not new, but it's quite sad that postmaster's process exit handling is
effectively O(Backends^2)...
It would be straightforward to turn ActiveChildList into a hash table.
But I'd like to leave that to a followup patch too.
/*
* We're ready to rock and roll...
*/
- StartupPID = StartChildProcess(B_STARTUP);
- Assert(StartupPID != 0);
+ StartupPMChild = StartChildProcess(B_STARTUP);
+ Assert(StartupPMChild != NULL);
This (not new) assertion is ... odd.
Yeah, it's an assertion because StartChildProcess has this:
/*
* fork failure is fatal during startup, but there's no need to
choke
* immediately if starting other child types fails.
*/
if (type == B_STARTUP)
ExitPostmaster(1);
@@ -1961,26 +1915,6 @@ process_pm_reload_request(void)
(errmsg("received SIGHUP, reloading configuration
files")));
ProcessConfigFile(PGC_SIGHUP);
SignalSomeChildren(SIGHUP, BACKEND_TYPE_ALL & ~(1 <<
B_DEAD_END_BACKEND));
- if (StartupPID != 0)
- signal_child(StartupPID, SIGHUP);
- if (BgWriterPID != 0)
- signal_child(BgWriterPID, SIGHUP);
- if (CheckpointerPID != 0)
- signal_child(CheckpointerPID, SIGHUP);
- if (WalWriterPID != 0)
- signal_child(WalWriterPID, SIGHUP);
- if (WalReceiverPID != 0)
- signal_child(WalReceiverPID, SIGHUP);
- if (WalSummarizerPID != 0)
- signal_child(WalSummarizerPID, SIGHUP);
- if (AutoVacPID != 0)
- signal_child(AutoVacPID, SIGHUP);
- if (PgArchPID != 0)
- signal_child(PgArchPID, SIGHUP);
- if (SysLoggerPID != 0)
- signal_child(SysLoggerPID, SIGHUP);
- if (SlotSyncWorkerPID != 0)
- signal_child(SlotSyncWorkerPID, SIGHUP);
/* Reload authentication config files too */
if (!load_hba())
For a moment I wondered why this change was part of this commit - but I guess
we didn't have any of these in an array/list before this change...
Correct.
@@ -2469,11 +2410,15 @@ process_pm_child_exit(void)
}
/* Was it the system logger? If so, try to start a new one */
- if (pid == SysLoggerPID)
+ if (SysLoggerPMChild && pid == SysLoggerPMChild->pid)
{
- SysLoggerPID = 0;
/* for safety's sake, launch new logger *first* */
- SysLoggerPID = SysLogger_Start();
+ SysLoggerPMChild->pid = SysLogger_Start();
+ if (SysLoggerPMChild->pid == 0)
+ {
+ FreePostmasterChildSlot(SysLoggerPMChild);
+ SysLoggerPMChild = NULL;
+ }
if (!EXIT_STATUS_0(exitstatus))
LogChildExit(LOG, _("system logger process"),
Seems a bit weird to have one place with a different memory lifetime handling
than other places. Why don't we just do this the same way as in other places
but continue to defer the logging until after we tried to start the new
logger?
Hmm, you mean let LaunchMissingBackgroundProcesses() handle the restart?
I'm a little scared of changing the existing logic. We don't have a
mechanism for deferring logging, so we would have to invent that, or the
logs would just accumulate in the pipe until syslogger starts up.
There's some code between here and LaunchMissingBackgroundProcesses(),
so might postmaster get blocked between writing to the syslogger pipe,
before having restarted it?
If forking the syslogger process fails, that can happen anyway, though.
Might be worth having a test ensuring that loggers restart OK.
Yeah..
/* Construct a process name for log message */
+
+ /*
+ * FIXME: use GetBackendTypeDesc here? How does the localization of that
+ * work?
+ */
if (bp->bkend_type == B_DEAD_END_BACKEND)
{
procname = _("dead end backend");
Might be worth having a version of GetBackendTypeDesc() that returns a
translated string?
Constructing the string for background workers is a little more complicated:
snprintf(namebuf, MAXPGPATH, _("background worker \"%s\""),
bp->rw->rw_worker.bgw_type);
We could still do that for background workers and use the transalated
variant of GetBackendTypeDesc() for everything else though.
@@ -2697,9 +2643,16 @@ HandleChildCrash(int pid, int exitstatus, const char
*procname)
{
dlist_iter iter;
- dlist_foreach(iter, &BackendList)
+ dlist_foreach(iter, &ActiveChildList)
{
- Backend *bp = dlist_container(Backend, elem,
iter.cur);
+ PMChild *bp = dlist_container(PMChild, elem,
iter.cur);
+
+ /* We do NOT restart the syslogger */
+ if (bp == SysLoggerPMChild)
+ continue;
That comment seems a bit misleading - we do restart syslogger, we just don't
do it here, no? I realize it's an old comment, but it still seems like it's
worth fixing given that you touch all the code here...
No, we really do not restart the syslogger. This code runs when
*another* process has crashed unexpectedly. We kill all other processes,
reinitialize shared memory and restart, but the old syslogger keeps
running through all that.
I'll add a note on that to InitPostmasterChildSlots(), as it's a bit
surprising.
@@ -2871,29 +2786,27 @@ PostmasterStateMachine(void)
<snip>
- if (WalSummarizerPID != 0)
- signal_child(WalSummarizerPID, SIGTERM);
- if (SlotSyncWorkerPID != 0)
- signal_child(SlotSyncWorkerPID, SIGTERM);
+ targetMask |= (1 << B_STARTUP);
+ targetMask |= (1 << B_WAL_RECEIVER);
+
+ targetMask |= (1 << B_WAL_SUMMARIZER);
+ targetMask |= (1 << B_SLOTSYNC_WORKER);
/* checkpointer, archiver, stats, and syslogger may continue
for now */
+ SignalSomeChildren(SIGTERM, targetMask);
+
/* Now transition to PM_WAIT_BACKENDS state to wait for them to
die */
pmState = PM_WAIT_BACKENDS;
<snip>
It's likely the right thing to not do as one patch, but IMO this really wants
to be a state table. Perhaps as part of child_process_kinds, perhaps separate
from that.
Yeah. I've tried to refactor this into a table before, but didn't come
up with anything that I was happy with. I also feel there must be a
better way to organize this, but not sure what exactly. I hope that will
become more apparent after these other changes.
@@ -3130,8 +3047,21 @@ static void
LaunchMissingBackgroundProcesses(void)
{
/* Syslogger is active in all states */
- if (SysLoggerPID == 0 && Logging_collector)
- SysLoggerPID = SysLogger_Start();
+ if (SysLoggerPMChild == NULL && Logging_collector)
+ {
+ SysLoggerPMChild = AssignPostmasterChildSlot(B_LOGGER);
+ if (!SysLoggerPMChild)
+ elog(LOG, "no postmaster child slot available for
syslogger");
How could this elog() be reached? Seems something seriously would have gone
wrong to get here - in which case a LOG that might not even be visible (due to
logger not working) doesn't seem like the right response.
I'll turn it into an assertion or PANIC.
@@ -3334,29 +3270,12 @@ SignalSomeChildren(int signal, uint32 targetMask)
static void
TerminateChildren(int signal)
{
The comment for TerminateChildren() says "except syslogger and dead_end
backends." - aren't you including the latter here?
The comment is adjusted in
v4-0004-Introduce-a-separate-BackendType-for-dead-end-chi.patch. Before
that, SignalChildren() does ignore dead-end children.
Thanks for the review!
--
Heikki Linnakangas
Neon (https://neon.tech)