Hi Andres,
Thanks for sharing these detailed questions and insights. Below are some
comments and additional thoughts that might help, particularly regarding
bgworkers.
On 09/12/2024 22:42, Andres Freund wrote:
Hi,
For AIO, when using the worker process based "AIO", we need to manage a set of
workers. These need to be started by postmaster, obviously.
I have a few questions about the design details. They seemed mostly
independent of the larger AIO thread, so I started this separately.
1) Where to start/stop workers after a config reload?
The GUC that controls how many workers are allowed is PGC_SIGHUP. Therefore we
need to adjust the number of workers after a config reload.
Right now this is implemented (I think by Thomas, but it could have been me),
in a GUC assign hook. But that seems rather iffy.
For example, I think this means that, as we're in the middle of the GUC reload
process, a child forked in that moment may have some updated GUCs, but not
others. But even besides that, it "feels wrong", in a way I can't fully
articulate.
However, it's not exactly obvious *where* workers should be started / stopped
instead.
One approach would be to put the adjustment in PostmasterStateMachine(), but
currently a config reload request doesn't trigger an invocation of
PostmasterStateMachine().
We have a somewhat similar case for bgworkers, for those we have a dedicated
call in ServerLoop(). I guess we could just copy that, but it doesn't seem
exactly pretty.
I am unsure why background workers are managed directly in the
ServerLoop(), wouldn't it help if a "background worker launcher" exist
instead ?
2) New state machine phase for AIO shutdown?
During non-crash shutdown, we currently stop archiver and walsender last
(except dead-end children, but those don't matter in this context). But at
least walsender might use AIO, so we need to shut down AIO after walsender.
We already have PM_SHUTDOWN, PM_SHUTDOWN_2. I didn't want to introduce
PM_SHUTDOWN_3. For now there's a new PM_SHUTDOWN_IO.
Is that the right approach?
But somehow the number of shutdown phases seems to be getting out of control,
besides PM_SHUTDOWN* we also have PM_STOP_BACKENDS, PM_WAIT_BACKENDS,
PM_WAIT_DEAD_END and PM_NO_CHILDREN...
For the startup there is a StartupStatusEnum, it serves distinct purpose
but maybe having ShutdownSequence/ShutdownStatus can be helpful: just
outline order of operations for the shutdown state ? Maybe nearest from
a chain of responsability than the existing state machine...
3) Do we need an new PM* phase at startup?
Because the startup process - obviously - can perform IO, IO workers need to
be started before the startup process is started.
We already do this with checkpointer and bgwriter. For those we don't have a
new phase, we just do so before forking the startup process. That means that
the startup process might need to wait for checkpointer to finish starting up,
but that seems fine to me. The same seems true for IO workers.
Therefore I don't see a need for a new phase?
So it should belong to PM_INIT? but the next question:
4) Resetting pmState during crash-restart?
For the IO worker handling we need to check the phase we currently are in to
know whether to start new workers (so we don't start more workers while
shutting down).
That's easy during normal startup, the state is PM_INIT. But for a
crash-restart (bottom of PostmasterStateMachine()), pmState will be
PM_NO_CHILDREN - in which we would *NOT* want to start new IO workers normally
(if e.g. a SIGHUP is processed while in that state). But we do need to start
IO workers at that point in the crash-restart cycle.
The AIO patchset has dealt with that by moving pmState = PM_STARTUP to a bit
earlier. But that seems a *bit* weird, even though I don't see any negative
consequences right now.
Would it be better to reset pmState to PM_INIT? That doesn't seem quite right
either, it's described as "postmaster starting".
The comments in postmaster.c said that crash recovery is like shutdown
followed by startup (not by init).
If I understood correctly IO workers will need to be restarted in such
context, so they must be after PM_INIT ...
5) btmask_all_except2() not used anymore, btmask_all_except3() needed
The one place using btmask_all_except2() now also needs to allow IO workers,
and thus btmask_all_except3(). Which triggers warnings about
btmask_all_except2 being unused.
We could remove btmask_all_except2(), ifdef it out or create a more generic
version that works with arbitrary numbers of arguments.
Something like
static inline BackendTypeMask
btmask_all_except_n(int nargs, BackendType *t)
{
BackendTypeMask mask = BTYPE_MASK_ALL;
for (int i = 0; i < nargs; i++)
mask = btmask_del(mask, t[i]);
return mask;
}
#define btmask_all_except(...) \
btmask_all_except_n( \
lengthof(((BackendType[]){__VA_ARGS__})), \
(BackendType[]){__VA_ARGS__} \
)
should do the trick, I think.
6) Variable numbered aux processes
The IO workers are aux processes - that seemed to be the most
appropriate. However, if "real" AIO is configured (PGC_POSTMASTER), no IO
workers can be started.
The only change the patch had to make for that is:
-#define NUM_AUXILIARY_PROCS 6
+#define MAX_IO_WORKERS 32
+#define NUM_AUXILIARY_PROCS (6 + MAX_IO_WORKERS)
Which means that we'll reserve some resources for IO workers even if no IO
workers can be started. Do we consider that a problem? There's some precedent
- we reserve space for archiver even if not configured, despite archive_mode
being PGC_POSTMASTER - but of course that's a different scale.
I have not checked the history, I have read some details on this topic,
but the reasoning is not yet very clear to me as **why** this is
hardcoded. If there's a technical or historical reason for the current
design, it would be useful to revisit that in this context.
What prevent the number of background workers to be dynamically sized
today ?
---
Cédric Villemain +33 6 20 30 22 52
https://www.Data-Bene.io
PostgreSQL Support, Expertise, Training, R&D