Hi, On 2025-01-14 13:06:31 +0000, Bertrand Drouvot wrote: > On Tue, Jan 14, 2025 at 12:58:44AM -0500, Andres Freund wrote: > > The current code has the weird behaviour of going through PM_WAIT_BACKENDS. > > I > > left it like that for now. In fact more paths do so now, because we did so > > for > > PM_WAIT_XLOG_SHUTDOWN but *didn't* do so for PM_WAIT_XLOG_ARCHIVAL, which > > seems rather weird. > > > > I suspect it'd be better to switch directly to PM_DEAD_END and have > > HandleFatalError always do ConfigurePostmasterWaitSet(false), but that's > > left > > for another time. > > That would probably make more sense, but would be more invasive and would > need careful testing. Worth a XXX comment in the code?
There is, although I guess it could be reformulated a bit. > ======== Patch 0003: > > It replaces the take_action boolean flag with an early return. > > One comment: > > === 1 > > + TerminateChildren(send_abort_for_crash ? SIGABRT : SIGQUIT); > > if (Shutdown != ImmediateShutdown) > FatalError = true; > > I think we can avoid this remaining extra check and set FatalError to true > unconditionally (as we already know that Shutdown != ImmediateShutdown as per > the first check in the function). > > ======== Patch 0004: > > One comment: > > === 2 > > As per comment === 1 above, then we could remove: > > if (Shutdown != ImmediateShutdown) > FatalError = true; > > from HandleFatalError(). And that would be up to the HandleFatalError() caller > to set FatalError to true (prior calling HandleFatalError()). > > HandleFatalError becomes a pure "transition to error state" function then. > Does that make sense to you? I don't see what we'd gain from moving FatalError = true separate from HandleFatalError? All it'll mean is more copied code? > === 4 > > + * XXX: Is it worth inventing a different PMQUIT value > + * that signals that the cluster is in a bad state, > + * without a process having crashed? > */ > > That would be more "accurate". Something like PMQUIT_FORK_FAILURE or such. For now I think I'll leave it as such, as we'd need to have a new message in quickdie(). While the primary message for PMQUIT_FOR_CRASH message would differ, the rest seems like it'd largely be duplicated: case PMQUIT_FOR_CRASH: /* A crash-and-restart cycle is in progress */ ereport(WARNING_CLIENT_ONLY, (errcode(ERRCODE_CRASH_SHUTDOWN), errmsg("terminating connection because of crash of another server process"), errdetail("The postmaster has commanded this server process to roll back" " the current transaction and exit, because another" " server process exited abnormally and possibly corrupted" " shared memory."), errhint("In a moment you should be able to reconnect to the" " database and repeat your command."))); break; That seems like a separate change. Particularly because I'm just working on this as part of some nasty three layer deep dependency chain - aio is pretty big on its own... > > I suspect it'd be better to switch directly to PM_DEAD_END and have > > HandleFatalError always do ConfigurePostmasterWaitSet(false), but that's > > left > > for another time. > > Before doing so, what do you think about: > > 1. keeping FatalError = true; as suggeted above > 2. put back the UpdatePMState(PM_WAIT_DEAD_END) (prior to the > HandleFatalError( > PMQUIT_FOR_CRASH, false) call I don't like that, what's the point of having something like HandleFatalError if we duplicate more code than it saves to each place? I don't think we need to either, we can just do that in the relevant case statement in HandleFatalError(). That works, I went back and forth several times :) > 3. put the ConfigurePostmasterWaitSet(false) in the PM_WAIT_DEAD_END switch in > HandleFatalError()? That one would make sense to me. > But that would also need to call TerminateChildren() (a second time) again > after > ConfigurePostmasterWaitSet() which is not ideal though (unless we move the > TerminateChildren() calls in the switch or add a check on PM_WAIT_DEAD_END in > the first call). Why would it need to? > === 8 > > + * Now that everyone important is gone > > s/everyone important is/walsenders and archiver are also gone/? I'd rather get away from these lists of specific processes - they end up being noisy in the git history because the lines get updated to add/remove items (or updates to them get missed). Greetings, Andres Freund