On Tue, Jan 17, 2023 at 08:44:27PM -0800, Nathan Bossart wrote: > Thanks. Here's a rebased version of the last patch.
Thanks for the rebase. The final state of the documentation is as follows: 51. Archive and Recovery Modules 51.1. Archive Module Initialization Functions 51.2. Archive Module Callbacks 51.3. Recovery Module Initialization Functions 51.4. Recovery Module Callbacks I am not completely sure whether this grouping is the best thing to do. Wouldn't it be better to move that into two different sub-sections instead? One layout suggestion: 51. WAL Modules 51.1. Archive Modules 51.1.1. Archive Module Initialization Functions 51.1.2. Archive Module Callbacks 51.2. Recovery Modules 51.2.1. Recovery Module Initialization Functions 51.2.2. Recovery Module Callbacks Putting both of them in the same section sounds like a good idea per the symmetry that one would likely have between the code paths of archiving and recovery, so as they share some common knowledge. This kinds of comes back to the previous suggestion to rename basic_archive to something like wal_modules, that covers both archiving and recovery. I does not seem like this would overlap with RMGRs, which is much more internal anyway. (errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("must specify restore_command when standby mode is not enabled"))); + errmsg("must specify restore_command or a restore_library that defines " + "a restore callback when standby mode is not enabled"))); This is long. Shouldn't this be split into an errdetail() to list all the options at hand? - if (XLogArchiveLibrary[0] != '\0' && XLogArchiveCommand[0] != '\0') - ereport(ERROR, - (errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("both archive_command and archive_library set"), - errdetail("Only one of archive_command, archive_library may be set."))); + CheckMutuallyExclusiveGUCs(XLogArchiveLibrary, "archive_library", + XLogArchiveCommand, "archive_command"); The introduction of this routine could be a patch on its own, as it impacts the archiving path. + CheckMutuallyExclusiveGUCs(restoreLibrary, "restore_library", + archiveCleanupCommand, "archive_cleanup_command"); + if (strcmp(prevRestoreLibrary, restoreLibrary) != 0 || + strcmp(prevArchiveCleanupCommand, archiveCleanupCommand) != 0) + { + call_recovery_module_shutdown_cb(0, (Datum) 0); + LoadRecoveryCallbacks(); + } + + pfree(prevRestoreLibrary); + pfree(prevArchiveCleanupCommand); Hm.. The callers of CheckMutuallyExclusiveGUCs() with the new ERROR paths they introduce need a close lookup. As far as I can see this concerns four areas depending on the three restore commands (restore_command and recovery_end command for startup, archive_cleanup_command for the checkpointer): - Startup process initialization, as of validateRecoveryParameters() where the postmaster GUCs for the recovery target are evaluated. This one is an early stage which is fine. - Startup reloading, as of StartupRereadConfig(). This code could involve a WAL receiver restart depending on a change in the slot change or in primary_conninfo, and force at the same time an ERROR because of conflicting recovery library and command configuration. This one should be safe because pendingWalRcvRestart would just be considered later on by the startup process itself while waiting for WAL to become available. Still this could deserve a comment? Even if there is a misconfiguration, a reload on a standby would enforce a FATAL in the startup process, taking down the whole server. - Checkpointer initialization, as of CheckpointerMain(). A configuration failure in this code path, aka server startup, causes the server to loop infinitely on FATAL with the misconfiguration showing up all the time.. This is a problem. - Last comes the checkpointer GUC reloading, as of HandleCheckpointerInterrupts(), with a second problem. This introduces a failure path where ConfigReloadPending is processed at the same time as ShutdownRequestPending based on the way it is coded, interacting with what would be a normal shutdown in some cases? And actually, if you enforce a misconfiguration on reload, the checkpointer reports an error but it does not enforce a process restart, hence it keeps around the new, incorrect, configuration while waiting for a new checkpoint to happen once restore_library and archive_cleanup_command are set. This could lead to surprises, IMO. Upgrading to a FATAL in this code path triggers an infinite loop, like the startup path. If the archive_cleanup_command ballback of a restore library triggers a FATAL, it seems to me that it would continuously trigger a server restart, actually. Perhaps that's something to document, in comparison to the safe fallbacks of the shell command where we don't force an ERROR to give priority to the stability of the checkpointer. -- Michael
signature.asc
Description: PGP signature