Hi, A colleague debugged an issue where their postgres was occasionally crash-restarting under load.
The cause turned out to be that a relatively complex archive_command was used, which could in some rare circumstances have a bash subshell pipeline not succeed. It wasn't at all obvious why that'd cause a crash though - the archive command handles the error. The issue turns out to be that postgres was in a container, with pid namespaces enabled. Because postgres was run directly in the container, without a parent process inside, it thus becomes pid 1. Which mostly works without a problem. Until, as the case here with the archive command, a sub-sub process exits while it still has a child. Then that child gets re-parented to postmaster (as init). Such a child is likely to have exited not just with 0 or 1, but something else. As the pid won't match anything in reaper(), we'll go to CleanupBackend(). Where any exit status but 0/1 will unconditionally trigger a restart: if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus)) { HandleChildCrash(pid, exitstatus, _("server process")); return; } This kind of thing is pretty hard to debug, because it's not easy to even figure out what the "crashing" pid belonged to. I wonder if we should work a bit harder to try to identify whether an exiting process was a "server process" before identifying it as such? And perhaps we ought to warn about postgres running as "init" unless we make that robust? Greetings, Andres Freund