Hello,While developing a patch and running regression tests I noticed that the postmaster could fail to shut down right after crash restart. It could get stuck in the PM_WAIT_BACKENDS state forever.
As far as I understand, the problem occurs when a shutdown signal is received before getting PMSIGNAL_RECOVERY_STARTED from the startup process. In that case the FatalError flag is not cleared, and the postmaster is stuck in PM_WAIT_BACKENDS waiting for the checkpointer, which ignores SIGTERM.
To easily reproduce the problem I added pg_usleep in xlogrecovery.c just before SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED). See the patch attached.
Then I run a script that simulates a crash and does pg_ctl stop: $ ./init.sh [...] $ ./stop-after-crash.sh waiting for server to start.... done server startedwaiting for server to shut down............................................................... failed
pg_ctl: server does not shut down Some processes are still alive: $ ps uf -C postgres USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMANDsergey 279874 0.0 0.0 222816 28560 ? Ss 14:25 0:00 /home/sergey/pgwork/devel/install/bin/postgres -D data sergey 279887 0.0 0.0 222772 5664 ? Ss 14:25 0:00 \_ postgres: io worker 0 sergey 279888 0.0 0.0 222772 5664 ? Ss 14:25 0:00 \_ postgres: io worker 1 sergey 279889 0.0 0.0 222772 5664 ? Ss 14:25 0:00 \_ postgres: io worker 2 sergey 279891 0.0 0.0 222884 8480 ? Ss 14:25 0:00 \_ postgres: checkpointer
Here is an excerpt from the debug log: postmaster[279874] LOG: all server processes terminated; reinitializingstartup[279890] LOG: database system was interrupted; last known up at 2025-04-24 14:25:58 MSK startup[279890] LOG: database system was not properly shut down; automatic recovery in progress
postmaster[279874] DEBUG: postmaster received shutdown request signal postmaster[279874] LOG: received fast shutdown requestpostmaster[279874] DEBUG: updating PMState from PM_STARTUP to PM_STOP_BACKENDS postmaster[279874] DEBUG: sending signal 15/SIGTERM to background writer process with pid 279892 postmaster[279874] DEBUG: sending signal 15/SIGTERM to checkpointer process with pid 279891 postmaster[279874] DEBUG: sending signal 15/SIGTERM to startup process with pid 279890 postmaster[279874] DEBUG: sending signal 15/SIGTERM to io worker process with pid 279889 postmaster[279874] DEBUG: sending signal 15/SIGTERM to io worker process with pid 279888 postmaster[279874] DEBUG: sending signal 15/SIGTERM to io worker process with pid 279887 postmaster[279874] DEBUG: updating PMState from PM_STOP_BACKENDS to PM_WAIT_BACKENDS
startup[279890] LOG: invalid record length at 0/175A4D8: expected at least 24, got 0
postmaster[279874] DEBUG: postmaster received pmsignal signal startup[279890] LOG: redo is not requiredcheckpointer[279891] LOG: checkpoint starting: end-of-recovery immediate wait checkpointer[279891] LOG: checkpoint complete: wrote 0 buffers (0.0%), wrote 3 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.007 s, sync=0.002 s, total=0.026 s; sync files=2, longest=0.001 s, average=0.001 s; distance=0 kB, estimate=0 kB; lsn=0/175A4D8, redo lsn=0/175A4D8
startup[279890] DEBUG: exit(0)postmaster[279874] DEBUG: updating PMState from PM_WAIT_BACKENDS to PM_WAIT_BACKENDS
checkpointer[279891] DEBUG: checkpoint skipped because system is idle checkpointer[279891] DEBUG: checkpoint skipped because system is idle I don't know how to fix this, but thought it's worth reporting. Best regards, -- Sergey Shinderuk https://postgrespro.com/
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index 6ce979f2d8b..19ee8b09685 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -1696,7 +1696,10 @@ PerformWalRecovery(void) * archiver if necessary. */ if (IsUnderPostmaster) + { + pg_usleep(3000000); SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED); + } /* * Allow read-only connections immediately if we're consistent already.
init.sh
Description: application/shellscript
stop-after-crash.sh
Description: application/shellscript
logfile.gz
Description: application/gzip