Hello,

While developing a patch and running regression tests I noticed that the postmaster could fail to shut down right after crash restart. It could get stuck in the PM_WAIT_BACKENDS state forever.

As far as I understand, the problem occurs when a shutdown signal is received before getting PMSIGNAL_RECOVERY_STARTED from the startup process. In that case the FatalError flag is not cleared, and the postmaster is stuck in PM_WAIT_BACKENDS waiting for the checkpointer, which ignores SIGTERM.

To easily reproduce the problem I added pg_usleep in xlogrecovery.c just before SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED). See the patch attached.

Then I run a script that simulates a crash and does pg_ctl stop:

$ ./init.sh
[...]

$ ./stop-after-crash.sh
waiting for server to start.... done
server started
waiting for server to shut down............................................................... failed
pg_ctl: server does not shut down


Some processes are still alive:

$ ps uf -C postgres
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
sergey 279874 0.0 0.0 222816 28560 ? Ss 14:25 0:00 /home/sergey/pgwork/devel/install/bin/postgres -D data sergey 279887 0.0 0.0 222772 5664 ? Ss 14:25 0:00 \_ postgres: io worker 0 sergey 279888 0.0 0.0 222772 5664 ? Ss 14:25 0:00 \_ postgres: io worker 1 sergey 279889 0.0 0.0 222772 5664 ? Ss 14:25 0:00 \_ postgres: io worker 2 sergey 279891 0.0 0.0 222884 8480 ? Ss 14:25 0:00 \_ postgres: checkpointer


Here is an excerpt from the debug log:

postmaster[279874] LOG:  all server processes terminated; reinitializing
startup[279890] LOG: database system was interrupted; last known up at 2025-04-24 14:25:58 MSK startup[279890] LOG: database system was not properly shut down; automatic recovery in progress

postmaster[279874] DEBUG:  postmaster received shutdown request signal
postmaster[279874] LOG:  received fast shutdown request
postmaster[279874] DEBUG: updating PMState from PM_STARTUP to PM_STOP_BACKENDS postmaster[279874] DEBUG: sending signal 15/SIGTERM to background writer process with pid 279892 postmaster[279874] DEBUG: sending signal 15/SIGTERM to checkpointer process with pid 279891 postmaster[279874] DEBUG: sending signal 15/SIGTERM to startup process with pid 279890 postmaster[279874] DEBUG: sending signal 15/SIGTERM to io worker process with pid 279889 postmaster[279874] DEBUG: sending signal 15/SIGTERM to io worker process with pid 279888 postmaster[279874] DEBUG: sending signal 15/SIGTERM to io worker process with pid 279887 postmaster[279874] DEBUG: updating PMState from PM_STOP_BACKENDS to PM_WAIT_BACKENDS

startup[279890] LOG: invalid record length at 0/175A4D8: expected at least 24, got 0
postmaster[279874] DEBUG:  postmaster received pmsignal signal
startup[279890] LOG:  redo is not required

checkpointer[279891] LOG: checkpoint starting: end-of-recovery immediate wait checkpointer[279891] LOG: checkpoint complete: wrote 0 buffers (0.0%), wrote 3 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.007 s, sync=0.002 s, total=0.026 s; sync files=2, longest=0.001 s, average=0.001 s; distance=0 kB, estimate=0 kB; lsn=0/175A4D8, redo lsn=0/175A4D8

startup[279890] DEBUG:  exit(0)
postmaster[279874] DEBUG: updating PMState from PM_WAIT_BACKENDS to PM_WAIT_BACKENDS

checkpointer[279891] DEBUG:  checkpoint skipped because system is idle
checkpointer[279891] DEBUG:  checkpoint skipped because system is idle


I don't know how to fix this, but thought it's worth reporting.

Best regards,

--
Sergey Shinderuk                https://postgrespro.com/
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 6ce979f2d8b..19ee8b09685 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1696,7 +1696,10 @@ PerformWalRecovery(void)
 	 * archiver if necessary.
 	 */
 	if (IsUnderPostmaster)
+	{
+		pg_usleep(3000000);
 		SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);
+	}
 
 	/*
 	 * Allow read-only connections immediately if we're consistent already.

Attachment: init.sh
Description: application/shellscript

Attachment: stop-after-crash.sh
Description: application/shellscript

Attachment: logfile.gz
Description: application/gzip

Reply via email to