Hello hackers,
While studying bug #18158, I've come to the conclusion that the existing
testing infrastructure is unable to detect abnormal situations. of some
kind.
Just a simple example:
With Assert(0) injected in walsender (say:
sed "s/WalSndDone(send_data)/Assert(0)/" -i src/backend/replication/walsender.c
), I observe the following:
$ make -s check -C src/test/recovery PROVE_TESTS="t/012*"
t/012_subtransactions.pl .. ok
All tests successful.
(The same with meson:
1/1 postgresql:recovery / recovery/012_subtransactions OK 3.24s
12 subtests passed)
But:
$ grep TRAP src/test/recovery/tmp_check/log/*
src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed Assert("0"), File: "walsender.c", Line:
2528, PID: 373729
src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed Assert("0"), File: "walsender.c", Line:
2528, PID: 373750
src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed Assert("0"), File: "walsender.c", Line:
2528, PID: 373821
src/test/recovery/tmp_check/log/012_subtransactions_standby.log:TRAP: failed Assert("0"), File: "walsender.c", Line:
2528, PID: 373786
src/test/recovery/tmp_check/log/012_subtransactions_primary.log contains:
...
2023-12-09 03:23:20.210 UTC [375933] LOG: server process (PID 375975) was
terminated by signal 6: Aborted
2023-12-09 03:23:20.210 UTC [375933] DETAIL: Failed process was running:
START_REPLICATION 0/3000000 TIMELINE 3
2023-12-09 03:23:20.210 UTC [375933] LOG: terminating any other active server
processes
2023-12-09 03:23:20.210 UTC [375933] LOG: abnormal database system shutdown
2023-12-09 03:23:20.211 UTC [375933] LOG: database system is shut down
...
So the shutdown definitely was considered abnormal, but we missed that.
With log_min_messages = DEBUG3, I can see in the log:
2023-12-09 03:26:50.145 UTC [377844] LOG: abnormal database system shutdown
2023-12-09 03:26:50.145 UTC [377844] DEBUG: shmem_exit(1): 0 before_shmem_exit
callbacks to make
2023-12-09 03:26:50.145 UTC [377844] DEBUG: shmem_exit(1): 5 on_shmem_exit
callbacks to make
2023-12-09 03:26:50.145 UTC [377844] DEBUG: cleaning up orphaned dynamic
shared memory with ID 2898643884
2023-12-09 03:26:50.145 UTC [377844] DEBUG: cleaning up dynamic shared memory
control segment with ID 3446966170
2023-12-09 03:26:50.146 UTC [377844] DEBUG: proc_exit(1): 2 callbacks to make
2023-12-09 03:26:50.146 UTC [377844] LOG: database system is shut down
2023-12-09 03:26:50.146 UTC [377844] DEBUG: exit(1)
2023-12-09 03:26:50.146 UTC [377844] DEBUG: shmem_exit(-1): 0
before_shmem_exit callbacks to make
2023-12-09 03:26:50.146 UTC [377844] DEBUG: shmem_exit(-1): 0 on_shmem_exit
callbacks to make
2023-12-09 03:26:50.146 UTC [377844] DEBUG: proc_exit(-1): 0 callbacks to make
The postmaster process exits with exit code 1, but pg_ctl can't get the
code and just reports that stop was completed successfully.
Should this be improved? And if yes, how it can be done?
Maybe postmaster shouldn't remove it's postmaster.pid when it exits
abnormally to let pg_ctl know of it?
Best regards,
Alexander