On Sat, Dec 9, 2023 at 9:30 AM Alexander Lakhin <exclus...@gmail.com> wrote: > > Hello hackers, > > While studying bug #18158, I've come to the conclusion that the existing > testing infrastructure is unable to detect abnormal situations. of some > kind. > > Just a simple example: > With Assert(0) injected in walsender (say: > sed "s/WalSndDone(send_data)/Assert(0)/" -i > src/backend/replication/walsender.c > ), I observe the following: > $ make -s check -C src/test/recovery PROVE_TESTS="t/012*" > > t/012_subtransactions.pl .. ok > All tests successful. > > (The same with meson: > 1/1 postgresql:recovery / recovery/012_subtransactions OK > 3.24s 12 subtests passed) > > But: > $ grep TRAP src/test/recovery/tmp_check/log/* > src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed > Assert("0"), File: "walsender.c", Line: > 2528, PID: 373729 > src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed > Assert("0"), File: "walsender.c", Line: > 2528, PID: 373750 > src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed > Assert("0"), File: "walsender.c", Line: > 2528, PID: 373821 > src/test/recovery/tmp_check/log/012_subtransactions_standby.log:TRAP: failed > Assert("0"), File: "walsender.c", Line: > 2528, PID: 373786 > > src/test/recovery/tmp_check/log/012_subtransactions_primary.log contains: > ... > 2023-12-09 03:23:20.210 UTC [375933] LOG: server process (PID 375975) was > terminated by signal 6: Aborted > 2023-12-09 03:23:20.210 UTC [375933] DETAIL: Failed process was running: > START_REPLICATION 0/3000000 TIMELINE 3 > 2023-12-09 03:23:20.210 UTC [375933] LOG: terminating any other active > server processes > 2023-12-09 03:23:20.210 UTC [375933] LOG: abnormal database system shutdown > 2023-12-09 03:23:20.211 UTC [375933] LOG: database system is shut down > ... > > So the shutdown definitely was considered abnormal, but we missed that. > > With log_min_messages = DEBUG3, I can see in the log: > 2023-12-09 03:26:50.145 UTC [377844] LOG: abnormal database system shutdown > 2023-12-09 03:26:50.145 UTC [377844] DEBUG: shmem_exit(1): 0 > before_shmem_exit callbacks to make > 2023-12-09 03:26:50.145 UTC [377844] DEBUG: shmem_exit(1): 5 on_shmem_exit > callbacks to make > 2023-12-09 03:26:50.145 UTC [377844] DEBUG: cleaning up orphaned dynamic > shared memory with ID 2898643884 > 2023-12-09 03:26:50.145 UTC [377844] DEBUG: cleaning up dynamic shared > memory control segment with ID 3446966170 > 2023-12-09 03:26:50.146 UTC [377844] DEBUG: proc_exit(1): 2 callbacks to make > 2023-12-09 03:26:50.146 UTC [377844] LOG: database system is shut down > 2023-12-09 03:26:50.146 UTC [377844] DEBUG: exit(1) > 2023-12-09 03:26:50.146 UTC [377844] DEBUG: shmem_exit(-1): 0 > before_shmem_exit callbacks to make > 2023-12-09 03:26:50.146 UTC [377844] DEBUG: shmem_exit(-1): 0 on_shmem_exit > callbacks to make > 2023-12-09 03:26:50.146 UTC [377844] DEBUG: proc_exit(-1): 0 callbacks to > make > > The postmaster process exits with exit code 1, but pg_ctl can't get the > code and just reports that stop was completed successfully. >
For what it's worth, there is another thread which stated the similar problem: https://www.postgresql.org/message-id/flat/2366244.1651681550%40sss.pgh.pa.us > Should this be improved? And if yes, how it can be done? > Maybe postmaster shouldn't remove it's postmaster.pid when it exits > abnormally to let pg_ctl know of it? > thanks Shveta