How abnormal server shutdown could be detected by tests?

Alexander Lakhin Fri, 08 Dec 2023 20:00:29 -0800

Hello hackers,

While studying bug #18158, I've come to the conclusion that the existing
testing infrastructure is unable to detect abnormal situations. of some
kind.


Just a simple example:
With Assert(0) injected in walsender (say:
sed "s/WalSndDone(send_data)/Assert(0)/" -i src/backend/replication/walsender.c
), I observe the following:
$ make -s check -C src/test/recovery PROVE_TESTS="t/012*"

t/012_subtransactions.pl .. ok
All tests successful.

(The same with meson:
1/1 postgresql:recovery / recovery/012_subtransactions OK                3.24s  
 12 subtests passed)

But:
$ grep TRAP src/test/recovery/tmp_check/log/*

src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed Assert("0"), File: "walsender.c", Line:2528, PID: 373729src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed Assert("0"), File: "walsender.c", Line:2528, PID: 373750src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed Assert("0"), File: "walsender.c", Line:2528, PID: 373821src/test/recovery/tmp_check/log/012_subtransactions_standby.log:TRAP: failed Assert("0"), File: "walsender.c", Line:2528, PID: 373786


src/test/recovery/tmp_check/log/012_subtransactions_primary.log contains:
...
2023-12-09 03:23:20.210 UTC [375933] LOG:  server process (PID 375975) was 
terminated by signal 6: Aborted
2023-12-09 03:23:20.210 UTC [375933] DETAIL:  Failed process was running: 
START_REPLICATION 0/3000000 TIMELINE 3
2023-12-09 03:23:20.210 UTC [375933] LOG:  terminating any other active server 
processes
2023-12-09 03:23:20.210 UTC [375933] LOG:  abnormal database system shutdown
2023-12-09 03:23:20.211 UTC [375933] LOG:  database system is shut down
...

So the shutdown definitely was considered abnormal, but we missed that.

With log_min_messages = DEBUG3, I can see in the log:
2023-12-09 03:26:50.145 UTC [377844] LOG:  abnormal database system shutdown
2023-12-09 03:26:50.145 UTC [377844] DEBUG:  shmem_exit(1): 0 before_shmem_exit 
callbacks to make
2023-12-09 03:26:50.145 UTC [377844] DEBUG:  shmem_exit(1): 5 on_shmem_exit 
callbacks to make
2023-12-09 03:26:50.145 UTC [377844] DEBUG:  cleaning up orphaned dynamic 
shared memory with ID 2898643884
2023-12-09 03:26:50.145 UTC [377844] DEBUG:  cleaning up dynamic shared memory 
control segment with ID 3446966170
2023-12-09 03:26:50.146 UTC [377844] DEBUG:  proc_exit(1): 2 callbacks to make
2023-12-09 03:26:50.146 UTC [377844] LOG:  database system is shut down
2023-12-09 03:26:50.146 UTC [377844] DEBUG:  exit(1)
2023-12-09 03:26:50.146 UTC [377844] DEBUG:  shmem_exit(-1): 0 
before_shmem_exit callbacks to make
2023-12-09 03:26:50.146 UTC [377844] DEBUG:  shmem_exit(-1): 0 on_shmem_exit 
callbacks to make
2023-12-09 03:26:50.146 UTC [377844] DEBUG:  proc_exit(-1): 0 callbacks to make

The postmaster process exits with exit code 1, but pg_ctl can't get the
code and just reports that stop was completed successfully.

Should this be improved? And if yes, how it can be done?
Maybe postmaster shouldn't remove it's postmaster.pid when it exits
abnormally to let pg_ctl know of it?

Best regards,
Alexander

How abnormal server shutdown could be detected by tests?

Reply via email to