On Sat, Dec 9, 2023 at 9:30 AM Alexander Lakhin <exclus...@gmail.com> wrote:
>
> Hello hackers,
>
> While studying bug #18158, I've come to the conclusion that the existing
> testing infrastructure is unable to detect abnormal situations. of some
> kind.
>
> Just a simple example:
> With Assert(0) injected in walsender (say:
> sed "s/WalSndDone(send_data)/Assert(0)/" -i 
> src/backend/replication/walsender.c
> ), I observe the following:
> $ make -s check -C src/test/recovery PROVE_TESTS="t/012*"
>
> t/012_subtransactions.pl .. ok
> All tests successful.
>
> (The same with meson:
> 1/1 postgresql:recovery / recovery/012_subtransactions OK                
> 3.24s   12 subtests passed)
>
> But:
> $ grep TRAP src/test/recovery/tmp_check/log/*
> src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed 
> Assert("0"), File: "walsender.c", Line:
> 2528, PID: 373729
> src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed 
> Assert("0"), File: "walsender.c", Line:
> 2528, PID: 373750
> src/test/recovery/tmp_check/log/012_subtransactions_primary.log:TRAP: failed 
> Assert("0"), File: "walsender.c", Line:
> 2528, PID: 373821
> src/test/recovery/tmp_check/log/012_subtransactions_standby.log:TRAP: failed 
> Assert("0"), File: "walsender.c", Line:
> 2528, PID: 373786
>
> src/test/recovery/tmp_check/log/012_subtransactions_primary.log contains:
> ...
> 2023-12-09 03:23:20.210 UTC [375933] LOG:  server process (PID 375975) was 
> terminated by signal 6: Aborted
> 2023-12-09 03:23:20.210 UTC [375933] DETAIL:  Failed process was running: 
> START_REPLICATION 0/3000000 TIMELINE 3
> 2023-12-09 03:23:20.210 UTC [375933] LOG:  terminating any other active 
> server processes
> 2023-12-09 03:23:20.210 UTC [375933] LOG:  abnormal database system shutdown
> 2023-12-09 03:23:20.211 UTC [375933] LOG:  database system is shut down
> ...
>
> So the shutdown definitely was considered abnormal, but we missed that.
>
> With log_min_messages = DEBUG3, I can see in the log:
> 2023-12-09 03:26:50.145 UTC [377844] LOG:  abnormal database system shutdown
> 2023-12-09 03:26:50.145 UTC [377844] DEBUG:  shmem_exit(1): 0 
> before_shmem_exit callbacks to make
> 2023-12-09 03:26:50.145 UTC [377844] DEBUG:  shmem_exit(1): 5 on_shmem_exit 
> callbacks to make
> 2023-12-09 03:26:50.145 UTC [377844] DEBUG:  cleaning up orphaned dynamic 
> shared memory with ID 2898643884
> 2023-12-09 03:26:50.145 UTC [377844] DEBUG:  cleaning up dynamic shared 
> memory control segment with ID 3446966170
> 2023-12-09 03:26:50.146 UTC [377844] DEBUG:  proc_exit(1): 2 callbacks to make
> 2023-12-09 03:26:50.146 UTC [377844] LOG:  database system is shut down
> 2023-12-09 03:26:50.146 UTC [377844] DEBUG:  exit(1)
> 2023-12-09 03:26:50.146 UTC [377844] DEBUG:  shmem_exit(-1): 0 
> before_shmem_exit callbacks to make
> 2023-12-09 03:26:50.146 UTC [377844] DEBUG:  shmem_exit(-1): 0 on_shmem_exit 
> callbacks to make
> 2023-12-09 03:26:50.146 UTC [377844] DEBUG:  proc_exit(-1): 0 callbacks to 
> make
>
> The postmaster process exits with exit code 1, but pg_ctl can't get the
> code and just reports that stop was completed successfully.
>

For what it's worth, there is another thread which stated the similar problem:
https://www.postgresql.org/message-id/flat/2366244.1651681550%40sss.pgh.pa.us

> Should this be improved? And if yes, how it can be done?
> Maybe postmaster shouldn't remove it's postmaster.pid when it exits
> abnormally to let pg_ctl know of it?
>


thanks
Shveta


Reply via email to