On Thu, Aug 26, 2021 at 2:45 PM Masahiko Sawada <sawada.m...@gmail.com> wrote:
> I think that it’s possible that the orders of the process writing > disconnections logs and setting 0 to walsender's pid are mismatched. > We set 0 to walsender's pid in WalSndKill() that is called during > on_shmem_exit callback. Once we set 0, pg_stat_replication doesn't > show the entry. On the other hand, disconnections logs are written by > log_disconnections() that is called during on_proc_exit callback. That > is, the following sequence could happen: > > 1. the second walsender (pid = 16475) raises an error as the slot is > already active (held by the first walsender). > 2. the first walsender (pid = 16336) clears its pid on the shmem. > 3. the polling query checks the walsender’s pid, and returns true > (since there is only the second walsender now). > 4. the second walsender clears its pid on the shmem. > 5. the second walsender writes disconnection log. > 6. the first walsender writes disconneciton log. I agree with this. Attaching a patch on head that modifies this particular script to also consider the state of the walsender. regards, Ajin Cherian Fujitsu Australia
v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patch
Description: Binary data