Re: Race condition in InvalidateObsoleteReplicationSlots()

Tom Lane Sun, 20 Jun 2021 11:37:36 -0700

I wrote:
> I studied this failure a bit more, and I think the test itself has
> a race condition.  It's doing
>
> # freeze walsender and walreceiver. Slot will still be active, but walreceiver
> # won't get anything anymore.
> kill 'STOP', $senderpid, $receiverpid;
> $logstart = get_log_size($node_primary3);
> advance_wal($node_primary3, 4);
> ok(find_in_log($node_primary3, "to release replication slot", $logstart),
>       "walreceiver termination logged");


Actually ... isn't there a second race, in the opposite direction?
IIUC, the point of this is that once we force some WAL to be sent
to the frozen sender/receiver, they'll be killed for failure to
respond.  But the advance_wal call is not the only possible cause
of that; a background autovacuum for example could emit some WAL.
So I fear it's possible for the 'to release replication slot'
message to come out before we capture $logstart.  I think you
need to capture that value before the kill not after.

I also suggest that it wouldn't be a bad idea to make the
find_in_log check more specific, by including the expected PID
and perhaps the expected slot name in the string.  The full
message in primary3's log looks like

2021-06-19 05:24:36.221 CEST [60cd636f.362648:12] LOG:  terminating process 
3548959 to release replication slot "rep3"

and I don't understand why we wouldn't match on the whole
message text.  (I think doing so will also reveal that what
we are looking for here is the walsender pid, not the walreceiver
pid, and thus that the description in the ok() call is backwards.
Or maybe we do want to check the walreceiver side, in which case
we are searching the wrong postmaster's log?)

                        regards, tom lane

Re: Race condition in InvalidateObsoleteReplicationSlots()

Reply via email to