I wrote: > I studied this failure a bit more, and I think the test itself has > a race condition. It's doing > > # freeze walsender and walreceiver. Slot will still be active, but walreceiver > # won't get anything anymore. > kill 'STOP', $senderpid, $receiverpid; > $logstart = get_log_size($node_primary3); > advance_wal($node_primary3, 4); > ok(find_in_log($node_primary3, "to release replication slot", $logstart), > "walreceiver termination logged");
Actually ... isn't there a second race, in the opposite direction? IIUC, the point of this is that once we force some WAL to be sent to the frozen sender/receiver, they'll be killed for failure to respond. But the advance_wal call is not the only possible cause of that; a background autovacuum for example could emit some WAL. So I fear it's possible for the 'to release replication slot' message to come out before we capture $logstart. I think you need to capture that value before the kill not after. I also suggest that it wouldn't be a bad idea to make the find_in_log check more specific, by including the expected PID and perhaps the expected slot name in the string. The full message in primary3's log looks like 2021-06-19 05:24:36.221 CEST [60cd636f.362648:12] LOG: terminating process 3548959 to release replication slot "rep3" and I don't understand why we wouldn't match on the whole message text. (I think doing so will also reveal that what we are looking for here is the walsender pid, not the walreceiver pid, and thus that the description in the ok() call is backwards. Or maybe we do want to check the walreceiver side, in which case we are searching the wrong postmaster's log?) regards, tom lane