I wrote: > This looks to me like it's probably a low-probability timing problem in > the test script itself. > ... > We could maybe hack around that with a short "sleep" in this script, but > a less cruddy solution would be to change PostgresNode::reload so that > it somehow waits for the server to process the reload signal.
On further reflection, I doubt that would help. The issue isn't whether the standby_2 postmaster has processed the SIGHUP, but whether its walreceiver process has sent a feedback message. The loop in WalReceiverMain does things in this order: 1. Check for SIGTERM, exit if seen. 2. Check for SIGHUP, process that (and possibly send a feedback message) if seen. 3. Read and process streaming data till no more is available. We've seen just a tiny number of failures at this point in the test --- scraping the buildfarm database, I find these: sysname | snapshot | stage | l --------------+---------------------+---------------+------------------------------------------------------------------------------ longfin | 2017-09-26 14:39:05 | recoveryCheck | Timed out waiting for slot xmins to advance at t/001_stream_rep.pl line 155. flaviventris | 2018-04-07 17:02:01 | recoveryCheck | Timed out waiting for slot xmins to advance at t/001_stream_rep.pl line 155. serinus | 2018-05-18 20:14:01 | recoveryCheck | Timed out waiting for slot xmins to advance at t/001_stream_rep.pl line 155. idiacanthus | 2018-06-23 12:46:01 | recoveryCheck | Timed out waiting for slot xmins to advance at t/001_stream_rep.pl line 157. francolin | 2018-06-29 13:30:01 | recoveryCheck | Timed out waiting for slot xmins to advance at t/001_stream_rep.pl line 157. flaviventris | 2018-07-12 07:36:01 | recoveryCheck | Timed out waiting for slot xmins to advance at t/001_stream_rep.pl line 157. (6 rows) So it's clearly a very low-probability race condition. My hypothesis is that the SIGHUP arrives, but the walreceiver process is already in the read-and-process-data part of its loop, and the timing is such that it manages to stay there until we issue the standby2 shutdown command. When it finally notices that it has no more data available, it iterates the outermost loop, and then shuts down without ever having sent a feedback message. The "replay_check()" call in the test script doesn't ensure any synchronization here, since it only verifies that WAL data has been applied, not that any feedback has happened. I can reproduce the failure pretty reliably with a hack like the one attached, which makes it unlikely that the walreceiver will send a feedback message instantly when it gets the SIGHUP. So the issue boils down to this: the test script is, effectively, assuming that it's guaranteed that the walreceiver will send a feedback message before it shuts down; but there is no such guarantee. Is this a bug in the test script, or a bug in the walreceiver logic? I can see the value of having such a guarantee, but I also think it would be nigh impossible to make it a bulletproof guarantee. We could perhaps add "XLogWalRcvSendHSFeedback(true)" somewhere closer to process exit, but that might add more failure modes than it removes. Or we could change the test script to wait for feedback before it issues the shutdown, but then I think the test is a bit pointless. Comments? regards, tom lane
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index 987bb84..d7bb1b4 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -447,7 +447,7 @@ WalReceiverMain(void) { got_SIGHUP = false; ProcessConfigFile(PGC_SIGHUP); - XLogWalRcvSendHSFeedback(true); + XLogWalRcvSendHSFeedback(false); } /* See if we can read data immediately */ diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl index a0d3e8f..8f8dbc5 100644 --- a/src/test/recovery/t/001_stream_rep.pl +++ b/src/test/recovery/t/001_stream_rep.pl @@ -145,7 +145,7 @@ is( $node_standby_1->psql( $node_standby_2->append_conf('recovery.conf', "primary_slot_name = $slotname_2"); $node_standby_2->append_conf('postgresql.conf', - "wal_receiver_status_interval = 1"); + "wal_receiver_status_interval = 10"); $node_standby_2->restart; # Fetch xmin columns from slot's pg_replication_slots row, after waiting for