On Sun, Apr 05, 2020 at 11:36:49PM -0700, Noah Misch wrote:
> Executive summary: the "MyWalSnd->write < sentPtr" in WalSndWaitForWal() is
> important for promptly updating pg_stat_replication.  When caught up, we
> should impose that logic before every sleep.  The one-line fix is to sleep in
> WalSndLoop() only when pq_is_send_pending(), not when caught up.

This seems to have made the following race condition easier to hit:
https://www.postgresql.org/message-id/flat/20200206074552.GB3326097%40rfd.leadboat.com
https://www.postgresql.org/message-id/flat/21519.1585272409%40sss.pgh.pa.us

Now it happened eight times in three days, all on BSD machines:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2020-04-11%2018%3A30%3A21
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-11%2018%3A45%3A39
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2020-04-11%2020%3A30%3A26
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-11%2021%3A45%3A48
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2010%3A45%3A35
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2020-04-13%2016%3A00%3A18
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2018%3A45%3A34
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2023%3A45%3A22

While I don't think that indicates anything wrong with the fix for $SUBJECT,
creating more buildfarm noise is itself bad.  I am inclined to revert the fix
after a week.  Not immediately, in case it uncovers lower-probability bugs.
I'd then re-commit it after one of those threads fixes the other bug.  Would
anyone like to argue for a revert earlier, later, or not at all?


There was a novel buildfarm failure, in 010_logical_decoding_timelines.pl:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-04-13%2008%3A35%3A05
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-04-13%2017%3A15%3A01

Most-relevant lines of the test script:

        $node_master->safe_psql('postgres',
                "INSERT INTO decoding(blah) VALUES ('afterbb');");
        $node_master->safe_psql('postgres', 'CHECKPOINT');
        $node_master->stop('immediate');

The failure suggested the INSERT was not replicated before the immediate stop.
I can reproduce that consistently, before or after the fix for $SUBJECT, by
modifying walsender to delay 0.2s before sending WAL:

--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -65,2 +65,3 @@
 #include "libpq/pqformat.h"
+#include "libpq/pqsignal.h"
 #include "miscadmin.h"
@@ -2781,2 +2782,5 @@ retry:
 
+       PG_SETMASK(&BlockSig);
+       pg_usleep(200 * 1000);
+       PG_SETMASK(&UnBlockSig);
        pq_putmessage_noblock('d', output_message.data, output_message.len);

I will shortly push a fix adding a wait_for_catchup to the test.  I don't know
if/how fixing $SUBJECT made this 010_logical_decoding_timelines.pl race
condition easier to hit.


Reply via email to