On Thu, May 13, 2021 at 3:12 PM Amit Kapila <amit.kapil...@gmail.com> wrote:
>
> I noticed $SUBJECT in curculio.
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2021-05-11%2018%3A30%3A23
>
> The failure test report:
> t/020_messages.pl                (Wstat: 7424 Tests: 1 Failed: 0)
>   Non-zero exit status: 29
>   Parse errors: Bad plan.  You planned 5 tests but ran 1.
> Files=21, Tests=149, 79 wallclock secs ( 0.03 usr  0.03 sys + 22.36
> cusr  9.47 csys = 31.89 CPU)
> Result: FAIL
> Makefile:21: recipe for target 'check' failed
> gmake: *** [check] Error 1
>
> As far as I can understand the failure happens due to error:
> 2021-05-11 20:41:38.356 CEST [31177:4] 020_messages.pl ERROR:
> replication slot "tap_sub" is active for PID 7354
> 2021-05-11 20:41:38.356 CEST [31177:5] 020_messages.pl STATEMENT:
> SELECT get_byte(data, 1), encode(substr(data, 11, 8), 'escape')
> FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
> 'proto_version', '1',
> 'publication_names', 'tap_pub',
> 'messages', 'true')
> OFFSET 1 LIMIT 1
>
> As per my initial analysis, this happens because in the test after
> creating a subscription, we are not waiting for the subscriber to
> catch up. I think there is a narrow window where it is possible that
> after creating a subscription, the apply worker starts replication
> even after we have done Alter Subscription .. DISABLE. This is also
> visible in buildfarm logs:
>
> 2021-05-11 20:41:38.345 CEST [4046:9] 020_messages.pl STATEMENT:
> SELECT get_byte(data, 0)
> FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
> 'proto_version', '1',
> 'publication_names', 'tap_pub',
> 'messages', 'true')
> ...
>
> 2021-05-11 20:41:38.354 CEST [7354:6] tap_sub LOG:  received
> replication command: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0
> (proto_version '2', publication_names '"tap_pub"')
>
> Notice that we have checked the messages before receiving the start
> replication command from the subscriber. Now, if we would have waited
> for the subscription to catch up then our check in pg_stat_replication
> wouldn't have given a false positive. I have noticed that we do have
> such wait in other subscription tests but somehow missed it here.

I agree with your analysis, apart from this issue there is one other
small problem. Currently after disabling subscription
pg_stat_replication is used to verify if the walsender is exited. The
steps of the walsender exit includes cleaning up of walsender and then
releasing a replication slot. There will be random test failure if
pg_logical_slot_peek_binary_changes is called in this narrow time
window. I think it is better to fix it by checking the active column
in pg_replication_slot instead of pg_stat_replication which is more
reliable.
Attached patch which has the fixes for the same.
Thoughts?

Regards,
Vignesh
From 5cbfa5614db18925ca92c5e2c7310ff6520b8397 Mon Sep 17 00:00:00 2001
From: vignesh <vignes...@gmail.com>
Date: Thu, 13 May 2021 16:00:43 +0530
Subject: [PATCH v1] Messages test failure fix.

This test was failing because there was no wait for catch up after
creating subscription. After disabling subscription pg_stat_replication
was used to verify if the walsender is exited. The steps of walsender
exit includes cleaning up of walsendr and then releasing replication
slot. There will be random test failure if pg_logical_slot_peek_binary_changes
is called in the narrow time window. Fixed it by checking
active column in pg_replication_slot instead of pg_stat_replication
which is more reliable.
---
 src/test/subscription/t/020_messages.pl | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index e5d48ec8a0..7b385b48e3 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -34,12 +34,14 @@ $node_subscriber->safe_psql('postgres',
 	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
 );
 
+$node_publisher->wait_for_catchup('tap_sub');
+
 # Ensure a transactional logical decoding message shows up on the slot
 $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
 
 # wait for the replication connection to drop from the publisher
 $node_publisher->poll_query_until('postgres',
-	'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
+	"SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
 
 $node_publisher->safe_psql('postgres',
 	"SELECT pg_logical_emit_message(true, 'pgoutput', 'a transactional message')"
@@ -94,7 +96,7 @@ $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
 
 # wait for the replication connection to drop from the publisher
 $node_publisher->poll_query_until('postgres',
-	'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
+	"SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
 
 $node_publisher->safe_psql('postgres', "INSERT INTO tab_test VALUES (1)");
 
@@ -122,7 +124,7 @@ $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
 
 # wait for the replication connection to drop from the publisher
 $node_publisher->poll_query_until('postgres',
-	'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
+	"SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
 
 # Ensure a non-transactional logical decoding message shows up on the slot when
 # it is emitted within an aborted transaction. The message won't emit until
-- 
2.25.1

Reply via email to