Hi, The 021_twophase test has failed on Kestrel at [1] with the following error: # Failed test 'should be no prepared transactions on subscriber' # at /home/bf/bf-build/kestrel/HEAD/pgsql/src/test/subscription/t/021_twophase.pl line 438. # got: '1' # expected: '0' # Looks like you failed 1 test of 30.
This failure is caused by a prepared transaction that was not properly committed due to replication lag on one of the subscriptions. The test involves two subscriptions: tap_sub and tap_sub_copy. After committing the prepared transaction 'mygid', the test only waits for tap_sub_copy to catch up: node_publisher->wait_for_catchup($appname_copy); However, tap_sub is dropped before ensuring it has replayed the commit of 'mygid' prepared transaction, leading to a leftover prepared transaction on the subscriber: $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub"); When the test later checks for the number of prepared transactions, it fails because tap_sub had not finished applying the commit: # at line 438 # got: '1' # expected: '0' This issue can be consistently reproduced by injecting a delay (e.g., 3 seconds) in tap_sub's walsender while decoding the commit of 'mygid'. A patch to demonstrate this behavior is provided at 021_two_phase_test_failure_reproduce.patch. The test can be fixed by explicitly waiting for both subscriptions to catch up before dropping either. A patch implementing this fix is attached. Thanks Amit for the offline discussion and sharing your thoughts on the same. [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2025-05-22%2021%3A19%3A22 Regards, Vignesh
From 4f75511e9866ef29f79789e5be0a31f0ee5786e7 Mon Sep 17 00:00:00 2001 From: Vignesh <vignes...@gmail.com> Date: Fri, 23 May 2025 20:40:40 +0530 Subject: [PATCH] Fix random 021_twophase test failure This issue occurred because the test dropped the subscription tap_sub before confirming it had replayed the commit for the previously prepared transaction 'mygid'. The test only waited for the second subscription, tap_sub_copy, to catch up. As a result, tap_sub had not yet applied the commit, leaving the prepared transaction uncommitted on the subscriber. --- src/test/subscription/t/021_twophase.pl | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/src/test/subscription/t/021_twophase.pl b/src/test/subscription/t/021_twophase.pl index 61c427aed21..b8e4242d1f1 100644 --- a/src/test/subscription/t/021_twophase.pl +++ b/src/test/subscription/t/021_twophase.pl @@ -373,7 +373,14 @@ $result = $node_publisher->safe_psql('postgres', "SELECT count(*) FROM tab_copy;"); is($result, qq(6), 'publisher inserted data'); +# Wait for both subscribers to catchup $node_publisher->wait_for_catchup($appname_copy); +$node_publisher->wait_for_catchup($appname); + +# Make sure there are no prepared transactions on the subscriber +$result = $node_subscriber->safe_psql('postgres', + "SELECT count(*) FROM pg_prepared_xacts;"); +is($result, qq(0), 'should be no prepared transactions on subscriber'); $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_copy;"); -- 2.43.0
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c index cc03f0706e9..981a8e45003 100644 --- a/src/backend/replication/logical/decode.c +++ b/src/backend/replication/logical/decode.c @@ -38,6 +38,7 @@ #include "replication/message.h" #include "replication/reorderbuffer.h" #include "replication/snapbuild.h" +#include "replication/slot.h" #include "storage/standbydefs.h" /* individual record(group)'s handlers */ @@ -682,6 +683,10 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, RepOriginId origin_id = XLogRecGetOrigin(buf->record); int i; + if (parsed->twophase_xid && strcmp(parsed->twophase_gid, "mygid") == 0 && + strcmp(NameStr(MyReplicationSlot->data.name), "tap_sub") == 0) + sleep(3); + if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN) { origin_lsn = parsed->origin_lsn;