On Fri, 2 May 2025 at 06:30, Tom Lane <t...@sss.pgh.pa.us> wrote:
>
> vignesh C <vignes...@gmail.com> writes:
> > I agree with your analysis. I was able to reproduce the issue by
> > delaying the invalidation of the subscription until the walsender
> > finished decoding the INSERT operation following the ALTER
> > SUBSCRIPTION through a debugger and using the lsn from the pg_waldump
> > of the INSERT after the ALTER SUBSCRIPTION.
>
> Can you be a little more specific about how you reproduced this?
> I tried inserting sleep() calls in various likely-looking spots
> and could not get a failure that way.

Test Steps:
1) Set up logical replication:
Create a publication on the publisher
Create a subscription on the subscriber
2) Create the following table on the publisher:
CREATE TABLE tab_3 (a int);
3) Create the same table on the subscriber:
CREATE TABLE tab_3 (a int);
4) On the subscriber, alter the subscription to refer to a
non-existent publication:
ALTER SUBSCRIPTION sub1 SET PUBLICATION tap_pub_3;
5) Insert data on the publisher:
INSERT INTO tab_3 VALUES (1);

As expected, the publisher logs the following warning in normal case:
2025-05-02 08:56:45.350 IST [516197] WARNING:  skipped loading
publication: tap_pub_3
2025-05-02 08:56:45.350 IST [516197] DETAIL:   The publication does
not exist at this point in the WAL.
2025-05-02 08:56:45.350 IST [516197] HINT:     Create the publication
if it does not exist.

To simulate a delay in subscription invalidation, I modified the
maybe_reread_subscription() function as follows:
diff --git a/src/backend/replication/logical/worker.c
b/src/backend/replication/logical/worker.c
index 4151a4b2a96..0831784aca3 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -3970,6 +3970,10 @@ maybe_reread_subscription(void)
     MemoryContext oldctx;
     Subscription *newsub;
     bool            started_tx = false;
+    bool            test = true;
+
+    if (test)
+        return;

This change delays the subscription invalidation logic, preventing the
apply worker from detecting the subscription change immediately.

With the patch applied, repeat steps 1–5.
Using pg_waldump, identify the LSN of the insert:
rmgr: Heap        len (rec/tot):     59/    59, tx: 756, lsn:
0/01711848, prev 0/01711810, desc: INSERT+INIT off: 1
rmgr: Transaction len (rec/tot):     46/    46, tx: 756, lsn:
0/01711888, prev 0/01711848, desc: COMMIT 2025-05-02 09:06:09.400926
IST

Check the confirmed flush LSN from the walsender via gdb by attaching
it to the walsender process
(gdb) p *MyReplicationSlot
...
confirmed_flush = 24241928
(gdb) p /x 24241928
$4 = 0x171e708

Now attach to the apply worker, set a breakpoint at
maybe_reread_subscription, and continue execution. Once control
reaches the function, set test = false. Now it will identify that
subscription is invalidated and restart the apply worker.

As the walsender has already confirmed_flush position after the
insert, causing the newly started apply worker to miss the inserted
row entirely. This leads to the CI failure. This issue can arise when
the walsender advances more quickly than the apply worker is able to
detect and react to the subscription change.

I could not find a simpler way to reproduce this.

Regards,
Vignesh
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 4151a4b2a96..0831784aca3 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -3970,6 +3970,10 @@ maybe_reread_subscription(void)
 	MemoryContext oldctx;
 	Subscription *newsub;
 	bool		started_tx = false;
+	bool		test = true;
+
+	if (test)
+		return;
 
 	/* When cache state is valid there is nothing to do here. */
 	if (MySubscriptionValid)

Reply via email to