Hi,
Is this an expected behavior?

A race between subscriber LSN feedback and publisher subscription change
processing allows the walsender to restart decoding past relevant WAL
records, bypassing the updated subscription rules for those records.


On Wed, 30 Apr 2025 at 17:41, Amit Kapila <amit.kapil...@gmail.com> wrote:
> >
> > On Wed, Apr 30, 2025 at 11:22 AM Tom Lane <t...@sss.pgh.pa.us> wrote:
> > >
> > > Xuneng Zhou pointed out on Discord that the test case added by
> > > 7c99dc587 has caused repeated failures in CI --- though oddly,
> > > it's not failed in the buildfarm so far as I can find.  The
> > > failures look like
> > >
> > > timed out waiting for match: (?^:WARNING: ( [A-Z0-9]+:)? skipped
> loading publication: tap_pub_3) at
> /tmp/cirrus-ci-build/src/test/subscription/t/024_add_drop_pub.pl line 103.
> > >
> >
> > I analyzed the relevant publisher-side CI Logs [1]:
> > ...
> > 2025-04-19 08:24:14.096 UTC [21961][client backend]
> > [024_add_drop_pub.pl][7/4:0] LOG:  statement: INSERT INTO tab_3
> > values(1)
> > 2025-04-19 08:24:14.098 UTC [21961][client backend]
> > [024_add_drop_pub.pl][:0] LOG:  disconnection: session time:
> > 0:00:00.003 user=postgres database=postgres host=[local]
> > 2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][30/0:0] LOG:
> > released logical replication slot "tap_sub"
> > 2025-04-19 08:24:14.108 UTC [21797][walsender] [tap_sub][:0] LOG:
> > disconnection: session time: 0:00:00.329 user=postgres
> > database=postgres host=[local]
> > 2025-04-19 08:24:14.127 UTC [21979][not initialized] [[unknown]][:0]
> > LOG:  connection received: host=[local]
> > 2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0]
> > LOG:  connection authenticated: user="postgres" method=trust
> >
> (/tmp/cirrus-ci-build/build/testrun/subscription/024_add_drop_pub/data/t_024_add_drop_pub_publisher_data/pgdata/pg_hba.conf:117)
> > 2025-04-19 08:24:14.128 UTC [21979][walsender] [[unknown]][23/5:0]
> > LOG:  replication connection authorized: user=postgres
> > application_name=tap_sub
> > 2025-04-19 08:24:14.129 UTC [21979][walsender] [tap_sub][23/6:0] LOG:
> > statement: SELECT pg_catalog.set_config('search_path', '', false);
> > 2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0] LOG:
> > received replication command: IDENTIFY_SYSTEM
> > 2025-04-19 08:24:14.130 UTC [21979][walsender] [tap_sub][23/0:0]
> > STATEMENT:  IDENTIFY_SYSTEM
> > 2025-04-19 08:24:14.131 UTC [21979][walsender] [tap_sub][23/0:0] LOG:
> > received replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
> > 0/0 (proto_version '4', streaming 'parallel', origin 'any',
> > publication_names '"tap_pub_
> > ...
> >
> > This shows that walsender restarts after the "INSERT INTO tab_3
> > values(1)" is processed by the previous walsender ("released logical
> > replication slot "tap_sub"" is after "INSERT INTO tab_3 values(1)").
> > So, it is possible that the old apply worker has sent the confirmation
> > of WAL received location after the Insert (due to keep_alive message
> > handling). So, after the restart, the new walsender will start
> > processing WAL after the INSERT and wait for the skipped message LOG
> > timed out.
> >
> > Considering the above theory is correct, after "ALTER SUBSCRIPTION
> > tap_sub SET PUBLICATION tap_pub_3", we should wait for the new
> > walsender to restart. We are already doing the same for a similar case
> > in 001_rep_changes.pl (See "# check that change of connection string
> > and/or publication list causes restart of subscription workers. We
> > check the state along with application_name to ensure that the
> > walsender is (re)started.).
> >
> > Unfortunately, I will be away for the rest of the week. In the
> > meantime, if you or someone else is able to reproduce and fix it, then
> > good; otherwise, I'll take care of it after I return.
>
> I agree with your analysis. I was able to reproduce the issue by
> delaying the invalidation of the subscription until the walsender
> finished decoding the INSERT operation following the ALTER
> SUBSCRIPTION through a debugger and using the lsn from the pg_waldump
> of the INSERT after the ALTER SUBSCRIPTION.  In this scenario, the
> confirmed_flush_lsn ends up pointing to a location after the INSERT.
> When the invalidation is eventually received and the apply
> worker/walsender is restarted, the restarted walsender begins decoding
> from that LSN—after the INSERT—which means the "skipped loading
> publication" warning is never triggered, causing the test to fail.
>
> Attached is a patch that ensures the walsender process is properly
> restarted after ALTER SUBSCRIPTION, preventing this race condition.
>
> Regards,
> Vignesh
>

Reply via email to