On Mon, Jun 8, 2026 at 4:31 PM Xuneng Zhou <[email protected]> wrote:
>
> Hi Kuroda-san,
>
> On Mon, Jun 8, 2026 at 12:25 PM Hayato Kuroda (Fujitsu)
> <[email protected]> wrote:
> >
> > Hi Alexander, Bertrand, Xuneng,
> >
> > Thanks for seeing the failure. Our team also recognized but could not find 
> > the reason.
> >
> > > Yeah, it looks like there is a race condition here. I think we should 
> > > check if
> > > the insertion timeline has already been set (like the walsummarizer is 
> > > doing).
> >
> > Sorry for stupid question; I tried to reproduce the failure but could not, 
> > see attached.
> >
> > IIUC, the issue can happen if the walsender must read the WAL record 
> > generated
> > after the promotion but the timeline could not be updated.
>
> I think the race is that the logical walsender is trying to read the
> pre-promotion wal from the slot's restart_lsn.
>
> 1) In 035_standby_logical_decoding.pl, the active slot /
> pg_recvlogical process is started before:
> INSERT rows 1..4   # before promotion
> promote standby
> INSERT rows 5..7   # after promotion
>
> 2) Alexander’s reproducer sleeps immediately after:
> ReplicationSlotAcquire(cmd->slotname, true, true);
> pg_usleep(200000);
>
> So with the delay:
> slot is active but walsender has not yet create decoding context and
> called xlogbeginread();
>
> 3) Then the test continues and promotes the standby while the
> walsender still sleeping, therefore promotion could happen before the
> walsender starts its actual wal read.
>
> 4)  The logical walsender is trying to read the pre-promotion wal from
> the slot's restart_lsn. The failing log implies this:
> Streaming transactions committing after 0/06487A20,
> reading WAL from 0/064879E8.
>
> This line is emitted from:
> errdetail("Streaming transactions committing after %X/%08X, reading
> WAL from %X/%08X.",
>           slot->data.confirmed_flush,
>           slot->data.restart_lsn)
>
> Function StartLogicalReplication() initiates the xlogreader at the restart 
> lsn:
> XLogBeginRead(logical_decoding_ctx->reader,
>               MyReplicationSlot->data.restart_lsn);
>
> Because the slot was created and acquired before promotion, its
> restart_lsn is from the standby-era state which points to wal required
> for decoding the pre-promotion stream. In the test, the first expected
> output is rows 1..4, inserted before promotion.
>
> The failing output was empty:
> got: ''
> expected:
> BEGIN
> rows 1..4
> COMMIT
> BEGIN
> rows 5..7
> COMMIT
>
> Empty output suggests that the walsender died before emitting even the
> first pre-promotion transaction.
>
> 5) The walsender try to read wal but failed
> Failed run:
>  LOG:  !!!WalSndSegmentOpen| nextSegNo: 6,
> path: pg_wal/000000010000000000000006
> ERROR:  requested WAL segment
> 000000010000000000000006 has already been removed
>
> Passed run
> LOG:  !!!WalSndSegmentOpen| nextSegNo: 6,
> path: pg_wal/000000020000000000000006
>
> As the log suggested, the walsender of the successful run readed the
> wal from timeline 2 but failed to do so for timeline1.
> This comment in xlogutils.is almost exactly the issue:
>
> We care about timelines ... reading xlog generated prior to a promotion
> ...
> The server copied the segment to the new timeline ... there's no guarantee
> the old segment will still exist. It may have been deleted or renamed with
> a .partial suffix
>
> So the need for tli1 does not always imply that the wal was generated
> after promotion. It could be bytes from before the switchpoint, but
> they now be available through the new tli segment file.
>
> 6) XLogReadDetermineTimeline() needs the current system-wide TLI.
>
> The current code in does roughly:
>
>          am_cascading_walsender = RecoveryInProgress();
>
>          if (am_cascading_walsender)
>                  GetXLogReplayRecPtr(&currTLI);
>          else
>                  currTLI = GetWALInsertionTimeLine();
>
> During promotion, we set the wal insertion timeline before marking
> recovery as fully done. So there is a window where:
>
> - InsertTimeLineID is already the new timeline, e.g. tli 2.
> - RecoveryInProgress() can still return true in another backend.
> - A logical walsender that started before promotion can wake up and
> choose its read timeline using the old recovery-state test.
>
> In the race, replay tli 1 is chosen even though the promotion has
> created the new tli 2 wal segment copy. Then WalSndSegmentOpen() tries
> to open 000000010000000000000006, which may already be gone or
> renamed, while the correct file is 000000020000000000000006.
>
> 7) Bertrand's proposal to fix this
> Recognizing the intermediate promotion state -- if recovery still
> shows progress but the insertion tli is already advanced, use that
> insertion tli for the timeline to be chosen.
>
> > However, I think logical_read_xlog_page() is called after the new WAL 
> > records
> > are generated, i.e., am_cascading_walsender has already been false at that 
> > time.
> > So not sure where is the race?
>
> This function is not only called after new post-promotion records are
> generated. When START_REPLICATION begins, CreateDecodingContext() /
> XLogBeginRead() starts from the slot's restart_lsn, which can be well
> before promotion.
>

I tweaked the reproducer based on the theory outlined above. The main
changes from the original reproducer are:

1) blocks at logical-walsender-after-slot-acquire in walsender.c,
before the decoding context is created and before the reader starts
from restart_lsn, matching the delay set by Alexander

2) Forces the first read to occur during promotion. It inserts rows
1..4, waits for replay, starts promotion with pg_promote(false), holds
startup at startup-logical-decoding-status-change-end-of-recovery,
then wakes the walsender.

The test failed on HEAD and passed after applying the patch by Bertrand.

-- 
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.

Attachment: v1-0001-Reproducer-for-logical-decoding-choosing-the-wron.patch
Description: Binary data

Reply via email to