Re: Fix slot synchronization with two_phase decoding enabled

Masahiko Sawada Sun, 04 May 2025 02:04:04 -0700

On Sat, May 3, 2025 at 10:23 AM Tom Lane <t...@sss.pgh.pa.us> wrote:
>
> BF member mule just showed what seems an identical failure [1]:
>
> 2025-05-03 17:46:20.088 CEST [24308:3] LOG:  database system is ready to 
> accept read-only connections
> 2025-05-03 17:46:20.091 CEST [24321:1] LOG:  slot sync worker started
> 2025-05-03 17:46:20.100 CEST [24322:1] LOG:  started streaming WAL from 
> primary at 0/6000000 on timeline 1
> 2025-05-03 17:46:20.117 CEST [24321:2] LOG:  starting logical decoding for 
> slot "lsub1_slot"
> 2025-05-03 17:46:20.117 CEST [24321:3] DETAIL:  Streaming transactions 
> committing after 0/60049C8, reading WAL from 0/5000048.
> 2025-05-03 17:46:20.117 CEST [24321:4] LOG:  logical decoding found 
> consistent point at 0/5000048
> 2025-05-03 17:46:20.117 CEST [24321:5] DETAIL:  There are no running 
> transactions.
> TRAP: failed Assert("slot->data.two_phase_at <= slot->data.confirmed_flush"), 
> File: "slotsync.c", Line: 311, PID: 24321
> postgres: standby1: slotsync worker 
> (ExceptionalCondition+0x59)[0x556c3b8cb3e9]
> postgres: standby1: slotsync worker (+0x4cb100)[0x556c3b706100]
> postgres: standby1: slotsync worker (+0x4cba4c)[0x556c3b706a4c]
> postgres: standby1: slotsync worker 
> (ReplSlotSyncWorkerMain+0x258)[0x556c3b707598]
> postgres: standby1: slotsync worker 
> (postmaster_child_launch+0x102)[0x556c3b6d6962]
> postgres: standby1: slotsync worker (+0x49daea)[0x556c3b6d8aea]
> postgres: standby1: slotsync worker (+0x49fb3d)[0x556c3b6dab3d]
> postgres: standby1: slotsync worker (PostmasterMain+0xd3f)[0x556c3b6dc04f]
> postgres: standby1: slotsync worker (main+0x1ca)[0x556c3b3bee1a]
> /lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7fe08da4824a]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7fe08da48305]
> postgres: standby1: slotsync worker (_start+0x21)[0x556c3b3bf421]
> 2025-05-03 17:46:20.337 CEST [24308:4] LOG:  slot sync worker process (PID 
> 24321) was terminated by signal 6: Aborted
>


While I cannot be entirely certain of my analysis, I believe the root
cause might be related to the backward movement of the confirmed_flush
LSN. The following scenario seems possible:

1. The walsender enables the two_phase and sets two_phase_at (which
should be the same as confirmed_flush).
2. The slot's confirmed_flush regresses for some reason.
3. The slotsync worker retrieves the remote slot information and
enables two_phase for the local slot.

If I recall correctly, we previously discussed[1] how a slot's
confirmed_flush LSN could move backward depending on the feedback
messages received from the downstream system. Based on the following
primary server's logs, it appears possible that the subscriber sent
feedback LSNs older than the slot's confirmed_flush, consequently
causing the confirmed_flush to move backward. However, I should note
that this is currently a hypothesis, as I haven't yet been able to
reproduce and verify this scenario.

Regards,

[1] 
https://www.postgresql.org/message-id/85fff40e-148b-4e86-b921-b4b846289132%40vondra.me

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Fix slot synchronization with two_phase decoding enabled

Reply via email to