On Sat, May 3, 2025 at 10:23 AM Tom Lane <t...@sss.pgh.pa.us> wrote: > > BF member mule just showed what seems an identical failure [1]: > > 2025-05-03 17:46:20.088 CEST [24308:3] LOG: database system is ready to > accept read-only connections > 2025-05-03 17:46:20.091 CEST [24321:1] LOG: slot sync worker started > 2025-05-03 17:46:20.100 CEST [24322:1] LOG: started streaming WAL from > primary at 0/6000000 on timeline 1 > 2025-05-03 17:46:20.117 CEST [24321:2] LOG: starting logical decoding for > slot "lsub1_slot" > 2025-05-03 17:46:20.117 CEST [24321:3] DETAIL: Streaming transactions > committing after 0/60049C8, reading WAL from 0/5000048. > 2025-05-03 17:46:20.117 CEST [24321:4] LOG: logical decoding found > consistent point at 0/5000048 > 2025-05-03 17:46:20.117 CEST [24321:5] DETAIL: There are no running > transactions. > TRAP: failed Assert("slot->data.two_phase_at <= slot->data.confirmed_flush"), > File: "slotsync.c", Line: 311, PID: 24321 > postgres: standby1: slotsync worker > (ExceptionalCondition+0x59)[0x556c3b8cb3e9] > postgres: standby1: slotsync worker (+0x4cb100)[0x556c3b706100] > postgres: standby1: slotsync worker (+0x4cba4c)[0x556c3b706a4c] > postgres: standby1: slotsync worker > (ReplSlotSyncWorkerMain+0x258)[0x556c3b707598] > postgres: standby1: slotsync worker > (postmaster_child_launch+0x102)[0x556c3b6d6962] > postgres: standby1: slotsync worker (+0x49daea)[0x556c3b6d8aea] > postgres: standby1: slotsync worker (+0x49fb3d)[0x556c3b6dab3d] > postgres: standby1: slotsync worker (PostmasterMain+0xd3f)[0x556c3b6dc04f] > postgres: standby1: slotsync worker (main+0x1ca)[0x556c3b3bee1a] > /lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7fe08da4824a] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7fe08da48305] > postgres: standby1: slotsync worker (_start+0x21)[0x556c3b3bf421] > 2025-05-03 17:46:20.337 CEST [24308:4] LOG: slot sync worker process (PID > 24321) was terminated by signal 6: Aborted >
While I cannot be entirely certain of my analysis, I believe the root cause might be related to the backward movement of the confirmed_flush LSN. The following scenario seems possible: 1. The walsender enables the two_phase and sets two_phase_at (which should be the same as confirmed_flush). 2. The slot's confirmed_flush regresses for some reason. 3. The slotsync worker retrieves the remote slot information and enables two_phase for the local slot. If I recall correctly, we previously discussed[1] how a slot's confirmed_flush LSN could move backward depending on the feedback messages received from the downstream system. Based on the following primary server's logs, it appears possible that the subscriber sent feedback LSNs older than the slot's confirmed_flush, consequently causing the confirmed_flush to move backward. However, I should note that this is currently a hypothesis, as I haven't yet been able to reproduce and verify this scenario. Regards, [1] https://www.postgresql.org/message-id/85fff40e-148b-4e86-b921-b4b846289132%40vondra.me -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com