Hello, 

For some unknown reason (probably a very big transaction at the source), we 
experienced a logical decoding breakdown, 
due to a timeout from the subscriber side (either wal_receiver_timeout or 
connexion drop by network equipment due to inactivity). 

The problem is, that due to that failure, the wal_receiver process stops. When 
the wal_sender is ready to send some data, it finds the connexion broken and 
exits. 
A new wal_sender process is created that restarts from the beginning (restart 
LSN). This is an endless loop. 

Checking the network connexion between wal_sender and wal_receiver, we found 
that no traffic occurs for hours. 

We first increased wal_receiver_timeout up to 12h and still got a disconnection 
on the receiver party: 

2024-10-17 16:31:58.645 GMT [1356203:2] user=,db=,app=,client= ERROR: 
terminating logical replication worker due to timeout 
2024-10-17 16:31:58.648 GMT [849296:212] user=,db=,app=,client= LOG: background 
worker "logical replication worker" (PID 1356203) exited with exit code 1 

Then put this parameter to 0, but got then disconnected by the network (note 
the slight difference in message): 

2024-10-21 11:45:42.867 GMT [1697787:2] user=,db=,app=,client= ERROR: could not 
receive data from WAL stream: could not receive data from server: Connection 
timed out 
2024-10-21 11:45:42.869 GMT [849296:40860] user=,db=,app=,client= LOG: 
background worker "logical replication worker" (PID 1697787) exited with exit 
code 1 

The message is generated in libpqrcv_receive function 
(replication/libpqwalreceiver/libpqwalreceiver.c) which calls pqsecure_raw_read 
(interfaces/libpq/fe-secure.c) 

The last message "Connection timed out" is the errno translation from the recv 
system function: 

ETIMEDOUT Connection timed out (POSIX.1-2001) 

When those timeout occurred, the sender was still busy deleting files from 
data/pg_replslot/bdcpb21_sene, accumulating more than 6 millions small ".spill" 
files. 
It seems this very long pause is at cleanup stage were PG is blindly trying to 
delete those files. 

strace on wal sender show tons of calls like: 

unlink("pg_replslot/bdcpb21_sene/xid-2 721 821 917-lsn-439C-0.spill") = -1 
ENOENT (Aucun fichier ou dossier de ce type) 
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-1000000.spill") = -1 
ENOENT (Aucun fichier ou dossier de ce type) 
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-2000000.spill") = -1 
ENOENT (Aucun fichier ou dossier de ce type) 
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-3000000.spill") = -1 
ENOENT (Aucun fichier ou dossier de ce type) 
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-4000000.spill") = -1 
ENOENT (Aucun fichier ou dossier de ce type) 
unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-5000000.spill") = -1 
ENOENT (Aucun fichier ou dossier de ce type) 

This occurs in ReorderBufferRestoreCleanup 
(backend/replication/logical/reorderbuffer.c). 
The call stack presumes this may probably occur in DecodeCommit or DecodeAbort 
(backend/replication/logical/decode.c): 

unlink("pg_replslot/bdcpb21_sene/xid-2730444214-lsn-43A6-88000000.spill") = -1 
ENOENT (Aucun fichier ou dossier de ce type) 
> /usr/lib64/libc-2.17.so(unlink+0x7) [0xf12e7] 
> /usr/pgsql-15/bin/postgres(ReorderBufferRestoreCleanup.isra.17+0x5d) 
> [0x769e3d] 
> /usr/pgsql-15/bin/postgres(ReorderBufferCleanupTXN+0x166) [0x76aec6] <=== 
> replication/logical/reorderbuff.c:1480 (mais cette fonction (static) n'est 
> utiliée qu'au sein de ce module ...) 
> /usr/pgsql-15/bin/postgres(xact_decode+0x1e7) [0x75f217] <=== 
> replication/logical/decode.c:175 
> /usr/pgsql-15/bin/postgres(LogicalDecodingProcessRecord+0x73) [0x75eee3] <=== 
> replication/logical/decode.c:90, appelle la fonction rmgr.rm_decode(ctx, 
> &buf) = 1 des 6 méthodes du resource manager 
> /usr/pgsql-15/bin/postgres(XLogSendLogical+0x4e) [0x78294e] 
> /usr/pgsql-15/bin/postgres(WalSndLoop+0x151) [0x785121] 
> /usr/pgsql-15/bin/postgres(exec_replication_command+0xcba) [0x785f4a] 
> /usr/pgsql-15/bin/postgres(PostgresMain+0xfa8) [0x7d0588] 
> /usr/pgsql-15/bin/postgres(ServerLoop+0xa8a) [0x493b97] 
> /usr/pgsql-15/bin/postgres(PostmasterMain+0xe6c) [0x74d66c] 
> /usr/pgsql-15/bin/postgres(main+0x1c5) [0x494a05] 
> /usr/lib64/libc-2.17.so(__libc_start_main+0xf4) [0x22554] 
> /usr/pgsql-15/bin/postgres(_start+0x28) [0x494fb8] 

We did not find any other option than deleting the subscription to stop that 
loop and start a new one (thus loosing transactions). 

The publisher is PostgreSQL 15.6 
The subscriber is PostgreSQL 14.5 

Thanks 

Reply via email to