Hi, On 2022-02-17 16:34:34 -0800, Andres Freund wrote: > I've done this for a couple hundred iterations, under load, subsequently with > additional assertions, without being able to reproduce.
Playing around with this further I did get into one interesting state: I started psql with replication=1, created a slot SELECT pg_drop_replication_slot('t');SELECT pg_create_physical_replication_slot('t', true); which was at 0/200002C8 at this point START_REPLICATION SLOT "t" 0/200002C8 TIMELINE 1 then did SELECT pg_switch_wal();checkpoint; in another session, which tried to invalidate that slot because of max_slot_wal_keep_size=1MB At this point the psql "walsender" is *not* terminated, because the fatal message is waiting behind all the WAL data sent. Which also means that the slot isn't yet released. At this point checkpointer is stuck, because it is waiting for the connection to end. Quitting the "psql walsender" or terminating it again resolves the issue. Not sure how this could be related, but it seems interesting. "psql walsender": (gdb) bt #0 0x00007faf4a570ec6 in epoll_wait (epfd=4, events=0x7faf4c201458, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x00007faf4b8ced5c in WaitEventSetWaitBlock (set=0x7faf4c2013e0, cur_timeout=-1, occurred_events=0x7ffe47df2320, nevents=1) at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1465 #2 0x00007faf4b8cebe3 in WaitEventSetWait (set=0x7faf4c2013e0, timeout=-1, occurred_events=0x7ffe47df2320, nevents=1, wait_event_info=100663297) at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1411 #3 0x00007faf4b70f48b in secure_write (port=0x7faf4c22da50, ptr=0x7faf4c2f1210, len=21470) at /home/andres/src/postgresql/src/backend/libpq/be-secure.c:298 #4 0x00007faf4b71aecb in internal_flush () at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1380 #5 0x00007faf4b71ada1 in internal_putbytes (s=0x7ffe47df23dc "E\177", len=1) at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1326 #6 0x00007faf4b71b0cf in socket_putmessage (msgtype=69 'E', s=0x7faf4c201700 "SFATAL", len=112) at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1507 #7 0x00007faf4b71c151 in pq_endmessage (buf=0x7ffe47df2460) at /home/andres/src/postgresql/src/backend/libpq/pqformat.c:301 #8 0x00007faf4babbb5e in send_message_to_frontend (edata=0x7faf4be2f1c0 <errordata>) at /home/andres/src/postgresql/src/backend/utils/error/elog.c:3253 #9 0x00007faf4bab8aa0 in EmitErrorReport () at /home/andres/src/postgresql/src/backend/utils/error/elog.c:1541 #10 0x00007faf4bab5ec0 in errfinish (filename=0x7faf4bc9770d "postgres.c", lineno=3192, funcname=0x7faf4bc99170 <__func__.8> "ProcessInterrupts") at /home/andres/src/postgresql/src/backend/utils/error/elog.c:592 #11 0x00007faf4b907e73 in ProcessInterrupts () at /home/andres/src/postgresql/src/backend/tcop/postgres.c:3192 #12 0x00007faf4b8920af in WalSndLoop (send_data=0x7faf4b8927f2 <XLogSendPhysical>) at /home/andres/src/postgresql/src/backend/replication/walsender.c:2404 #13 0x00007faf4b88f82e in StartReplication (cmd=0x7faf4c293fc0) at /home/andres/src/postgresql/src/backend/replication/walsender.c:834 #14 0x00007faf4b89136f in exec_replication_command (cmd_string=0x7faf4c2073c0 "START_REPLICATION SLOT \"t\" 0/1B000000 TIMELINE 1\n;") at /home/andres/src/postgresql/src/backend/replication/walsender.c:1771 #15 0x00007faf4b909842 in PostgresMain (dbname=0x7faf4c22fce8 "", username=0x7faf4c22fcc8 "andres") at /home/andres/src/postgresql/src/backend/tcop/postgres.c:4494 #16 0x00007faf4b830e7e in BackendRun (port=0x7faf4c22da50) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4593 #17 0x00007faf4b83075f in BackendStartup (port=0x7faf4c22da50) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4321 #18 0x00007faf4b82c55b in ServerLoop () at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1801 #19 0x00007faf4b82bd71 in PostmasterMain (argc=49, argv=0x7faf4c1ff1e0) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1473 checkpointer: (gdb) bt #0 0x00007f0d4ee1fec6 in epoll_wait (epfd=10, events=0x7f0d51489b78, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x00007f0d5017dd5c in WaitEventSetWaitBlock (set=0x7f0d51489b18, cur_timeout=-1, occurred_events=0x7fffe7c5f410, nevents=1) at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1465 #2 0x00007f0d5017dbe3 in WaitEventSetWait (set=0x7f0d51489b18, timeout=-1, occurred_events=0x7fffe7c5f410, nevents=1, wait_event_info=134217772) at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1411 #3 0x00007f0d5017cec4 in WaitLatch (latch=0x7f0d4c27db04, wakeEvents=33, timeout=-1, wait_event_info=134217772) at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:472 #4 0x00007f0d5018fe34 in ConditionVariableTimedSleep (cv=0x7f0d4c342268, timeout=-1, wait_event_info=134217772) at /home/andres/src/postgresql/src/backend/storage/lmgr/condition_variable.c:163 #5 0x00007f0d5018fd6c in ConditionVariableSleep (cv=0x7f0d4c342268, wait_event_info=134217772) at /home/andres/src/postgresql/src/backend/storage/lmgr/condition_variable.c:100 #6 0x00007f0d50132525 in InvalidatePossiblyObsoleteSlot (s=0x7f0d4c342188, oldestLSN=553648128, invalidated=0x7fffe7c5f577) at /home/andres/src/postgresql/src/backend/replication/slot.c:1290 #7 0x00007f0d501326af in InvalidateObsoleteReplicationSlots (oldestSegno=33) at /home/andres/src/postgresql/src/backend/replication/slot.c:1359 #8 0x00007f0d4fd84072 in CreateCheckPoint (flags=108) at /home/andres/src/postgresql/src/backend/access/transam/xlog.c:6620 #9 0x00007f0d500cd278 in CheckpointerMain () at /home/andres/src/postgresql/src/backend/postmaster/checkpointer.c:445 #10 0x00007f0d500cad7a in AuxiliaryProcessMain (auxtype=CheckpointerProcess) at /home/andres/src/postgresql/src/backend/postmaster/auxprocess.c:153 #11 0x00007f0d500e034c in StartChildProcess (type=CheckpointerProcess) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:5531 #12 0x00007f0d500dacf1 in PostmasterMain (argc=49, argv=0x7f0d514891e0) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1458 #13 0x00007f0d4ffcfc19 in main (argc=49, argv=0x7f0d514891e0) at /home/andres/src/postgresql/src/backend/main/main.c:202 Greetings, Andres Freund