On 2021-Sep-27, Michael Paquier wrote: > I got again a failure today, so I have used this occasion to check that > when the checkpoint gets stuck the WAL sender process getting SIGCONT > is still around, waiting for a write to happen: > * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP > frame #0: 0x00007fff20320c4a libsystem_kernel.dylib`kevent + 10 > frame #1: 0x000000010fe50a43 > postgres`WaitEventSetWaitBlock(set=0x00007f884d80a690, cur_timeout=-1, > occurred_events=0x00007ffee0395fd0, nevents=1) at latch.c:1601:7 > frame #2: 0x000000010fe4ffd0 > postgres`WaitEventSetWait(set=0x00007f884d80a690, timeout=-1, > occurred_events=0x00007ffee0395fd0, nevents=1, wait_event_info=100663297) at > latch.c:1396:8 > frame #3: 0x000000010fc586c4 > postgres`secure_write(port=0x00007f883eb04080, ptr=0x00007f885006a040, > len=122694) at be-secure.c:298:3 > frame #4: 0x000000010fc66d81 postgres`internal_flush at pqcomm.c:1352:7 > frame #5: 0x000000010fc665b9 postgres`internal_putbytes(s="E, len=1) at > pqcomm.c:1298:8 > frame #6: 0x000000010fc66be3 postgres`socket_putmessage(msgtype='E', > s="SFATAL", len=112) at pqcomm.c:1479:6 > frame #7: 0x000000010fc67318 > postgres`pq_endmessage(buf=0x00007ffee0396118) at pqformat.c:301:9 > frame #8: 0x00000001100a469f > postgres`send_message_to_frontend(edata=0x000000011030d640) at elog.c:3431:3 > frame #9: 0x00000001100a066d postgres`EmitErrorReport at elog.c:1546:3 > frame #10: 0x000000011009ff99 postgres`errfinish(filename="postgres.c", > lineno=3193, funcname="ProcessInterrupts") at elog.c:597:2 > * frame #11: 0x000000010fe8e2f5 postgres`ProcessInterrupts at > postgres.c:3191:4 > frame #12: 0x000000010fe0bbe5 > postgres`WalSndLoop(send_data=(postgres`XLogSendPhysical at > walsender.c:2550)) at walsender.c:2285:3 > frame #13: 0x000000010fe0754f > postgres`StartReplication(cmd=0x00007f881d808790) at walsender.c:738:3 > frame #14: 0x000000010fe06149 > postgres`exec_replication_command(cmd_string="START_REPLICATION SLOT \"rep3\" > 0/700000 TIMELINE 1") at walsender.c:1652:6 > frame #15: 0x000000010fe91eb8 postgres`PostgresMain(dbname="", > username="mpaquier") at postgres.c:4493:12
Ah, so the problem here is that the walsender is not exiting. That also causes the checkpointer to hang waiting for it. I wonder if this is related to the problem reported in https://www.postgresql.org/message-id/adce2c09-3bfc-4666-997a-c21991cb1eb1.mengjuan.cmj%40alibaba-inc.com A patch was proposed on that thread on September 22nd, can to try with that and see if this problem still reproduces? -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "El sabio habla porque tiene algo que decir; el tonto, porque tiene que decir algo" (Platon).