Hi, On 2022-02-18 18:15:21 -0500, Tom Lane wrote: > Andres Freund <and...@anarazel.de> writes: > > On 2022-02-17 21:55:21 -0800, Andres Freund wrote: > >> Isn't it pretty bonkers that we allow error processing to get stuck behind > >> network traffic, *before* we have have released resources (locks etc)? > > It's more or less intentional, per elog.c: > > /* > * This flush is normally not necessary, since postgres.c will flush out > * waiting data when control returns to the main loop. But it seems best > * to leave it here, so that the client has some clue what happened if the > * backend dies before getting back to the main loop ... error/notice > * messages should not be a performance-critical path anyway, so an extra > * flush won't hurt much ... > */ > pq_flush(); > > Perhaps it'd be sensible to do this only in debugging (ie Assert) > builds?
That seems not great, because it pretty clearly can lead to hangs, which is problematic in tests too. What about using pq_flush_if_writable()? In nearly all situations that'd still push the failure to the client. We'd also need to add pq_endmessage_noblock(), because the pq_endmessage() obviously tries to send (as in the backtrace upthread) if the output buffer is large enough, which it often will be in walsender. I guess we could try to flush in a blocking manner sometime later in the shutdown sequence, after we've released resources? But I'm doubtful it's a good idea, we don't really want to block waiting to exit when e.g. the network connection is dead without the TCP stack knowing. Hm. There already is code trying to short-circuit sending errors to the client if a backend gets terminated. Introduced in commit 2ddb9149d14de9a2e7ac9ec6accf3ad442702b24 Author: Tom Lane <t...@sss.pgh.pa.us> Date: 2018-10-19 21:39:21 -0400 Server-side fix for delayed NOTIFY and SIGTERM processing. and predecessors. If ProcessClientWriteInterrupt() sees ProcDiePending, we'll stop trying to send stuff to the client if writes block. However, this doesn't work here, because we've already unset ProcDiePending: #7 0x00007faf4b71c151 in pq_endmessage (buf=0x7ffe47df2460) at /home/andres/src/postgresql/src/backend/libpq/pqformat.c:301 #8 0x00007faf4babbb5e in send_message_to_frontend (edata=0x7faf4be2f1c0 <errordata>) at /home/andres/src/postgresql/src/backend/utils/error/elog.c:3253 #9 0x00007faf4bab8aa0 in EmitErrorReport () at /home/andres/src/postgresql/src/backend/utils/error/elog.c:1541 #10 0x00007faf4bab5ec0 in errfinish (filename=0x7faf4bc9770d "postgres.c", lineno=3192, funcname=0x7faf4bc99170 <__func__.8> "ProcessInterrupts") at /home/andres/src/postgresql/src/backend/utils/error/elog.c:592 #11 0x00007faf4b907e73 in ProcessInterrupts () at /home/andres/src/postgresql/src/backend/tcop/postgres.c:3192 #12 0x00007faf4b8920af in WalSndLoop (send_data=0x7faf4b8927f2 <XLogSendPhysical>) at /home/andres/src/postgresql/src/backend/replication/walsender.c:2404 #13 0x00007faf4b88f82e in StartReplication (cmd=0x7faf4c293fc0) at /home/andres/src/postgresql/src/backend/replication/walsender.c:834 Before ProcessInterrupts() FATALs due to a SIGTERM, it resets ProcDiePending. This seems not optimal. We can't just leave ProcDiePending set, otherwise we'll probably end up throwing more errors during the shutdown sequence. But it seems we need something similar to proc_exit_inprogress, except set earlier? And then take that into account in ProcessClientWriteInterrupt()? Greetings, Andres Freund