Hi, On 2022-02-18 18:49:14 -0500, Tom Lane wrote: > Andres Freund <and...@anarazel.de> writes: > > We'd also need to add pq_endmessage_noblock(), because the pq_endmessage() > > obviously tries to send (as in the backtrace upthread) if the output buffer > > is > > large enough, which it often will be in walsender. > > I don't see that as "obvious". If we're there, we do not have an > error situation.
The problem is that due walsender using pq_putmessage_noblock(), the output buffer is often going to be too full for a plain pq_endmessage() to not send out data. Because walsender will have an output buffer > PQ_SEND_BUFFER_SIZE a lot of the time, errors will commonly be thrown with the output buffer full. That leads the pq_endmessage() in send_message_to_frontend() to block sending the error message and the preceding WAL data. Leading to backtraces like: > #0 0x00007faf4a570ec6 in epoll_wait (epfd=4, events=0x7faf4c201458, > maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 > #1 0x00007faf4b8ced5c in WaitEventSetWaitBlock (set=0x7faf4c2013e0, > cur_timeout=-1, occurred_events=0x7ffe47df2320, nevents=1) > at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1465 > #2 0x00007faf4b8cebe3 in WaitEventSetWait (set=0x7faf4c2013e0, timeout=-1, > occurred_events=0x7ffe47df2320, nevents=1, wait_event_info=100663297) > at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1411 > #3 0x00007faf4b70f48b in secure_write (port=0x7faf4c22da50, > ptr=0x7faf4c2f1210, len=21470) at > /home/andres/src/postgresql/src/backend/libpq/be-secure.c:298 > #4 0x00007faf4b71aecb in internal_flush () at > /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1380 > #5 0x00007faf4b71ada1 in internal_putbytes (s=0x7ffe47df23dc "E\177", len=1) > at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1326 > #6 0x00007faf4b71b0cf in socket_putmessage (msgtype=69 'E', s=0x7faf4c201700 > "SFATAL", len=112) > at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1507 > #7 0x00007faf4b71c151 in pq_endmessage (buf=0x7ffe47df2460) at > /home/andres/src/postgresql/src/backend/libpq/pqformat.c:301 > #8 0x00007faf4babbb5e in send_message_to_frontend (edata=0x7faf4be2f1c0 > <errordata>) at > /home/andres/src/postgresql/src/backend/utils/error/elog.c:3253 > #9 0x00007faf4bab8aa0 in EmitErrorReport () at > /home/andres/src/postgresql/src/backend/utils/error/elog.c:1541 > #10 0x00007faf4bab5ec0 in errfinish (filename=0x7faf4bc9770d "postgres.c", > lineno=3192, funcname=0x7faf4bc99170 <__func__.8> "ProcessInterrupts") > at /home/andres/src/postgresql/src/backend/utils/error/elog.c:592 > #11 0x00007faf4b907e73 in ProcessInterrupts () at > /home/andres/src/postgresql/src/backend/tcop/postgres.c:3192 > #12 0x00007faf4b8920af in WalSndLoop (send_data=0x7faf4b8927f2 > <XLogSendPhysical>) at > /home/andres/src/postgresql/src/backend/replication/walsender.c:2404 > > I guess we could try to flush in a blocking manner sometime later in the > > shutdown sequence, after we've released resources? But I'm doubtful it's a > > good idea, we don't really want to block waiting to exit when e.g. the > > network > > connection is dead without the TCP stack knowing. > > I think you are trying to move in the direction of possibly exiting > without ever sending at all, which does NOT seem like an improvement. I was just talking about doing another blocking pq_flush(), in addition to the pq_flush_if_writable() earlier. That'd be trying harder to send out data, not less hard... Greetings, Andres Freund