Hi, On 2022-02-18 14:42:48 -0800, Andres Freund wrote: > On 2022-02-17 21:55:21 -0800, Andres Freund wrote: > > Isn't it pretty bonkers that we allow error processing to get stuck behind > > network traffic, *before* we have have released resources (locks etc)? > > This is particularly likely to be a problem for walsenders, because they often > have a large output buffer filled, because walsender uses > pq_putmessage_noblock() to send WAL data. Which obviously can be large. > > In the stacktrace upthread you can see: > #3 0x00007faf4b70f48b in secure_write (port=0x7faf4c22da50, > ptr=0x7faf4c2f1210, len=21470) at > /home/andres/src/postgresql/src/backend/libpq/be-secure.c:29 > > which certainly is more than in most other cases of error messages being > sent. And it obviously might not be the first to have gone out. > > > > I wonder if we should try to send, but do it in a nonblocking way. > > I think we should probably do so at least during FATAL error processing. But > also consider doing so for ERROR, because not releasing resources after > getting cancelled / terminated is pretty nasty imo.
Is it possible that what we're seeing is a deadlock, with both walsender and the pg_basebackup child trying to send data, but neither receiving? But that'd require that somehow the basebackup child process didn't exit with its parent. And I don't really see how that'd happen. I'm running out of ideas for how to try to reproduce this. I think we might need some additional debugging information to get more information from the buildfarm. I'm thinking of adding log_min_messages=DEBUG2 to primary3, passing --verbose to pg_basebackup in $node_primary3->backup(...). It might also be worth adding DEBUG2 messages to ReplicationSlotShmemExit(), ReplicationSlotCleanup(), InvalidateObsoleteReplicationSlots(). Greetings, Andres Freund