On Sat, Apr 27, 2019 at 5:57 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > > I have spent a fair amount of time trying to replicate these failures > locally, with little success. I now think that the most promising theory > is Munro's idea in [1] that the walreceiver is hanging up during its > unsafe attempt to do ereport(FATAL) from inside a signal handler. It's > extremely plausible that that could result in a deadlock inside libc's > malloc/free, or some similar place. Moreover, if that's what's causing > it, then the windows for trouble are fixed by the length of time that > malloc might hold internal locks, which fits with the results I've gotten > that inserting delays in various promising-looking places doesn't do a > thing towards making this reproducible.
For Greenplum (based on 9.4 but current master code looks the same) we did see deadlocks recently hit in CI many times for walreceiver which I believe confirms above finding. #0 __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95 #1 0x00007f0637ee72bd in _int_free (av=0x7f063822bb20 <main_arena>, p=0x26bb3b0, have_lock=0) at malloc.c:3962 #2 0x00007f0637eeb53c in __GI___libc_free (mem=<optimized out>) at malloc.c:2968 #3 0x00007f0636629464 in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30 #4 0x00007f0636630720 in ?? () from /usr/lib/x86_64-linux-gnu/libgnutls.so.30 #5 0x00007f063b5cede7 in _dl_fini () at dl-fini.c:235 #6 0x00007f0637ea0ff8 in __run_exit_handlers (status=1, listp=0x7f063822b5f8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:82 #7 0x00007f0637ea1045 in __GI_exit (status=<optimized out>) at exit.c:104 #8 0x00000000008c72c7 in proc_exit () #9 0x0000000000a75867 in errfinish () #10 0x000000000089ea53 in ProcessWalRcvInterrupts () #11 0x000000000089eac5 in WalRcvShutdownHandler () #12 <signal handler called> #13 _int_malloc (av=av@entry=0x7f063822bb20 <main_arena>, bytes=bytes@entry=16384) at malloc.c:3802 #14 0x00007f0637eeb184 in __GI___libc_malloc (bytes=16384) at malloc.c:2913 #15 0x00000000007754c3 in makeEmptyPGconn () #16 0x0000000000779686 in PQconnectStart () #17 0x0000000000779b8b in PQconnectdb () #18 0x00000000008aae52 in libpqrcv_connect () #19 0x000000000089f735 in WalReceiverMain () #20 0x00000000005c5eab in AuxiliaryProcessMain () #21 0x00000000004cd5f1 in ServerLoop () #22 0x000000000086fb18 in PostmasterMain () #23 0x00000000004d2e28 in main () ImmediateInterruptOK was removed from regular backends but not for walreceiver and walreceiver performing elog(FATAL) inside signal handler is dangerous.