Hi Tomas, On Wed, 1 May 2019 at 02:28, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:
> I see there's an ongoing discussion about race conditions in walreceiver > blocking shutdown, so let me start a new thread about a race condition in > walsender during shutdown. > > The symptoms are rather simple - 'pg_ctl -m fast shutdown' gets stuck, > waiting for walsender processes to catch-up and terminate indefinitely. I can confirm, during the past couple of years we observed such a problem a few times and this is really annoying. > The reason for that is pretty simple - the walsenders are doing logical > decoding, and during shutdown they're waiting for WalSndCaughtUp=true. > But per XLogSendLogical() this only happens if > > if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr()) > { > WalSndCaughtUp = true; > ... > } After a couple of days investigating and debugging I came to a slightly different conclusion, WalSndCaughtUp is set to true in my case. Since I am mostly a python person, I decided to use psycopg2 for my investigation. I took an example from http://initd.org/psycopg/docs/advanced.html#logical-replication-quick-start as a starting point, created a slot and started the script. I wasn't running any transactions, therefore the DemoConsumer.__call__ was never executed and cursor.send_feedback(flush_lsn=msg.data_start) was never called either. Basically, the only what the psycopg2 internals was doing - periodically sending keepalive messages or replying to keepalives sent by postgres. In the postgres debug log they are visible as: 2019-05-01 12:58:32.785 CEST [13939] DEBUG: write 0/0 flush 0/0 apply 0/0 If you try to do a fast shutdown of postgres while the script is running, it will never finish, and in the postgres log you will get indefinite stream of messages: 2019-05-01 13:00:02.880 CEST [13939] DEBUG: write 0/0 flush 0/0 apply 0/0 2019-05-01 13:00:02.880 CEST [13939] DEBUG: sending replication keepalive 2019-05-01 13:00:02.880 CEST [13939] DEBUG: write 0/0 flush 0/0 apply 0/0 2019-05-01 13:00:02.880 CEST [13939] DEBUG: sending replication keepalive 2019-05-01 13:00:02.881 CEST [13939] DEBUG: write 0/0 flush 0/0 apply 0/0 2019-05-01 13:00:02.881 CEST [13939] DEBUG: sending replication keepalive 2019-05-01 13:00:02.881 CEST [13939] DEBUG: write 0/0 flush 0/0 apply 0/0 2019-05-01 13:00:02.881 CEST [13939] DEBUG: sending replication keepalive 2019-05-01 13:00:02.881 CEST [13939] DEBUG: write 0/0 flush 0/0 apply 0/0 2019-05-01 13:00:02.881 CEST [13939] DEBUG: sending replication keepalive 2019-05-01 13:00:02.881 CEST [13939] DEBUG: write 0/0 flush 0/0 apply 0/0 Actually, the same problem will happen even in the case when the consumer script receives some message, but not very intensively, but it is just a bit harder to reproduce it. If you attach to the walsender with gdb, you'll see the following picture: (gdb) bt #0 0x00007fd6623d296a in __libc_send (fd=8, buf=0x55cb958dca08, len=94, flags=0) at ../sysdeps/unix/sysv/linux/send.c:28 #1 0x000055cb93aa7ce9 in secure_raw_write (port=0x55cb958d71e0, ptr=0x55cb958dca08, len=94) at be-secure.c:318 #2 0x000055cb93aa7b87 in secure_write (port=0x55cb958d71e0, ptr=0x55cb958dca08, len=94) at be-secure.c:265 #3 0x000055cb93ab6bf9 in internal_flush () at pqcomm.c:1433 #4 0x000055cb93ab6b89 in socket_flush () at pqcomm.c:1409 #5 0x000055cb93dac30b in send_message_to_frontend (edata=0x55cb942b4380 <errordata>) at elog.c:3317 #6 0x000055cb93da8973 in EmitErrorReport () at elog.c:1481 #7 0x000055cb93da5abf in errfinish (dummy=0) at elog.c:481 #8 0x000055cb93da852d in elog_finish (elevel=13, fmt=0x55cb93f32de3 "sending replication keepalive") at elog.c:1376 #9 0x000055cb93bcae71 in WalSndKeepalive (requestReply=true) at walsender.c:3358 #10 0x000055cb93bca062 in WalSndDone (send_data=0x55cb93bc9e29 <XLogSendLogical>) at walsender.c:2872 #11 0x000055cb93bc9155 in WalSndLoop (send_data=0x55cb93bc9e29 <XLogSendLogical>) at walsender.c:2194 #12 0x000055cb93bc7b11 in StartLogicalReplication (cmd=0x55cb95931cc0) at walsender.c:1109 #13 0x000055cb93bc83d6 in exec_replication_command (cmd_string=0x55cb958b2360 "START_REPLICATION SLOT \"test\" LOGICAL 0/00000000") at walsender.c:1541 #14 0x000055cb93c31653 in PostgresMain (argc=1, argv=0x55cb958deb68, dbname=0x55cb958deb48 "postgres", username=0x55cb958deb28 "akukushkin") at postgres.c:4178 #15 0x000055cb93b95185 in BackendRun (port=0x55cb958d71e0) at postmaster.c:4361 #16 0x000055cb93b94824 in BackendStartup (port=0x55cb958d71e0) at postmaster.c:4033 #17 0x000055cb93b90ccd in ServerLoop () at postmaster.c:1706 #18 0x000055cb93b90463 in PostmasterMain (argc=3, argv=0x55cb958ac710) at postmaster.c:1379 #19 0x000055cb93abb08e in main (argc=3, argv=0x55cb958ac710) at main.c:228 (gdb) f 10 #10 0x000055cb93bca062 in WalSndDone (send_data=0x55cb93bc9e29 <XLogSendLogical>) at walsender.c:2872 2872 WalSndKeepalive(true); (gdb) p WalSndCaughtUp $1 = true (gdb) p *MyWalSnd $2 = {pid = 21845, state = WALSNDSTATE_STREAMING, sentPtr = 23586168, needreload = false, write = 0, flush = 23586112, apply = 0, writeLag = -1, flushLag = -1, applyLag = -1, mutex = 0 '\000', latch = 0x7fd66096b594, sync_standby_priority = 0} As you can see, the value of WalSndCaughtUp is set to true! The shutdown never finishes because the value of sentPtr is higher than values of MyWalSnd->flush or MyWalSnd->write: (gdb) l 2858 2853 /* 2854 * To figure out whether all WAL has successfully been replicated, check 2855 * flush location if valid, write otherwise. Tools like pg_receivewal will 2856 * usually (unless in synchronous mode) return an invalid flush location. 2857 */ 2858 replicatedPtr = XLogRecPtrIsInvalid(MyWalSnd->flush) ? 2859 MyWalSnd->write : MyWalSnd->flush; 2860 2861 if (WalSndCaughtUp && sentPtr == replicatedPtr && 2862 !pq_is_send_pending()) 2863 { 2864 /* Inform the standby that XLOG streaming is done */ 2865 EndCommand("COPY 0", DestRemote); 2866 pq_flush(); 2867 2868 proc_exit(0); 2869 } What is more interesting, if one is using pg_recvlogical, it is not possible to reproduce the issue. That happens because pg_recvlogical sends the response on keepalive messages by sending the flush location equals to walEnd which it got from keepalive. As a next logical step I tried to do the same in python with psycopg2. Unfortunately, the keepalive functionality is hidden in the C code and it is not possible to change it without recompiling the psycopg2, but there is an asynchronous interface available: http://initd.org/psycopg/docs/extras.html#psycopg2.extras.ReplicationCursor.wal_end. I just had to do just one minor adjustment: try: sel = select([cur], [], [], max(0, timeout)) if not any(sel): - cur.send_feedback() # timed out, send keepalive message + cur.send_feedback(flush_lsn=cur.wal_end) # timed out, send keepalive message except InterruptedError: pass # recalculate timeout and continue wal_end шы еру property of the cursor object and it is updated not only for every message received, but also from keepalive messages. Basically such a little change made the python example behavior very similar to the pg_recvlogical. All above text probably looks like a brain dump, but I don't think that it conflicts with Tomas's findings it rather compliments them. I am very glad that now I know how to mitigate the problem on the client side, but IMHO it is also very important to fix the server behavior if it is ever possible. Regards, -- Alexander Kukushkin