10.01.2022 23:52, Thomas Munro wrote: >> I'm yet to find out whether the other >> WaitLatchOrSocket' users (e. g. postgres_fdw) can suffer from the >> disconnected socket state, but this approach definitely works for >> walreceiver. > I see where you're going: there might be safe call sequences and > unsafe call sequences, and maybe walreceiver is asking for trouble by > double-polling. I'm not sure about that; I got the impression > recently that it's possible to get FD_CLOSE while you still have > buffered data to read, so then the next recv() will return > 0 and > then we don't have any state left anywhere to remember that we saw > FD_CLOSE, even if you're careful to poll and read in the ideal > sequence. I could be wrong, and it would be nice if there is an easy > fix along those lines... The documentation around FD_CLOSE is > unclear. I had no strong opinion regarding unsafe sequence, though initially I suspected that exactly the second libpqrcv_PQgetResult call could cause the issue. But after digging into WaitLatchOrSocket I'd inclined to put the fix deeper to satisfy all possible callers. At the other hand, I've shared Tom's concerns regarding other clients, that can stuck on WaitForMultipleObjects() just as walreceiver does, and hoped that only walreceiver suffer from a graceful server socket closing. So to get these doubts cleared, I've made a simple test for postgres_fdw (please look at the attachment; you can put it into contrib/postgres_fdw/t and run `vcregress taptest contrib\postgres_fdw`). This test shows for me: === ... t/001_disconnection.pl .. # 12:13:39.481084 executing query... # 12:13:43.245277 result: 0 # 0|0
# 12:13:43.246342 executing query... # 12:13:46.525924 result: 0 # 0|0 # 12:13:46.527097 executing query... # 12:13:47.745176 result: 3 # # psql:<stdin>:1: WARNING: no connection to the server # psql:<stdin>:1: ERROR: FATAL: terminating connection due to administrator co mmand # server closed the connection unexpectedly # This probably means the server terminated abnormally # before or while processing the request. # CONTEXT: remote SQL command: FETCH 100 FROM c1 # 12:13:47.794612 executing query... # 12:13:51.073318 result: 0 # 0|0 # 12:13:51.074347 executing query... === With the simple logging added to connection.c: /* Sleep until there's something to do */ elog(LOG, "pgfdw_get_result before WaitLatchOrSocket"); wc = WaitLatchOrSocket(MyLatch, WL_LATCH_SET | WL_SOCKET_READABLE | WL_EXIT_ON_PM_DEATH, PQsocket(conn), -1L, PG_WAIT_EXTENSION); elog(LOG, "pgfdw_get_result after WaitLatchOrSocket"); I see in 001_disconnection_local.log: ... 2022-01-11 15:13:52.875 MSK|Administrator|postgres|61dd747f.5e4|LOG: pgfdw_get_result after WaitLatchOrSocket 2022-01-11 15:13:52.875 MSK|Administrator|postgres|61dd747f.5e4|STATEMENT: SELECT * FROM large WHERE a = fx2(a) 2022-01-11 15:13:52.875 MSK|Administrator|postgres|61dd747f.5e4|LOG: pgfdw_get_result before WaitLatchOrSocket 2022-01-11 15:13:52.875 MSK|Administrator|postgres|61dd747f.5e4|STATEMENT: SELECT * FROM large WHERE a = fx2(a) 2022-01-11 15:14:36.976 MSK|||61dd74ac.840|DEBUG: autovacuum: processing database "postgres" 2022-01-11 15:14:51.088 MSK|Administrator|postgres|61dd747f.5e4|LOG: pgfdw_get_result after WaitLatchOrSocket 2022-01-11 15:14:51.088 MSK|Administrator|postgres|61dd747f.5e4|STATEMENT: SELECT * FROM large WHERE a = fx2(a) 2022-01-11 15:14:51.089 MSK|Administrator|postgres|61dd747f.5e4|LOG: pgfdw_get_result before WaitLatchOrSocket 2022-01-11 15:14:51.089 MSK|Administrator|postgres|61dd747f.5e4|STATEMENT: SELECT * FROM large WHERE a = fx2(a) 2022-01-11 15:15:37.006 MSK|||61dd74e9.9e8|DEBUG: autovacuum: processing database "postgres" 2022-01-11 15:16:37.116 MSK|||61dd7525.ad0|DEBUG: autovacuum: processing database "postgres" 2022-01-11 15:17:37.225 MSK|||61dd7561.6a0|DEBUG: autovacuum: processing database "postgres" 2022-01-11 15:18:36.916 MSK|||61dd7470.704|LOG: checkpoint starting: time ... 2022-01-11 15:36:38.225 MSK|||61dd79d6.2a0|DEBUG: autovacuum: processing database "postgres" ... So here we get similar hanging on WaitLatchOrSocket(). Just to make sure that it's indeed the same issue, I've removed socket shutdown&close and the test executed to the end (several times). Argh. Best regards, Alexander
001_disconnection.pl
Description: Perl program