On Tue, Jan 11, 2022 at 6:00 AM Alexander Lakhin <exclus...@gmail.com> wrote: > 10.01.2022 12:40, Thomas Munro wrote: > > This is super quick-and-dirty code (and doesn't handle some errors or > > socket changes correctly), but does it detect the closed socket?
> Yes, it fixes the behaviour and makes the 002_standby test pass (100 of > 100 iterations). Thanks for testing. That result does seem to confirm the hypothesis that FD_CLOSE is reported only once for the socket on graceful shutdown (that is, it's edge-triggered and incidentally you won't get FD_READ), so you need to keep track of it carefully. Incidentally, another observation is that your WSAPoll() test appears to be returning POLLHUP where at least Linux, FreeBSD and Solaris would not: a socket that is only half shut down (the primary shut down its end gracefully, but walreceiver did not), so I suspect Windows' POLLHUP might have POLLRDHUP semantics. > I'm yet to find out whether the other > WaitLatchOrSocket' users (e. g. postgres_fdw) can suffer from the > disconnected socket state, but this approach definitely works for > walreceiver. I see where you're going: there might be safe call sequences and unsafe call sequences, and maybe walreceiver is asking for trouble by double-polling. I'm not sure about that; I got the impression recently that it's possible to get FD_CLOSE while you still have buffered data to read, so then the next recv() will return > 0 and then we don't have any state left anywhere to remember that we saw FD_CLOSE, even if you're careful to poll and read in the ideal sequence. I could be wrong, and it would be nice if there is an easy fix along those lines... The documentation around FD_CLOSE is unclear. I do plan to make a higher quality patch like the one I showed (material from earlier unfinished work[1] that needs a bit more infrastructure), but to me that's new feature/efficiency work, not something we'd want to back-patch. Hmm, one thing I'm still unclear on: did this problem really start with 6051857fc/ed52c3707? My initial email in this thread lists similar failures going back further, doesn't it? (And what's tern doing mixed up in this mess?) [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJPaygh-6WHEd0FnH89GrkTpVyN_ew9ckv3%2BnwjmLcSeg%40mail.gmail.com#aa33ec3e7ad85499f35dd1434a139c3f