On 2020/11/24 5:52, Alvaro Herrera wrote:
Hello Chloe Dives reported that sometimes a walsender would become stuck during shutdown and *not* shutdown, thus preventing postmaster from completing the shutdown cycle. This has been observed to cause the servers to remain in such state for several hours. After a lengthy investigation and thanks to a handy reproducer by Chris Wilson, we found that the problem is that WalSndDone wants to avoid shutting down until everything has been sent and acknowledged; but this test is coded in a way that ignores the possibility that we have never received anything from the other end. In that case, both MyWalSnd->flush and MyWalSnd->write are InvalidRecPtr, so the condition in WalSndDone to terminate the loop is never fulfilled. So the walsender is looping forever and never terminates, blocking shutdown of the whole instance. The attached patch fixes the problem by testing for the problematic condition. Apparently this problem has existed forever. Fujii-san almost patched for it in 5c6d9fc4b2b8 (2014!), but missed it by a zillionth of an inch.
Thanks for working on this! Could you tell me the discussion thread where Chloe Dives reported the issue to? Sorry I could not find that.. I'd like to see the procedure to reproduce the issue. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION