On Thu, Aug 26, 2021 at 2:55 PM Daniel Gustafsson <dan...@yesql.se> wrote: > > When using pg_basebackup with WAL streaming (-X stream), we have observed on a > number of times in production that the streaming child exited prematurely (to > no fault of the code it seems, most likely due to network middleboxes), which > cause the backup to fail but only after it has run to completion. On long > running backups this can consume a lot of time before it’s noticed.
Hm. > By trapping the failure of the streaming process we can instead exit early to > allow the user to fix and/or restart the process. > > The attached adds a SIGCHLD handler for Unix, and catch the returnvalue from > the Windows thread, in order to break out early from the main loop. It still > needs a test, and proper testing on Windows, but early feedback on the > approach > would be appreciated. Here are some comments on the patch: 1) Do we need volatile keyword here to read the value of the variables always from the memory? +static volatile sig_atomic_t bgchild_exited = false; 2) Do we need #ifndef WIN32 ... #endif around sigchld_handler function definition? 3) I'm not sure if the new value of bgchild_exited being set in the child thread will reflect in the main process on Windows? But theoretically, I can understand that the memory will be shared between the main process thread and child thread. #ifdef WIN32 /* * In order to signal the main thread of an ungraceful exit we * set the flag used on Unix to signal SIGCHLD. */ bgchild_exited = true; #endif 4) How about "set the same flag that we use on Unix to signal SIGCHLD." instead of "* set the flag used on Unix to signal SIGCHLD."? 5) How about "background WAL receiver terminated unexpectedly" instead of "log streamer child terminated unexpectedly"? This will be in sync with the existing message "starting background WAL receiver". "log streamer" is the word used internally in the code, user doesn't know it with that name. 6) How about giving the exit code (like postmaster's reaper function does) instead of just a message saying unexpected termination? It will be useful to know for what reason the process exited. For Windows, we can use GetExitCodeThread (I'm referring to the code around waitpid in pg_basebackup) and for Unix we can use waitpid. Regards, Bharath Rupireddy.