I just noticed something very interesting: in a couple of recent buildfarm runs with this failure, the pg_stat_activity printout no longer shows the extra walsender:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-03-24%2017%3A50%3A10 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&dt=2022-03-23%2011%3A00%3A05 This is just two of the 33 such failures in the past ten days, so maybe it's not surprising that we didn't see it already. (I got bored before looking back further than that.) What this suggests to me is that maybe the extra walsender is indeed not blocked on anything, but is just taking its time about exiting. In these two runs, as well as in all the non-failing runs, it had enough time to do so. I suggest that we add a couple-of-seconds sleep in front of the query that collects walsender PIDs, and maybe a couple more seconds before the pg_stat_activity probe in the failure path, and see if the behavior changes at all. That should be enough to confirm or disprove this idea pretty quickly. If it is right, a permanent fix could be to wait for the basebackup's walsender to disappear from node_primary3's pg_stat_activity before we start the one for node_standby3. regards, tom lane