On 5/18/21 11:03 PM, Michael Paquier wrote: > >> 3. Once the subscriber1 postmaster has exited, the TAP >> test will eventually time out, and then this happens: >> >> [.. logs ..] >> >> That is, because we failed to shut down subscriber1, the >> test script neglects to shut down subscriber2, and now >> things just sit indefinitely. So that's a robustness >> problem in the TAP infrastructure, rather than a bug in >> PG proper; but I still say it's a bug that needs fixing. > This one comes down to teardown_node() that uses system_or_bail(), > leaving things unfinished. I guess that we could be more aggressive > and ignore failures if we have a non-zero error code and that not all > the tests have passed within the END block of PostgresNode.pm.
Yeah, this area needs substantial improvement. I have seen similar sorts of nasty hangs, where the script is waiting forever for some process that hasn't got the shutdown message. At least we probably need some way of making sure the END handler doesn't abort early. Maybe PostgresNode::stop() needs a mode that handles failure more gracefully. Maybe it needs to try shutting down all the nodes and only calling BAIL_OUT after trying all of them and getting a failure. But that might still leave us work to do on failures occuring pre-END. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com