Recently, buildfarm member curculio has started to show a semi-repeatable failure in src/test/recovery/t/013_crash_restart.pl:
# aborting wait: program died # stream contents: >>psql:<stdin>:8: no connection to the server # psql:<stdin>:8: connection to server was lost # << # pattern searched for: (?^m:server closed the connection unexpectedly) # Failed test 'psql query died successfully after SIGKILL' # at t/013_crash_restart.pl line 198. The message this test is looking for is what libpq reports upon getting EOF or ECONNRESET from a socket read attempt. The message it's actually seeing is what libpq reports if it notices that the PQconn is *already* in CONNECTION_BAD state when it's trying to send a new query. I have no idea why we're seeing this in only one buildfarm member and only for the past week or so, as it doesn't appear that any related code has changed for months. (Perhaps something changed about curculio's host?) I do see a plausibly related change back in October: commit 4247db625 added PQconsumeInput() calls in places where psql would call that after having already emitted a query result, but before going idle. So my rough theory about what's happening is that psql loses the CPU immediately after printing the query result containing "in-progress-before-sigquit", which allows the test script to proceed, and doesn't get it back until after the connected backend commits hara-kiri. Then PQconsumeInput tries to read the socket, gets EOF or ECONNRESET, and sets a message (that's ignored) along with setting state = CONNECTION_BAD. Then when psql tries to send the next query, we get the observed message instead of what the test is expecting. I'm inclined to write this off as an inevitable race condition and just change the test script to accept either message as a successful result. I think that 4247db625 made such races more likely, but I don't believe it was impossible before. Another idea is to change libpq so that both these cases emit identical messages, but I don't really feel that that'd be an improvement. Also, since 4247db625 was back-patched, we'd have to back-patch the message change as well, which I like even less. People might be relying on seeing either message spelling in some situations. In any case, it'd be nice to have an explanation of why this just started failing --- that fact makes it even harder to credit that 4247db625 was the trigger. Thoughts? regards, tom lane