Race condition in crash-recovery tests

Tom Lane Sat, 26 Jan 2019 17:54:44 -0800

Recently, buildfarm member curculio has started to show a semi-repeatable
failure in src/test/recovery/t/013_crash_restart.pl:


# aborting wait: program died
# stream contents: >>psql:<stdin>:8: no connection to the server
# psql:<stdin>:8: connection to server was lost
# <<
# pattern searched for: (?^m:server closed the connection unexpectedly)

#   Failed test 'psql query died successfully after SIGKILL'
#   at t/013_crash_restart.pl line 198.

The message this test is looking for is what libpq reports upon getting
EOF or ECONNRESET from a socket read attempt.  The message it's actually
seeing is what libpq reports if it notices that the PQconn is *already*
in CONNECTION_BAD state when it's trying to send a new query.

I have no idea why we're seeing this in only one buildfarm member
and only for the past week or so, as it doesn't appear that any
related code has changed for months.  (Perhaps something changed
about curculio's host?)

I do see a plausibly related change back in October: commit 4247db625
added PQconsumeInput() calls in places where psql would call that after
having already emitted a query result, but before going idle.  So my
rough theory about what's happening is that psql loses the CPU immediately
after printing the query result containing "in-progress-before-sigquit",
which allows the test script to proceed, and doesn't get it back until
after the connected backend commits hara-kiri.  Then PQconsumeInput
tries to read the socket, gets EOF or ECONNRESET, and sets a message
(that's ignored) along with setting state = CONNECTION_BAD.  Then when
psql tries to send the next query, we get the observed message instead
of what the test is expecting.

I'm inclined to write this off as an inevitable race condition and
just change the test script to accept either message as a successful
result.  I think that 4247db625 made such races more likely, but I
don't believe it was impossible before.

Another idea is to change libpq so that both these cases emit identical
messages, but I don't really feel that that'd be an improvement.  Also,
since 4247db625 was back-patched, we'd have to back-patch the message
change as well, which I like even less.  People might be relying on
seeing either message spelling in some situations.

In any case, it'd be nice to have an explanation of why this just
started failing --- that fact makes it even harder to credit that
4247db625 was the trigger.

Thoughts?

                        regards, tom lane

Race condition in crash-recovery tests

Reply via email to