Inadequate failure reporting from poll_query_until

Tom Lane Mon, 15 Oct 2018 13:19:52 -0700

I just spent some time puzzling over why we've seen so many recoveryCheck
failures in the buildfarm recently --- excluding failures with known
causes, such as the pg_verify_checksums business, that's basically all
we've seen for the last week.  I am suspicious that there's some
timing-dependent problems in WAL streaming startup and shutdown, but
can't prove anything yet.  However, one case catches my eye:


https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&dt=2018-10-15%2016%3A01%3A59

This is the sum total of the visible logged information:

    #
    not ok 2 - asterisk in synchronous_standby_names
    
    #   Failed test 'asterisk in synchronous_standby_names'
    #   at t/007_sync_rep.pl line 26.

The failing test is evidently this one:

    test_sync_state(
            $node_master, qq(standby1|1|sync
    standby2|1|potential
    standby3|1|potential),
            'asterisk in synchronous_standby_names',
            '*');

and when you drill down, you find that PostgresNode.pm's poll_query_until
must have timed out, and that blank line there is as much as it felt like
telling us about it.  Evidently what we were getting back from the server
did not match the expected output, but good luck guessing what it was.

I think we ought to fix poll_query_until so that upon failure, you get
a much more complete and intelligible report, along the lines of

poll_query_until timed out executing this query:
$query
expecting this output:
$expected
last actual query output:
$stdout
with stderr:
$stderr

Objections?  Thoughts about better formatting?

                        regards, tom lane

Inadequate failure reporting from poll_query_until

Reply via email to