I just spent some time puzzling over why we've seen so many recoveryCheck failures in the buildfarm recently --- excluding failures with known causes, such as the pg_verify_checksums business, that's basically all we've seen for the last week. I am suspicious that there's some timing-dependent problems in WAL streaming startup and shutdown, but can't prove anything yet. However, one case catches my eye:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&dt=2018-10-15%2016%3A01%3A59 This is the sum total of the visible logged information: # not ok 2 - asterisk in synchronous_standby_names # Failed test 'asterisk in synchronous_standby_names' # at t/007_sync_rep.pl line 26. The failing test is evidently this one: test_sync_state( $node_master, qq(standby1|1|sync standby2|1|potential standby3|1|potential), 'asterisk in synchronous_standby_names', '*'); and when you drill down, you find that PostgresNode.pm's poll_query_until must have timed out, and that blank line there is as much as it felt like telling us about it. Evidently what we were getting back from the server did not match the expected output, but good luck guessing what it was. I think we ought to fix poll_query_until so that upon failure, you get a much more complete and intelligible report, along the lines of poll_query_until timed out executing this query: $query expecting this output: $expected last actual query output: $stdout with stderr: $stderr Objections? Thoughts about better formatting? regards, tom lane