Hi, On 2024-03-20 17:41:47 -0700, Andres Freund wrote: > There's a lot of other animals on the same machine, however it's rarely fuly > loaded (with either CPU or IO). > > I don't think the test just being slow is the issue here, e.g. in the last > failing iteration > > [...] > > I suspect we have some more fundamental instability at our hands, there have > been failures like this going back a while, and on various machines.
I'm somewhat confused by the timestamps in the log: [22:07:50.263](223.929s) ok 2 - regression tests pass ... [22:14:02.051](371.788s) # poll_query_until timed out executing this query: I read this as 371.788s having passed between the messages. Which of course is much higher than PostgreSQL::Test::Utils::timeout_default=180 Ah. The way that poll_query_until() implements timeouts seems decidedly suboptimal. If a psql invocation, including query processing, takes any appreciateble amount of time, poll_query_until() waits much longer than it shoulds, because it very naively determines a number of waits ahead of time: my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default; my $attempts = 0; while ($attempts < $max_attempts) { ... # Wait 0.1 second before retrying. usleep(100_000); $attempts++; } Ick. What's worse is that if the query takes too long, the timeout afaict never takes effect. Greetings, Andres Freund