Re: tests against running server occasionally fail, postgres_fdw & tenk1

Andres Freund Thu, 08 Dec 2022 16:36:23 -0800

Hi,

On 2022-12-08 16:15:11 -0800, Andres Freund wrote:
> commit 3f0e786ccbf
> Author: Andres Freund <and...@anarazel.de>
> Date:   2022-12-07 12:13:35 -0800
> 
>     meson: Add 'running' test setup, as a replacement for installcheck
> 
> CI tests the pg_regress/isolationtester tests that support doing so against a
> running server.
> 
> 
> Unfortunately cfbot shows that that doesn't work entirely reliably.
> 
> The most frequent case is postgres_fdw, which somewhat regularly fails with a
> regression.diff like this:
> 
> diff -U3 /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out 
> /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out
> --- /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out       
> 2022-12-08 20:35:24.772888000 +0000
> +++ 
> /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out
>   2022-12-08 20:43:38.199450000 +0000
> @@ -9911,8 +9911,7 @@
>       WHERE application_name = 'fdw_retry_check';
>   pg_terminate_backend
>  ----------------------
> - t
> -(1 row)
> +(0 rows)
> 
>  -- This query should detect the broken connection when starting new remote
>  -- transaction, reestablish new connection, and then succeed.
> 
> 
> See e.g.
> https://cirrus-ci.com/task/5925540020879360
> https://api.cirrus-ci.com/v1/artifact/task/5925540020879360/testrun/build/testrun/postgres_fdw-running/regress/regression.diffs
> https://api.cirrus-ci.com/v1/artifact/task/5925540020879360/testrun/build/testrun/runningcheck.log
> 
> 
> The following comment in the test provides a hint what might be happening:
> 
> -- If debug_discard_caches is active, it results in
> -- dropping remote connections after every transaction, making it
> -- impossible to test termination meaningfully.  So turn that off
> -- for this test.
> SET debug_discard_caches = 0;
> 
> 
> I guess that a cache reset message arrives and leads to the connection being
> terminated. Unfortunately that's hard to see right now, as the relevant log
> messages are output with DEBUG3 - it's quite verbose, so enabling it for all
> tests will be painful.
> 
> I *think* I have seen this failure locally at least once, when running the
> test normally.
> 
> 
> I'll try to reproduce this locally for a bit. If I can't, the only other idea
> I have for debugging this is to change log_min_messages in that section of the
> postgres_fdw test to DEBUG3.


Oh. I tried to reproduce the issue, without success so far, but eventually my
test loop got stuck in something I reported previously and forgot about since:
https://www.postgresql.org/message-id/20220925232237.p6uskba2dw6fnwj2%40awork3.anarazel.de

I wonder if the timing on the freebsd CI task works out to hitting a "smaller
version" of the problem that eventually resolves itself, which then leads to a
sinval reset getting sent, causing the observable problem.

Greetings,

Andres Freund

Re: tests against running server occasionally fail, postgres_fdw & tenk1

Reply via email to