Thanks for the analysis!
On 05/03/2025 02:41, Andres Freund wrote:
On 2025-03-04 09:48:45 +1300, Thomas Munro wrote:
On Tue, Mar 4, 2025 at 5:48 AM Andres Freund <[email protected]> wrote:
On 2025-02-24 11:26:56 +0100, Jelte Fennema-Nio wrote:
[1]: https://cirrus-ci.com/task/5571017969500160?logs=test_world#L256
A possibly related test failure is:
https://cirrus-ci.com/task/5005337046024192?logs=check_world#L259
They look different to me:
Thanks for checking!
(1) In Jelte's report for 003_start_stop, we see:
[03:40:42.347](2.022s) ok 1 - dead-end connection 0
[03:40:42.348](0.001s) ok 2 - dead-end connection 1
[03:40:42.349](0.001s) ok 3 - dead-end connection 2
... up to 20 ...
We see in the server log that the first one had pid 31660 and slot 1,
but it never said "SSLRequest rejected" like all the others. The Perl
code must have received "N", or it would not have moved onto opening
the next connection. Then just as it is about to try to connect,
hoping to see "too many clients", that one exits, the "invalid_role"
connection gets slot 1 instead of the dead-end backend we expected:
03:40:42.375 UTC [31645][postmaster] DEBUG: client backend (PID
31660) exited with exit code 0 <-- goodbye, silent backend
03:40:42.480 UTC [31681][not initialized] [[unknown]][:0] LOG:
connection received: host=[local]
03:40:42.480 UTC [31681][not initialized] [[unknown]][:0] LOG:
SSLRequest rejected
03:40:42.511 UTC [31645][postmaster] DEBUG: assigned pm child slot 1
for backend <-- we get silent backend's slot
03:40:42.512 UTC [31645][postmaster] DEBUG: forked new client
backend, pid=31690 socket=7
03:40:42.542 UTC [31690][not initialized] [[unknown]][:0] LOG:
connection received: host=[local]
03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] LOG:
connection authenticated: user="invalid_user" method=trust
(/tmp/cirrus-ci-build/build/testrun/postmaster/003_start_stop/data/t_003_start_stop_main_data/pgdata/pg_hba.conf:117)
03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] LOG:
connection authorized: user=invalid_user database=postgres
application_name=003_start_stop.pl
03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] FATAL:
role "invalid_user" does not exist
Good catch.
The pattern is similar in other failed tests. There has been a spate of errors
on postgres/postgres' CI, which made me look at this again.
I see what's going on here. Before the loop that opens all the
connections with the SSLRequests, the test does this:
if (!$node->raw_connect_works())
{
plan skip_all => "this test requires working raw_connect()";
}
That's the connection you see in the log without the SSLRequest.
When the test fails, what happens is that the backend handling that
connection is slow. It lingers around, even though the perl client has
already closed the socket. The backend finally exits _just_ when the
test has opened all the "dead end" connections, and checks that the
final connection fails with the "FATAL: sorry, too many clients
already" error. Because the backend exits at just the right moment, it
frees up a connection slot which is then used by that final connection.
The test fails, because it expects all the connection slots to be in use
and for that final connection to launch a dead-end backend.
The 002_connection_limits.pl test has potential for a similar issue. I
added a "sleep(1)" to proc_exit to simulate slow backend exit, and the
test started failing, because the "safe_psql" calls used in the
initialization was still consuming a connection slot.
I've pushed a fix to both tests, by restarting the server after the
initialization steps. That's a little heavy-weight, but it's a simple
way to ensure that all the backends are gone.
- Heikki