Hi,

See e.g. https://cirrus-ci.com/task/4682373060100096
2022-10-01 15:15:21.849 UTC [41962][postmaster] LOG:  could not bind IPv4 
address "127.0.0.1": Address already in use
2022-10-01 15:15:21.849 UTC [41962][postmaster] HINT:  Is another postmaster 
already running on port 57003? If not, wait a few seconds and retry.
2022-10-01 15:15:21.849 UTC [41962][postmaster] WARNING:  could not create 
listen socket for "127.0.0.1"

I downloaded all test logs and grepping for 57003 shows the problem:

build/testrun/ldap/001_auth/log/regress_log_001_auth
3:# Checking port 57003
4:# Found port 57003
22:# Running: /usr/sbin/slapd -f 
/tmp/cirrus-ci-build/build/testrun/ldap/001_auth/data/slapd.conf -h 
ldap://localhost:57002 ldaps://localhost:57003

build/testrun/ldap/001_auth/log/001_auth_node.log
253:2022-10-01 15:15:25.103 UTC [42574][client backend] [[unknown]][3/1:0] 
DETAIL:  Connection matched pg_hba.conf line 1: "local all all ldap 
ldapurl="ldaps://localhost:57003/dc=example,dc=net??sub?(uid=$username)" 
ldaptls=1"

build/testrun/ssl/001_ssltests/log/regress_log_001_ssltests
2:# Checking port 57003
3:# Found port 57003
8:Connection string: port=57003 host=/tmp/1k5yhaWLQ1

build/testrun/ssl/001_ssltests/log/001_ssltests_primary.log
2:2022-10-01 15:15:20.668 UTC [41740][postmaster] LOG:  listening on Unix 
socket "/tmp/1k5yhaWLQ1/.s.PGSQL.57003"
58:2022-10-01 15:15:21.849 UTC [41962][postmaster] HINT:  Is another postmaster 
already running on port 57003? If not, wait a few seconds and retry.


I.e. we chose the same port for slapd as part of ldap's 001_auth.pl as for the
postgres instance of ssl's 001_ssltests.pl.


I don't think get_free_port() has any chance of being reliable as is. It's
fundamentally racy just among concurrently running tests, without even
considering things external to the tests (given it's using the range of ports
auto-assigned for client tcp ports...).


The current code is from 803466b6ffa, which said:
    This isn't 100% bulletproof, since
    conceivably something else on the machine could grab the port between
    the time we check and the time we actually start the server.  But that's
    a pretty short window, so in practice this should be good enough.

but I've seen this fail a couple times, so I suspect it's unfortunately not
good enough.


I can see a few potential ways of improving the situation:

1) Improve the port we start the search for a unique port from. We use the
   following bit right now:

        # Tracking of last port value assigned to accelerate free port lookup.
        $last_port_assigned = int(rand() * 16384) + 49152;

   It *might* be less likely that we hit conflicts if we start the search on a
   port based on the pid, rather than rand(). We e.g., could start at
   something like (($$ * 16) % 16384 + 49152), giving a decent likelihood that
   each test has 16 free ports.

   Perhaps also worth to increase the range of ports searched?


2) Use a lockfile containing a pid to protect the choice of a port within a
   build directory. Before accepting a port get_free_port() would check if the
   a lockfile exists for the port and if so, if the test using it is still
   alive.  That will protect against racyness between multiple tests inside a
   build directory, but won't protect against races between multiple builds
   running concurrently on a machine (e.g. on a buildfarm host)


3) We could generate unique port ranges for each test and thus remove the
   chance of conflicts within a builddir and substantially reduce the
   likelihood of conflicts between build directories (as the conflict would be
   between two tests in different build directories, rather than all tests).

   This would be easy to do in the meson build, but nontrivial in autoconf
   afaics.


4) Add retries to the tests that need get_free_port(). That seems hard to get
   right, because every single restart of postgres (or some other service that
   we used get_free_port()) provides the potential for a new conflict.


Greetings,

Andres Freund


Reply via email to