On Mon, Sep 04, 2023 at 03:18:40PM +1200, Thomas Munro wrote:
> Somehow these tests have recently become unstable and have failed a few times:
> 
> https://github.com/postgres/postgres/commits/REL_15_STABLE
> 
> The failures are like:
> 
> [22:32:26.722] # Failed test 'pgbench simple update stdout
> /(?^:builtin: simple update)/'
> [22:32:26.722] # at t/001_pgbench_with_server.pl line 119.
> [22:32:26.722] # 'pgbench (15.4)
> [22:32:26.722] # '
> [22:32:26.722] # doesn't match '(?^:builtin: simple update)'

Fun.  That's a test of "pgbench -C".  The test harness isn't reporting
pgbench's stderr, so I hacked things to get that and the actual file
descriptor values being assigned.  The failure case gets "pgbench: error: too
many client connections for select()" in stderr, from this pgbench.c function:

static void
add_socket_to_set(socket_set *sa, int fd, int idx)
{
        if (fd < 0 || fd >= FD_SETSIZE)
        {
                /*
                 * Doing a hard exit here is a bit grotty, but it doesn't seem 
worth
                 * complicating the API to make it less grotty.
                 */
                pg_fatal("too many client connections for select()");
        }
        FD_SET(fd, &sa->fds);
        if (fd > sa->maxfd)
                sa->maxfd = fd;
}

The "fd >= FD_SETSIZE" check is irrelevant on Windows.  See comments in the
attached patch; in brief, Windows assigns FDs and uses FD_SETSIZE differently.
The first associated failure was commit dea12a1 (2023-08-03); as a doc commit,
it's an innocent victim.  Bisect blamed 8488bab "ci: Use windows VMs instead
of windows containers" (2023-02), long before the failures began.  I'll guess
some 2023-08 Windows update or reconfiguration altered file descriptor
assignment, hence the onset of failures.  In my tests of v16, the highest file
descriptor was 948.  I could make v16 fail by changing --client=5 to
--client=90 in the test.  With the attached patch and --client=90, v16 peaked
at file descriptor 2040.

Thanks,
nm

P.S. Later, we should change test code so the pgbench stderr can't grow an
extra line without that line appearing in test logs.
Author:     Noah Misch <n...@leadboat.com>
Commit:     Noah Misch <n...@leadboat.com>

    Don't spuriously report FD_SETSIZE exhaustion on Windows.
    
    Starting on 2023-08-03, this intermittently terminated a "pgbench -C"
    test in CI.  It could affect a high-client-count "pgbench" without "-C".
    While parallel reindexdb and vacuumdb reach the same problematic check,
    sufficient client count and/or connection turnover is less plausible for
    them.  Given the lack of examples from the buildfarm or from manual
    builds, reproducing this must entail rare operating system
    configurations.  Also correct the associated error message, which was
    wrong for non-Windows and did not comply with the style guide.
    Back-patch to v12, where the pgbench check first appeared.  While v11
    vacuumdb has the problematic check, reaching it with typical vacuumdb
    usage is implausible.
    
    Reviewed by FIXME.
    
    Discussion: 
https://postgr.es/m/CA+hUKG+JwvTNdcyJTriy9BbtzF1veSRQ=9m_zkfn9_lqe7k...@mail.gmail.com

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index e3919395..a1f566c 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -7837,14 +7837,22 @@ clear_socket_set(socket_set *sa)
 static void
 add_socket_to_set(socket_set *sa, int fd, int idx)
 {
+       /* See connect_slot() for background on this code. */
+#ifdef WIN32
+       if (sa->fds.fd_count + 1 >= FD_SETSIZE)
+       {
+               pg_log_error("too many concurrent database clients for this 
platform: %d",
+                                        sa->fds.fd_count + 1);
+               exit(1);
+       }
+#else
        if (fd < 0 || fd >= FD_SETSIZE)
        {
-               /*
-                * Doing a hard exit here is a bit grotty, but it doesn't seem 
worth
-                * complicating the API to make it less grotty.
-                */
-               pg_fatal("too many client connections for select()");
+               pg_log_error("socket file descriptor out of range: %d", fd);
+               pg_log_error_hint("Try fewer concurrent database clients.");
+               exit(1);
        }
+#endif
        FD_SET(fd, &sa->fds);
        if (fd > sa->maxfd)
                sa->maxfd = fd;
diff --git a/src/fe_utils/parallel_slot.c b/src/fe_utils/parallel_slot.c
index aca238b..2ee4b6d 100644
--- a/src/fe_utils/parallel_slot.c
+++ b/src/fe_utils/parallel_slot.c
@@ -295,8 +295,40 @@ connect_slot(ParallelSlotArray *sa, int slotno, const char 
*dbname)
        slot->connection = connectDatabase(sa->cparams, sa->progname, sa->echo, 
false, true);
        sa->cparams->override_dbname = old_override;
 
-       if (PQsocket(slot->connection) >= FD_SETSIZE)
-               pg_fatal("too many jobs for this platform");
+       /*
+        * POSIX defines FD_SETSIZE as the highest file descriptor acceptable to
+        * FD_SET() and allied macros.  Windows defines it as a ceiling on the
+        * count of file descriptors in the set, not a ceiling on the value of
+        * each file descriptor; see
+        * 
https://learn.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-select
+        * and
+        * 
https://learn.microsoft.com/en-us/windows/win32/api/winsock/ns-winsock-fd_set.
+        * We can't ignore that, because Windows starts file descriptors at a
+        * higher value, delays reuse, and skips values.  With less than ten
+        * concurrent file descriptors, opened and closed rapidly, one can reach
+        * file descriptor 1024.
+        *
+        * Doing a hard exit here is a bit grotty, but it doesn't seem worth
+        * complicating the API to make it less grotty.
+        */
+#ifdef WIN32
+       if (slotno >= FD_SETSIZE)
+       {
+               pg_log_error("too many jobs for this platform: %d", slotno);
+               exit(1);
+       }
+#else
+       {
+               int                     fd = PQsocket(slot->connection);
+
+               if (fd >= FD_SETSIZE)
+               {
+                       pg_log_error("socket file descriptor out of range: %d", 
fd);
+                       pg_log_error_hint("Try fewer jobs.");
+                       exit(1);
+               }
+       }
+#endif
 
        /* Setup the connection using the supplied command, if any. */
        if (sa->initcmd)

Reply via email to