On Mon, Oct 09, 2023 at 04:22:52PM +1300, Thomas Munro wrote: > On Mon, Oct 9, 2023 at 3:25 PM Noah Misch <n...@leadboat.com> wrote: > > The "fd >= FD_SETSIZE" check is irrelevant on Windows. See comments in the > > attached patch; in brief, Windows assigns FDs and uses FD_SETSIZE > > differently. > > The first associated failure was commit dea12a1 (2023-08-03); as a doc > > commit, > > it's an innocent victim. Bisect blamed 8488bab "ci: Use windows VMs instead > > of windows containers" (2023-02), long before the failures began. I'll > > guess > > some 2023-08 Windows update or reconfiguration altered file descriptor > > assignment, hence the onset of failures. In my tests of v16, the highest > > file > > descriptor was 948. I could make v16 fail by changing --client=5 to > > --client=90 in the test. With the attached patch and --client=90, v16 > > peaked > > at file descriptor 2040. > > Ohhh. Thanks for figuring that out. I'd never noticed that quirk. I > didn't/can't test it but the patch looks reasonable after reading the > referenced docs.
For what it's worth, I did all that testing through CI, using hacks like https://cirrus-ci.com/task/5352974499708928 to get the relevant information. I didn't bother trying non-CI, since buildfarm animals aren't failing. > Maybe instead of just "out of range" I'd say "out of > range for select()" since otherwise it might seem a little baffling -- > what range? I was trying to follow this from the style guide: Avoid mentioning called function names, either; instead say what the code was trying to do: BAD: open() failed: %m BETTER: could not open file %s: %m I didn't think of any phrasing that clearly explained things without the reader consulting the code. I considered these: "socket file descriptor out of range: %d" [what range?] "socket file descriptor out of range for select(): %d" [style guide violation] "socket file descriptor out of range for checking whether ready for reading: %d" [?] "socket file descriptor out of range for synchronous I/O multiplexing: %d" [term from POSIX] > Random water cooler speculation about future ideas: I wonder > whether/when we can eventually ditch select() and use WSAPoll() for > this on Windows, which is supposed to be like poll(). There's a > comment explaining that we prefer select() because it has a higher > resolution sleep than poll() (us vs ms), so we wouldn't want to use > poll() on Unixen, but we know that Windows doesn't even use high > resolution timers for any user space APIs anyway so that's just not a > concern on that platform. The usual reason nobody ever uses WSAPoll() > is because the Curl guys told[1] everyone that it's terrible in a way > that would quite specifically break our usage. But I wonder, because > the documentation now says "As of Windows 10 version 2004, when a TCP > socket fails to connect, (POLLHUP | POLLERR | POLLWRNORM) is > indicated", it *sounds* like it might have been fixed ~3 years ago? > One idea would be to hide it inside WaitEventSet, and let WaitEventSet > be used in front end code (we couldn't use the > WaitForMultipleObjects() version because it's hard-limited to 64 > events, but in front end code we don't need latches and other stuff, > so we could have a sockets-only version with WSAPoll()). The idea of > using WaitEventSet for pgbench on Unix was already mentioned a couple > of times by others for general scalability reasons -- that way we > could finish up using a better kernel interface on all supported > platforms. We'd probably want to add high resolution time-out support > (I already posted a patch for that somewhere), or we'd be back to 1ms > timeouts. > > [1] https://daniel.haxx.se/blog/2012/10/10/wsapoll-is-broken/ Interesting. I have no current knowledge to add there.