On Wed, Apr 7, 2021 at 5:44 PM Robins Tharakan <thara...@gmail.com> wrote: > Bichir's been stuck for the past month and is unable to run regression tests > since 6a2a70a02018d6362f9841cc2f499cc45405e86b.
Hrmph. That's "Use signalfd(2) for epoll latches." I had a similar report from an illumos user (but it was intermittent). I have never seen such a failure on Linux. My first guess is that these two systems that are doing Linux system call emulation have implemented subtly different semantics, and something is going wrong like this: a SIGUSR1 arrives to tell you some important news about a procsignal and the signal handler calls SetLatch(MyLatch) which does kill(MyProcPid, SIGURG), but somehow that fails to wake up the epoll() you are sleeping in which contains the signalfd that should receive the signal and report it by being readable, due to some internal race. Or something like that. But I haven't been able to verify that theory because I don't have any of those computers. If it is indeed something like that and not a bug in my code, then I was thinking that the main tool available to deal with it would be to set WAIT_USE_POLL in the relevant template file, so that we don't use the combination of epoll + signalfd on illlumos, but then WSL1 thows a spanner in the works because AFAIK it's masquerading as Ubuntu, running PostgreSQL from an Ubuntu package with a freaky kernel. Hmm. > It is interesting that that commit's a month old and probably no other client > has complained since, but diving in, I can see that it's been unable to even > start regression tests after that commit went in. Oh, well at least it's easily reproducible then, that's something! > Note that Bichir is running on WSL1 (not WSL2) - i.e. Windows Subsystem for > Linux inside Windows 10 - and so isn't really production use-case. The only > run that actually got submitted to Buildfarm was from a few days back when I > killed it after a long wait - see [1]. > > Since yesterday, I have another run that's again stuck on CREATE DATABASE > (see outputs below) and although pstack not working may be a limitation of > the architecture / installation (unsure), a trace shows it is stuck at poll. That's actually the client. I guess there is also a backend process stuck somewhere in epoll_wait()?