On 4/7/21 2:16 AM, Thomas Munro wrote: > On Wed, Apr 7, 2021 at 5:44 PM Robins Tharakan <thara...@gmail.com> wrote: >> Bichir's been stuck for the past month and is unable to run regression tests >> since 6a2a70a02018d6362f9841cc2f499cc45405e86b. > Hrmph. That's "Use signalfd(2) for epoll latches." I had a similar > report from an illumos user (but it was intermittent). I have never > seen such a failure on Linux. My first guess is that these two > systems that are doing Linux system call emulation have implemented > subtly different semantics, and something is going wrong like this: a > SIGUSR1 arrives to tell you some important news about a procsignal and > the signal handler calls SetLatch(MyLatch) which does kill(MyProcPid, > SIGURG), but somehow that fails to wake up the epoll() you are > sleeping in which contains the signalfd that should receive the signal > and report it by being readable, due to some internal race. Or > something like that. But I haven't been able to verify that theory > because I don't have any of those computers. If it is indeed > something like that and not a bug in my code, then I was thinking that > the main tool available to deal with it would be to set WAIT_USE_POLL > in the relevant template file, so that we don't use the combination of > epoll + signalfd on illlumos, but then WSL1 thows a spanner in the > works because AFAIK it's masquerading as Ubuntu, running PostgreSQL > from an Ubuntu package with a freaky kernel. Hmm. >
To test this the OP could just add CPPFLAGS => '-DWAIT_USE_POLL', to his animal's config's config_env stanza. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com