Hi, On Fri, Jan 09, 2026 at 03:41:03PM -0500, Tom Lane wrote: > We've been assuming that all the "timedout" failures on BF member > fruitcrow were due to some wonkiness in the GNU/Hurd platform. > I got suspicious about that though after noticing that there are > a small number of such failures on other animals, eg [1][2][3]. > In each case, the failure message claims it waited a good long > time, which is at variance with the actually observed runtime. > For instance [1] says "timed out after 14400 secs", but the > actual total test runtime is only 01:24:28 according to the > summary at the top of the page. > > Looking into the buildfarm client, I realized that it's assuming that > "sleep($wait_time)" is sufficient to wait for $wait_time seconds. > However, the Perl docs point out that sleep() can be interrupted by a > signal. So now I'm suspicious that many of these failures are caused > by a stray signal waking up the wait_timeout thread prematurely. > GNU/Hurd might just be more prone to that than other platforms.
That might be the case for those other failures, but unfortunately, I think the fruitcrow failures are really because it gets stuck endlessly in the test_shm_mq test (it is always that one) and only the test timeout kicks it out. I've ran that test manually quite a lot and either it finishes in 10-15 seconds, or (presumably) never. This is not really easy to see in the public builfarm logs (at least I can't find it on a quick glance), but I've routinely checked the log timestamps of the runs, and they really take one hour (wait_timeout) in the case of a hang. > I propose the attached patch to the BF client to try to make this > more robust. Looks sensible, though I wonder whether something should be logged in case we get woken up early so that we can gather some evidence for this? Michael
