On 2024-Jul-16, Alexander Lakhin wrote: > I've managed to reproduce this issue in my Cygwin environment by running > the postgres_fdw test in a loop (10 iterations are enough to get the > described effect). And what I'm seeing is that a query-cancelling backend > is stuck inside pgfdw_xact_callback() -> pgfdw_abort_cleanup() -> > pgfdw_cancel_query() -> pgfdw_cancel_query_begin() -> libpqsrv_cancel() -> > WaitLatchOrSocket() -> WaitEventSetWait() -> WaitEventSetWaitBlock() -> > poll(). > > The timeout value (approximately 30 seconds), which is passed to poll(), > is effectively ignored by this call — the waiting lasts for unlimited time.
Ugh. I tried to follow what's going on in that cygwin code, but I gave up pretty quickly. It depends on a mutex, but I didn't see the mutex being defined or initialized anywhere. > So it looks like a Cygwin bug, but maybe something should be done on our side > too, at least to prevent such lorikeet failures. I don't know what else we can do other than remove the test. Maybe we can disable this test specifically on Cygwin. We could do that by creating a postgres_fdw_cancel.sql file, with the current output for all platforms, and a "SELECT version() ~ 'cygwin' AS skip_test" query, as we do for encoding tests and such. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Doing what he did amounts to sticking his fingers under the hood of the implementation; if he gets his fingers burnt, it's his problem." (Tom Lane)