On 2024-Mar-13, Jelte Fennema-Nio wrote: > I agree it's probably a timing issue. The cancel being received after > the query is done seems very unlikely, since the query takes 180 > seconds (assuming PG_TEST_TIMEOUT_DEFAULT is not lowered for these > animals). I think it's more likely that the cancel request arrives too > early, and thus being ignored because no query is running yet. The > test already had logic to wait until the query backend was in the > "active" state, before sending a cancel to solve that issue. But my > guess is that that somehow isn't enough. > > Sadly I'm having a hard time reliably reproducing this race condition > locally. So it's hard to be sure what is happening here. Attached is a > patch with a wild guess as to what the issue might be (i.e. seeing an > outdated "active" state and thus passing the check even though the > query is not running yet)
I tried leaving the original running in my laptop to see if I could reproduce it, but got no hits ... and we didn't get any other failures apart from the three ones already reported ... so it's not terribly high probability. Anyway I pushed your patch now since the theory seems plausible; let's see if we still get the issue to reproduce. If it does, we could make the script more verbose to hunt for further clues. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Here's a general engineering tip: if the non-fun part is too complex for you to figure out, that might indicate the fun part is too ambitious." (John Naylor) https://postgr.es/m/CAFBsxsG4OWHBbSDM%3DsSeXrQGOtkPiOEOuME4yD7Ce41NtaAD9g%40mail.gmail.com