Michael Paquier <mich...@paquier.xyz> writes: > On Mon, May 24, 2021 at 02:07:12PM -0400, Alvaro Herrera wrote: >> Maybe we can change the "cancel" query to something like >> SELECT pg_cancel_backend(pid), somehow_wait_for_detach_to_terminate() FROM >> d3_pid; >> ... where maybe that function can check the "state" column in s3's >> pg_stat_activity row? I'll give that a try.
> Couldn't you achieve that with a small PL function in a way similar to > what 32a9c0b did, except that you track a different state in > pg_stat_activity for this PID? The number of subsequent fixes to 32a9c0b seem to argue against using that as a model :-( The experiments I did awhile ago are coming back to me now. I tried a number of variations on this same theme, and none of them closed the gap entirely. The fundamental problem is that it's possible for backend A to complete its transaction, and for backend B (which is the isolationtester's monitoring session) to observe that A has completed its transaction, and for B to report that fact to the isolationtester, and for that report to arrive at the isolationtester *before A's query result does*. You need some bad luck for that to happen, like A losing the CPU right before it flushes its output buffer to the client, but I was able to demonstrate it fairly repeatably. (IIRC, the reason I was looking into this was that the clobber-cache-always buildfarm critters were showing such failures somewhat regularly.) It doesn't really matter whether B's observation technique involves locks (as now), or the pgstat activity table, or what. Conceivably, if we used the activity data, we could have A postpone updating its state to "idle" until after it's flushed its buffer to the client. But that would likely break things for other use-cases. Moreover it still guarantees nothing, really, because we're still at the mercy of the kernel as to when it will choose to deliver network packets. So a completely bulletproof interlock seems out of reach. Maybe something like what Alvaro's thinking of will get the failure rate down to an acceptable level for most developers. A simple "pg_sleep(1)" might have about the same effect for much less work, though. I do agree we need to do something. I've had two or three failures in those test cases in the past few days in my own manual check-world runs, which is orders of magnitude worse than the previous reliability. regards, tom lane