I wrote: > I've spent the day fooling around with a re-implementation of > isolationtester that waits for all its controlled sessions to quiesce > (either wait for client input, or block on a lock held by another > session) before moving on to the next step. That was not a feasible > approach before we had the wait_event infrastructure, but it's > seeming like it might be workable now. Still have a few issues to > sort out though ...
I wasted a good deal of time on this idea, and eventually concluded that it's a dead end, because there is an unremovable race condition. Namely, that even if the isolationtester's observer backend has observed that test session X has quiesced according to its wait_event_info, it is possible for the report of that fact to arrive at the isolationtester client process before test session X's output does. It's quite obvious how that might happen if the isolationtester is on a different machine than the PG server --- just imagine a dropped packet in X's output that has to be retransmitted. You might think it shouldn't happen within a single machine, but I'm seeing results that I cannot explain any other way (on an 8-core RHEL8 box). It appears to not be particularly rare, either. > Andres Freund <and...@anarazel.de> writes: >> ISTM the issue at hand isn't so much that X expects something to be >> printed by Y before it terminates, but that it expects the next step to >> not be executed before Y unlocks. If I understand the wrong output >> correctly, what happens is that "controller_print_speculative_locks" is >> executed, even though s1 hasn't yet acquired the next lock. The problem as I'm now understanding it is that insert-conflict-specconflict.spec issues multiple commands in sequence to its "controller" session, and expects that NOTICE outputs from a different test session will arrive at a determinate point in that sequence. In practice that's not guaranteed, because (a) the other test session might not send the NOTICE soon enough --- as my modified specfile proves --- and (b) even if the NOTICE is timely sent, the kernel will not guarantee timely receipt. We could fix (a) by introducing some explicit interlock between the controller and test sessions, but (b) is a killer. I think what we have to do to salvage this test is to get rid of the use of NOTICE outputs, and instead have the test functions insert log records into some table, which we can inspect after the fact to verify that things happened as we expect. regards, tom lane