> On 07 Apr 2018, at 00:23, Andres Freund <and...@anarazel.de> wrote:
> On 2018-04-06 02:28:17 +0200, Daniel Gustafsson wrote: >> Applying this makes the _cancel test pass, moving the failure instead to the >> following _enable test (which matches what coypu and mylodon are seeing). > > FWIW, I'm somewhat annoyed that I'm now spending time debugging this to > get the buildfarm green again. Sorry about that, I’m a bit slow due to various $family situations at the moment. > I'm fairly certain that the bug here is a simple race condition in the > test (not the main code!): I wonder if it may perhaps be a case of both? > It's > exceedingly unsurprising that a 'pg_sleep(1)' is not a reliable way to > make sure that a process has finished exiting. Then followup tests fail > because the process is still running I can reproduce the error when building with --disable-atomics, and it seems that all the failing members either do that, lack atomic.h, lack atomics or a combination. checksumhelper cancellation use pg_atomic_unlocked_test_flag() to test if it’s running when asked to abort, something which seems unsafe to do in semaphore simulation as it always returns true. If I for debugging synthesize a flag test with testset/clear, the tests pass green (with the upstream patch for pg_atomic_test_set_flag_impl() applied). Cancelling with semaphore sim is thus doomed to never work IIUC. Or it’s a red herring. As Magnus mentioned upstream, rewriting to not use an atomic flag is probably the best option, once the current failure is understood. > really? Let's just force the test take at least 6s purely from > sleeping? The test needs continuous reading in a session to try and trigger any bugs in read access on the cluster during checksumming, is there a good way to do that in the isolationtester? I have failed to find a good way to repeat a step like that, but I might be missing something. cheers ./daniel