Hi, On 2023-01-30 06:26:02 +1300, Thomas Munro wrote: > On Mon, Jan 30, 2023 at 1:53 AM Tomas Vondra > <tomas.von...@enterprisedb.com> wrote: > > So I did that - same configure options as the buildfarm client, and a > > 'make check' (with only tests up to the 'join' suite, because that's > > where it got stuck before). And it took only ~15 runs (~1h) to hit this > > again on dikkop. > > That's good news.
Indeed. As annoying as it is, it might be worth reproing it once or twice more, just to have a feeling for how long we need to run to have confidence in a fix. > I was talking to Andres on IM about this yesterday and he pointed out > a potential out-of-order hazard: WaitEventSetWait() sets "waiting" (to > tell the signal handler to write to the self-pipe) and then reads > latch->is_set with neither compiler nor memory barrier, which doesn't > seem right because we might see a value of latch->is_set from before > "waiting" was true, and yet the signal handler might also have run > while "waiting" was false so the self-pipe doesn't save us, despite > the length of the comment about that. Can you reproduce it with this > change? > > --- a/src/backend/storage/ipc/latch.c > +++ b/src/backend/storage/ipc/latch.c > @@ -1011,6 +1011,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout, > * ordering, so that we cannot miss seeing is_set if a > notificat > ion > * has already been queued. > */ > + pg_memory_barrier(); > if (set->latch && set->latch->is_set) > { > occurred_events->fd = PGINVALID_SOCKET; I think we need a barrier in SetLatch(), after is_set = true. We have that in some of the newer branches (due to the maybe_sleeping logic), but not in the older branches. Greetings, Andres Freund