Thomas Munro <thomas.mu...@enterprisedb.com> writes: > On Sun, Jan 7, 2018 at 10:00 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: >> I began with the intention of making no non-cosmetic changes, but then >> I started to wonder why ConditionVariablePrepareToSleep bothers with a >> !proclist_contains test, when the calling process surely ought not be >> in the list -- or any other list -- since it wasn't prepared to sleep. >> And that led me down a different rabbit hole ending in the conclusion >> that proclist.h could stand some improvements too.
> +1 >> The second attached patch is the cosmetic changes I want to make in >> condition_variable.c/.h. > +1 I pushed these, with an addition to ConditionVariablePrepareToSleep's comment emphasizing that you're supposed to call it before the loop test; this because some desultory digging found that one of the existing callers is violating that rule. replorigin_drop() in replication/logical/origin.c has cv = &state->origin_cv; LWLockRelease(ReplicationOriginLock); ConditionVariablePrepareToSleep(cv); ConditionVariableSleep(cv, WAIT_EVENT_REPLICATION_ORIGIN_DROP); ConditionVariableCancelSleep(); goto restart; and unless I'm much mistaken, that is flat broken. Somebody could change the ReplicationState's acquired_by before we reach ConditionVariablePrepareToSleep, in which case we won't get any signal for that and will just sleep here indefinitely. It would still be broken if we removed the PrepareToSleep call, but at least it'd be a busy-wait not a hang. Not sure about a convenient fix. One idea is to move the PrepareToSleep call inside the hold on ReplicationOriginLock, which would fix things only if the various signallers of origin_cv did the signalling while holding that lock, which they do not. It's not clear to me whether making them do so inside that lock would be problematic for performance. We can't just move the prepare/cancel sleep calls to around this whole loop, because we do not know outside the loop which CV is to be waited on. So another idea is to get rid of that and have just one CV for all ReplicationStates. Or (and I'm sure Robert sees this coming), if we applied my proposed patch to let ConditionVariableSleep auto-switch to a different CV, then we could fix this code by simply removing the PrepareToSleep and moving the ConditionVariableCancelSleep call to below the loop. This would work even in the perhaps-unlikely case where the slot as identified by roident changed positions, though you'd get an additional trip through the loop (a/k/a test of the exit condition) if that happened. >> Another loose end that I'm seeing here is that while a process waiting on >> a condition variable will respond to a cancel or die interrupt, it will >> not notice postmaster death. > Yeah. I'll respond to that separately to keep it from getting confused with this replorigin_drop() bug. regards, tom lane