Hi hackers, While debugging a build farm assertion failure after commit 18042840, and with the assumption that the problem is timing/scheduling sensitive, I tried hammering the problem workload on a few different machines and noticed that my slow 2-core test machine fairly regularly got into a live lock state for tens to millions of milliseconds at a time when there were 3+ active processes, in here:
int ConditionVariableBroadcast(ConditionVariable *cv) { int nwoken = 0; /* * Let's just do this the dumbest way possible. We could try to dequeue * all the sleepers at once to save spinlock cycles, but it's a bit hard * to get that right in the face of possible sleep cancelations, and we * don't want to loop holding the mutex. */ while (ConditionVariableSignal(cv)) ++nwoken; return nwoken; } The problem is that another backend can be woken up, determine that it would like to wait for the condition variable again, and then get itself added to the back of the wait queue *before the above loop has finished*, so this interprocess ping-pong isn't guaranteed to terminate. It seems that we'll need something slightly smarter than the above to avoid that. I don't currently suspect this phenomenon of being responsible for the problem I'm hunting, even though it occurs on the only machine I've been able to reproduce my real problem on. AFAICT the problem described in this email should deliver arbitrary numbers of spurious wake-ups wasting arbitrary CPU time but cause no harm that would affect program correctness. So I didn't try to write a patch to fix that just yet. I think we should probably back patch a fix when we have one though, because it could bite Parallel Index Scan in REL_10_STABLE. -- Thomas Munro http://www.enterprisedb.com