While trying to make sense of Adam Sjøgren's problem [1], I found myself staring at ReplicationSlotsComputeRequiredXmin() in slot.c. It seems to me that that is very shaky code, on two different grounds:
1. Sometimes it's called with ProcArrayLock already held exclusively. This means that any delay in acquiring the ReplicationSlotControlLock translates directly into a hold on ProcArrayLock; in other words, every acquisition of the ReplicationSlotControlLock is just as bad for concurrency as an acquisition of ProcArrayLock. While I didn't see any places that were doing really obviously slow things while holding ReplicationSlotControlLock, this is disturbing. Do we really need it to be like that? 2. On the other hand, the code is *releasing* the ReplicationSlotControlLock before it calls ProcArraySetReplicationSlotXmin, and that seems like a flat out concurrency bug. How can we be sure that the values we're storing into the shared xmin fields aren't stale by the time we acquire the ProcArrayLock (in the case where we didn't hold it already)? I'm concerned that in the worst case this code could make the shared xmin fields go backwards. Both of these issues could be solved, I think, if we got rid of the provision for calling with ProcArrayLock already held and moved the ProcArraySetReplicationSlotXmin call inside the hold of ReplicationSlotControlLock. But maybe I'm missing something about why that would be worse. regards, tom lane [1] https://www.postgresql.org/message-id/flat/87364kdsim.fsf%40tullinup.koldfront.dk