On Sat, Aug 22, 2020 at 8:33 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > > Amit Kapila <amit.kapil...@gmail.com> writes: > > On Thu, Aug 20, 2020 at 10:28 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > >> 2. On the other hand, the code is *releasing* the > >> ReplicationSlotControlLock before it calls > >> ProcArraySetReplicationSlotXmin, and that seems like a flat out > >> concurrency bug. > > > It is not clear to me how those values can go backward. > > After releasing ReplicationSlotControlLock, that code is holding no > lock at all (in the already_locked=false case I'm concerned about). > Thus the scenario to consider is: > > 1. Walsender A runs ReplicationSlotsComputeRequiredXmin, computes > some perfectly-valid xmins, releases ReplicationSlotControlLock, > amd then gets swapped out to Timbuktu. > > 2. Time passes and the "true values" of those xmins advance thanks > to other walsender activity. > > 3. Walsender B runs ReplicationSlotsComputeRequiredXmin, computes > some perfectly-valid xmins, and successfully stores them in the > procarray. > > 4. Walsender A returns from never-never land, and stores its now > quite stale results in the procarray, causing the globally visible > xmins to go backwards from where they were after step 3. > > I see no mechanism in the code that prevents this scenario. > On reflection I'm not even very sure that the code change > I'm suggesting would prevent it. It'd prevent walsenders > from entering or exiting until we've updated the procarray, > but there's nothing to stop the furthest-back walsender from > advancing its values. >
I think we can prevent that if we allow ProcArraySetReplicationSlotXmin to update the shared values only when new xmin values follows the shared values. I am not very sure if it is safe but I am not able to think of a problem with it. The other way could be to always acquire ProcArrayLock before calling ReplicationSlotsComputeRequiredXmin or before acquiring ReplicationSlotControlLock but that seems too restrictive. -- With Regards, Amit Kapila.