On Fri, Jun 12, 2026 at 6:54 PM Amit Kapila <[email protected]> wrote: > > On Fri, Jun 12, 2026 at 8:22 AM Xuneng Zhou <[email protected]> wrote: > > > > The issues look real to me and could be dealt with patch v1 partially. > > > > On Thu, Jun 11, 2026 at 9:19 PM Fujii Masao <[email protected]> wrote: > > > > > > On Thu, Jun 11, 2026 at 8:18 PM Amit Kapila <[email protected]> > > > wrote: > > > > 1. Stale name read in local_sync_slot_required(): The reused cell > > > > holds a different name. local_sync_slot_required() might return false > > > > (drop needed). But then the in_use && synced spinlock check sees > > > > synced = false and skips the actual drop. The wrong decision is > > > > caught. > > > > > > Yes, we could skip the actual drop. But then wouldn't we still emit > > > the log message "dropped replication slot ..." even though no slot was > > > actually dropped? > > > > With v1, we won't emit the log message unless the log is factually > > dropped. However it did not prevent the stale read in > > local_sync_slot_required(). > > > > > > 2. Wrong database OID read at line 551: The reused cell holds OID_B > > > > from the new slot. We lock OID_B, then at lines 563–565 we see synced > > > > = false, skip the drop, and unlock OID_B at line 579. Since no drop > > > > occurred, the cell is still the same non-synced slot, so the lock and > > > > unlock see the same OID_B. Symmetric — no lock leak. > > > > > > What happens if the slot for OID_B is dropped after we lock > > > OID_B, and then a new slot for OID_C reuses the same array entry? In > > > that case, wouldn't the later unlock read OID_C from > > > local_slot->data.database even though the lock was originally taken on > > > OID_B? > > > > V1 stops doing the venerable second read of local_slot->data.database. > > So if the copied value was already stale and points to OID_B, v1 is at > > least symmetric: > > > > read OID_B once > > lock OID_B > > cell reused as OID_C > > unlock OID_B > > > > But v1 seems not to fully solve issue 1. > > > > It can still do this: > > > > cell already reused before slot_database is copied > > v1 copies OID_B from replacement slot > > locks OID_B > > recheck sees synced=false > > skips drop > > unlocks OID_B > > > > That is still a stale read and possibly a wasted/wrong database lock, > > but it doesn't leak the lock, unlocks the wrong object, logs a false > > drop, or drops the replacement slot. > > > > In an off-list chat with Zhijie, we kinda thought that holding the > > lock of a wrong db for a brief time doesn't seem to harm a lot. The > > concurrent dropping-db operation leads to this issue seems rare in > > practice. He stated that the deletion of the slot seems unavoidable > > because we have to acquire the database lock after releasing the > > replication slot lock to avoid the deadlock with the startup/drop db > > operation. Therefore, he prefered keeping the design simple and > > avoiding the fatal issue over doing a broader refactoring work. > > > > +1. I also think this change is not worth it.
I am also OK with the scope of change made by patch 1. > > don't have a strong opinion on this. Still attaching the refactoring > > patch to do some clean-up in case someone thinks it's worthwhile. > > > > I feel even if there is an argument to do such a refactoring, it can > be done separately. Makes sense to me. -- Regards, Xuneng Zhou HighGo Software Co., Ltd.
