On Wed, Mar 24, 2021 at 7:06 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Tue, Mar 23, 2021 at 10:54 PM Andres Freund <and...@anarazel.de> wrote: > > > > On 2021-03-23 23:37:14 +0900, Masahiko Sawada wrote: > > > > > > > Maybe we can compare the slot name in the > > > > > received message to the name in the element of replSlotStats. If they > > > > > don’t match, we swap entries in replSlotStats to synchronize the index > > > > > of the replication slot in ReplicationSlotCtl->replication_slots and > > > > > replSlotStats. If we cannot find the entry in replSlotStats that has > > > > > the name in the received message, it probably means either it's a new > > > > > slot or the previous create message is dropped, we can create the new > > > > > stats for the slot. Is that what you mean, Andres? > > > > That doesn't seem great. Slot names are imo a poor identifier for > > something happening asynchronously. The stats collector regularly > > doesn't process incoming messages for periods of time because it is busy > > writing out the stats file. That's also when messages to it are most > > likely to be dropped (likely because the incoming buffer is full). > > > > Leaving aside restart case, without some sort of such sanity checking, > if both drop (of old slot) and create (of new slot) messages are lost > then we will start accumulating stats in old slots. However, if only > one of them is lost then there won't be any such problem. > > > Perhaps we could have RestoreSlotFromDisk() send something to the stats > > collector ensuring the mapping makes sense? > > > > Say if we send just the index location of each slot then probably we > can setup replSlotStats. Now say before the restart if one of the drop > messages was missed (by stats collector) and that happens to be at > some middle location, then we would end up restoring some already > dropped slot, leaving some of the still required ones. However, if > there is some sanity identifier like name along with the index, then I > think that would have worked for such a case.
Even such messages could also be lost? Given that any message could be lost under a UDP connection, I think we cannot rely on a single message. Instead, I think we need to loosely synchronize the indexes while assuming the indexes in replSlotStats and ReplicationSlotCtl->replication_slots are not synchronized. > > I think it would have been easier if we would have some OID type of > identifier for each slot. But, without that may be index location of > ReplicationSlotCtl->replication_slots and slotname combination can > reduce the chances of slot stats go wrong quite less even if not zero. > If not name, do we have anything else in a slot that can be used for > some sort of sanity checking? I don't see any useful information in a slot for sanity checking. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/