On Thu, Aug 31, 2023 at 12:10 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > + > > + /* > > + * We won't ensure that the slot is persisted after the confirmed_flush > > + * LSN is updated as that could lead to frequent writes. However, we > > need > > + * to ensure that we do persist the slots at the time of shutdown whose > > + * confirmed_flush LSN is changed since we last saved the slot to disk. > > + * This will help in avoiding retreat of the confirmed_flush LSN after > > + * restart. This variable is used to track the last saved > > confirmed_flush > > + * LSN value. > > + */ > > > > This comment makes more sense in SaveSlotToPath() than here. We may decide > > to > > use last_saved_confirmed_flush for something else in future. > > > > I have kept it here because it contains some information that is not > specific to SaveSlotToPath. So, it seems easier to follow the whole > theory if we keep it at the central place in the structure and then > add the reference wherever required but I am fine if you and others > feel strongly about moving this to SaveSlotToPath().
Saving slot to disk happens only in SaveSlotToPath, so except the last sentence rest of the comment makes sense in SaveSlotToPath(). > > > > This function assumes that the subscriber will receive and confirm WAL upto > > checkpoint location and publisher's WAL sender will update it in the slot. > > Where is the code to ensure that? Does the WAL sender process wait for > > checkpoint > > LSN to be confirmed when shutting down? > > > > Note, that we need to compare if all the WAL before the > shutdown_checkpoint WAL record is sent. Before the clean shutdown, we > do ensure that all the pending WAL is confirmed back. See the use of > WalSndDone() in WalSndLoop(). Ok. Thanks for pointing that out to me. > > > I > > think we should shut down subscriber, restart publisher and then make this > > check based on the contents of the replication slot instead of server log. > > Shutting down subscriber will ensure that the subscriber won't send any new > > confirmed flush location to the publisher after restart. > > > > But if we shutdown the subscriber before the publisher there is no > guarantee that the publisher has sent all outstanding logs up to the > shutdown checkpoint record (i.e., the latest record). Such a guarantee > can only be there if we do a clean shutdown of the publisher before > the subscriber. So the sequence is shutdown publisher node, shutdown subscriber node, start publisher node and carry out the checks. > > > All the places which call ReplicationSlotSave() mark the slot as dirty. All > > the places where SaveSlotToPath() is called, the slot is marked dirty except > > when calling from CheckPointReplicationSlots(). So I am wondering whether we > > should be marking the slot dirty in CheckPointReplicationSlots() and avoid > > passing down is_shutdown flag to SaveSlotToPath(). > > > > I feel that will add another spinlock acquire/release pair without > much benefit. Sure, it may not be performance-sensitive but still > adding another pair of lock/release doesn't seem like a better idea. We call ReplicatioinSlotMarkDirty() followed by ReplicationSlotSave() at all the places, even those which are more frequent than this. So I think it's better to stick to that protocol rather than adding a new flag. -- Best Wishes, Ashutosh Bapat