On Monday, December 11, 2023 5:31 PM shveta malik <shveta.ma...@gmail.com> wrote: > > On Thu, Dec 7, 2023 at 1:33 PM Peter Smith <smithpb2...@gmail.com> > wrote: > > > > Hi. > > > > Here are my review comments for patch v43-0002. > > > > > ====== > > src/backend/access/transam/xlogrecovery.c > > > > 13. > > + /* > > + * Shutdown the slot sync workers to prevent potential conflicts between > > + * user processes and slotsync workers after a promotion. Additionally, > > + * drop any slots that have initiated but not yet completed the sync > > + * process. > > + */ > > + ShutDownSlotSync(); > > + slotsync_drop_initiated_slots(); > > + > > > > Is this where maybe the 'sync_state' should also be updated for > > everything so you are not left with confusion about different states > > on a node that is no longer a standby node? > > > > yes, this is the place. But this needs more thought as it may cause > too much disk activity during promotion. so let me analyze and come > back.
Per off-list discussion with Amit. I think it's fine to keep both READY and NONE on a primary. Because even if we update the sync_state from READY to NONE on promotion, it doesn't reduce the complexity for the handling of READY and NONE state. And it's not straightforward to choose the right place to update sync_state, here is the analysis: (related steps on promotion) 1 (patch) shutdown slotsync worker 2 (patch) drop 'i' state slots. 3 remove standby.signal and recovery.signal 4 switch to a new timeline and write the timeline history file 5 set SharedRecoveryState = RECOVERY_STATE_DONE which means RecoveryInProgress() will return false. We could not update the sync_state before step 3 because if the update fails after updating some of slots' state, then the server will be shutdown leaving some 'NONE' state slots. After restarting, the server is still a standby so the slot sync worker will fail to sync these 'NONE' state slots. We also could not update it after step 3 and before step 4. Because if any ERROR when updating, then after restarting the server, although the server will become a master(as standby.signal is removed), but it can still be made as a active standby by creating a standby.signal file because the timeline has not been switched. And in this case, the slot sync worker will also fail to sync these 'NONE' state slots. Updating the sync_state after step 4 and before step 5 is OK, but still It doesn't simplify the handling for both READY and NONE state slots. Therefore, I think we can retain the READY state slots after promotion as they can provide information about the slot's origin. I added some comments around slotsync cleanup codes (in FinishWalRecovery) to mentioned the reason. Best Regards, Hou zj