On Wed, Dec 23, 2020 at 9:07 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Tue, Dec 22, 2020 at 4:58 PM Peter Smith <smithpb2...@gmail.com> wrote: > > > > On Mon, Dec 21, 2020 at 11:36 PM Amit Kapila <amit.kapil...@gmail.com> > > wrote: > > > > > > On Mon, Dec 21, 2020 at 3:17 PM Peter Smith <smithpb2...@gmail.com> wrote: > > > > > > > > On Mon, Dec 21, 2020 at 4:23 PM Amit Kapila <amit.kapil...@gmail.com> > > > > wrote: > > > > > > > > > Few other comments: > > > > > ================== > > > > > > > > Thanks for your feedback. > > > > > > > > > 1. > > > > > * FIXME 3 - Crashed tablesync workers may also have remaining slots > > > > > because I don't think > > > > > + * such workers are even iterated by this loop, and nobody else is > > > > > removing them. > > > > > + */ > > > > > + if (slotname) > > > > > + { > > > > > > > > > > The above FIXME is not clear to me. Actually, the crashed workers > > > > > should restart, finish their work, and drop the slots. So not sure > > > > > what exactly this FIXME refers to? > > > > > > > > Yes, normally if the tablesync can complete it should behave like that. > > > > But I think there are other scenarios where it may be unable to > > > > clean-up after itself. For example: > > > > > > > > i) Maybe the crashed tablesync worker cannot finish. e.g. A row insert > > > > handled by tablesync can give a PK violation which also will crash > > > > again and again for each re-launched/replacement tablesync worker. > > > > This can be reproduced in the debugger. If the DropSubscription > > > > doesn't clean-up the tablesync's slot then nobody will. > > > > > > > > ii) Also DROP SUBSCRIPTION code has locking (see code commit) "to > > > > ensure that the launcher doesn't restart new worker during dropping > > > > the subscription". > > > > > > > > > > Yeah, I have also read that comment but do you know how it is > > > preventing relaunch? How does the subscription lock help? > > > > Hmmm. I did see there is a matching lock in get_subscription_list of > > launcher.c, which may be what that code comment was referring to. But > > I also am currently unsure how this lock prevents anybody (e.g. > > process_syncing_tables_for_apply) from executing another > > logicalrep_worker_launch. > > > > process_syncing_tables_for_apply will be called by the apply worker > and we are stopping the apply worker. So, after that launcher won't > start a new apply worker because of get_subscription_list() and if the > apply worker is not started then it won't be able to start tablesync > worker. So, we need the handling of crashed tablesync workers here > such that we need to drop any new sync slots.
Yes, in the v6 patch code this was a problem in need of handling. But since the v7 patch the DropSubscription code is now using a separate GetSubscriptionNotReadyRelations loop to handle the cleanup of potentially leftover slots from crashed tablesync workers (i.e. workers that never got to a READY state). Kind Regards, Peter Smith. Fujitsu Australia