Re: Single transaction in the tablesync worker?

Peter Smith Tue, 29 Dec 2020 22:15:49 -0800

On Wed, Dec 23, 2020 at 9:07 PM Amit Kapila <amit.kapil...@gmail.com> wrote:
>
> On Tue, Dec 22, 2020 at 4:58 PM Peter Smith <smithpb2...@gmail.com> wrote:
> >
> > On Mon, Dec 21, 2020 at 11:36 PM Amit Kapila <amit.kapil...@gmail.com> 
> > wrote:
> > >
> > > On Mon, Dec 21, 2020 at 3:17 PM Peter Smith <smithpb2...@gmail.com> wrote:
> > > >
> > > > On Mon, Dec 21, 2020 at 4:23 PM Amit Kapila <amit.kapil...@gmail.com> 
> > > > wrote:
> > > >
> > > > > Few other comments:
> > > > > ==================
> > > >
> > > > Thanks for your feedback.
> > > >
> > > > > 1.
> > > > > * FIXME 3 - Crashed tablesync workers may also have remaining slots
> > > > > because I don't think
> > > > > + * such workers are even iterated by this loop, and nobody else is
> > > > > removing them.
> > > > > + */
> > > > > + if (slotname)
> > > > > + {
> > > > >
> > > > > The above FIXME is not clear to me. Actually, the crashed workers
> > > > > should restart, finish their work, and drop the slots. So not sure
> > > > > what exactly this FIXME refers to?
> > > >
> > > > Yes, normally if the tablesync can complete it should behave like that.
> > > > But I think there are other scenarios where it may be unable to
> > > > clean-up after itself. For example:
> > > >
> > > > i) Maybe the crashed tablesync worker cannot finish. e.g. A row insert
> > > > handled by tablesync can give a PK violation which also will crash
> > > > again and again for each re-launched/replacement tablesync worker.
> > > > This can be reproduced in the debugger. If the DropSubscription
> > > > doesn't clean-up the tablesync's slot then nobody will.
> > > >
> > > > ii) Also DROP SUBSCRIPTION code has locking (see code commit) "to
> > > > ensure that the launcher doesn't restart new worker during dropping
> > > > the subscription".
> > > >
> > >
> > > Yeah, I have also read that comment but do you know how it is
> > > preventing relaunch? How does the subscription lock help?
> >
> > Hmmm. I did see there is a matching lock in get_subscription_list of
> > launcher.c, which may be what that code comment was referring to. But
> > I also am currently unsure how this lock prevents anybody (e.g.
> > process_syncing_tables_for_apply) from executing another
> > logicalrep_worker_launch.
> >
>
> process_syncing_tables_for_apply will be called by the apply worker
> and we are stopping the apply worker. So, after that launcher won't
> start a new apply worker because of get_subscription_list() and if the
> apply worker is not started then it won't be able to start tablesync
> worker. So, we need the handling of crashed tablesync workers here
> such that we need to drop any new sync slots.


Yes, in the v6 patch code this was a problem in need of handling. But
since the v7 patch the DropSubscription code is now using a separate
GetSubscriptionNotReadyRelations loop to handle the cleanup of
potentially leftover slots from crashed tablesync workers (i.e.
workers that never got to a READY state).

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Single transaction in the tablesync worker?

Reply via email to