Re: Single transaction in the tablesync worker?

Peter Smith Mon, 21 Dec 2020 01:47:30 -0800

On Mon, Dec 21, 2020 at 4:23 PM Amit Kapila <[email protected]> wrote:


> Few other comments:
> ==================

Thanks for your feedback.

> 1.
> * FIXME 3 - Crashed tablesync workers may also have remaining slots
> because I don't think
> + * such workers are even iterated by this loop, and nobody else is
> removing them.
> + */
> + if (slotname)
> + {
>
> The above FIXME is not clear to me. Actually, the crashed workers
> should restart, finish their work, and drop the slots. So not sure
> what exactly this FIXME refers to?

Yes, normally if the tablesync can complete it should behave like that.
But I think there are other scenarios where it may be unable to
clean-up after itself. For example:

i) Maybe the crashed tablesync worker cannot finish. e.g. A row insert
handled by tablesync can give a PK violation which also will crash
again and again for each re-launched/replacement tablesync worker.
This can be reproduced in the debugger. If the DropSubscription
doesn't clean-up the tablesync's slot then nobody will.

ii) Also DROP SUBSCRIPTION code has locking (see code commit) "to
ensure that the launcher doesn't restart new worker during dropping
the subscription". So executing DROP SUBSCRIPTION will prevent a newly
crashed tablesync from re-launching, so it won’t be able to take care
of its own slot. If the DropSubscription doesn't clean-up that
tablesync's slot then nobody will.

>
> 2.
> DropSubscription()
> {
> ..
> ReplicationSlotDropAtPubNode(
> + NULL,
> + conninfo, /* use conninfo to make a new connection. */
> + subname,
> + syncslotname);
> ..
> }
>
> With the above call, it will form a connection with the publisher and
> drop the required slots. I think we need to save the connection info
> so that we don't need to connect/disconnect for each slot to be
> dropped. Later in this function, we again connect and drop the apply
> worker slot. I think we should connect just once drop the apply and
> table sync slots if any.

OK. IIUC this is a suggestion for more efficient connection usage,
rather than actual bug right? I have added this suggestion to my TODO
list.

>
> 3.
> ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn_given, char
> *conninfo, char *subname, char *slotname)
> {
> ..
> + PG_TRY();
> ..
> + PG_CATCH();
> + {
> + /* NOP. Just gobble any ERROR. */
> + }
> + PG_END_TRY();
>
> Why are we suppressing the error instead of handling it the error in
> the same way as we do while dropping the apply worker slot in
> DropSubscription?

This function is common - it is also called from the tablesync
finish_sync_worker. But in the finish_sync_worker case I wanted to
avoid throwing an ERROR which would cause the tablesync to crash and
relaunch (and crash/relaunch/repeat...) when all it was trying to do
in the first place was just cleanup and exit the process. Perhaps the
error suppression should be conditional depending where this function
is called from?

>
> 4.
> @@ -139,6 +141,28 @@ finish_sync_worker(void)
>   get_rel_name(MyLogicalRepWorker->relid))));
>   CommitTransactionCommand();
>
> + /*
> + * Cleanup the tablesync slot.
> + */
> + {
> + extern void ReplicationSlotDropAtPubNode(
> + WalReceiverConn *wrconn_given, char *conninfo, char *subname, char 
> *slotname);
>
> This is not how we export functions at other places?

Fixed in latest v5 patch -
https://www.postgresql.org/message-id/CAHut%2BPvmDJ_EO11_up%3D_cRbOjhdWCMG-n7kF-mdRhjtCHcjHRA%40mail.gmail.com


----
Kind Regards,
Peter Smith.
Fujitsu Australia.

Re: Single transaction in the tablesync worker?

Reply via email to