On Mon, Dec 21, 2020 at 4:23 PM Amit Kapila <amit.kapil...@gmail.com> wrote:
> Few other comments: > ================== Thanks for your feedback. > 1. > * FIXME 3 - Crashed tablesync workers may also have remaining slots > because I don't think > + * such workers are even iterated by this loop, and nobody else is > removing them. > + */ > + if (slotname) > + { > > The above FIXME is not clear to me. Actually, the crashed workers > should restart, finish their work, and drop the slots. So not sure > what exactly this FIXME refers to? Yes, normally if the tablesync can complete it should behave like that. But I think there are other scenarios where it may be unable to clean-up after itself. For example: i) Maybe the crashed tablesync worker cannot finish. e.g. A row insert handled by tablesync can give a PK violation which also will crash again and again for each re-launched/replacement tablesync worker. This can be reproduced in the debugger. If the DropSubscription doesn't clean-up the tablesync's slot then nobody will. ii) Also DROP SUBSCRIPTION code has locking (see code commit) "to ensure that the launcher doesn't restart new worker during dropping the subscription". So executing DROP SUBSCRIPTION will prevent a newly crashed tablesync from re-launching, so it won’t be able to take care of its own slot. If the DropSubscription doesn't clean-up that tablesync's slot then nobody will. > > 2. > DropSubscription() > { > .. > ReplicationSlotDropAtPubNode( > + NULL, > + conninfo, /* use conninfo to make a new connection. */ > + subname, > + syncslotname); > .. > } > > With the above call, it will form a connection with the publisher and > drop the required slots. I think we need to save the connection info > so that we don't need to connect/disconnect for each slot to be > dropped. Later in this function, we again connect and drop the apply > worker slot. I think we should connect just once drop the apply and > table sync slots if any. OK. IIUC this is a suggestion for more efficient connection usage, rather than actual bug right? I have added this suggestion to my TODO list. > > 3. > ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn_given, char > *conninfo, char *subname, char *slotname) > { > .. > + PG_TRY(); > .. > + PG_CATCH(); > + { > + /* NOP. Just gobble any ERROR. */ > + } > + PG_END_TRY(); > > Why are we suppressing the error instead of handling it the error in > the same way as we do while dropping the apply worker slot in > DropSubscription? This function is common - it is also called from the tablesync finish_sync_worker. But in the finish_sync_worker case I wanted to avoid throwing an ERROR which would cause the tablesync to crash and relaunch (and crash/relaunch/repeat...) when all it was trying to do in the first place was just cleanup and exit the process. Perhaps the error suppression should be conditional depending where this function is called from? > > 4. > @@ -139,6 +141,28 @@ finish_sync_worker(void) > get_rel_name(MyLogicalRepWorker->relid)))); > CommitTransactionCommand(); > > + /* > + * Cleanup the tablesync slot. > + */ > + { > + extern void ReplicationSlotDropAtPubNode( > + WalReceiverConn *wrconn_given, char *conninfo, char *subname, char > *slotname); > > This is not how we export functions at other places? Fixed in latest v5 patch - https://www.postgresql.org/message-id/CAHut%2BPvmDJ_EO11_up%3D_cRbOjhdWCMG-n7kF-mdRhjtCHcjHRA%40mail.gmail.com ---- Kind Regards, Peter Smith. Fujitsu Australia.