On Thu, Feb 4, 2021 at 8:33 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > ...
> Thanks. I have fixed one of the issues reported by me earlier [1] > wherein the tablesync worker can repeatedly fail if after dropping the > slot there is an error while updating the SYNCDONE state in the > database. I have moved the drop of the slot just before commit of the > transaction where we are marking the state as SYNCDONE. Additionally, > I have removed unnecessary includes in tablesync.c, updated the docs > for Alter Subscription, and updated the comments at various places in > the patch. I have also updated the commit message this time. > Below are my feedback comments for V17 (nothing functional) ~~ 1. V27 Commit message: For the initial table data synchronization in logical replication, we use a single transaction to copy the entire table and then synchronizes the position in the stream with the main apply worker. Typo: "synchronizes" -> "synchronize" ~~ 2. @@ -48,6 +48,23 @@ ALTER SUBSCRIPTION <replaceable class="parameter">name</replaceable> RENAME TO < (Currently, all subscription owners must be superusers, so the owner checks will be bypassed in practice. But this might change in the future.) </para> + + <para> + When refreshing a publication we remove the relations that are no longer + part of the publication and we also remove the tablesync slots if there are + any. It is necessary to remove tablesync slots so that the resources + allocated for the subscription on the remote host are released. If due to + network breakdown or some other error, we are not able to remove the slots, + we give WARNING and the user needs to manually remove such slots later as + otherwise, they will continue to reserve WAL and might eventually cause + the disk to fill up. See also <xref linkend="logical-replication-subscription-slot"/>. + </para> I think the content is good, but the 1st-person wording seemed strange. e.g. "we are not able to remove the slots, we give WARNING and the user needs..." Maybe it should be like: "... PostgreSQL is unable to remove the slots, so a WARNING is reported. The user needs... " ~~ 3. @@ -566,107 +569,197 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data) ... + * XXX If there is a network break down while dropping the "network break down" -> "network breakdown" ~~ 4. @@ -872,7 +970,48 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos) (errmsg("could not connect to the publisher: %s", err))); ... + * XXX We could also instead try to drop the slot, last time we failed + * but for that, we might need to clean up the copy state as it might + * be in the middle of fetching the rows. Also, if there is a network + * break down then it wouldn't have succeeded so trying it next time + * seems like a better bet. "network break down" -> "network breakdown" ~~ 5. @@ -269,26 +313,47 @@ invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue) ... + + /* + * Cleanup the tablesync slot. + * + * This has to be done after updating the state because otherwise if + * there is an error while doing the database operations we won't be + * able to rollback dropped slot. + */ + ReplicationSlotNameForTablesync(MyLogicalRepWorker->subid, + MyLogicalRepWorker->relid, + syncslotname); + + ReplicationSlotDropAtPubNode(wrconn, syncslotname, false /* missing_ok */); + Should this comment also describe why the missing_ok is false for this case? ---- Kind Regards, Peter Smith. Fujitsu Australia