On Tue, Feb 2, 2021 at 8:29 AM Peter Smith <smithpb2...@gmail.com> wrote: > > On Mon, Feb 1, 2021 at 11:26 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > I have updated the patch to display WARNING for each of the tablesync > > slots during DropSubscription. As discussed, I have moved the drop > > slot related code towards the end in AlterSubscription_refresh. Apart > > from this, I have fixed one more issue in tablesync code where in > > after catching the exception we were not clearing the transaction > > state on the publisher, see changes in LogicalRepSyncTableStart. I > > have also fixed other comments raised by you. > > Here are some additional feedback comments about the v24 patch: > > ~~ > > ReportSlotConnectionError: > > 1,2,3,4. > + foreach(lc, rstates) > + { > + SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc); > + Oid relid = rstate->relid; > + > + /* Only cleanup resources of tablesync workers */ > + if (!OidIsValid(relid)) > + continue; > + > + /* > + * Caller needs to ensure that we have appropriate locks so that > + * relstate doesn't change underneath us. > + */ > + if (rstate->state != SUBREL_STATE_SYNCDONE) > + { > + char syncslotname[NAMEDATALEN] = { 0 }; > + > + ReplicationSlotNameForTablesync(subid, relid, syncslotname); > + elog(WARNING, "could not drop tablesync replication slot \"%s\"", > + syncslotname); > + > + } > + } > > 1. I wonder if "rstates" would be better named something like > "not_ready_rstates", otherwise it is not apparent what states are in > this list >
I don't know if that would be better and it is used in the same way in the existing code. I find the current naming succinct. > 2. The comment "/* Only cleanup resources of tablesync workers */" is > not quite correct because there is no cleanup happening here. Maybe > change to: > if (!OidIsValid(relid)) > continue; /* not a tablesync worker */ > Aren't we trying to cleanup the tablesync slots here? So, I don't see the comment as irrelevant. > 3. Maybe the "appropriate locks" comment can say what locks are the > "appropriate" ones? > > 4. Spurious blank line after the elog? > Will fix both the above. > ~~ > > AlterSubscription_refresh: > > 5. > + /* > + * Drop the tablesync slot. This has to be at the end because > otherwise if there > + * is an error while doing the database operations we won't be able to > rollback > + * dropped slot. > + */ > > Maybe "Drop the tablesync slot." should say "Drop the tablesync slots > associated with removed tables." > makes sense, will fix. > ~~ > > DropSubscription: > > 6. > + /* > + * Cleanup of tablesync replication origins. > + * > + * Any READY-state relations would already have dealt with clean-ups. > + * > + * Note that the state can't change because we have already stopped both > + * the apply and tablesync workers and they can't restart because of > + * exclusive lock on the subscription. > + */ > + rstates = GetSubscriptionNotReadyRelations(subid); > + foreach(lc, rstates) > > I wonder if "rstates" would be better named as "not_ready_rstates", > because it is used in several places where not READY is assumed. > Same response as above for similar comment. > 7. > + { > + if (!slotname) > + { > + /* be tidy */ > + list_free(rstates); > + return; > + } > + else > + { > + ReportSlotConnectionError(rstates, subid, slotname, err); > + } > + > + } > > Spurious blank line above? > Will fix. > 8. > The new logic of calling the ReportSlotConnectionError seems to be > expecting that the user has encountered some connection error, and > *after* that they have assigned slot_name = NONE as a workaround. In > this scenario the code looks ok since names of any dangling tablesync > slots were being logged at the time of the error. > > But I am wondering what about where the user might have set slot_name > = NONE *before* the connection is broken. In this scenario, there is > no ERROR, so if there are still (is it possible?) dangling tablesync > slots, their names are never getting logged at all. So how can the > user know what to delete? > It has been mentioned in docs that the user is responsible for cleaning that up manually in such a case. The patch has also described how the names are generated so that can help user to remove those. + These table synchronization slots have generated names: + <quote><literal>pg_%u_sync_%u</literal></quote> (parameters: Subscription + <parameter>oid</parameter>, Table <parameter>relid</parameter>) I think if the user changes slot_name associated with the subscription, it would be his responsibility to clean up the previously associated slot. This is currently the case with the main subscription slot as well. I think it won't be advisable for the user to change slot_name unless under some rare cases where the system might be stuck like the one for which we are giving WARNING and providing a hint for setting the slot_name to NONE. > ~~ > > > Additionally, I have > > removed the test because it was creating the same name slot as the > > tablesync worker and tablesync worker removed the same due to new > > logic in LogicalRepSyncStart. Earlier, it was not failing because of > > the bug in that code which I have fixed in the attached. > > Wasn't causing a tablesync slot clash and seeing if it could recover > the point of that test? Why not just keep, and modify the test to make > it work again? > We can do that but my other worry was that we might want to reserve the names for slots that start with pg_. > Isn't it still valuable because at least it would > execute the code through the PG_CATCH which otherwise may not get > executed by any other test? > It is valuable but IIRC there was a test (in subscription/004_sync.pl) where PK violation happens during copy which will lead to the coverage of code in CATCH. > > > > I wonder whether we should restrict creating slots with prefix pg_ > > because we are internally creating slots with those names? I think > > this was a problem previously also. We already prohibit it for few > > other objects like origins, schema, etc., see the usage of > > IsReservedName. > > > > Yes, we could restrict the create slot API if you really wanted to. > But, IMO it is implausible that a user could "accidentally" clash with > the internal tablesync slot name, so in practice maybe this change > would not help much but it might make it more difficult to test some > scenarios. > Isn't the same true for origins? -- With Regards, Amit Kapila.