Here are my feedback comments for the V29 patch. ====
FILE: logical-replication.sgml + slots have generated names: <quote><literal>pg_%u_sync_%u_%llu</literal></quote> + (parameters: Subscription <parameter>oid</parameter>, + Table <parameter>relid</parameter>, system identifier<parameter>sysid</parameter>) + </para> 1. There is a missing space before the sysid parameter. ===== FILE: subscriptioncmds.c + * SUBREL_STATE_FINISHEDCOPY. The apply worker can also + * concurrently try to drop the origin and by this time the + * origin might be already removed. For these reasons, + * passing missing_ok = true from here. + */ + snprintf(originname, sizeof(originname), "pg_%u_%u", sub->oid, relid); + replorigin_drop_by_name(originname, true, false); + } 2. Don't really need to say "from here". (same comment applies multiple places, in this file and in tablesync.c) 3. Previously the tablesync origin name format was encapsulated in a common function. IMO it was cleaner/safer how it was before, instead of the same "pg_%u_%u" cut/paste and scattered in many places. (same comment applies multiple places, in this file and in tablesync.c) 4. Calls like replorigin_drop_by_name(originname, true, false); make it unnecessarily hard to read code when the boolean params are neither named as variables nor commented. I noticed on another thread [et0205] there was an idea that having no name/comments is fine because anyway it is not difficult to figure out when using a "modern IDE", but since my review tools are only "vi" and "meld" I beg to differ with that justification. (same comment applies multiple places, in this file and in tablesync.c) [et0205] https://www.postgresql.org/message-id/c1d9833f-eeeb-40d5-89ba-87674e1b7ba3%40www.fastmail.com ===== FILE: tablesync.c 5. Previously there was a function tablesync_replorigin_drop which was encapsulating the tablesync origin name formatting. I thought that was better than the V29 code which now has the same formatting scattered over many places. (same comment applies for worker_internal.h) + * Determine the tablesync slot name. + * + * The name must not exceed NAMEDATALEN - 1 because of remote node constraints + * on slot name length. We do append system_identifier to avoid slot_name + * collision with subscriptions in other clusters. With current scheme + * pg_%u_sync_%u_UINT64_FORMAT (3 + 10 + 6 + 10 + 20 + '\0'), the maximum + * length of slot_name will be 50. + * + * The returned slot name is either: + * - stored in the supplied buffer (syncslotname), or + * - palloc'ed in current memory context (if syncslotname = NULL). + * + * Note: We don't use the subscription slot name as part of tablesync slot name + * because we are responsible for cleaning up these slots and it could become + * impossible to recalculate what name to cleanup if the subscription slot name + * had changed. + */ +char * +ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char syncslotname[NAMEDATALEN]) +{ + if (syncslotname) + sprintf(syncslotname, "pg_%u_sync_%u_" UINT64_FORMAT, suboid, relid, + GetSystemIdentifier()); + else + syncslotname = psprintf("pg_%u_sync_%u_" UINT64_FORMAT, suboid, relid, + GetSystemIdentifier()); + + return syncslotname; +} 6. "We do append" --> "We append" "With current scheme" -> "With the current scheme" 7. Maybe consider to just assign GetSystemIdentifier() to a static instead of calling that function for every slot? static uint64 sysid = GetSystemIdentifier(); IIUC the sysid value is never going to change for a process, right? ------ Kind Regards, Peter Smith. Fujitsu Australia On Mon, Feb 8, 2021 at 9:59 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Fri, Feb 5, 2021 at 8:40 PM Petr Jelinek > <petr.jeli...@enterprisedb.com> wrote: > > > > Hi, > > > > We had a bit high-level discussion about this patches with Amit > > off-list, so I decided to also take a look at the actual code. > > > > Thanks for the discussion and a follow-up review. > > > My main concern originally was the potential for left-over slots on > > publisher, but I think the state now is relatively okay, with couple of > > corner cases that are documented and don't seem much worse than the main > > slot. > > > > I wonder if we should mention the max_slot_wal_keep_size GUC in the > > table sync docs though. > > > > I have added the reference of this in Alter Subscription where we > mentioned the risk of leftover slots. Let me know if you have > something else in mind? > > > Another thing that might need documentation is that the the visibility > > of changes done by table sync is not anymore isolated in that table > > contents will show intermediate progress to other backends, rather than > > switching from nothing to state consistent with rest of replication. > > > > Agreed and updated the docs accordingly. > > > > > Some minor comments about code: > > > > > + else if (res->status == WALRCV_ERROR && missing_ok) > > > + { > > > + /* WARNING. Error, but missing_ok = true. */ > > > + ereport(WARNING, > > > > I wonder if we need to add error code to the WalRcvExecResult and check > > for the appropriate ones here. Because this can for example return error > > because of timeout, not because slot is missing. > > > > I think there are both pros and cons of distinguishing the error > ("slot doesnot exist" from others). The benefit is if there a network > glitch then the user can probably retry the commands Alter/Drop and it > will be successful next time. OTOH, say the network is broken for a > long time and the user wants to proceed but there won't be any way to > proceed for Alter Subscription ... Refresh or Drop Command. So by > giving WARNING at least we can provide a way to proceed and then they > can drop such slots later. We have mentioned this in docs as well. I > think we can go either way here, let me know what do you think is a > better way? > > > Not sure if it matters > > for current callers though (but then maybe don't call the param > > missign_ok?). > > > > Sure, if we decide not to change the behavior as suggested by you then > this makes sense. > > > > > > +ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char > > > syncslotname[NAMEDATALEN]) > > > +{ > > > + if (syncslotname) > > > + sprintf(syncslotname, "pg_%u_sync_%u", suboid, relid); > > > + else > > > + syncslotname = psprintf("pg_%u_sync_%u", suboid, relid); > > > + > > > + return syncslotname; > > > +} > > > > Given that we are now explicitly dropping slots, what happens here if we > > have 2 different downstreams that happen to get same suboid and reloid, > > will one of the drop the slot of the other one? Previously with the > > cleanup being left to temp slot we'd at maximum got error when creating > > it but with the new logic in LogicalRepSyncTableStart it feels like we > > could get into situation where 2 downstreams are fighting over slot no? > > > > As discussed, added system_identifier to distinguish subscriptions > between different clusters. > > Apart from fixing the above comment, I have integrated it with the new > replorigin_drop_by_name() API being discussed in the thread [1] and > posted that patch just for ease. I have also integrated Osumi-San's > test case patch with minor modifications. > > [1] - > https://www.postgresql.org/message-id/CAA4eK1L7mLhY%3DwyCB0qsEGUpfzWfncDSS9_0a4Co%2BN0GUyNGNQ%40mail.gmail.com > > -- > With Regards, > Amit Kapila.