On Thu, Jan 28, 2021 at 9:37 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Thu, Jan 28, 2021 at 12:32 PM Peter Smith <smithpb2...@gmail.com> wrote: > > > > On Wed, Jan 27, 2021 at 2:53 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > > > On Sat, Jan 23, 2021 at 5:56 PM Amit Kapila <amit.kapil...@gmail.com> > > > wrote: > > > > > > > > On Sat, Jan 23, 2021 at 4:55 AM Peter Smith <smithpb2...@gmail.com> > > > > wrote: > > > > > > > > > > PSA the v18 patch for the Tablesync Solution1. > > > > > > > > 7. Have you tested with the new patch the scenario where we crash > > > > after FINISHEDCOPY and before SYNCDONE, is it able to pick up the > > > > replication using the new temporary slot? Here, we need to test the > > > > case where during the catchup phase we have received few commits and > > > > then the tablesync worker is crashed/errored out? Basically, check if > > > > the replication is continued from the same point? > > > > > > > > > > I have tested this and it didn't work, see the below example. > > > > > > Publisher-side > > > ================ > > > CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text > > > varchar(120)); > > > > > > BEGIN; > > > INSERT INTO mytbl1(somedata, text) VALUES (1, 1); > > > INSERT INTO mytbl1(somedata, text) VALUES (1, 2); > > > COMMIT; > > > > > > CREATE PUBLICATION mypublication FOR TABLE mytbl1; > > > > > > Subscriber-side > > > ================ > > > - Have a while(1) loop in LogicalRepSyncTableStart so that tablesync > > > worker stops. > > > > > > CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text > > > varchar(120)); > > > > > > > > > CREATE SUBSCRIPTION mysub > > > CONNECTION 'host=localhost port=5432 dbname=postgres' > > > PUBLICATION mypublication; > > > > > > During debug, stop after we mark FINISHEDCOPY state. > > > > > > Publisher-side > > > ================ > > > INSERT INTO mytbl1(somedata, text) VALUES (1, 3); > > > INSERT INTO mytbl1(somedata, text) VALUES (1, 4); > > > > > > > > > Subscriber-side > > > ================ > > > - Have a breakpoint in apply_dispatch > > > - continue in debugger; > > > - After we replay first commit (which will be for values(1,3), note > > > down the origin position in apply_handle_commit_internal and somehow > > > error out. I have forced the debugger to set to the last line in > > > apply_dispatch where the error is raised. > > > - After the error, again the tablesync worker is restarted and it > > > starts from the position noted in the previous step > > > - It exits without replaying the WAL for (1,4) > > > > > > So, on the subscriber-side, you will see 3 records. Fourth is missing. > > > Now, if you insert more records on the publisher, it will anyway > > > replay those but the fourth one got missing. > > > > ... > > > > > > At this point, I can't think of any way to fix this problem except for > > > going back to the previous approach of permanent slots but let me know > > > if you have any ideas to salvage this approach? > > > > > > > OK. The latest patch [v21] now restores the permanent slot (and slot > > cleanup) approach as it was implemented in an earlier version [v17]. > > Please note that this change also re-introduces some potential slot > > cleanup problems for some race scenarios. > > > > I am able to reproduce the race condition where slot/origin will > remain on the publisher node even when the corresponding subscription > is dropped. Basically, if we error out in the 'catchup' phase in > tablesync worker then either it will restart and cleanup slot/origin > or if in the meantime we have dropped the subscription and stopped > apply worker then probably the slot and origin will be dangling on the > publisher. > > I have used exactly the same test procedure as was used to expose the > problem in the temporary slots with some minor changes as mentioned > below: > Subscriber-side > ================ > - Have a while(1) loop in LogicalRepSyncTableStart so that tablesync > worker stops. > - Have a while(1) loop in wait_for_relation_state_change so that we > can control apply worker via debugger at the right time. > > Subscriber-side > ================ > - Have a breakpoint in apply_dispatch > - continue in debugger; > - After we replay first commit somehow error out. I have forced the > debugger to set to the last line in apply_dispatch where the error is > raised. > - Now, the table sync worker won't restart because the apply worker is > looping in wait_for_relation_state_change. > - Execute DropSubscription; > - We can allow apply worker to continue by skipping the while(1) and > it will exit because DropSubscription would have sent a terminate > signal. > > After the above steps, check the publisher (select * from > pg_replication_slots) and you will find the dangling tablesync slot. > > I think to solve the above problem we should drop tablesync > slot/origin at the Drop/Alter Subscription time and additionally we > need to ensure that apply worker doesn't let tablesync workers restart > (or it must not do any work to access the slot because the slots are > dropped) once we stopped them. To ensure that, I think we need to make > the following changes: > > 1. Take AccessExclusivelock on subscription_rel during Alter (before > calling RemoveSubscriptionRel) and don't release it till transaction > end (do table_close with NoLock) similar to DropSubscription. > 2. Take share lock (AccessShareLock) in GetSubscriptionRelState (it > gets called from logicalrepsyncstartworker), we can release this lock > at the end of that function. This will ensure that even if the > tablesync worker is restarted, it will be blocked till the transaction > performing Alter will commit. > 3. Make Alter command to not run in xact block so that we don't keep > locks for a longer time and second for the slots related stuff similar > to dropsubscription. >
OK. The latest patch [v22] changes the code as suggested above. > Few comments on v21: > =================== > 1. > DropSubscription() > { > .. > - /* Clean up dependencies */ > + /* Clean up dependencies. */ > deleteSharedDependencyRecordsFor(SubscriptionRelationId, subid, 0); > .. > } > > The above change seems unnecessary w.r.t current patch. > OK. Modified in patch [v22]. > 2. > DropSubscription() > { > .. > /* > - * If there is no slot associated with the subscription, we can finish > - * here. > + * If there is a slot associated with the subscription, then drop the > + * replication slot at the publisher node using the replication > + * connection. > */ > - if (!slotname) > + if (slotname) > { > - table_close(rel, NoLock); > - return; > .. > } > > What is the reason for this change? Can't we keep the check in its > existing form? > I think the above comment is longer applicable in the latest patch [v22]. Early exit for null slotname is not desirable anymore; we still need to process all the tablesync slots/origins regardless. ---- [v22] https://www.postgresql.org/message-id/CAHut%2BPtrAVrtjc8srASTeUhbJtviw0Up-bzFSc14Ss%3DmAMxz9g%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia