On Wed, Jan 27, 2021 at 2:53 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Sat, Jan 23, 2021 at 5:56 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > On Sat, Jan 23, 2021 at 4:55 AM Peter Smith <smithpb2...@gmail.com> wrote: > > > > > > PSA the v18 patch for the Tablesync Solution1. > > > > 7. Have you tested with the new patch the scenario where we crash > > after FINISHEDCOPY and before SYNCDONE, is it able to pick up the > > replication using the new temporary slot? Here, we need to test the > > case where during the catchup phase we have received few commits and > > then the tablesync worker is crashed/errored out? Basically, check if > > the replication is continued from the same point? > > > > I have tested this and it didn't work, see the below example. > > Publisher-side > ================ > CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120)); > > BEGIN; > INSERT INTO mytbl1(somedata, text) VALUES (1, 1); > INSERT INTO mytbl1(somedata, text) VALUES (1, 2); > COMMIT; > > CREATE PUBLICATION mypublication FOR TABLE mytbl1; > > Subscriber-side > ================ > - Have a while(1) loop in LogicalRepSyncTableStart so that tablesync > worker stops. > > CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120)); > > > CREATE SUBSCRIPTION mysub > CONNECTION 'host=localhost port=5432 dbname=postgres' > PUBLICATION mypublication; > > During debug, stop after we mark FINISHEDCOPY state. > > Publisher-side > ================ > INSERT INTO mytbl1(somedata, text) VALUES (1, 3); > INSERT INTO mytbl1(somedata, text) VALUES (1, 4); > > > Subscriber-side > ================ > - Have a breakpoint in apply_dispatch > - continue in debugger; > - After we replay first commit (which will be for values(1,3), note > down the origin position in apply_handle_commit_internal and somehow > error out. I have forced the debugger to set to the last line in > apply_dispatch where the error is raised. > - After the error, again the tablesync worker is restarted and it > starts from the position noted in the previous step > - It exits without replaying the WAL for (1,4) > > So, on the subscriber-side, you will see 3 records. Fourth is missing. > Now, if you insert more records on the publisher, it will anyway > replay those but the fourth one got missing. > > The temporary slots didn't seem to work because we created again the > new temporary slot after the crash and ask it to start decoding from > the point we noted in origin_lsn. The publisher didn’t hold the > required WAL as our slot was temporary so it started sending from some > later point. We retain WAL based on the slots restart_lsn position and > wal_keep_size. For our case, the positions of the slots will matter > and as we have created temporary slots, there is no way for a > publisher to save that WAL. > > In this particular case, even if the WAL would have been there we only > pass the start_decoding_at position but didn’t pass restart_lsn, so it > picked a random location (current insert position in WAL) which is > ahead of start_decoding_at point so it never sent the required fourth > record. Now, I don’t think it will work even if somehow sent the > correct restart_lsn because of what I wrote earlier that there is no > guarantee that the earlier WAL would have been saved. > > At this point, I can't think of any way to fix this problem except for > going back to the previous approach of permanent slots but let me know > if you have any ideas to salvage this approach? >
OK. The latest patch [v21] now restores the permanent slot (and slot cleanup) approach as it was implemented in an earlier version [v17]. Please note that this change also re-introduces some potential slot cleanup problems for some race scenarios. These will be addressed by future patches. ---- [v17] https://www.postgresql.org/message-id/CAHut%2BPt9%2Bg8qQR0kMC85nY-O4uDQxXboamZAYhHbvkebzC9fAQ%40mail.gmail.com [v21] https://www.postgresql.org/message-id/CAHut%2BPvzHRRA_5O0R8KZCb1tVe1mBVPxFtmttXJnmuOmAegoWA%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia