On Wed, Jan 6, 2021 at 4:04 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Tue, Jan 5, 2021 at 3:32 PM Peter Smith <smithpb2...@gmail.com> wrote: > > > > On Mon, Jan 4, 2021 at 10:48 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > > > > > 5. Is it possible to write a testcase where we fail (say due to pk > > > violation or some other error) after the initial copy is done, then > > > remove the conflicting row and allow a copy to be completed? If we > > > already have any such test then it is fine. > > > > > > > Causing a PK violation during the initial copy is not a problem to > > test, but doing it after the initial copy is difficult. I have done > > exactly this test scenario before but I thought it cannot be > > automated. E.g. To cause an PK violation error somewhere between > > COPYDONE and SYNDONE means that the offending insert (the one which > > tablesync will fail to replicate) has to be sent while the tablesync > > is in CATCHUP mode. But AFAIK that can only be achieved using the > > debugger to get the timing right. > > > > Yeah, I am also not able to think of any way to automate such a test. > I was thinking about what could go wrong if we error out in that > stage. The only thing that could be problematic is if we somehow make > the slot and replication origin used during copy dangling. I think if > tablesync is restarted after error then we will clean up those which > will be normally the case but what if the tablesync worker is not > started again? I think the only possibility of tablesync worker not > started again is if during Alter Subscription ... Refresh Publication, > we remove the corresponding subscription rel (see > AlterSubscription_refresh, I guess it could happen if one has dropped > the relation from publication). I haven't tested this with your patch > but if such a possibility exists then we need to think of cleaning up > slot and origin when we remove subscription rel. What do you think? >
I think it makes sense. If there can be a race between the tablesync re-launching (after error), and the AlterSubscription_refresh removing some table’s relid from the subscription then there could be lurking slot/origin tablesync resources (of the removed table) which a subsequent DROP SUBSCRIPTION cannot discover. I will think more about how/if it is possible to make this happen. Anyway, I suppose I ought to refactor/isolate some of the tablesync cleanup code in case it needs to be commonly called from DropSubscription and/or from AlterSubscription_refresh. ---- Kind Regards, Peter Smith. Fujitsu Australia.