On Tue, May 25, 2021 at 7:21 PM Bharath Rupireddy <bharath.rupireddyforpostg...@gmail.com> wrote: > > On Tue, May 25, 2021 at 1:44 PM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > > > On Tue, May 25, 2021 at 2:49 PM Bharath Rupireddy > > <bharath.rupireddyforpostg...@gmail.com> wrote: > > > > > > On Mon, May 24, 2021 at 1:32 PM Masahiko Sawada <sawada.m...@gmail.com> > > > wrote: > > > > > > > > Hi all, > > > > > > > > If a logical replication worker cannot apply the change on the > > > > subscriber for some reason (e.g., missing table or violating a > > > > constraint, etc.), logical replication stops until the problem is > > > > resolved. Ideally, we resolve the problem on the subscriber (e.g., by > > > > creating the missing table or removing the conflicting data, etc.) but > > > > occasionally a problem cannot be fixed and it may be necessary to skip > > > > the entire transaction in question. Currently, we have two ways to > > > > skip transactions: advancing the LSN of the replication origin on the > > > > subscriber and advancing the LSN of the replication slot on the > > > > publisher. But both ways might not be able to skip exactly one > > > > transaction in question and end up skipping other transactions too. > > > > > > Does it mean pg_replication_origin_advance() can't skip exactly one > > > txn? I'm not familiar with the function or never used it though, I was > > > just searching for "how to skip a single txn in postgres" and ended up > > > in [1]. Could you please give some more details on scenarios when we > > > can't skip exactly one txn? Is there any other way to advance the LSN, > > > something like directly updating the pg_replication_slots catalog? > > > > Sorry, it's not impossible. Although the user mistakenly skips more > > than one transaction by specifying a wrong LSN it's always possible to > > skip an exact one transaction. > > IIUC, if the user specifies the "correct LSN", then it's possible to > skip exact txn for which the sync workers are unable to apply changes, > right? > > How can the user get the LSN (which we call "correct LSN")? Is it from > pg_replication_slots? Or some other way? > > If the user somehow can get the "correct LSN", can't the exact txn be > skipped using it with any of the existing ways, either using > pg_replication_origin_advance or any other ways?
One possible way I know is to copy the logical replication slot used by the subscriber and peek at the changes to identify the correct LSN (maybe there is another handy way though) . For example, suppose that two transactions insert tuples as follows on the publisher: TX-A: BEGIN; TX-A: INSERT INTO test VALUES (1); TX-B: BEGIN; TX-B: INSERT INTO test VALUES (10); TX-B: COMMIT; TX-A: INSERT INTO test VALUES (2); TX-A: COMMIT; And suppose further that the insertion with value = 10 (by TX-A) cannot be applied only on the subscriber due to unique constraint violation. If we copy the slot by pg_copy_logical_replication_slot('test_sub', 'copy_slot', true, 'test_decoding') , we can peek at those changes with LSN as follows: =# select * from pg_logical_slot_peek_changes('copy', null, null) order by lsn; lsn | xid | data -----------+-----+------------------------------------------ 0/1911548 | 736 | BEGIN 736 0/1911548 | 736 | table public.hoge: INSERT: c[integer]:1 0/1911588 | 737 | BEGIN 737 0/1911588 | 737 | table public.hoge: INSERT: c[integer]:10 0/19115F8 | 737 | COMMIT 737 0/1911630 | 736 | table public.hoge: INSERT: c[integer]:2 0/19116A0 | 736 | COMMIT 736 (7 rows) In this case, '0/19115F8' is the correct LSN to specify. We can advance the replication origin to ' 0/19115F8' by pg_replication_origin_advance() so that logical replication streams transactions committed after ' 0/19115F8'. After the logical replication restarting, it skips the transaction with xid = 737 but replicates the transaction with xid = 736. > If there's no way to get the "correct LSN", then why can't we just > print that LSN in the error context and/or in the new statistics view > for logical replication workers, so that any of the existing ways can > be used to skip exactly one txn? I think specifying XID to the subscription is more understandable for users. > > IIUC, the feature proposed here guards against the users specifying > wrong LSN. If I'm right, what is the guarantee that users don't > specify the wrong txn id? Why can't we tell the users when a wrong LSN > is specified that "currently, an apply worker is failing to apply the > LSN XXXX, and you specified LSN YYYY, are you sure this is > intentional?" With the initial idea, specifying the correct XID is the user's responsibility. If they specify an old XID, the worker invalids it and raises a warning to tell "the worker invalidated the specified XID as it's too old". As the second idea, if we store the last failed XID somewhere (e.g., a system catalog), the user can just specify to skip that transaction. That is, instead of specifying the XID they could do something like "ALTER SUBSCRIPTION test_sub RESOLVE CONFLICT BY SKIP". Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/