Re: Skipping logical replication transactions on subscriber side

Masahiko Sawada Sun, 23 Jan 2022 22:55:29 -0800

On Mon, Jan 24, 2022 at 1:49 PM David G. Johnston
<david.g.johns...@gmail.com> wrote:
>
> On Sun, Jan 23, 2022 at 8:35 PM Amit Kapila <amit.kapil...@gmail.com> wrote:
>>
>> > I really dislike the user experience this provides, and given it is new in 
>> > v15 (and right now this table seems to exist solely to support this 
>> > feature) changing this seems within the realm of possibility. I have to 
>> > imagine these workers have a sense of local state that would just be "no 
>> > errors, no need to touch pg_stat_subscription_workers at the end of this 
>> > transaction's commit".  It would save a local state of the error_xid and 
>> > if a successfully committed transaction has that xid it would clear the 
>> > error.  The skip code path would also check for and see the matching xid 
>> > value and clear the error.  Even if the local state thing doesn't work, 
>> > one catalog lookup per transaction seems like potentially reasonable 
>> > overhead to incur here.
>> >
>>
>> Are you telling to update the catalog to save error_xid when an error
>> occurs? If so, that has many challenges like we are not supposed to
>> perform any such operations when the transaction is in an error state.
>> We have discussed this and other ideas in the beginning. I don't find
>> any of your arguments convincing to change the basic approach here but
>> I would like to see what others think on this matter?
>>
>
> Then how does the table get updated to that state in the first place since it 
> doesn't know the error details until there is an error?


I think your idea is based on storing error information including XID
is stored in the system catalog. I think that the reasons why we use
the stats collector to store error information including
last_error_xid are (1) as Amit mentioned, it would have many
challenges if updating the catalog when the transaction is in an error
state, and (2) we can store more information such as error messages,
action, etc. other than XID so that users can identify that the
reported error is a conflict error but not other types of error such
as OOM error. For these reasons to me, it makes sense to store
subscribers' error information by using the stats collector.

When it comes to reporting a message to the stats collector, we need
to note that it's not guaranteed that all messages arrive at the stats
collector. Therefore, last_error_xid doesn't not necessarily get
updated after the worker reports an error. Similarly, the same is true
for clearing subskipxid. I agree that it's useful if
pg_subscription.subskipxid is automatically set when executing ALTER
SUBSCRIPTION SKIP but it might not work in some cases because of this
restriction.

There is another idea of storing error XID on shmem (e.g., in
ReplicationState) in addition to reporting error details to the stats
collector and using the XID when skipping the transaction, but I'm not
sure whether it's a reliable way.

Anyway, even if subskipxid is automatically set when ALTER
SUBSCRIPTION SKIP, I think we need to provide a way to clear it as the
current patch does (setting NONE) just in case.

>
> In any case, clearing out the entries in the table would not happen while it 
> is applying the replication stream, in an error state or otherwise.
>
> in = while streaming
> out = not streaming
>
> 1(in). replication stream is working
> 2(in). replication stream fails; capture error information
> 3(in->out). stop replication stream; perform rollback on xid
> 4(out). update pg_stat_subscription_worker to report the failure, including 
> xid of the transaction
> 5(out). wait for the user to manually restart the replication stream

Do you mean that there always is user intervention after error so the
replication stream can resume?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Skipping logical replication transactions on subscriber side

Reply via email to