On Mon, Jan 24, 2022 at 1:49 PM David G. Johnston <david.g.johns...@gmail.com> wrote: > > On Sun, Jan 23, 2022 at 8:35 PM Amit Kapila <amit.kapil...@gmail.com> wrote: >> >> > I really dislike the user experience this provides, and given it is new in >> > v15 (and right now this table seems to exist solely to support this >> > feature) changing this seems within the realm of possibility. I have to >> > imagine these workers have a sense of local state that would just be "no >> > errors, no need to touch pg_stat_subscription_workers at the end of this >> > transaction's commit". It would save a local state of the error_xid and >> > if a successfully committed transaction has that xid it would clear the >> > error. The skip code path would also check for and see the matching xid >> > value and clear the error. Even if the local state thing doesn't work, >> > one catalog lookup per transaction seems like potentially reasonable >> > overhead to incur here. >> > >> >> Are you telling to update the catalog to save error_xid when an error >> occurs? If so, that has many challenges like we are not supposed to >> perform any such operations when the transaction is in an error state. >> We have discussed this and other ideas in the beginning. I don't find >> any of your arguments convincing to change the basic approach here but >> I would like to see what others think on this matter? >> > > Then how does the table get updated to that state in the first place since it > doesn't know the error details until there is an error?
I think your idea is based on storing error information including XID is stored in the system catalog. I think that the reasons why we use the stats collector to store error information including last_error_xid are (1) as Amit mentioned, it would have many challenges if updating the catalog when the transaction is in an error state, and (2) we can store more information such as error messages, action, etc. other than XID so that users can identify that the reported error is a conflict error but not other types of error such as OOM error. For these reasons to me, it makes sense to store subscribers' error information by using the stats collector. When it comes to reporting a message to the stats collector, we need to note that it's not guaranteed that all messages arrive at the stats collector. Therefore, last_error_xid doesn't not necessarily get updated after the worker reports an error. Similarly, the same is true for clearing subskipxid. I agree that it's useful if pg_subscription.subskipxid is automatically set when executing ALTER SUBSCRIPTION SKIP but it might not work in some cases because of this restriction. There is another idea of storing error XID on shmem (e.g., in ReplicationState) in addition to reporting error details to the stats collector and using the XID when skipping the transaction, but I'm not sure whether it's a reliable way. Anyway, even if subskipxid is automatically set when ALTER SUBSCRIPTION SKIP, I think we need to provide a way to clear it as the current patch does (setting NONE) just in case. > > In any case, clearing out the entries in the table would not happen while it > is applying the replication stream, in an error state or otherwise. > > in = while streaming > out = not streaming > > 1(in). replication stream is working > 2(in). replication stream fails; capture error information > 3(in->out). stop replication stream; perform rollback on xid > 4(out). update pg_stat_subscription_worker to report the failure, including > xid of the transaction > 5(out). wait for the user to manually restart the replication stream Do you mean that there always is user intervention after error so the replication stream can resume? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/