On Sun, Jan 23, 2022 at 11:55 PM Masahiko Sawada <sawada.m...@gmail.com> wrote:
> On Mon, Jan 24, 2022 at 1:49 PM David G. Johnston > <david.g.johns...@gmail.com> wrote: > > > > On Sun, Jan 23, 2022 at 8:35 PM Amit Kapila <amit.kapil...@gmail.com> > wrote: > >> > >> > I really dislike the user experience this provides, and given it is > new in v15 (and right now this table seems to exist solely to support this > feature) changing this seems within the realm of possibility. I have to > imagine these workers have a sense of local state that would just be "no > errors, no need to touch pg_stat_subscription_workers at the end of this > transaction's commit". It would save a local state of the error_xid and if > a successfully committed transaction has that xid it would clear the > error. The skip code path would also check for and see the matching xid > value and clear the error. Even if the local state thing doesn't work, one > catalog lookup per transaction seems like potentially reasonable overhead > to incur here. > >> > > >> > >> Are you telling to update the catalog to save error_xid when an error > >> occurs? If so, that has many challenges like we are not supposed to > >> perform any such operations when the transaction is in an error state. > >> We have discussed this and other ideas in the beginning. I don't find > >> any of your arguments convincing to change the basic approach here but > >> I would like to see what others think on this matter? > >> > > > > Then how does the table get updated to that state in the first place > since it doesn't know the error details until there is an error? > > I think your idea is based on storing error information including XID > is stored in the system catalog. I think that the reasons why we use > the stats collector I noticed this dynamic while skimming the patch (and also pondering why the new worker table was not in a catalog chapter) but am only now fully beginning to appreciate its impact on this discussion. > to store error information including last_error_xid are (1) as Amit mentioned, it would have many > challenges if updating the catalog when the transaction is in an error > state, and I'm going on faith right now that this is a problem. But from my prior outline I hope you can see why I find it surprising. Don't try to update a catalog while in an error state. Get out of the error state first. e.g., A transient "holding pattern" would seem to work. Upon a server restart the transient state would be forgotten, it would attempt to reapply the wal, would see the same error, and would then go back into the transient holding pattern. I do intend to read the other discussion on this particular topic so a detailed rebuttal, if warranted, can be withheld. > (2) we can store more information such as error messages, > action, etc. other than XID so that users can identify that the > reported error is a conflict error but not other types of error such > as OOM error. I mentioned only XID because of the focus on SKIP. The other data already present in that table is ok. Whether we use a catalog or the stats collector seems irrelevant. If anything the catalog makes more sense - calling an error message a statistic is a bit of a reach. >Similarly, the same is true >for clearing subskipxid. I agree that it's useful if >pg_subscription.subskipxid is automatically set when executing ALTER >SUBSCRIPTION SKIP but it might not work in some cases because of this >restriction. For these reasons to me, it makes sense to store >subscribers' error information by using the stats collector. I'm confused - pg_subscription is a catalog, not a stat view. Why is it affected? I don't see how point 2 prevents using a system catalog. I accept point 1 as true but will need to read some of the prior discussion to really understand it. When it comes to reporting a message to the stats collector, we need > to note that it's not guaranteed that all messages arrive at the stats > collector. Therefore, last_error_xid doesn't not necessarily get > updated after the worker reports an error. You'll forgive me for not considering this due to its apparent lack of mention in the documentation [*] and it's arguable classification as a POLA violation. [*] https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-SUBSCRIPTION What I do read there seems compatible with the desired user experience. 500ms lag, idle transaction oriented, reset upon unclean shutdown, and consumers seeing a stable transactional view: none of these seem like show-stoppers. Anyway, even if subskipxid is automatically set when ALTER > SUBSCRIPTION SKIP, I think we need to provide a way to clear it as the > current patch does (setting NONE) just in case. > With my suggestion of requiring a matching xid the whole option for skip_xid = { xid | NONE } remains. > 5(out). wait for the user to manually restart the replication stream > > Do you mean that there always is user intervention after error so the > replication stream can resume? > That is my working assumption. It doesn't seem like the system would auto-resume without a DBA doing something (I'll attribute a server crash to the DBA for convenience). Apparently I need to read more about how the system works today to understand how this varies from and integrates with today's user experience. That said, at present my two dislikes: 1) ALTER SYSTEM SKIP accepts any xid value (I need to consider further the timing of when this resets to zero) 2) pg_stat_subscription_worker.last_error_* fields remain populated even while the system is in a normal operating state. are preventing me from preferring this patch over the status quo (yes, I know the 2nd point is about a committed feature). Regardless of how far off I may be regarding our technical ability to change them to a more (IMO) user-friendly design. David J.