Re: Optionally automatically disable logical replication subscriptions on error

Peter Smith Mon, 21 Jun 2021 17:58:11 -0700

Much of the discussion above seems to be related to where to store the
error information and how much information is needed to be useful.

As a summary, the 5 alternatives I have seen mentioned are:

#1. Store some simple message in the pg_subscription ("I wasn't trying
to capture everything, just enough to give the user a reasonable
indication of what went wrong" [Mark-1]). Storing the error message
was also seen as a convenience for writing TAP tests ("I originally
ran into the motivation to write this patch when frustrated that TAP
tests needed to parse the apply worker log file" [Mark-2}). It also
can sometimes provide a simple clue for the error (e.g. PK violation
for table TBL) but still the user will have to look elsewhere for
details to resolve the error. So while this implementation seems good
for simple scenarios, it appears to have been shot down because the
non-trivial scenarios either have insufficient or wrong information in
the error message. Some DETAILS could have been added to give more
information but that would maybe bloat the catalog ("I have not yet
included the DETAIL field because I didn't want to bloat the catalog."
[Mark-3])

#2. Similarly another idea was to use another existing catalog table
pg_subscription_rel. This could have the same problems ("It won't be
sufficient to store information in either pg_subscription_rel or
pg_susbscription." [Amit-1])

#3. There is another suggestion to use the Stats Collector to hold the
error message [Amit-2]. For me, this felt like blurring too much the
distinction between "stats tracking/metrics" and "logs". ERROR logs
must be flushed, whereas for stats (IIUC) there is no guarantee that
everything you need to see would be present. Indeed Amit wrote "But in
this case, if the stats collector missed updating the information, the
user may have to manually update the subscription and let the error
happen again to see it." [Amit-3]. Requesting the user to cause the
same error again just in case it was not captured a first time seems
too strange to me.

#4. The next idea was to have an entirely new catalog for holding the
subscription error information. I feel that storing/duplicating lots
of error information in another table seems like a bridge too far.
What about the risks of storing incorrect or sufficient information?
What is the advantage of duplicating errors over just referring to the
log files for ERROR details?

#5. Document to refer to the logs. All ERROR details are already in
the logs, and this seems to me the intuitive place to look for them.
Searching for specific errors becomes difficult programmatically (is
this really a problem other than complex TAP tests?). But here there
is no risk of missing or insufficient information captured in the log
files ("but still there will be some information related to ERROR
which we wanted the user to see unless we ask them to refer to logs
for that." [Amit-4}).

---

My preferred alternative is #5. ERRORs are logged in the log file, so
there is nothing really for this patch to do in this regard (except
documentation), and there is no risk of missing any information, no
ambiguity of having duplicated errors, and it is the intuitive place
the user would look.

So I felt current best combination is just this:
a) A tri-state indicating the state of the subscription: e.g.
something like "enabled" ('e')/ "disabled" ('d') / "auto-disabled"
('a') [Amit-5]
b) For "auto-disabled" the PG docs would be updated tell the user to
check the logs to resolve the problem before re-enabling the
subscription

//////////

IMO it is not made exactly clear to me what is the main goal of this
patch. Because of this, I feel that you can't really judge if this new
option is actually useful or not except only in hindsight. It seems
like whatever you implement can be made to look good or bad, just by
citing different test scenarios.

e.g.

* Is the goal mainly to help automated (TAP) testing? In that case,
then maybe you do want to store the error message somewhere other than
the log files. But still I wonder if results would be unpredictable
anyway - e.g if there are multiple tables all with errors then it
depends on the tablesync order of execution which error you see caused
the auto-disable, right? And if it is not predictable maybe it is less
useful.

* Is the goal to prevent some *unattended* SUBSCRIPTION from going bad
at some point in future and then going into a relaunch loop for
days/weeks and causing 1000's of errors without the user noticing. In
that case, this patch seems to be quite useful, but for this goal
maybe you don't want to be checking the tablesync workers at all, but
should only be checking the apply worker like your original v1 patch
did.

* Is the goal just to be a convenient way to disable the subscription
during the CREATE SUBSCRIPTION phase so that the user can make
corrections in peace without the workers re-launching and making more
error logs? Here the patch is helpful, but only for simple scenarios
like 1 faulty table. Imagine if there are 10 tables (all with PK
violations at DATASYNC copy) then you will encounter them one at a
time and have to re-enable the subscription 10 times, after fixing
each error in turn. So in this scenario the new option might be more
of a hindrance than a help because it would be easier if the user just
did "ALTER SUBSCRIPTION sub DISABLE" manually and fixed all the
problems in one sitting before re-enabling.

* etc

//////////

Finally, here is one last (crazy?) thought-bubble just for
consideration. I might be wrong, but my gut feeling is that the Stats
Collector is intended more for "tracking" and for "metrics" rather
than for holding duplicates of logged error messages. At the same
time, I felt that disabling an entire subscription due to a single
rogue error might be overkill sometimes. But I wonder if there is a
way to combine those two ideas so that the Stats Collector gets some
new counter for tracking the number of worker re-launches that have
occurred, meanwhile there could be a subscription option which gives a
threshold above which you would disable the subscription.
e.g.
"disable_on_error_threshold=0" default, relaunch forever
"disable_on_error_threshold=1" disable upon first error encountered.
(This is how your patch behaves now I think.)
"disable_on_error_threshold=500" disable if the re-launch errors go
unattended and happen 500 times.

------
[Mark-1]
https://www.postgresql.org/message-id/A539C848-670E-454F-B31C-82D3CBE9F5AC%40enterprisedb.com
[Mark-2]
https://www.postgresql.org/message-id/DB35438F-9356-4841-89A0-412709EBD3AB%40enterprisedb.com
[Mark-3]
https://www.postgresql.org/message-id/DE7E13B7-DC76-416A-A98F-3BC3F80E6BE9%40enterprisedb.com
[Amit-1]
https://www.postgresql.org/message-id/CAA4eK1K_JFSFrAkr_fgp3VX6hTSmjK%3DwNs4Tw8rUWHGp0%2BBsaw%40mail.gmail.com
[Amit-2]
https://www.postgresql.org/message-id/CAA4eK1%2BNoRbYSH1J08zi4OJ_EUMcjmxTwnmwVqZ6e_xzS0D6VA%40mail.gmail.com
[Amit-3]
https://www.postgresql.org/message-id/CAA4eK1Kyx6U9yxC7OXoBD7pHC3bJ4LuNGd%3DOiABmiW6%2BqG%2BvEQ%40mail.gmail.com
[Amit-4]
https://www.postgresql.org/message-id/CAA4eK1Kyx6U9yxC7OXoBD7pHC3bJ4LuNGd%3DOiABmiW6%2BqG%2BvEQ%40mail.gmail.com
[Amit-5]
https://www.postgresql.org/message-id/CAA4eK1Kyx6U9yxC7OXoBD7pHC3bJ4LuNGd%3DOiABmiW6%2BqG%2BvEQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Optionally automatically disable logical replication subscriptions on error

Reply via email to