On Fri, Jun 18, 2021 at 6:18 AM Mark Dilger
<mark.dil...@enterprisedb.com> wrote:
>
> Hackers,
>
> Logical replication apply workers for a subscription can easily get stuck in 
> an infinite loop of attempting to apply a change, triggering an error (such 
> as a constraint violation), exiting with an error written to the subscription 
> worker log, and restarting.
>
> As things currently stand, only superusers can create subscriptions.  Ongoing 
> work to delegate superuser tasks to non-superusers creates the potential for 
> even more errors to be triggered, specifically, errors where the apply worker 
> does not have permission to make changes to the target table.
>
> The attached patch makes it possible to create a subscription using a new 
> subscription_parameter, "disable_on_error", such that rather than going into 
> an infinite loop, the apply worker will catch errors and automatically 
> disable the subscription, breaking the loop.  The new parameter defaults to 
> false.  When false, the PG_TRY/PG_CATCH overhead is avoided, so for 
> subscriptions which do not use the feature, there shouldn't be any change.  
> Users can manually clear the error after fixing the underlying issue with an 
> ALTER SUBSCRIPTION .. ENABLE command.
>
> In addition to helping on production systems, this makes writing TAP tests 
> involving error conditions simpler.  I originally ran into the motivation to 
> write this patch when frustrated that TAP tests needed to parse the apply 
> worker log file to determine whether permission failures were occurring and 
> what they were.  It was also obnoxiously easy to have a test get stuck 
> waiting for a permanently stuck subscription to catch up.  This helps with 
> both issues.
>
> I don't think this is quite ready for commit, but I'd like feedback if folks 
> like this idea or want to suggest design changes.

I tried your patch.

It applied OK (albeit with whitespace warnings).

The code build and TAP tests are all OK.

Below are a few comments and observations.

COMMENTS
========

(1) PG Docs catalogs.sgml

Documented new column "suberrmsg" but did not document the other new
columns ("disable_on_error", "disabled_by_error")?

------

(2) New column "disabled_by_error".

I wondered if there was actually any need for this column. Isn't the
same information conveyed by just having "subenabled" = false, at same
time as as non-empty "suberrmsg"? This would remove any confusion for
having 2 booleans which both indicate disabled.

------

(3) New columns "disabled_by_error", "disabled_on_error".

All other columns of the pg_subscription have a "sub" prefix.

------

(4) errhint member used?

@@ -91,12 +100,16 @@ typedef struct Subscription
  char    *name; /* Name of the subscription */
  Oid owner; /* Oid of the subscription owner */
  bool enabled; /* Indicates if the subscription is enabled */
+ bool disable_on_error; /* Whether errors automatically disable */
+ bool disabled_by_error; /* Whether an error has disabled */
  bool binary; /* Indicates if the subscription wants data in
  * binary format */
  bool stream; /* Allow streaming in-progress transactions. */
  char    *conninfo; /* Connection string to the publisher */
  char    *slotname; /* Name of the replication slot */
  char    *synccommit; /* Synchronous commit setting for worker */
+ char    *errmsg; /* Message from error which disabled */
+ char    *errhint; /* Hint from error which disabled */
  List    *publications; /* List of publication names to subscribe to */
 } Subscription;

I did not find any code using that newly added member "errhint".

------

(5) dump.c

i. No mention of new columns "disabled_on_error" and
"disabled_by_error". Is that right?

ii. Shouldn't the code for the "suberrmsg" be qualified with some PG
version number checks?

------

(6) Patch only handles errors only from the Apply worker.

Tablesync can give similar errors (e.g. PK violation during DATASYNC
phase) which will trigger re-launch forever regardless of the setting
of "disabled_on_error".
(confirmed by observations below)

------

(7) TAP test code

+$node_subscriber->init(allows_streaming => 'logical');

AFAIK that "logical" configuration is not necessary for the subscriber side. So,

$node_subscriber->init();

////////////


Some Experiments/Observations
==============================

In general, I found this functionality is useful and it works as
advertised by your patch comment.

======

Test: Display pg_subscription with the new columns
Observation: As expected. But some new colnames are not prefixed like
their peers.

test_sub=# \pset x
Expanded display is on.
test_sub=# select * from pg_subscription;
-[ RECORD 1 ]-----+--------------------------------------------------------
oid               | 16394
subdbid           | 16384
subname           | tap_sub
subowner          | 10
subenabled        | t
disable_on_error  | t
disabled_by_error | f
subbinary         | f
substream         | f
subconninfo       | host=localhost dbname=test_pub application_name=tap_sub
subslotname       | tap_sub
subsynccommit     | off
suberrmsg         |
subpublications   | {tap_pub}

======

Test: Cause a PK violation during normal Apply replication (when
"disabled_on_error=true")
Observation: Apply worker stops. Subscription is disabled. Error
message is in the catalog.

2021-06-18 15:12:45.905 AEST [25904] LOG:  edata is true for
subscription 'tap_sub': message = "duplicate key value violates unique
constraint "test_tab_pkey"", hint = "<NONE>"
2021-06-18 15:12:45.905 AEST [25904] LOG:  logical replication apply
worker for subscription "tap_sub" will stop because the subscription
was disabled due to error
2021-06-18 15:12:45.905 AEST [25904] ERROR:  duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:12:45.905 AEST [25904] DETAIL:  Key (a)=(1) already exists.
2021-06-18 15:12:45.908 AEST [19924] LOG:  background worker "logical
replication worker" (PID 25904) exited with exit code 1

test_sub=# select * from pg_subscription;
-[ RECORD 1 
]-----+---------------------------------------------------------------
oid               | 16394
subdbid           | 16384
subname           | tap_sub
subowner          | 10
subenabled        | f
disable_on_error  | t
disabled_by_error | t
subbinary         | f
substream         | f
subconninfo       | host=localhost dbname=test_pub application_name=tap_sub
subslotname       | tap_sub
subsynccommit     | off
suberrmsg         | duplicate key value violates unique constraint
"test_tab_pkey"
subpublications   | {tap_pub}

======

Test: Try to enable subscription (without fixing the PK violation problem).
Observation. OK. It just stops again

test_sub=# alter subscription tap_sub enable;
ALTER SUBSCRIPTION
test_sub=# 2021-06-18 15:17:18.067 AEST [10228] LOG:  logical
replication apply worker for subscription "tap_sub" has started
2021-06-18 15:17:18.078 AEST [10228] LOG:  edata is true for
subscription 'tap_sub': message = "duplicate key value violates unique
constraint "test_tab_pkey"", hint = "<NONE>"
2021-06-18 15:17:18.078 AEST [10228] LOG:  logical replication apply
worker for subscription "tap_sub" will stop because the subscription
was disabled due to error
2021-06-18 15:17:18.078 AEST [10228] ERROR:  duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:17:18.078 AEST [10228] DETAIL:  Key (a)=(1) already exists.
2021-06-18 15:17:18.079 AEST [19924] LOG:  background worker "logical
replication worker" (PID 10228) exited with exit code 1

======

Test: Manually disable the subscription (which had previously already
been disabled due to error)
Observation: OK. The suberrmsg gets reset to an empty string.

alter subscription tap_sub disable;

=====

Test: Turn off the disable_on_error
Observation: As expected, now the Apply worker goes into re-launch
forever loop every time it hits PK violation

test_sub=# alter subscription tap_sub set (disable_on_error=false);
ALTER SUBSCRIPTION

...

======

Test: Cause a PK violation in the Tablesync copy (DATASYNC) phase.
(when disable_on_error = true)
Observation: This patch changes nothing for this case. The Tablesyn
re-launchs in a forever loop the same as current functionality.

test_sub=# CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost
dbname=test_pub application_name=tap_sub' PUBLICATION tap_pub WITH
(disable_on_error=false);
NOTICE:  created replication slot "tap_sub" on publisher
CREATE SUBSCRIPTION
test_sub=# 2021-06-18 15:38:19.547 AEST [18205] LOG:  logical
replication apply worker for subscription "tap_sub" has started
2021-06-18 15:38:19.557 AEST [18207] LOG:  logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has started
2021-06-18 15:38:19.610 AEST [18207] ERROR:  duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:38:19.610 AEST [18207] DETAIL:  Key (a)=(1) already exists.
2021-06-18 15:38:19.610 AEST [18207] CONTEXT:  COPY test_tab, line 1
2021-06-18 15:38:19.611 AEST [19924] LOG:  background worker "logical
replication worker" (PID 18207) exited with exit code 1
2021-06-18 15:38:24.634 AEST [18369] LOG:  logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has started
2021-06-18 15:38:24.689 AEST [18369] ERROR:  duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:38:24.689 AEST [18369] DETAIL:  Key (a)=(1) already exists.
2021-06-18 15:38:24.689 AEST [18369] CONTEXT:  COPY test_tab, line 1
2021-06-18 15:38:24.690 AEST [19924] LOG:  background worker "logical
replication worker" (PID 18369) exited with exit code 1
2021-06-18 15:38:29.701 AEST [18521] LOG:  logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has started
2021-06-18 15:38:29.765 AEST [18521] ERROR:  duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:38:29.765 AEST [18521] DETAIL:  Key (a)=(1) already exists.
2021-06-18 15:38:29.765 AEST [18521] CONTEXT:  COPY test_tab, line 1
2021-06-18 15:38:29.766 AEST [19924] LOG:  background worker "logical
replication worker" (PID 18521) exited with exit code 1
etc...


-[ RECORD 1 ]-----+--------------------------------------------------------
oid               | 16399
subdbid           | 16384
subname           | tap_sub
subowner          | 10
subenabled        | t
disable_on_error  | f
disabled_by_error | f
subbinary         | f
substream         | f
subconninfo       | host=localhost dbname=test_pub application_name=tap_sub
subslotname       | tap_sub
subsynccommit     | off
suberrmsg         |
subpublications   | {tap_pub}


------
Kind Regards,
Peter Smith.
Fujitsu Australia


Reply via email to