On Fri, Jun 18, 2021 at 6:18 AM Mark Dilger <mark.dil...@enterprisedb.com> wrote: > > Hackers, > > Logical replication apply workers for a subscription can easily get stuck in > an infinite loop of attempting to apply a change, triggering an error (such > as a constraint violation), exiting with an error written to the subscription > worker log, and restarting. > > As things currently stand, only superusers can create subscriptions. Ongoing > work to delegate superuser tasks to non-superusers creates the potential for > even more errors to be triggered, specifically, errors where the apply worker > does not have permission to make changes to the target table. > > The attached patch makes it possible to create a subscription using a new > subscription_parameter, "disable_on_error", such that rather than going into > an infinite loop, the apply worker will catch errors and automatically > disable the subscription, breaking the loop. The new parameter defaults to > false. When false, the PG_TRY/PG_CATCH overhead is avoided, so for > subscriptions which do not use the feature, there shouldn't be any change. > Users can manually clear the error after fixing the underlying issue with an > ALTER SUBSCRIPTION .. ENABLE command. > > In addition to helping on production systems, this makes writing TAP tests > involving error conditions simpler. I originally ran into the motivation to > write this patch when frustrated that TAP tests needed to parse the apply > worker log file to determine whether permission failures were occurring and > what they were. It was also obnoxiously easy to have a test get stuck > waiting for a permanently stuck subscription to catch up. This helps with > both issues. > > I don't think this is quite ready for commit, but I'd like feedback if folks > like this idea or want to suggest design changes.
I tried your patch. It applied OK (albeit with whitespace warnings). The code build and TAP tests are all OK. Below are a few comments and observations. COMMENTS ======== (1) PG Docs catalogs.sgml Documented new column "suberrmsg" but did not document the other new columns ("disable_on_error", "disabled_by_error")? ------ (2) New column "disabled_by_error". I wondered if there was actually any need for this column. Isn't the same information conveyed by just having "subenabled" = false, at same time as as non-empty "suberrmsg"? This would remove any confusion for having 2 booleans which both indicate disabled. ------ (3) New columns "disabled_by_error", "disabled_on_error". All other columns of the pg_subscription have a "sub" prefix. ------ (4) errhint member used? @@ -91,12 +100,16 @@ typedef struct Subscription char *name; /* Name of the subscription */ Oid owner; /* Oid of the subscription owner */ bool enabled; /* Indicates if the subscription is enabled */ + bool disable_on_error; /* Whether errors automatically disable */ + bool disabled_by_error; /* Whether an error has disabled */ bool binary; /* Indicates if the subscription wants data in * binary format */ bool stream; /* Allow streaming in-progress transactions. */ char *conninfo; /* Connection string to the publisher */ char *slotname; /* Name of the replication slot */ char *synccommit; /* Synchronous commit setting for worker */ + char *errmsg; /* Message from error which disabled */ + char *errhint; /* Hint from error which disabled */ List *publications; /* List of publication names to subscribe to */ } Subscription; I did not find any code using that newly added member "errhint". ------ (5) dump.c i. No mention of new columns "disabled_on_error" and "disabled_by_error". Is that right? ii. Shouldn't the code for the "suberrmsg" be qualified with some PG version number checks? ------ (6) Patch only handles errors only from the Apply worker. Tablesync can give similar errors (e.g. PK violation during DATASYNC phase) which will trigger re-launch forever regardless of the setting of "disabled_on_error". (confirmed by observations below) ------ (7) TAP test code +$node_subscriber->init(allows_streaming => 'logical'); AFAIK that "logical" configuration is not necessary for the subscriber side. So, $node_subscriber->init(); //////////// Some Experiments/Observations ============================== In general, I found this functionality is useful and it works as advertised by your patch comment. ====== Test: Display pg_subscription with the new columns Observation: As expected. But some new colnames are not prefixed like their peers. test_sub=# \pset x Expanded display is on. test_sub=# select * from pg_subscription; -[ RECORD 1 ]-----+-------------------------------------------------------- oid | 16394 subdbid | 16384 subname | tap_sub subowner | 10 subenabled | t disable_on_error | t disabled_by_error | f subbinary | f substream | f subconninfo | host=localhost dbname=test_pub application_name=tap_sub subslotname | tap_sub subsynccommit | off suberrmsg | subpublications | {tap_pub} ====== Test: Cause a PK violation during normal Apply replication (when "disabled_on_error=true") Observation: Apply worker stops. Subscription is disabled. Error message is in the catalog. 2021-06-18 15:12:45.905 AEST [25904] LOG: edata is true for subscription 'tap_sub': message = "duplicate key value violates unique constraint "test_tab_pkey"", hint = "<NONE>" 2021-06-18 15:12:45.905 AEST [25904] LOG: logical replication apply worker for subscription "tap_sub" will stop because the subscription was disabled due to error 2021-06-18 15:12:45.905 AEST [25904] ERROR: duplicate key value violates unique constraint "test_tab_pkey" 2021-06-18 15:12:45.905 AEST [25904] DETAIL: Key (a)=(1) already exists. 2021-06-18 15:12:45.908 AEST [19924] LOG: background worker "logical replication worker" (PID 25904) exited with exit code 1 test_sub=# select * from pg_subscription; -[ RECORD 1 ]-----+--------------------------------------------------------------- oid | 16394 subdbid | 16384 subname | tap_sub subowner | 10 subenabled | f disable_on_error | t disabled_by_error | t subbinary | f substream | f subconninfo | host=localhost dbname=test_pub application_name=tap_sub subslotname | tap_sub subsynccommit | off suberrmsg | duplicate key value violates unique constraint "test_tab_pkey" subpublications | {tap_pub} ====== Test: Try to enable subscription (without fixing the PK violation problem). Observation. OK. It just stops again test_sub=# alter subscription tap_sub enable; ALTER SUBSCRIPTION test_sub=# 2021-06-18 15:17:18.067 AEST [10228] LOG: logical replication apply worker for subscription "tap_sub" has started 2021-06-18 15:17:18.078 AEST [10228] LOG: edata is true for subscription 'tap_sub': message = "duplicate key value violates unique constraint "test_tab_pkey"", hint = "<NONE>" 2021-06-18 15:17:18.078 AEST [10228] LOG: logical replication apply worker for subscription "tap_sub" will stop because the subscription was disabled due to error 2021-06-18 15:17:18.078 AEST [10228] ERROR: duplicate key value violates unique constraint "test_tab_pkey" 2021-06-18 15:17:18.078 AEST [10228] DETAIL: Key (a)=(1) already exists. 2021-06-18 15:17:18.079 AEST [19924] LOG: background worker "logical replication worker" (PID 10228) exited with exit code 1 ====== Test: Manually disable the subscription (which had previously already been disabled due to error) Observation: OK. The suberrmsg gets reset to an empty string. alter subscription tap_sub disable; ===== Test: Turn off the disable_on_error Observation: As expected, now the Apply worker goes into re-launch forever loop every time it hits PK violation test_sub=# alter subscription tap_sub set (disable_on_error=false); ALTER SUBSCRIPTION ... ====== Test: Cause a PK violation in the Tablesync copy (DATASYNC) phase. (when disable_on_error = true) Observation: This patch changes nothing for this case. The Tablesyn re-launchs in a forever loop the same as current functionality. test_sub=# CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost dbname=test_pub application_name=tap_sub' PUBLICATION tap_pub WITH (disable_on_error=false); NOTICE: created replication slot "tap_sub" on publisher CREATE SUBSCRIPTION test_sub=# 2021-06-18 15:38:19.547 AEST [18205] LOG: logical replication apply worker for subscription "tap_sub" has started 2021-06-18 15:38:19.557 AEST [18207] LOG: logical replication table synchronization worker for subscription "tap_sub", table "test_tab" has started 2021-06-18 15:38:19.610 AEST [18207] ERROR: duplicate key value violates unique constraint "test_tab_pkey" 2021-06-18 15:38:19.610 AEST [18207] DETAIL: Key (a)=(1) already exists. 2021-06-18 15:38:19.610 AEST [18207] CONTEXT: COPY test_tab, line 1 2021-06-18 15:38:19.611 AEST [19924] LOG: background worker "logical replication worker" (PID 18207) exited with exit code 1 2021-06-18 15:38:24.634 AEST [18369] LOG: logical replication table synchronization worker for subscription "tap_sub", table "test_tab" has started 2021-06-18 15:38:24.689 AEST [18369] ERROR: duplicate key value violates unique constraint "test_tab_pkey" 2021-06-18 15:38:24.689 AEST [18369] DETAIL: Key (a)=(1) already exists. 2021-06-18 15:38:24.689 AEST [18369] CONTEXT: COPY test_tab, line 1 2021-06-18 15:38:24.690 AEST [19924] LOG: background worker "logical replication worker" (PID 18369) exited with exit code 1 2021-06-18 15:38:29.701 AEST [18521] LOG: logical replication table synchronization worker for subscription "tap_sub", table "test_tab" has started 2021-06-18 15:38:29.765 AEST [18521] ERROR: duplicate key value violates unique constraint "test_tab_pkey" 2021-06-18 15:38:29.765 AEST [18521] DETAIL: Key (a)=(1) already exists. 2021-06-18 15:38:29.765 AEST [18521] CONTEXT: COPY test_tab, line 1 2021-06-18 15:38:29.766 AEST [19924] LOG: background worker "logical replication worker" (PID 18521) exited with exit code 1 etc... -[ RECORD 1 ]-----+-------------------------------------------------------- oid | 16399 subdbid | 16384 subname | tap_sub subowner | 10 subenabled | t disable_on_error | f disabled_by_error | f subbinary | f substream | f subconninfo | host=localhost dbname=test_pub application_name=tap_sub subslotname | tap_sub subsynccommit | off suberrmsg | subpublications | {tap_pub} ------ Kind Regards, Peter Smith. Fujitsu Australia