On Wed, Jan 17, 2024 at 7:15 PM Nisha Moond <nisha.moond...@gmail.com> wrote: > > > > > ~~ > > > > BTW, while experimenting with the bad connection ALTER I also tried > > setting 'disable_on_error' like below: > > > > ALTER SUBSCRIPTION sub4 SET (disable_on_error); > > ALTER SUBSCRIPTION sub4 CONNECTION 'port = -1'; > > > > ...but here the subscription did not become DISABLED as I expected it > > would do on the next connection error iteration. It remains enabled > > and just continues to loop relaunch/ERROR indefinitely same as before. > > > > That looks like it may be a bug. Thoughts? > > > Ideally, if the already running apply worker in > "LogicalRepApplyLoop()" has any exception/error it will be handled and > the subscription will be disabled if 'disable_on_error' is set - > > start_apply(XLogRecPtr origin_startpos) > { > PG_TRY(); > { > LogicalRepApplyLoop(origin_startpos); > } > PG_CATCH(); > { > if (MySubscription->disableonerr) > DisableSubscriptionAndExit(); > ... > > What is happening in this case is that the control reaches the function - > run_apply_worker() -> start_apply() -> LogicalRepApplyLoop -> > maybe_reread_subscription() > ... > /* > * Exit if any parameter that affects the remote connection was changed. > * The launcher will start a new worker but note that the parallel apply > * worker won't restart if the streaming option's value is changed from > * 'parallel' to any other value or the server decides not to stream the > * in-progress transaction. > */ > if (strcmp(newsub->conninfo, MySubscription->conninfo) != 0 || > ... > > and it sees a change in the parameter and calls apply_worker_exit(). > This will exit the current process, without throwing an exception to > the caller and the postmaster will try to restart the apply worker. > The new apply worker, before reaching the start_apply() [where we > handle exception], will hit the code to establish the connection to > the publisher - > > ApplyWorkerMain() -> run_apply_worker() - > ... > LogRepWorkerWalRcvConn = walrcv_connect(MySubscription->conninfo, > true /* replication */ , > true, > must_use_password, > MySubscription->name, &err); > > if (LogRepWorkerWalRcvConn == NULL) > ereport(ERROR, > (errcode(ERRCODE_CONNECTION_FAILURE), > errmsg("could not connect to the publisher: %s", err))); > ... > and due to the bad connection string in the subscription, it will error out. > [28680] ERROR: could not connect to the publisher: invalid port number: "-1" > [3196] LOG: background worker "logical replication apply worker" (PID > 28680) exited with exit code 1 > > Now, the postmaster keeps trying to restart the apply worker and it > will keep failing until the connection string is corrected or the > subscription is disabled manually. > > I think this is a bug that needs to be handled in run_apply_worker() > when disable_on_error is set. > IMO, this bug-fix discussion deserves a separate thread. Thoughts?
Hi Nisha, Thanks for your analysis -- it is the same as my understanding. As suggested, I have created a new thread for any further discussion related to this 'disable_on_error' topic [1]. ====== [1] https://www.postgresql.org/message-id/flat/CAHut%2BPuEsekA3e7ThwzWr%2BUs4x%3DLzkF7DSrED1UsZTUqNrhCUQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia