On Tue, 31 Dec 2024 at 02:48, Peter Smith <smithpb2...@gmail.com> wrote: > > On Thu, Dec 26, 2024 at 1:37 AM vignesh C <vignes...@gmail.com> wrote: > > > > Hi, > > > > Currently, we restart the table synchronization worker after the > > duration specified by wal_retrieve_retry_interval following the last > > failure. While this behavior is documented for apply workers, it is > > not mentioned for table synchronization workers. I believe this detail > > should be included in the documentation for table synchronization > > workers as well. Attached is a patch to address this omission. > > > > Regards, > > Vignesh > > Hi Vignesh, > > Here are some review comments for your v1 patch. > > +1 to enhance the documentation. > > ====== > > 1. > <para> > In logical replication, this parameter also limits how often a > failing > - replication apply worker will be respawned. > + replication apply worker, and table synchronization worker will be > + respawned. > </para> > > /, and/or/ > > > SUGGESTION > In logical replication, this parameter also limits how often a failing > replication apply worker or table synchronization worker will be > respawned.
Modified > ====== > > 2. > I think the reader might never be aware of any of this (throttled > relaunch) behaviour unless they accidentally stumble across the docs > for this GUC, so IMO this information should be mentioned elsewhere -- > wherever the tablesync worker errors are documented. But, TBH, I can't > find anywhere in the PostgreSQL docs where it even mentions > re-launching failed tablesync workers! > > Anyway, I think it might be good to include such information in some > suitable place (maybe in the CREATE SUBSCRIPTION notes? or maybe in > Chapter 29?) to say something like... > > SUGGESTION: > In practice, if a table synchronization worker fails during logical > replication, the apply worker detects the failure and attempts to > respawn the table synchronization worker to continue the > synchronization process. This behaviour ensures that transient errors > do not permanently disrupt the replication setup. See also > wal_retrieve_retry_interval. Yes, adding it to logical replication Initial Snapshot seemed more appropriate to me. The attached v2 version patch has the changes for the same. Regards, Vignesh
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index fbdd6ce574..b58c7f25f7 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -5094,7 +5094,8 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class=" </para> <para> In logical replication, this parameter also limits how often a failing - replication apply worker will be respawned. + replication apply worker or table synchronization worker will be + respawned. </para> </listitem> </varlistentry> diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml index 8290cd1a08..925e0dd101 100644 --- a/doc/src/sgml/logical-replication.sgml +++ b/doc/src/sgml/logical-replication.sgml @@ -1993,18 +1993,17 @@ CONTEXT: processing remote data for replication origin "pg_16395" during "INSER <title>Initial Snapshot</title> <para> The initial data in existing subscribed tables are snapshotted and - copied in a parallel instance of a special kind of apply process. - This process will create its own replication slot and copy the existing - data. As soon as the copy is finished the table contents will become - visible to other backends. Once existing data is copied, the worker - enters synchronization mode, which ensures that the table is brought - up to a synchronized state with the main apply process by streaming - any changes that happened during the initial data copy using standard - logical replication. During this synchronization phase, the changes - are applied and committed in the same order as they happened on the - publisher. Once synchronization is done, control of the - replication of the table is given back to the main apply process where - replication continues as normal. + copied in a parallel instance of a special kind of table synchronization + worker process. This process will create its own replication slot and copy + the existing data. As soon as the copy is finished the table contents will + become visible to other backends. Once existing data is copied, the worker + enters synchronization mode, which ensures that the table is brought up to + a synchronized state with the main apply process by streaming any changes + that happened during the initial data copy using standard logical + replication. During this synchronization phase, the changes are applied + and committed in the same order as they happened on the publisher. Once + synchronization is done, control of the replication of the table is given + back to the main apply process where replication continues as normal. </para> <note> <para> @@ -2015,6 +2014,15 @@ CONTEXT: processing remote data for replication origin "pg_16395" during "INSER when copying the existing table data. </para> </note> + <note> + <para> + If a table synchronization worker fails during copy, the apply worker + detects the failure and respawns the table synchronization worker to + continue the synchronization process. This behaviour ensures that + transient errors do not permanently disrupt the replication setup. See + also <link linkend="guc-wal-retrieve-retry-interval"><varname>wal_retrieve_retry_interval</varname></link>. + </para> + </note> </sect2> </sect1>