On Tuesday, January 16, 2024 9:27 AM Peter Smith <smithpb2...@gmail.com> wrote: > > Here are some review comments for patch v61-0002
Thanks for the comments. > > ====== > doc/src/sgml/logical-replication.sgml > > 1. > + <sect2 id="logical-replication-failover-examples"> > + <title>Examples: logical replication failover</title> > > The current documentation structure (after the patch is applied) looks > like this: > > 30.1. Publication > 30.2. Subscription > 30.2.1. Replication Slot Management > 30.2.2. Examples: Set Up Logical Replication > 30.2.3. Examples: Deferred Replication Slot Creation > 30.2.4. Examples: logical replication failover > > I don't think it is ideal. > > Firstly, I think this new section is not just "Examples:"; it is more > like instructions for steps to check if a successful failover is > possible. IMO call it something like "Logical Replication Failover" or > "Replication Slot Failover". > > Secondly, I don't think this new section strictly belongs underneath > the "Subscription" section anymore because IMO it is just as much > about the promotion of the publications. Now that you are adding this > new (2nd) section about slots, I think the whole structure of this > document should be changed like below: > > SUGGESTION #1 (make a new section 30.3 just for slot-related topics) > > 30.1. Publication > 30.2. Subscription > 30.2.1. Examples: Set Up Logical Replication > 30.3. Logical Replication Slots > 30.3.1. Replication Slot Management > 30.3.2. Examples: Deferred Replication Slot Creation > 30.3.3. Logical Replication Failover > > ~ > > SUGGESTION #2 (keep the existing structure, but give the failover its > own new section 30.3) > > 30.1. Publication > 30.2. Subscription > 30.2.1. Replication Slot Management > 30.2.2. Examples: Set Up Logical Replication > 30.2.3. Examples: Deferred Replication Slot Creation > 30.3 Logical Replication Failover I used this version for now as I am sure about changing other section. > > ~ > > SUGGESTION #2a (and maybe later you can extract some of the failover > examples further) > > 30.1. Publication > 30.2. Subscription > 30.2.1. Replication Slot Management > 30.2.2. Examples: Set Up Logical Replication > 30.2.3. Examples: Deferred Replication Slot Creation > 30.3 Logical Replication Failover > 30.3.1. Examples: Checking if failover ready > > ~~~ > > 2. > + <para> > + In a logical replication setup, if the publisher server is also the > primary > + server of the streaming replication, the logical slots on the > primary server > + can be synchronized to the standby server by specifying > <literal>failover = true</literal> > + when creating the subscription. Enabling failover ensures a seamless > + transition of the subscription to the promoted standby, allowing it to > + subscribe to the new primary server without any data loss. > + </para> > > I was initially confused by the wording. How about like below: > > SUGGESTION > When the publisher server is the primary server of a streaming > replication, the logical slots on that primary server can be > synchronized to the standby server by specifying <literal>failover = > true</literal> when creating subscriptions for those publications. > Enabling failover ensures a seamless transition of those subscriptions > after the standby is promoted. They can continue subscribing to > publications now on the new primary server without any data loss. Changed as suggested. > > ~~~ > > 3. > + <para> > + However, the replication slots are copied asynchronously, which > means it's necessary > + to confirm that replication slots have been synced to the standby server > + before the failover happens. Additionally, to ensure a successful > failover, > + the standby server must not lag behind the subscriber. To confirm > + that the standby server is ready for failover, follow these steps: > + </para> > > Minor rewording > > SUGGESTION > Because the slot synchronization logic copies asynchronously, it is > necessary to confirm that replication slots have been synced to the > standby server before the failover happens. Furthermore, to ensure a > successful failover, the standby server must not be lagging behind the > subscriber. To confirm that the standby server is indeed ready for > failover, follow these 2 steps: Changed as suggested. > > ~~~ > > 4. > The instructions said "follow these steps", so the next parts should > be rendered as 2 "steps" (using <procedure> markup?) > > SUGGESTION (show as steps 1,2 and also some minor rewording of the > step heading) > > 1. Confirm that all the necessary logical replication slots have been > synced to the standby server. > 2. Confirm that the standby server is not lagging behind the subscribers. > Changed as suggested. > ~~~ > > 5. > + <para> > + Check if all the necessary logical replication slots have been synced to > + the standby server. > + </para> > > SUGGESTION > Confirm that all the necessary logical replication slots have been > synced to the standby server. > Changed as suggested. > ~~~ > > 6. > + <listitem> > + <para> > + On logical subscriber, fetch the slot names that should be synced to > the > + standby that we plan to promote. > > SUGGESTION > Firstly, on the subscriber node, use the following SQL to identify the > slot names that should be... > Changed as suggested. > ~~~ > > 7. > +<programlisting> > +test_sub=# SELECT > + array_agg(slotname) AS slots > + FROM > + (( > + SELECT r.srsubid AS subid, CONCAT('pg_' || srsubid || > '_sync_' || srrelid || '_' || ctl.system_identifier) AS slotname > + FROM pg_control_system() ctl, pg_subscription_rel r, > pg_subscription s > + WHERE r.srsubstate = 'f' AND s.oid = r.srsubid AND > s.subfailover > + ) UNION ( > + SELECT oid AS subid, subslotname as slotname > + FROM pg_subscription > + WHERE subfailover > + )); > > 7a > Maybe this ought to include "pg_catalog" schemas? After searching other query examples, I think most of them don’t add this for either function or system table. So, I didn’t add this. > > ~ > > 7b. > For consistency, maybe it is better to use a table alias "FROM > pg_subscription s" in the UNION also Added. > > ~~~ > > 8. > + <listitem> > + <para> > + Check that the logical replication slots exist on the standby server. > > SUGGESTION > Next, check that the logical replication slots identified above exist > on the standby server. Changed as suggested. > > ~~~ > > 9. > +<programlisting> > +test_standby=# SELECT bool_and(synced AND NOT temporary AND > conflict_reason IS NULL) AS failover_ready > + FROM pg_replication_slots > + WHERE slot_name in ('slots'); > + failover_ready > +---------------- > + t > > 9a. > Maybe this ought to include "pg_catalog" schemas? Same as above. > > ~ > > 9b. > IIUC that 'slots' reference is supposed to be those names that were > found in the prior step. If so, then that point needs to be made > clear, and anyway in this case 'slots' is not compatible with the > 'sub' name returned by your first SQL. Changed as suggested. > > ~~~ > > 10. > + <listitem> > + <para> > + Query the last replayed WAL on the logical subscriber. > > SUGGESTION > Firstly, on the subscriber node check the last replayed WAL. > Changed as suggested. > ~~~ > > 11. > +<programlisting> > +test_sub=# SELECT > + MAX(remote_lsn) AS remote_lsn_on_subscriber > + FROM > + (( > + SELECT (CASE WHEN r.srsubstate = 'f' THEN > pg_replication_origin_progress(CONCAT('pg_' || r.srsubid || '_' || > r.srrelid), false) > + WHEN r.srsubstate = 's' THEN r.srsublsn > END) as remote_lsn > + FROM pg_subscription_rel r, pg_subscription s > + WHERE r.srsubstate IN ('f', 's') AND s.oid = r.srsubid > AND s.subfailover > + ) UNION ( > + SELECT pg_replication_origin_progress(CONCAT('pg_' || > s.oid), false) AS remote_lsn > + FROM pg_subscription s > + WHERE subfailover > + )); > > 11a. > Maybe this ought to include "pg_catalog" schemas? Same as above. > > ~ > > 11b. > /WHERE subfailover/WHERE s.subfailover/ > > ~~~ > > 12. > + <listitem> > + <para> > + On the standby server, check that the last-received WAL location > + is ahead of the replayed WAL location on the subscriber. > > SUGGESTION > Next, on the standby server check that the last-received WAL location > is ahead of the replayed WAL location on the subscriber identified > above. > Changed as suggested. > ~~~ > > 13. > +</programlisting></para> > + </listitem> > + <listitem> > + <para> > + On the standby server, check that the last-received WAL location > + is ahead of the replayed WAL location on the subscriber. > +<programlisting> > +test_standby=# SELECT pg_last_wal_receive_lsn() >= > 'remote_lsn_on_subscriber'::pg_lsn AS failover_ready; > + failover_ready > +---------------- > + t > > IIUC the 'remote_lsn_on_subscriber' is supposed to represent the > substitution of the value found in the subscriber server. In this > example maybe it would be: > SELECT pg_last_wal_receive_lsn() >= '0/3000388'::pg_lsn AS failover_ready; > > maybe that point can be made more clearly. I have changed it to use the actual LSN got in last step. > > ~~~ > > 14. > + <para> > + If the result (failover_ready) of both above steps is true, it means it > is > + okay to subscribe to the standby server. > + </para> > > 14a. > failover_ready should be rendered as literal. Added. > > ~ > > 14b. > Does this say what you intended, or did you mean something more like > "the standby can be promoted and existing subscriptions will be able > to continue without data loss" I used the later part of your suggestion as I think promotion depends not only on logical replication part. Best Regards, Hou zj