On Tue, Sep 17, 2019 at 9:41 PM Michael Paquier <mich...@paquier.xyz> wrote:

> On Tue, Sep 17, 2019 at 08:38:18AM -0400, James Coleman wrote:
> > I don't agree that that's a valid equivalency. I myself spent a lot of
> > time trying to understand how this could possibly be true a while
> > back, and even looked at source code to be certain. I've asked other
> > people and found the same confusion.
> >
> > As I read it the 2nd second sentence doesn't actually tell you the
> > differences; it makes a quick attempt at summarizing *how* the first
> > sentence is true, but if the first sentence isn't accurate, then it's
> > hard to read the 2nd one as helping.
>
> Well, then it comes back to the part where I am used to the existing
> docs :)
>
> > If you'd prefer something less detailed at this point at that point in
> > the docs, then something along the lines of "results in a data
> > directory state which can then be safely replayed from the source" or
> > some such.
>
> Actually this is a good suggestion, and could replace the first
> sentence of this paragraph.
>
> > The docs shouldn't be correct just for someone how already understands
> > the intricacies. And the end user shouldn't have to read the "how it
> > works" (which incidentally is kinda hidden at the bottom underneath
> > the CLI args -- perhaps we could move that?) to extrapolate things in
> > the primary documentation.
>
> Perhaps.  This doc page is not that long either.
>

I'd set this aside for quite a while, but I was looking at it again this
afternoon, and I've come to see your concern about the opening paragraphs
remaining relatively simple. To that end I believe I've come up with a
patch that's a good compromise: retaining that simplicity and being more
clear and accurate at the same time.

In the first paragraph I've updated it to refer to both "successful rewind
and subsequent WAL replay" and the result I describe as being equivalent to
the result of a base backup, since that's more technically correct anyway
(the current text could be read as implying a full out copy of the data
directory, but that's not really true just as it isn't with pg_basebackup).

I've added the information about how the backup label control file is
written, and updated the How It Works steps to refer to that separately
from restart.

Additionally the How It Works is updated to include WAL segments and new
relation files in the list of files copied wholesale, since that was
previously stated but somewhat contradicted there.

I realized I didn't previously add this to the CF; since it's not a new
patch I've added it to the current CF, but if this is incorrect please let
me know.

Thanks,
James
From 592ba15c35bb16e55b0bb0a7e7bdbb6dd4e08a0b Mon Sep 17 00:00:00 2001
From: James Coleman <jtc...@gmail.com>
Date: Sun, 8 Mar 2020 16:39:45 -0400
Subject: [PATCH v3] Improve pg_rewind explanation and warnings

The pg_rewind docs currently assert that the state of the target's
data directory after rewind is equivalent to the source's data
directory. But that isn't quite true both because the base state is
further back in time and because the target's data directory will
include the current state on the source of any copied blocks.
Additionally the state isn't equal to a copy of the source data
directory; it's equivalent to a base backup of the source.

The How It Works section now:
- Includes details about how the backup label file is created.
- Is updated to include WAL segments and new relation files in the
  list of files copied wholesale from the source.

Finally, document clearly the state of the cluster after the operation
and also the operation sequencing dangers caused by copying
configuration files from the source.
---
 doc/src/sgml/ref/pg_rewind.sgml | 87 ++++++++++++++++++++-------------
 1 file changed, 54 insertions(+), 33 deletions(-)

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 42d29edd4e..bc6f0009cc 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -48,14 +48,16 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The result is equivalent to replacing the target data directory with the
-   source one. Only changed blocks from relation files are copied;
-   all other files are copied in full, including configuration files. The
-   advantage of <application>pg_rewind</application> over taking a new base backup, or
-   tools like <application>rsync</application>, is that <application>pg_rewind</application> does
-   not require reading through unchanged blocks in the cluster. This makes
-   it a lot faster when the database is large and only a small
-   fraction of blocks differ between the clusters.
+   After a successful rewind and subsequent WAL replay, the target data
+   directory is equivalent to a base backup of the source data directory. While
+   only changed blocks from existing relation files are copied; all other files
+   are copied in full, including new relation files, configuration files, and WAL
+   segments. The advantage of <application>pg_rewind</application> over taking a
+   new base backup, or tools like <application>rsync</application>, is that
+   <application>pg_rewind</application> does not require comparing or copying
+   unchanged relation blocks in the cluster. As such the rewind operation is
+   significantly faster than other approaches when the database is large and
+   only a small fraction of blocks differ between the clusters.
   </para>
 
   <para>
@@ -77,16 +79,18 @@ PostgreSQL documentation
   </para>
 
   <para>
-   When the target server is started for the first time after running
-   <application>pg_rewind</application>, it will go into recovery mode and replay all
-   WAL generated in the source server after the point of divergence.
-   If some of the WAL was no longer available in the source server when
-   <application>pg_rewind</application> was run, and therefore could not be copied by the
-   <application>pg_rewind</application> session, it must be made available when the
-   target server is started. This can be done by creating a
-   <filename>recovery.signal</filename> file in the target data directory
-   and configuring suitable <xref linkend="guc-restore-command"/>
-   in <filename>postgresql.conf</filename>.
+   After running <application>pg_rewind</application> the data directory is
+   not immediately in a consistent state. However
+   <application>pg_rewind</application> configures the control file so that when
+   the target server is started again it will enter recovery mode and replay all
+   WAL generated in the source server after the point of divergence. If some of
+   the WAL was no longer available in the source server when
+   <application>pg_rewind</application> was run, and therefore could not be
+   copied by the <application>pg_rewind</application> session, it must be made
+   available when the target server is started. This can be done by creating a
+   <filename>recovery.signal</filename> file in the target data directory and
+   configuring suitable <xref linkend="guc-restore-command"/> in
+   <filename>postgresql.conf</filename>.
   </para>
 
   <para>
@@ -105,6 +109,15 @@ PostgreSQL documentation
     recovered.  In such a case, taking a new fresh backup is recommended.
    </para>
 
+   <para>
+    Because <application>pg_rewind</application> copies configuration files
+    entirely from the source, correcting recovery configuration options before
+    restarting the server is necessary if you intend to re-introduce the target
+    as a replica of the source. If you restart the server after the rewind
+    operation has finished but without configuring recovery, the target will
+    again diverge from the primary.
+   </para>
+
    <para>
     <application>pg_rewind</application> will fail immediately if it finds
     files it cannot write directly to.  This can happen for example when
@@ -326,34 +339,42 @@ GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, b
       Copy all those changed blocks from the source cluster to
       the target cluster, either using direct file system access
       (<option>--source-pgdata</option>) or SQL (<option>--source-server</option>).
+      The relation files are now to their state at the last checkpoint completed
+      prior to the point at which the WAL timelines of the source and target
+      diverged plus the current state on the source of any blocks changed on the
+      target after that divergence.
      </para>
     </step>
     <step>
      <para>
-      Copy all other files such as <filename>pg_xact</filename> and
-      configuration files from the source cluster to the target cluster
-      (everything except the relation files). Similarly to base backups,
-      the contents of the directories <filename>pg_dynshmem/</filename>,
+      Copy all other files, including new relation files, WAL segments,
+      <filename>pg_xact</filename>, and configuration files from the source
+      cluster to the target cluster. Similarly to base backups, the contents
+      of the directories <filename>pg_dynshmem/</filename>,
       <filename>pg_notify/</filename>, <filename>pg_replslot/</filename>,
       <filename>pg_serial/</filename>, <filename>pg_snapshots/</filename>,
-      <filename>pg_stat_tmp/</filename>, and
-      <filename>pg_subtrans/</filename> are omitted from the data copied
-      from the source cluster. Any file or directory beginning with
-      <filename>pgsql_tmp</filename> is omitted, as well as are
+      <filename>pg_stat_tmp/</filename>, and <filename>pg_subtrans/</filename>
+      are omitted from the data copied from the source cluster. The files
       <filename>backup_label</filename>,
       <filename>tablespace_map</filename>,
       <filename>pg_internal.init</filename>,
-      <filename>postmaster.opts</filename> and
-      <filename>postmaster.pid</filename>.
+      <filename>postmaster.opts</filename>, and
+      <filename>postmaster.pid</filename>, as well as any file or directory
+      beginning with <filename>pgsql_tmp</filename>, are omitted.
+     </para>
+    </step>
+    <step>
+     <para>
+      Create a backup label file to begin WAL replay at the checkpoint created
+      at failover and  a minimum consistency LSN using
+      <literal>pg_current_wal_insert_lsn()</literal>, when using a live source
+      and the last checkpoint LSN, when using a stopped source.
      </para>
     </step>
     <step>
      <para>
-      Apply the WAL from the source cluster, starting from the checkpoint
-      created at failover. (Strictly speaking, <application>pg_rewind</application>
-      doesn't apply the WAL, it just creates a backup label file that
-      makes <productname>PostgreSQL</productname> start by replaying all WAL from
-      that checkpoint forward.)
+      On restart, <productname>PostgreSQL</productname> replays the required WAL
+      resulting in a consistent data directory state.
      </para>
     </step>
    </procedure>

base-commit: 691e8b2e1889d61df47ae76601fa9db6cbac6f1c
-- 
2.17.1

Reply via email to