Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?

Nikolay Samokhvalov Mon, 10 Jul 2023 14:37:51 -0700

On Mon, Jul 10, 2023 at 2:02 PM Michael Banck <[email protected]> wrote:


> Thanks for that!
>

Thanks for the fast review.


>
> > I agree with Andrey – without it, we don't have any good way to upgrade
> > large clusters in short time. Default rsync mode (without "--size-only")
> > takes a lot of time too, if the load is heavy.
> >
> > With these adjustments, can "rsync --size-only" remain in the docs as the
> > *fast* and safe method to upgrade standbys, or there are still some
> > concerns related to corruption risks?
>
> I hope somebody can answer that definitively, but I read Stephen's mail
> to indicate that this procedure should be safe in principle (if you know
> what you are doing).
>

right, this is my understanding too


> > +++ b/doc/src/sgml/ref/pgupgrade.sgml
> > @@ -380,22 +380,28 @@ NET STOP postgresql-&majorversion;
> >      </para>
> >
> >      <para>
> > -     Streaming replication and log-shipping standby servers can
> > +     Streaming replication and log-shipping standby servers must
> >       remain running until a later step.
> >      </para>
> >     </step>
> >
> > -   <step>
> > +   <step id="pgupgrade-step-prepare-standbys">
> >
> >      <para>
> > -     If you are upgrading standby servers using methods outlined in
> section <xref
> > -     linkend="pgupgrade-step-replicas"/>, verify that the old standby
> > -     servers are caught up by running
> <application>pg_controldata</application>
> > -     against the old primary and standby clusters.  Verify that the
> > -     <quote>Latest checkpoint location</quote> values match in all
> clusters.
> > -     (There will be a mismatch if old standby servers were shut down
> > -     before the old primary or if the old standby servers are still
> running.)
> > +     If you are upgrading standby servers using methods outlined in
> > +     <xref linkend="pgupgrade-step-replicas"/>,
>
> You dropped the "section" before the xref, I think that should be kept
> around.
>

Seems to be a problem in discussing source code that looks quite different
than the final result.

In the result – the docs – we currently have "section Step 9", looking
weird. I still think it's good to remove it. We also have "in Step 17
below" (without the extra word "section") in a different place on the same
page.


>
> > +                                                ensure that they were
> running when
> > +     you shut down the primaries in the previous step, so all the
> latest changes
>
> You talk of primaries in plural here, that is a bit weird for PostgreSQL
> documentation.
>

The same docs already discuss two primaries ("8. Stop both primaries"), but
I agree it might look confusing if you read only a part of the doc, jumping
into middle of it, like I do all the time when using the docs in "check the
reference" style.

I agree with this comment, but it tells me we need even more improvements
of this doc, beyond my original goal – e.g., I don't like section 8 saying
"Make sure both database servers", it should be "both primaries".


>
> > +     and the shutdown checkpoint record were received. You can verify
> this by running
> > +     <application>pg_controldata</application> against the old primary
> and standby
> > +     clusters. The <quote>Latest checkpoint location</quote> values
> must match in all
> > +     nodes. A mismatch might occur if old standby servers were shut
> down before
> > +     the old primary. To fix a mismatch, start all old servers and
> return to the
> > +     previous step; proceeding with mismatched
> > +     <quote>Latest checkpoint location</quote> may lead to standby
> corruption.
> > +    </para>
> > +
> > +    <para>
> >       Also, make sure <varname>wal_level</varname> is not set to
> >       <literal>minimal</literal> in the
> <filename>postgresql.conf</filename> file on the
> >       new primary cluster.
> > @@ -497,7 +503,6 @@ pg_upgrade.exe
> >       linkend="warm-standby"/>) standby servers, you can follow these
> steps to
> >       quickly upgrade them.  You will not be running
> <application>pg_upgrade</application> on
> >       the standby servers, but rather <application>rsync</application>
> on the primary.
> > -     Do not start any servers yet.
> >      </para>
> >
> >      <para>
> > @@ -508,6 +513,15 @@ pg_upgrade.exe
> >       is running.
> >      </para>
> >
> > +    <para>
> > +     Before running rsync, to avoid standby corruption, it is absolutely
> > +     critical to ensure that both primaries are shut down and standbys
> > +     have received the last changes (see <xref
> linkend="pgupgrade-step-prepare-standbys"/>).
>
> I think this should be something like "ensure both that the primary is
> shut down and that the standbys have received all the changes".
>

Well, we have two primary servers – old and new. I tried to clarify it in
the new version.


>
> > +     Standbys can be running at this point or fully stopped.
>
> "or be fully stopped." I think.
>
> > +                                                             If they
> > +     are still running, you can stop, upgrade, and start them one by
> one; this
> > +     can be useful to keep the cluster open for read-only transactions.
> > +    </para>
>
> Maybe this is clear from the context, but "upgrade" in the above should
> maybe more explicitly refer to the rsync method or else people might
> think one can run pg_upgrade on them after all?
>

Maybe. It will require changes in other parts of this doc.
Thinking (here:
https://gitlab.com/postgres/postgres/-/merge_requests/18/diffs)

Meanwhile, attached is v2

thanks for the comments

From: Nikolay Samokhvalov <[email protected]>
Date: Mon, 10 Jul 2023 20:07:18 +0000
Subject: [PATCH] Refine instructions for major upgrades on standby servers
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Recipe involving pg_upgrade -k and rsync --size-only can be dangerous
– clarify what to do to avoid standby corruption.

Discussion: https://www.postgresql.org/message-id/flat/CAM527d8heqkjG5VrvjU3Xjsqxg41ufUyabD9QZccdAxnpbRH-Q%40mail.gmail.com
---
 doc/src/sgml/ref/pgupgrade.sgml | 39 +++++++++++++++++++++++----------
 1 file changed, 27 insertions(+), 12 deletions(-)

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 7816b4c6859..41242bb4200 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -361,10 +361,10 @@ make prefix=/usr/local/pgsql.new install
    </step>
 
    <step>
-    <title>Stop both servers</title>
+    <title>Stop both primary servers</title>
 
     <para>
-     Make sure both database servers are stopped using, on Unix, e.g.:
+     Make sure both primary servers are stopped using, on Unix, e.g.:
 
 <programlisting>
 pg_ctl -D /opt/PostgreSQL/9.6 stop
@@ -380,22 +380,28 @@ NET STOP postgresql-&majorversion;
     </para>
 
     <para>
-     Streaming replication and log-shipping standby servers can
+     Streaming replication and log-shipping standby servers must
      remain running until a later step.
     </para>
    </step>
 
-   <step>
+   <step id="pgupgrade-step-prepare-standbys">
     <title>Prepare for standby server upgrades</title>
 
     <para>
-     If you are upgrading standby servers using methods outlined in section <xref
-     linkend="pgupgrade-step-replicas"/>, verify that the old standby
-     servers are caught up by running <application>pg_controldata</application>
-     against the old primary and standby clusters.  Verify that the
-     <quote>Latest checkpoint location</quote> values match in all clusters.
-     (There will be a mismatch if old standby servers were shut down
-     before the old primary or if the old standby servers are still running.)
+     If you are upgrading standby servers using methods outlined in 
+     <xref linkend="pgupgrade-step-replicas"/>, ensure that they were running when 
+     you shut down the primaries in the previous step, so all the latest changes 
+     and the shutdown checkpoint record were received. You can verify this by running 
+     <application>pg_controldata</application> against the old primary and standby 
+     clusters. The <quote>Latest checkpoint location</quote> values must match in all 
+     nodes. A mismatch might occur if old standby servers were shut down before 
+     the old primary. To fix a mismatch, start all old servers and return to the 
+     previous step; proceeding with mismatched 
+     <quote>Latest checkpoint location</quote> may lead to standby corruption.
+    </para>
+
+    <para>
      Also, make sure <varname>wal_level</varname> is not set to
      <literal>minimal</literal> in the <filename>postgresql.conf</filename> file on the
      new primary cluster.
@@ -497,7 +503,6 @@ pg_upgrade.exe
      linkend="warm-standby"/>) standby servers, you can follow these steps to
      quickly upgrade them.  You will not be running <application>pg_upgrade</application> on
      the standby servers, but rather <application>rsync</application> on the primary.
-     Do not start any servers yet.
     </para>
 
     <para>
@@ -508,6 +513,16 @@ pg_upgrade.exe
      is running.
     </para>
 
+    <para>
+     Before running rsync, to avoid standby corruption, it is absolutely
+     critical to ensure that both old and new primary servers are shut down
+     and standby servers have received the last changes as discussed in
+     <xref linkend="pgupgrade-step-prepare-standbys"/>. 
+     Standby servers can be running at this point or be fully stopped. If they 
+     are still running. You can stop, upgrade, and start them one by one; this
+     can be useful to keep the cluster open for read-only transactions.
+    </para>
+
     <procedure>
 
      <step>
--

Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?

Reply via email to