On Mon, Jul 10, 2023 at 2:02 PM Michael Banck <mba...@gmx.net> wrote:
> Thanks for that! > Thanks for the fast review. > > > I agree with Andrey – without it, we don't have any good way to upgrade > > large clusters in short time. Default rsync mode (without "--size-only") > > takes a lot of time too, if the load is heavy. > > > > With these adjustments, can "rsync --size-only" remain in the docs as the > > *fast* and safe method to upgrade standbys, or there are still some > > concerns related to corruption risks? > > I hope somebody can answer that definitively, but I read Stephen's mail > to indicate that this procedure should be safe in principle (if you know > what you are doing). > right, this is my understanding too > > +++ b/doc/src/sgml/ref/pgupgrade.sgml > > @@ -380,22 +380,28 @@ NET STOP postgresql-&majorversion; > > </para> > > > > <para> > > - Streaming replication and log-shipping standby servers can > > + Streaming replication and log-shipping standby servers must > > remain running until a later step. > > </para> > > </step> > > > > - <step> > > + <step id="pgupgrade-step-prepare-standbys"> > > > > <para> > > - If you are upgrading standby servers using methods outlined in > section <xref > > - linkend="pgupgrade-step-replicas"/>, verify that the old standby > > - servers are caught up by running > <application>pg_controldata</application> > > - against the old primary and standby clusters. Verify that the > > - <quote>Latest checkpoint location</quote> values match in all > clusters. > > - (There will be a mismatch if old standby servers were shut down > > - before the old primary or if the old standby servers are still > running.) > > + If you are upgrading standby servers using methods outlined in > > + <xref linkend="pgupgrade-step-replicas"/>, > > You dropped the "section" before the xref, I think that should be kept > around. > Seems to be a problem in discussing source code that looks quite different than the final result. In the result – the docs – we currently have "section Step 9", looking weird. I still think it's good to remove it. We also have "in Step 17 below" (without the extra word "section") in a different place on the same page. > > > + ensure that they were > running when > > + you shut down the primaries in the previous step, so all the > latest changes > > You talk of primaries in plural here, that is a bit weird for PostgreSQL > documentation. > The same docs already discuss two primaries ("8. Stop both primaries"), but I agree it might look confusing if you read only a part of the doc, jumping into middle of it, like I do all the time when using the docs in "check the reference" style. I agree with this comment, but it tells me we need even more improvements of this doc, beyond my original goal – e.g., I don't like section 8 saying "Make sure both database servers", it should be "both primaries". > > > + and the shutdown checkpoint record were received. You can verify > this by running > > + <application>pg_controldata</application> against the old primary > and standby > > + clusters. The <quote>Latest checkpoint location</quote> values > must match in all > > + nodes. A mismatch might occur if old standby servers were shut > down before > > + the old primary. To fix a mismatch, start all old servers and > return to the > > + previous step; proceeding with mismatched > > + <quote>Latest checkpoint location</quote> may lead to standby > corruption. > > + </para> > > + > > + <para> > > Also, make sure <varname>wal_level</varname> is not set to > > <literal>minimal</literal> in the > <filename>postgresql.conf</filename> file on the > > new primary cluster. > > @@ -497,7 +503,6 @@ pg_upgrade.exe > > linkend="warm-standby"/>) standby servers, you can follow these > steps to > > quickly upgrade them. You will not be running > <application>pg_upgrade</application> on > > the standby servers, but rather <application>rsync</application> > on the primary. > > - Do not start any servers yet. > > </para> > > > > <para> > > @@ -508,6 +513,15 @@ pg_upgrade.exe > > is running. > > </para> > > > > + <para> > > + Before running rsync, to avoid standby corruption, it is absolutely > > + critical to ensure that both primaries are shut down and standbys > > + have received the last changes (see <xref > linkend="pgupgrade-step-prepare-standbys"/>). > > I think this should be something like "ensure both that the primary is > shut down and that the standbys have received all the changes". > Well, we have two primary servers – old and new. I tried to clarify it in the new version. > > > + Standbys can be running at this point or fully stopped. > > "or be fully stopped." I think. > > > + If they > > + are still running, you can stop, upgrade, and start them one by > one; this > > + can be useful to keep the cluster open for read-only transactions. > > + </para> > > Maybe this is clear from the context, but "upgrade" in the above should > maybe more explicitly refer to the rsync method or else people might > think one can run pg_upgrade on them after all? > Maybe. It will require changes in other parts of this doc. Thinking (here: https://gitlab.com/postgres/postgres/-/merge_requests/18/diffs) Meanwhile, attached is v2 thanks for the comments
From: Nikolay Samokhvalov <nik@postgres.ai> Date: Mon, 10 Jul 2023 20:07:18 +0000 Subject: [PATCH] Refine instructions for major upgrades on standby servers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Recipe involving pg_upgrade -k and rsync --size-only can be dangerous – clarify what to do to avoid standby corruption. Discussion: https://www.postgresql.org/message-id/flat/CAM527d8heqkjG5VrvjU3Xjsqxg41ufUyabD9QZccdAxnpbRH-Q%40mail.gmail.com --- doc/src/sgml/ref/pgupgrade.sgml | 39 +++++++++++++++++++++++---------- 1 file changed, 27 insertions(+), 12 deletions(-) diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml index 7816b4c6859..41242bb4200 100644 --- a/doc/src/sgml/ref/pgupgrade.sgml +++ b/doc/src/sgml/ref/pgupgrade.sgml @@ -361,10 +361,10 @@ make prefix=/usr/local/pgsql.new install </step> <step> - <title>Stop both servers</title> + <title>Stop both primary servers</title> <para> - Make sure both database servers are stopped using, on Unix, e.g.: + Make sure both primary servers are stopped using, on Unix, e.g.: <programlisting> pg_ctl -D /opt/PostgreSQL/9.6 stop @@ -380,22 +380,28 @@ NET STOP postgresql-&majorversion; </para> <para> - Streaming replication and log-shipping standby servers can + Streaming replication and log-shipping standby servers must remain running until a later step. </para> </step> - <step> + <step id="pgupgrade-step-prepare-standbys"> <title>Prepare for standby server upgrades</title> <para> - If you are upgrading standby servers using methods outlined in section <xref - linkend="pgupgrade-step-replicas"/>, verify that the old standby - servers are caught up by running <application>pg_controldata</application> - against the old primary and standby clusters. Verify that the - <quote>Latest checkpoint location</quote> values match in all clusters. - (There will be a mismatch if old standby servers were shut down - before the old primary or if the old standby servers are still running.) + If you are upgrading standby servers using methods outlined in + <xref linkend="pgupgrade-step-replicas"/>, ensure that they were running when + you shut down the primaries in the previous step, so all the latest changes + and the shutdown checkpoint record were received. You can verify this by running + <application>pg_controldata</application> against the old primary and standby + clusters. The <quote>Latest checkpoint location</quote> values must match in all + nodes. A mismatch might occur if old standby servers were shut down before + the old primary. To fix a mismatch, start all old servers and return to the + previous step; proceeding with mismatched + <quote>Latest checkpoint location</quote> may lead to standby corruption. + </para> + + <para> Also, make sure <varname>wal_level</varname> is not set to <literal>minimal</literal> in the <filename>postgresql.conf</filename> file on the new primary cluster. @@ -497,7 +503,6 @@ pg_upgrade.exe linkend="warm-standby"/>) standby servers, you can follow these steps to quickly upgrade them. You will not be running <application>pg_upgrade</application> on the standby servers, but rather <application>rsync</application> on the primary. - Do not start any servers yet. </para> <para> @@ -508,6 +513,16 @@ pg_upgrade.exe is running. </para> + <para> + Before running rsync, to avoid standby corruption, it is absolutely + critical to ensure that both old and new primary servers are shut down + and standby servers have received the last changes as discussed in + <xref linkend="pgupgrade-step-prepare-standbys"/>. + Standby servers can be running at this point or be fully stopped. If they + are still running. You can stop, upgrade, and start them one by one; this + can be useful to keep the cluster open for read-only transactions. + </para> + <procedure> <step> --