On Mon, May 1, 2023 at 2:30 AM Peter Geoghegan <p...@bowt.ie> wrote: > > On Sat, Apr 29, 2023 at 7:30 PM John Naylor > <john.nay...@enterprisedb.com> wrote: > > How about > > > > -HINT: To avoid a database shutdown, [...] > > +HINT: To prevent XID exhaustion, [...] > > > > ...and "MXID", respectively? We could explain in the docs that vacuum and read-only queries still work "when XIDs have been exhausted", etc. > > I think that that particular wording works in this example -- we *are* > avoiding XID exhaustion. But it still doesn't really address my > concern -- at least not on its own. I think that we need a term for > xidStopLimit mode (and perhaps multiStopLimit) itself. This is a > discrete state/mode that is associated with a specific mechanism.
Well, since you have a placeholder "xidStopLimit mode" in your other patch, I'll confine my response to there. Inventing "modes" seems like an awkward thing to backpatch, not to mention moving the goalposts. My modest goal here is quite limited: to stop lying to our users about "not accepting commands", and change tragically awful advice into sensible advice. Here's my new idea: -HINT: To avoid a database shutdown, [...] +HINT: To prevent XID generation failure, [...] Actually, I like "allocation" better, but the v8 patch now has "generation" simply because one MXID message already has "generate" and I did it that way before thinking too hard. I'd be okay with either one as long as it's consistent. > > (I should probably also add in the commit message that the "shutdown" in the message was carried over to MXIDs when they arrived also in 2005). Done > > > Separately, there is a need to update a couple of other places to use > > > this new terminology. The documentation for vacuum_sailsafe_age and > > > vacuum_multixact_failsafe_age refer to "system-wide transaction ID > > > wraparound failure", which seems less than ideal, even without your > > > patch. > > > > Right, I'll have a look. Looking now, I'm even less inclined to invent new terminology in back branches. > As you know, there is a more general problem with the use of the term > "wraparound" in the docs, and in the system itself (in places like > pg_stat_activity). Even the very basic terminology in this area is > needlessly scary. Terms like "VACUUM (to prevent wraparound)" are > uncomfortably close to "a race against time to avoid data corruption". > The system isn't ever supposed to corrupt data, even if misconfigured > (unless the misconfiguration is very low-level, such as "fsync=off"). > Users should be able to take that much for granted. Granted. Whatever form your rewrite ends up in, it could make a lot of sense to then backpatch a few localized corrections. I wouldn't even object to including a few substitutions of s/wraparound failure/allocation failure/ where appropriate. Let's see how that shakes out first. > > I think the docs would do well to have ordered steps for recovering from both XID and MXID exhaustion. > > I had planned to address this with my ongoing work on the "Routine > Vacuuming" docs, but I think that you're right about the necessity of > addressing it as part of this patch. 0003 is now a quick-and-dirty attempt at that, only in the docs. The MXID part is mostly copy-pasted from the XID part, just to get something together. I'd like to abbreviate that somehow. -- John Naylor EDB: http://www.enterprisedb.com
From 469d0e7123a16386e300a85a6bb08109e283b65c Mon Sep 17 00:00:00 2001 From: John Naylor <john.nay...@postgresql.org> Date: Sat, 29 Apr 2023 14:23:50 +0700 Subject: [PATCH v8 2/3] Stop telling users to run VACUUM in a single-user mode Single-user mode is almost always the worst thing to reach for in a VACUUM emergency: * Restarting in single user mode requires a shutdown checkpoint * The user interface is completely different, and awful * The buffer cache is completely cold * The checkpointer, background writer and WAL writer are not running * Without checkpoints WAL segments can not be rotated and reused * Replication is not running, so after VACUUM is done and database is started in normal mode, there is a huge backlog to replicate * pg_stat_progress_vacuum is not available so there is no indication of when the command will complete * VACUUM VERBOSE doesn't work - there is no output from single-user mode vacuum, with or without VERBOSE If that weren't enough, it's also unsafe because the wraparound limits are not enforced. It is by no means impossible to corrupt the database by mistake, such as by a user running VACUUM FULL because it sounds better. As mentioned in commit XXXXXXXXX, the system is perfectly capable of accepting commands when reaching xidStopLimit. Most VACUUM operations will work normally, with one exception: A new XID is required when truncating the relation if wal_level is above "minimal". As of v14 the failsafe mechanism disables truncation some time before reaching xidStopLimit, so this is not an issue in practice. By remaining in multi-user mode, users still have read-only access to their database, they can use parallelism, they can use command line utilities like vacuumdb, and they can remotely access the database using normal clients. The only reason to restart in single-user mode is to DROP or TRUNCATE a table, when it is suspected that doing that would be faster than vacuuming. Also add an explicit note warning against using single-user mode. Backpatch to v14, which is the first version with the VACUUM failsafe. XXX We should consider v12-13 with "VACUUM (TRUNCATE off, INDEX_CLEANUP off);", but it's not yet convenient to get to xidStopLimit before v14. XXX We could consider v11 with careful instructions about redoing VACUUMs in single-user mode when truncation is necessary. Aleksander Alekseev, with some adjustments by me Reviewed by Peter Geoghegan Reported at various times by (at least) Hannu Krosing, Robert Haas, and Andres Freund Discussion: https://postgr.es/m/caj7c6tm2d277u2wh8x78kg8ph3tduqebv3_jcjqakyqfhcf...@mail.gmail.com Discussion: https://postgr.es/m/CA%2BTgmob1QCMJrHwRBK8HZtGsr%2B6cJANRQw2mEgJ9e%3DD%2Bz7cOsw%40mail.gmail.com Discussion: https://postgr.es/m/CAMT0RQTmRj_Egtmre6fbiMA9E2hM3BsLULiV8W00stwa3URvzA%40mail.gmail.com Discussion: https://postgr.es/m/CA%2BTgmoYPfofQmRtUan%3DA3aWE9wFsJaOFr%2BW_ys2pPkNPr-2FZw%40mail.gmail.com --- doc/src/sgml/maintenance.sgml | 32 +++++++++++++++++++++-------- src/backend/access/transam/varsup.c | 4 ++-- 2 files changed, 25 insertions(+), 11 deletions(-) diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml index 48d43cb339..116d6187cd 100644 --- a/doc/src/sgml/maintenance.sgml +++ b/doc/src/sgml/maintenance.sgml @@ -669,22 +669,36 @@ HINT: To prevent XID generation failure, execute a database-wide VACUUM in that <programlisting> ERROR: database is not accepting commands that generate new XIDs to avoid wraparound data loss in database "mydb" -HINT: Stop the postmaster and vacuum that database in single-user mode. +HINT: Execute a database-wide VACUUM in that database. </programlisting> In this condition any transactions already in progress can continue, but only read-only transactions can be started. Operations that modify database records or truncate relations will fail. - - The three-million-transaction safety margin exists to let the - administrator recover without data loss, by manually executing the - required <command>VACUUM</command> commands. However - the only way to do this is to stop the server and start the server in single-user - mode to execute <command>VACUUM</command>. See the - <xref linkend="app-postgres"/> reference page for details about using - single-user mode. + The <command>VACUUM</command> command can still be run normally to recover. + <!-- v12 and v13 need VACUUM (TRUNCATE off, INDEX_CLEANUP off); --> + <!-- v11 maybe mention that it will often succed, but will need S-U if truncation happens --> </para> + <note> + <para> + In earlier versions it was required to stop the postmaster and + <command>VACUUM</command> the database in a single-user mode. There is no + need to use single-user mode anymore, and in fact it's strongly + discouraged: It increases downtime, makes monitoring impossible, + disables replication, bypasses safeguards against wraparound, etc. + </para> + + <para> + The only reason to use single-user mode in an emergency is to e.g. + <command>TRUNCATE</command> or <command>DROP</command> unneeded tables + to avoid needing to <command>VACUUM</command> them. + The three-million-transaction safety margin exists to let the + administrator do this. See the <xref linkend="app-postgres"/> + reference page for details about using single-user mode. + </para> + </note> + <sect3 id="vacuum-for-multixact-wraparound"> <title>Multixacts and Wraparound</title> diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c index 672e6f0196..0d6c4925d5 100644 --- a/src/backend/access/transam/varsup.c +++ b/src/backend/access/transam/varsup.c @@ -128,14 +128,14 @@ GetNewTransactionId(bool isSubXact) (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("database is not accepting commands that generate new XIDs to avoid wraparound data loss in database \"%s\"", oldest_datname), - errhint("Stop the postmaster and vacuum that database in single-user mode.\n" + errhint("Execute a database-wide VACUUM in that database.\n" "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); else ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("database is not accepting commands that generate new XIDs to avoid wraparound data loss in database with OID %u", oldest_datoid), - errhint("Stop the postmaster and vacuum that database in single-user mode.\n" + errhint("Execute a database-wide VACUUM in that database.\n" "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); } else if (TransactionIdFollowsOrEquals(xid, xidWarnLimit)) -- 2.39.2
From 313019305d78a82b816fcfcbd79dc5c570dd31af Mon Sep 17 00:00:00 2001 From: John Naylor <john.nay...@postgresql.org> Date: Fri, 28 Apr 2023 16:08:33 +0700 Subject: [PATCH v8 1/3] Correct outdated docs and messages regarding XID limits Previously, when approaching xidStopLimit or xidWrapLimit, log messages would warn against a "database shutdown", and when it reached those limits claimed that it "is not accepting commands". This language originated in commit 60b2444cc when the xidStopLimit was added in 2005. At that time, even a trivial SELECT would have failed. Commit 295e63983d in 2007 introduced virtual transaction IDs, which allowed actual XIDs to be allocated lazily when it is necessary to do so, such as when modifying database records. Since then, the behavior at these limits is merely to refuse to allocate new XIDs, so read-only queries can continue to be initiated. The "database shutdown" message was also copied to the message warning for multiWarnLimit when it was added. This has been wrong for a very long time, so backpatch to all supported branches. Aleksander Alekseev, with some editing by me Reviewed by Pavel Borisov and Peter Geoghegan Discussion: https://postgr.es/m/caj7c6tm2d277u2wh8x78kg8ph3tduqebv3_jcjqakyqfhcf...@mail.gmail.com --- doc/src/sgml/maintenance.sgml | 22 ++++++++++++---------- src/backend/access/transam/multixact.c | 4 ++-- src/backend/access/transam/varsup.c | 12 ++++++------ 3 files changed, 20 insertions(+), 18 deletions(-) diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml index 9cf9d030a8..48d43cb339 100644 --- a/doc/src/sgml/maintenance.sgml +++ b/doc/src/sgml/maintenance.sgml @@ -656,7 +656,7 @@ SELECT datname, age(datfrozenxid) FROM pg_database; <programlisting> WARNING: database "mydb" must be vacuumed within 39985967 transactions -HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database. +HINT: To prevent XID generation failure, execute a database-wide VACUUM in that database. </programlisting> (A manual <command>VACUUM</command> should fix the problem, as suggested by the @@ -664,23 +664,25 @@ HINT: To avoid a database shutdown, execute a database-wide VACUUM in that data superuser, else it will fail to process system catalogs and thus not be able to advance the database's <structfield>datfrozenxid</structfield>.) If these warnings are - ignored, the system will shut down and refuse to start any new - transactions once there are fewer than three million transactions left - until wraparound: + ignored, the system will refuse to generate new XIDs once there are + fewer than three million transactions left until wraparound: <programlisting> -ERROR: database is not accepting commands to avoid wraparound data loss in database "mydb" +ERROR: database is not accepting commands that generate new XIDs to avoid wraparound data loss in database "mydb" HINT: Stop the postmaster and vacuum that database in single-user mode. </programlisting> + In this condition any transactions already in progress can continue, + but only read-only transactions can be started. Operations that + modify database records or truncate relations will fail. + The three-million-transaction safety margin exists to let the administrator recover without data loss, by manually executing the - required <command>VACUUM</command> commands. However, since the system will not - execute commands once it has gone into the safety shutdown mode, + required <command>VACUUM</command> commands. However the only way to do this is to stop the server and start the server in single-user - mode to execute <command>VACUUM</command>. The shutdown mode is not enforced - in single-user mode. See the <xref linkend="app-postgres"/> reference - page for details about using single-user mode. + mode to execute <command>VACUUM</command>. See the + <xref linkend="app-postgres"/> reference page for details about using + single-user mode. </para> <sect3 id="vacuum-for-multixact-wraparound"> diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index fe6698d5ff..ca0e038c36 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -2335,7 +2335,7 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid, multiWrapLimit - curMulti, oldest_datname, multiWrapLimit - curMulti), - errhint("To avoid a database shutdown, execute a database-wide VACUUM in that database.\n" + errhint("To prevent MultiXactId generation failure, execute a database-wide VACUUM in that database.\n" "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); else ereport(WARNING, @@ -2344,7 +2344,7 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid, multiWrapLimit - curMulti, oldest_datoid, multiWrapLimit - curMulti), - errhint("To avoid a database shutdown, execute a database-wide VACUUM in that database.\n" + errhint("To prevent MultiXactId generation failure, execute a database-wide VACUUM in that database.\n" "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); } } diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c index 334adac09e..672e6f0196 100644 --- a/src/backend/access/transam/varsup.c +++ b/src/backend/access/transam/varsup.c @@ -126,14 +126,14 @@ GetNewTransactionId(bool isSubXact) if (oldest_datname) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), - errmsg("database is not accepting commands to avoid wraparound data loss in database \"%s\"", + errmsg("database is not accepting commands that generate new XIDs to avoid wraparound data loss in database \"%s\"", oldest_datname), errhint("Stop the postmaster and vacuum that database in single-user mode.\n" "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); else ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), - errmsg("database is not accepting commands to avoid wraparound data loss in database with OID %u", + errmsg("database is not accepting commands that generate new XIDs to avoid wraparound data loss in database with OID %u", oldest_datoid), errhint("Stop the postmaster and vacuum that database in single-user mode.\n" "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); @@ -148,14 +148,14 @@ GetNewTransactionId(bool isSubXact) (errmsg("database \"%s\" must be vacuumed within %u transactions", oldest_datname, xidWrapLimit - xid), - errhint("To avoid a database shutdown, execute a database-wide VACUUM in that database.\n" + errhint("To prevent XID generation failure, execute a database-wide VACUUM in that database.\n" "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); else ereport(WARNING, (errmsg("database with OID %u must be vacuumed within %u transactions", oldest_datoid, xidWrapLimit - xid), - errhint("To avoid a database shutdown, execute a database-wide VACUUM in that database.\n" + errhint("To prevent XID generation failure, execute a database-wide VACUUM in that database.\n" "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); } @@ -463,14 +463,14 @@ SetTransactionIdLimit(TransactionId oldest_datfrozenxid, Oid oldest_datoid) (errmsg("database \"%s\" must be vacuumed within %u transactions", oldest_datname, xidWrapLimit - curXid), - errhint("To avoid a database shutdown, execute a database-wide VACUUM in that database.\n" + errhint("To prevent entering read-only mode, execute a database-wide VACUUM in that database.\n" "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); else ereport(WARNING, (errmsg("database with OID %u must be vacuumed within %u transactions", oldest_datoid, xidWrapLimit - curXid), - errhint("To avoid a database shutdown, execute a database-wide VACUUM in that database.\n" + errhint("To prevent entering read-only mode, execute a database-wide VACUUM in that database.\n" "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); } } -- 2.39.2
From ef5448479cea756c349ab2b46138477250cbac8e Mon Sep 17 00:00:00 2001 From: John Naylor <john.nay...@postgresql.org> Date: Mon, 1 May 2023 19:07:37 +0700 Subject: [PATCH v8 3/3] Rough draft of complete steps to recover from (M)XID generation failure --- doc/src/sgml/maintenance.sgml | 53 +++++++++++++++++++++++++++++++++++ 1 file changed, 53 insertions(+) diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml index 116d6187cd..8b9f34074b 100644 --- a/doc/src/sgml/maintenance.sgml +++ b/doc/src/sgml/maintenance.sgml @@ -678,6 +678,20 @@ HINT: Execute a database-wide VACUUM in that database. The <command>VACUUM</command> command can still be run normally to recover. <!-- v12 and v13 need VACUUM (TRUNCATE off, INDEX_CLEANUP off); --> <!-- v11 maybe mention that it will often succed, but will need S-U if truncation happens --> + + However, it is first necessary to remove obstacles to advancing the XID horizon: + + <orderedlist> + <listitem> + <simpara>Commit or rollback any prepared transactions</simpara> + </listitem> + <listitem> + <simpara>Terminate any sessions that might have open transactions</simpara> + </listitem> + <listitem> + <simpara>Drop any old replication slots</simpara> + </listitem> + </orderedlist> </para> <note> @@ -763,6 +777,45 @@ HINT: Execute a database-wide VACUUM in that database. have the oldest multixact-age. Both of these kinds of aggressive scans will occur even if autovacuum is nominally disabled. </para> + +<!-- WIP: Lot's of stuff identical to the XID case, can we abbreviate? --> + <para> + If for some reason autovacuum fails to clear old MXIDs from a table, the + system will begin to emit warning messages like this when the database's + oldest XIDs reach forty million transactions from the wraparound point: + +<programlisting> +WARNING: database "mydb" must be vacuumed within 39985967 transactions +HINT: To prevent MultiXactId generation failure, execute a database-wide VACUUM in that database. +</programlisting> + +<!-- WIP: not sure about system catalogs --> + (A manual <command>VACUUM</command> should fix the problem, as suggested by the + hint; but note that the <command>VACUUM</command> must be performed by a + superuser, else it will fail to process system catalogs and thus not + be able to advance the database's <structfield>datfrozenxid</structfield>.) + If these warnings are + ignored, the system will refuse to generate new MXIDs once there are + fewer than three million left until wraparound: + +<programlisting> +ERROR: database is not accepting commands that generate new MultiXactIds to avoid wraparound data loss in database "mydb" +HINT: Execute a database-wide VACUUM in that database. +</programlisting> + </para> + + <para> + To restore normal operation, it is first necessary to remove obstacles to advancing the MXID horizon: + <orderedlist> + <listitem> + <simpara>Commit or rollback each prepared transaction that might appear in a multixact</simpara> + </listitem> + <listitem> + <simpara>Resolve each transaction that might appear in a multixact</simpara> + </listitem> + </orderedlist> + </para> + </sect3> </sect2> -- 2.39.2