Re: pg_upgrade check for invalid databases
On Fri, 11 Oct 2024 at 04:01, Daniel Gustafsson wrote: > > > On 7 Oct 2024, at 22:04, Nathan Bossart wrote: > > > > On Mon, Oct 07, 2024 at 03:37:35PM -0400, Bruce Momjian wrote: > >> On Tue, Oct 1, 2024 at 09:28:54AM +0200, Daniel Gustafsson wrote: > >>> Correct, sorry for being unclear. The consistency argument would be to > >>> expand > >>> pg_upgrade to report all invalid databases rather than just the first > >>> found; > >>> attempting to fix problems would be a new behavior. > >> > >> Yes, historically pg_upgrade will fail if it finds anything unusual, > >> mostly because what it does normally is already scary enough. If users > >> what pg_upgrade to do cleanups, it would be enabled by a separate flag, > >> or even a new command-line app. > > > > While I suspect it's rare that someone CTRL-C's out of an accidental DROP > > DATABASE and then runs pg_upgrade before trying to recover the data, I > > agree with the principle of having pg_upgrade fail by default for things > > like this. If we did add a new flag, the new invalid database report that > > Daniel mentions could say something like "try again with > > --skip-invalid-databases to have pg_upgrade automatically drop invalid > > databases." > > If we are teaching pg_upgrade to handle errors, either by skipping or by > fixing, then I believe this is the right way to go about it. A successful run > should probably also create a report of the databases which were skipped. In v2 I've made changes to the patch incorporating the suggestions here: * Default behaviour is to just fail with a report of all invalid databases * A new option --skip-invalid-databases will then skip the checks, and would not transfer any invalid database to the new cluster. A warning with a report file will then follow after a successful run. Dropping invalid databases in the old cluster will make invalid databases unrecoverable, so I opted for a skip over invalid databases approach that would leave invalid databases in the old cluster. Apart from a missing --skip-invalid-databases test, does this attempt look OK? From e51f581ddc1158a9fb2840dfc04863618a390887 Mon Sep 17 00:00:00 2001 From: Thomas Krennwallner Date: Fri, 20 Sep 2024 17:33:05 -0400 Subject: [PATCH] pg_upgrade: Add check for invalid databases. Currently, pg_upgrade fails to connect to the first invalid database it encounters in get_loadable_libraries() and then aborts. While we print a fatal message with a hint what should be done, the output is terse and does not report further invalid databases present in an installation. Instead of just exiting on first error, we collect all pg_database.datconnlimit values in get_db_infos() and order invalid databases after valid ones in cluster.dbarr. This allows for skipping over invalid databases in all checks with --skip-invalid-databases, a new option where we do not transfer such databases from the old cluster. Unless we use --skip-invalid-databases, check_for_invalid_dbs() collects all invalid databases in a report file invalid_databases.txt. After a successful pg_upgrade run with skipped invalid databases, create_invalid_databases_report() will collect them in a report file skipped_invalid_databases.txt for the completion banner output. --- doc/src/sgml/ref/pgupgrade.sgml| 19 src/bin/pg_upgrade/check.c | 131 +++-- src/bin/pg_upgrade/info.c | 20 +++- src/bin/pg_upgrade/option.c| 7 ++ src/bin/pg_upgrade/pg_upgrade.c| 18 +++- src/bin/pg_upgrade/pg_upgrade.h| 8 +- src/bin/pg_upgrade/t/002_pg_upgrade.pl | 2 +- 7 files changed, 187 insertions(+), 18 deletions(-) diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml index 4777381dac..a9a6b62e7a 100644 --- a/doc/src/sgml/ref/pgupgrade.sgml +++ b/doc/src/sgml/ref/pgupgrade.sgml @@ -253,6 +253,25 @@ PostgreSQL documentation + + --skip-invalid-databases + + +If a cluster contains invalid databases, the default behaviour of +pg_upgrade is to fail with a report of all +such databases found in an installation. Invalid databases, which cannot +be connected to anymore, are a consequence of interrupted +DROP DATABASE and marked as corrupted databases in +pg_database.datconnlimit. + + +When using option --skip-invalid-databases, +pg_upgrade instead will skip all invalid +databases and do not transfer them over to the new cluster. + + + + --sync-method=method diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c index 12735a4268..e372feff1a 100644 --- a/src/bin/pg_upgrade/check.c +++ b/src/bin/pg
pg_upgrade check for invalid databases
Hi, if a cluster contains invalid databases that we cannot connect to anymore, pg_upgrade would currently fail when trying to connect to the first encountered invalid database with Performing Consistency Checks - Checking cluster versions ok connection failure: connection to server on socket "/tmp/.s.PGSQL.50432" failed: FATAL: cannot connect to invalid database "foo" HINT: Use DROP DATABASE to drop invalid databases. Failure, exiting If there is more than one invalid database, we need to run pg_upgrade more than once (unless the user inspects pg_database). I attached two small patches for PG 17 and PG 18 (can be easily backported to all previous versions upon request). Instead of just failing to connect with an error, we collect all invalid databases in a report file invalid_databases.txt: Performing Consistency Checks - Checking cluster versions ok Checking for invalid databasesfatal Your installation contains invalid databases as a consequence of interrupted DROP DATABASE. They are now marked as corrupted databases that cannot be connected to anymore. Consider removing them using DROP DATABASE ...; A list of invalid databases is in the file: /usr/local/pgsql/data/18/pg_upgrade_output.d/20240929T200559.707/invalid_databases.txt Failure, exiting Any thoughts on the proposed patches? From c44f723c033573fa5da72721595bc135a00e3cbf Mon Sep 17 00:00:00 2001 From: Thomas Krennwallner Date: Fri, 20 Sep 2024 17:33:05 -0400 Subject: [PATCH] pg_upgrade: Add check for invalid databases. Currently, pg_upgrade fails to connect to the first invalid database it encounters in get_loadable_libraries() and then aborts. While we print a fatal message with a hint what should be done, the output is terse and does not report further invalid databases present in an installation. Instead of just exiting on first error, we collect all pg_database.datconnlimit values in get_db_infos(), ignore invalid databases in process_slot(), and then perform check_for_invalid_dbs(), which collects all invalid databases in a report file invalid_databases.txt. --- src/bin/pg_upgrade/check.c | 62 ++ src/bin/pg_upgrade/info.c | 5 ++- src/bin/pg_upgrade/pg_upgrade.h| 2 + src/bin/pg_upgrade/t/002_pg_upgrade.pl | 2 +- src/bin/pg_upgrade/task.c | 32 +++-- 5 files changed, 98 insertions(+), 5 deletions(-) diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c index 12735a4268..b5b0f7cd6f 100644 --- a/src/bin/pg_upgrade/check.c +++ b/src/bin/pg_upgrade/check.c @@ -12,6 +12,7 @@ #include "catalog/pg_authid_d.h" #include "catalog/pg_class_d.h" #include "catalog/pg_collation.h" +#include "catalog/pg_database.h" #include "fe_utils/string_utils.h" #include "mb/pg_wchar.h" #include "pg_upgrade.h" @@ -19,6 +20,7 @@ static void check_new_cluster_is_empty(void); static void check_is_install_user(ClusterInfo *cluster); static void check_proper_datallowconn(ClusterInfo *cluster); +static void check_for_invalid_dbs(ClusterInfo *cluster); static void check_for_prepared_transactions(ClusterInfo *cluster); static void check_for_isn_and_int8_passing_mismatch(ClusterInfo *cluster); static void check_for_user_defined_postfix_ops(ClusterInfo *cluster); @@ -598,6 +600,13 @@ check_and_dump_old_cluster(void) */ get_db_rel_and_slot_infos(&old_cluster); + /* + * Verify that the old cluster does not reference invalid databases. + * Connections to such databases are not allowed and break the following + * checks. + */ + check_for_invalid_dbs(&old_cluster); + init_tablespaces(); get_loadable_libraries(); @@ -691,6 +700,13 @@ check_new_cluster(void) { get_db_rel_and_slot_infos(&new_cluster); + /* + * Verify that the new cluster does not reference invalid databases. + * Connections to such databases are not allowed and break the following + * checks. + */ + check_for_invalid_dbs(&new_cluster); + check_new_cluster_is_empty(); check_loadable_libraries(); @@ -1087,6 +1103,52 @@ check_is_install_user(ClusterInfo *cluster) check_ok(); } +/* + * check_for_invalid_dbs + * + * Ensure that all databases are valid as connections to invalid databases + * are not allowed. + */ +static void +check_for_invalid_dbs(ClusterInfo *cluster) +{ + int dbnum; + FILE *script = NULL; + char output_path[MAXPGPATH]; + + prep_status("Checking for invalid databases"); + + snprintf(output_path, sizeof(output_path), "%s/%s", + log_opts.basedir, + "invalid_databases.txt"); + + for (dbnum = 0; dbnum < cluster->dbarr.ndbs; dbnum++) + { + DbInfo *active_db = &cluster->dbarr.dbs[dbnum]; + + if (active_db-&g
Re: pg_upgrade check for invalid databases
On 30/09/2024 17.29, Daniel Gustafsson wrote: On 30 Sep 2024, at 16:55, Tom Lane wrote: TBH I'm not finding anything very much wrong with the current behavior... this has to be a rare situation, do we need to add debatable behavior to make it easier? One argument would be to make the checks consistent, pg_upgrade generally tries to report all the offending entries to help the user when fixing the source database. Not sure if it's a strong enough argument for carrying code which really shouldn't see much use though. In general, I agree that this situation should be rare for deliberate DROP DATABASE interrupted in interactive sessions. Unfortunately, for (popular) tools that perform automatic "temporary database" cleanup, we could recently see an increase in invalid databases. The additional check for pg_upgrade was made necessary due to several unrelated customers having invalid databases that stem from left-over Prisma Migrate "shadow databases" [1]. We could not reproduce this Prisma Migrate issue yet, as those migrations happened some time ago. Maybe this bug really stems from a much older Prisma Migrate version and we only see the fallout now. This is still a TODO item. But it appears that this tool can get interrupted "at the wrong time" while it is deleting temporary databases (probably a manual Ctrl-C), and clients are unaware that this can then leave behind invalid databases. Those temporary databases do not cause any harm as they are not used anymore. But eventually, PG installations will be upgraded to the next major version, and it is only then when those invalid databases resurface after pg_upgrade fails to run the checks. Long story short: interactive DROP DATABASE interrupts are rare (they do exist, but customers are usually aware). Automation tools on the other hand may run DROP DATABASE and when they get interrupted at the wrong time they will then produce several left-over invalid databases. pg_upgrade will then fail to run the checks. [1] https://www.prisma.io/docs/orm/prisma-migrate/understanding-prisma-migrate/shadow-database