Re: pg_upgrade check for invalid databases

2024-10-13 Thread Thomas Krennwallner
On Fri, 11 Oct 2024 at 04:01, Daniel Gustafsson  wrote:
>
> > On 7 Oct 2024, at 22:04, Nathan Bossart  wrote:
> >
> > On Mon, Oct 07, 2024 at 03:37:35PM -0400, Bruce Momjian wrote:
> >> On Tue, Oct  1, 2024 at 09:28:54AM +0200, Daniel Gustafsson wrote:
> >>> Correct, sorry for being unclear.  The consistency argument would be to 
> >>> expand
> >>> pg_upgrade to report all invalid databases rather than just the first 
> >>> found;
> >>> attempting to fix problems would be a new behavior.
> >>
> >> Yes, historically pg_upgrade will fail if it finds anything unusual,
> >> mostly because what it does normally is already scary enough.  If users
> >> what pg_upgrade to do cleanups, it would be enabled by a separate flag,
> >> or even a new command-line app.
> >
> > While I suspect it's rare that someone CTRL-C's out of an accidental DROP
> > DATABASE and then runs pg_upgrade before trying to recover the data, I
> > agree with the principle of having pg_upgrade fail by default for things
> > like this.  If we did add a new flag, the new invalid database report that
> > Daniel mentions could say something like "try again with
> > --skip-invalid-databases to have pg_upgrade automatically drop invalid
> > databases."
>
> If we are teaching pg_upgrade to handle errors, either by skipping or by
> fixing, then I believe this is the right way to go about it.  A successful run
> should probably also create a report of the databases which were skipped.

In v2 I've made changes to the patch incorporating the suggestions here:

* Default behaviour is to just fail with a report of all invalid databases

* A new option --skip-invalid-databases will then skip the checks, and
would not transfer any invalid database to the new cluster. A warning
with a report file will then follow after a successful run.

Dropping invalid databases in the old cluster will make invalid
databases unrecoverable, so I opted for a skip over invalid databases
approach that would leave invalid databases in the old cluster.

Apart from a missing --skip-invalid-databases test, does this attempt look OK?
From e51f581ddc1158a9fb2840dfc04863618a390887 Mon Sep 17 00:00:00 2001
From: Thomas Krennwallner 
Date: Fri, 20 Sep 2024 17:33:05 -0400
Subject: [PATCH] pg_upgrade: Add check for invalid databases.

Currently, pg_upgrade fails to connect to the first invalid database
it encounters in get_loadable_libraries() and then aborts.  While we
print a fatal message with a hint what should be done, the output is
terse and does not report further invalid databases present in an
installation.

Instead of just exiting on first error, we collect all
pg_database.datconnlimit values in get_db_infos() and order invalid
databases after valid ones in cluster.dbarr. This allows for skipping
over invalid databases in all checks with --skip-invalid-databases, a
new option where we do not transfer such databases from the old
cluster.

Unless we use --skip-invalid-databases, check_for_invalid_dbs()
collects all invalid databases in a report file invalid_databases.txt.
After a successful pg_upgrade run with skipped invalid databases,
create_invalid_databases_report() will collect them in a report file
skipped_invalid_databases.txt for the completion banner output.
---
 doc/src/sgml/ref/pgupgrade.sgml|  19 
 src/bin/pg_upgrade/check.c | 131 +++--
 src/bin/pg_upgrade/info.c  |  20 +++-
 src/bin/pg_upgrade/option.c|   7 ++
 src/bin/pg_upgrade/pg_upgrade.c|  18 +++-
 src/bin/pg_upgrade/pg_upgrade.h|   8 +-
 src/bin/pg_upgrade/t/002_pg_upgrade.pl |   2 +-
 7 files changed, 187 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 4777381dac..a9a6b62e7a 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -253,6 +253,25 @@ PostgreSQL documentation
   
  
 
+ 
+  --skip-invalid-databases
+  
+   
+If a cluster contains invalid databases, the default behaviour of
+pg_upgrade is to fail with a report of all
+such databases found in an installation. Invalid databases, which cannot
+be connected to anymore, are a consequence of interrupted
+DROP DATABASE and marked as corrupted databases in
+pg_database.datconnlimit.
+   
+   
+When using option --skip-invalid-databases,
+pg_upgrade instead will skip all invalid
+databases and do not transfer them over to the new cluster.
+   
+  
+ 
+
  
   --sync-method=method
   
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 12735a4268..e372feff1a 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg

pg_upgrade check for invalid databases

2024-09-29 Thread Thomas Krennwallner

Hi,

if a cluster contains invalid databases that we cannot connect to 
anymore, pg_upgrade would currently fail when trying to connect to the 
first encountered invalid database with



Performing Consistency Checks
-
Checking cluster versions ok

connection failure: connection to server on socket "/tmp/.s.PGSQL.50432" 
failed: FATAL:  cannot connect to invalid database "foo"

HINT:  Use DROP DATABASE to drop invalid databases.

Failure, exiting


If there is more than one invalid database, we need to run pg_upgrade 
more than once (unless the user inspects pg_database).


I attached two small patches for PG 17 and PG 18 (can be easily 
backported to all previous versions upon request).  Instead of just 
failing to connect with an error, we collect all invalid databases in a 
report file invalid_databases.txt:



Performing Consistency Checks
-
Checking cluster versions ok
Checking for invalid databasesfatal

Your installation contains invalid databases as a consequence of
interrupted DROP DATABASE.  They are now marked as corrupted databases
that cannot be connected to anymore.  Consider removing them using
DROP DATABASE ...;
A list of invalid databases is in the file:

/usr/local/pgsql/data/18/pg_upgrade_output.d/20240929T200559.707/invalid_databases.txt
Failure, exiting


Any thoughts on the proposed patches?
From c44f723c033573fa5da72721595bc135a00e3cbf Mon Sep 17 00:00:00 2001
From: Thomas Krennwallner 
Date: Fri, 20 Sep 2024 17:33:05 -0400
Subject: [PATCH] pg_upgrade: Add check for invalid databases.

Currently, pg_upgrade fails to connect to the first invalid database
it encounters in get_loadable_libraries() and then aborts.  While we
print a fatal message with a hint what should be done, the output is
terse and does not report further invalid databases present in an
installation.

Instead of just exiting on first error, we collect all
pg_database.datconnlimit values in get_db_infos(), ignore invalid
databases in process_slot(), and then perform check_for_invalid_dbs(),
which collects all invalid databases in a report file
invalid_databases.txt.
---
 src/bin/pg_upgrade/check.c | 62 ++
 src/bin/pg_upgrade/info.c  |  5 ++-
 src/bin/pg_upgrade/pg_upgrade.h|  2 +
 src/bin/pg_upgrade/t/002_pg_upgrade.pl |  2 +-
 src/bin/pg_upgrade/task.c  | 32 +++--
 5 files changed, 98 insertions(+), 5 deletions(-)

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 12735a4268..b5b0f7cd6f 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -12,6 +12,7 @@
 #include "catalog/pg_authid_d.h"
 #include "catalog/pg_class_d.h"
 #include "catalog/pg_collation.h"
+#include "catalog/pg_database.h"
 #include "fe_utils/string_utils.h"
 #include "mb/pg_wchar.h"
 #include "pg_upgrade.h"
@@ -19,6 +20,7 @@
 static void check_new_cluster_is_empty(void);
 static void check_is_install_user(ClusterInfo *cluster);
 static void check_proper_datallowconn(ClusterInfo *cluster);
+static void check_for_invalid_dbs(ClusterInfo *cluster);
 static void check_for_prepared_transactions(ClusterInfo *cluster);
 static void check_for_isn_and_int8_passing_mismatch(ClusterInfo *cluster);
 static void check_for_user_defined_postfix_ops(ClusterInfo *cluster);
@@ -598,6 +600,13 @@ check_and_dump_old_cluster(void)
 	 */
 	get_db_rel_and_slot_infos(&old_cluster);
 
+	/*
+	 * Verify that the old cluster does not reference invalid databases.
+	 * Connections to such databases are not allowed and break the following
+	 * checks.
+	 */
+	check_for_invalid_dbs(&old_cluster);
+
 	init_tablespaces();
 
 	get_loadable_libraries();
@@ -691,6 +700,13 @@ check_new_cluster(void)
 {
 	get_db_rel_and_slot_infos(&new_cluster);
 
+	/*
+	 * Verify that the new cluster does not reference invalid databases.
+	 * Connections to such databases are not allowed and break the following
+	 * checks.
+	 */
+	check_for_invalid_dbs(&new_cluster);
+
 	check_new_cluster_is_empty();
 
 	check_loadable_libraries();
@@ -1087,6 +1103,52 @@ check_is_install_user(ClusterInfo *cluster)
 	check_ok();
 }
 
+/*
+ *	check_for_invalid_dbs
+ *
+ *	Ensure that all databases are valid as connections to invalid databases
+ *	are not allowed.
+ */
+static void
+check_for_invalid_dbs(ClusterInfo *cluster)
+{
+	int			dbnum;
+	FILE	   *script = NULL;
+	char		output_path[MAXPGPATH];
+
+	prep_status("Checking for invalid databases");
+
+	snprintf(output_path, sizeof(output_path), "%s/%s",
+			 log_opts.basedir,
+			 "invalid_databases.txt");
+
+	for (dbnum = 0; dbnum < cluster->dbarr.ndbs; dbnum++)
+	{
+		DbInfo	   *active_db = &cluster->dbarr.dbs[dbnum];
+
+		if (active_db-&g

Re: pg_upgrade check for invalid databases

2024-09-30 Thread Thomas Krennwallner

On 30/09/2024 17.29, Daniel Gustafsson wrote:

On 30 Sep 2024, at 16:55, Tom Lane  wrote:



TBH I'm not finding anything very much wrong with the current
behavior... this has to be a rare situation, do we need to add
debatable behavior to make it easier?


One argument would be to make the checks consistent, pg_upgrade generally tries
to report all the offending entries to help the user when fixing the source
database.  Not sure if it's a strong enough argument for carrying code which
really shouldn't see much use though.
In general, I agree that this situation should be rare for deliberate 
DROP DATABASE interrupted in interactive sessions.


Unfortunately, for (popular) tools that perform automatic "temporary 
database" cleanup, we could recently see an increase in invalid databases.


The additional check for pg_upgrade was made necessary due to several 
unrelated customers having invalid databases that stem from left-over 
Prisma Migrate "shadow databases" [1]. We could not reproduce this 
Prisma Migrate issue yet, as those migrations happened some time ago. 
Maybe this bug really stems from a much older Prisma Migrate version and 
we only see the fallout now. This is still a TODO item.


But it appears that this tool can get interrupted "at the wrong time" 
while it is deleting temporary databases (probably a manual Ctrl-C), and 
clients are unaware that this can then leave behind invalid databases.


Those temporary databases do not cause any harm as they are not used 
anymore. But eventually, PG installations will be upgraded to the next 
major version, and it is only then when those invalid databases 
resurface after pg_upgrade fails to run the checks.


Long story short: interactive DROP DATABASE interrupts are rare (they do 
exist, but customers are usually aware). Automation tools on the other 
hand may run DROP DATABASE and when they get interrupted at the wrong 
time they will then produce several left-over invalid databases. 
pg_upgrade will then fail to run the checks.


[1] 
https://www.prisma.io/docs/orm/prisma-migrate/understanding-prisma-migrate/shadow-database