Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

Thomas Munro Fri, 22 Apr 2022 00:38:29 -0700

On Wed, Apr 6, 2022 at 5:07 AM Robert Haas <robertmh...@gmail.com> wrote:
> On Mon, Apr 4, 2022 at 10:20 PM Thomas Munro <thomas.mu...@gmail.com> wrote:
> > > The checkpointer never takes heavyweight locks, so the opportunity
> > > you're describing can't arise.
> >
> > <thinks harder>  Hmm, oh, you probably meant the buffer interlocking
> > in SyncOneBuffer().  It's true that my most recent patch throws away
> > more requests than it could, by doing the level check at the end of
> > the loop over all buffers instead of adding some kind of
> > DropPendingWritebacks() in the barrier handler.  I guess I could find
> > a way to improve that, basically checking the level more often instead
> > of at the end, but I don't know if it's worth it; we're still throwing
> > out an arbitrary percentage of writeback requests.
>
> Doesn't every backend have its own set of pending writebacks?
> BufferAlloc() calls
> ScheduleBufferTagForWriteback(&BackendWritebackContext, ...)?

Yeah. Alright, for my exaggeration above, s/can't arise/probably
won't often arise/, since when regular backends do this it's for
random dirty buffers, not necessarily stuff they hold a relation lock
on.

I think you are right that if we find ourselves calling smgrwrite() on
another buffer for that relfilenode a bit later, then it's safe to
"unzonk" the relation. In the worst case, IssuePendingWritebacks()
might finish up calling sync_file_range() on a file that we originally
wrote to for generation 1 of that relfilenode, even though we now have
a file descriptor for generation 2 of that relfilenode that was
reopened by a later smgrwrite(). That would be acceptable, because a
bogus sync_file_range() call would be harmless. The key point here is
that we absolutely must avoid re-opening files without careful
interlocking, because that could lead to later *writes* for generation
2 going to the defunct generation 1 file that we opened just as it was
being unlinked.

(For completeness: we know of another way for smgrwrite() to write to
the wrong generation of a recycled relfilenode, but that's hopefully
very unlikely and will hopefully be addressed by adding more bits and
killing off OID-wraparound[1]. In this thread we're concerned only
with these weird "explicit OID reycling" cases we're trying to fix
with PROCSIGNAL_BARRIER_SMGRRELEASE sledgehammer-based cache
invalidation.)

My new attempt, attached, is closer to what Andres proposed, except at
the level of md.c segment objects instead of level of SMgrRelation
objects. This avoids the dangling SMgrRelation pointer problem
discussed earlier, and is also a little like your "zonk" idea, except
instead of a zonk flag, the SMgrRelation object is reset to a state
where it knows nothing at all except its relfilenode. Any access
through smgrXXX() functions *except* smgrwriteback() will rebuild the
state, just as you would clear the hypothetical zonk flag.

So, to summarise the new patch that I'm attaching to this email as 0001:

1. When PROCSIGNAL_BARRIER_SMGRRELEASE is handled, we call
smgrreleaseall(), which calls smgrrelease() on all SMgrRelation
objects, telling them to close all their files.

2. All SMgrRelation objects are still valid, and any pointers to them
that were acquired before a CFI() and then used in an smgrXXX() call
afterwards will still work (the example I know of being
RelationCopyStorage()[2]).

3. mdwriteback() internally uses a new "behavior" flag
EXTENSION_DONT_OPEN when getting its hands on the internal segment
object, which says that we should just give up immediately if we don't
already have the file open (concretely: if
PROCSIGNAL_BARRIER_SMGRRELEASE came along between our recent
smgrwrite() call and the writeback machinery's smgrwriteback() call,
it'll do nothing at all).

Someone might say that it's weird that smgrwriteback() has that
behaviour internally, and we might eventually want to make it more
explicit by adding a "behavior" argument to the function itself, so
that it's the caller that controls it. It didn't seem worth it for
now though; the whole thing is a hint to the operating system anyway.

However it seems that I have something wrong, because CI is failing on
Windows; I ran out of time for looking into that today, but wanted to
post what I have so far since I know we have an open item or two to
close here ASAP...

Patches 0002-0004 are Andres's, with minor tweaks:

v3-0002-Fix-old-fd-issues-using-global-barriers-everywher.patch:

It seems useless to even keep the macro USE_BARRIER_SMGRRELEASE if
we're going to define it always, so I removed it. I guess it's useful
to be able to disable that logic easily to see that the assertion in
the other patch fails, but you can do that by commenting out a line in
ProcessBarrierSmgrRelease().

I'm still slightly confused about whether we need an *extra* global
barrier in DROP TABLESPACE, not just if destroy_tablespace_directory()
failed.

v3-0003-WIP-test-for-file-reuse-dangers-around-database-a.patch

Was missing:

-EXTRA_INSTALL=contrib/test_decoding
+EXTRA_INSTALL=contrib/test_decoding contrib/pg_prewarm

v3-0004-WIP-AssertFileNotDeleted-fd.patch

+ * XXX: Figure out which operating systems this works on.

Do we want this to be wrapped in some kind of macros that vanishes
into thin air unless you enable it with USE_UNLINKED_FILE_CHECKING or
something? Then we wouldn't care so much if it doesn't work on some
unusual system. Bikeshed mode: I would prefer "not unlinked", since
that has a more specific meaning than "not deleted".

[1]
https://www.postgresql.org/message-id/flat/CA%2BTgmobM5FN5x0u3tSpoNvk_TZPFCdbcHxsXCoY1ytn1dXROvg%40mail.gmail.com#1070c79256f2330ec52f063cdbe2add0
[2]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BBfaZ2puQDYV6h2oSWO2QW21_JOXZpha65gWRcmGNCZA%40mail.gmail.com#5deeb3d8adae4daf0da1f09e509eef56

From 5eb4efbbe9155417755a635e44fcfda81ecb6cd2 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Sat, 2 Apr 2022 17:05:39 +1300
Subject: [PATCH v3 1/4] Rethink PROCSIGNAL_BARRIER_SMGRRELEASE.

With sufficiently bad luck, it was possible for IssuePendingWritebacks()
to reopen relation files just after we'd processed
PROCSIGNAL_BARRIER_SMGRRELEASE but before the file was unlinked by some
other backend.

While commit 4eb21763 initially appeared to have fixed the spate of test
failures we'd seen on Windows, Andres hammered it with some extra
instrumentation to look out for writes to unlinked files and found this
race.

Defend against that by closing md.c's segments, instead of closing the
fd.c's descriptors, and then teaching smgrwriteback() not to open files
that aren't already open.  This means that if
PROCSIGNAL_BARRIER_SMGRRELEASE comes between smgrwrite() and
smgrwriteback(), the writeback is silently skipped, but we don't risk
caching a descriptor to a file that is unlinked.

Reported-by: Andres Freund <and...@anarazel.de>
Discussion: https://postgr.es/m/20220209220004.kb3dgtn2x2k2gtdm%40alap3.anarazel.de
---
 src/backend/storage/smgr/md.c   | 22 +++++++++-------
 src/backend/storage/smgr/smgr.c | 45 +++++++++++++++++++++++++++------
 src/include/storage/md.h        |  1 -
 src/include/storage/smgr.h      |  2 ++
 4 files changed, 52 insertions(+), 18 deletions(-)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 286dd3f755..41fa73087b 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -116,6 +116,8 @@ static MemoryContext MdCxt;		/* context for all MdfdVec objects */
  * mdnblocks().
  */
 #define EXTENSION_DONT_CHECK_SIZE	(1 << 4)
+/* don't try to open a segment, if not already open */
+#define EXTENSION_DONT_OPEN			(1 << 5)
 
 
 /* local routines */
@@ -551,12 +553,6 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
-void
-mdrelease(void)
-{
-	closeAllVfds();
-}
-
 /*
  *	mdprefetch() -- Initiate asynchronous read of the specified block of a relation
  */
@@ -605,11 +601,14 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 					segnum_end;
 
 		v = _mdfd_getseg(reln, forknum, blocknum, true /* not used */ ,
-						 EXTENSION_RETURN_NULL);
+						 EXTENSION_DONT_OPEN);
 
 		/*
 		 * We might be flushing buffers of already removed relations, that's
-		 * ok, just ignore that case.
+		 * ok, just ignore that case.  If the segment file wasn't open already
+		 * (ie from a recent mdwrite()), then we don't want to re-open it, to
+		 * avoid a race with PROCSIGNAL_BARRIER_SMGRRELEASE that might leave us
+		 * with a descriptor to a file that is about to be unlinked.
 		 */
 		if (!v)
 			return;
@@ -1202,7 +1201,8 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 
 	/* some way to handle non-existent segments needs to be specified */
 	Assert(behavior &
-		   (EXTENSION_FAIL | EXTENSION_CREATE | EXTENSION_RETURN_NULL));
+		   (EXTENSION_FAIL | EXTENSION_CREATE | EXTENSION_RETURN_NULL |
+			EXTENSION_DONT_OPEN));
 
 	targetseg = blkno / ((BlockNumber) RELSEG_SIZE);
 
@@ -1213,6 +1213,10 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 		return v;
 	}
 
+	/* The caller only wants the segment if we already had it open. */
+	if (behavior & EXTENSION_DONT_OPEN)
+		return NULL;
+
 	/*
 	 * The target segment is not yet open. Iterate over all the segments
 	 * between the last opened and the target segment. This way missing
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 2c7a2b2857..9cbfaa86cb 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -41,7 +41,6 @@ typedef struct f_smgr
 {
 	void		(*smgr_init) (void);	/* may be NULL */
 	void		(*smgr_shutdown) (void);	/* may be NULL */
-	void		(*smgr_release) (void); /* may be NULL */
 	void		(*smgr_open) (SMgrRelation reln);
 	void		(*smgr_close) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_create) (SMgrRelation reln, ForkNumber forknum,
@@ -70,7 +69,6 @@ static const f_smgr smgrsw[] = {
 	{
 		.smgr_init = mdinit,
 		.smgr_shutdown = NULL,
-		.smgr_release = mdrelease,
 		.smgr_open = mdopen,
 		.smgr_close = mdclose,
 		.smgr_create = mdcreate,
@@ -281,6 +279,42 @@ smgrclose(SMgrRelation reln)
 		*owner = NULL;
 }
 
+/*
+ *	smgrrelease() -- Release all resources used by this object.
+ *
+ *	The object remains valid.
+ */
+void
+smgrrelease(SMgrRelation reln)
+{
+	for (int i = 0; i <= MAX_FORKNUM; ++i)
+	{
+		smgrsw[reln->smgr_which].smgr_close(reln, i);
+		reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
+	}
+}
+
+/*
+ *	smgrreleaseall() -- Release resources used by all objects.
+ *
+ *	This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
+ */
+void
+smgrreleaseall(void)
+{
+	HASH_SEQ_STATUS status;
+	SMgrRelation reln;
+
+	/* Nothing to do if hashtable not set up */
+	if (SMgrRelationHash == NULL)
+		return;
+
+	hash_seq_init(&status, SMgrRelationHash);
+
+	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
+		smgrrelease(reln);
+}
+
 /*
  *	smgrcloseall() -- Close all existing SMgrRelation objects.
  */
@@ -698,11 +732,6 @@ AtEOXact_SMgr(void)
 bool
 ProcessBarrierSmgrRelease(void)
 {
-	for (int i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_release)
-			smgrsw[i].smgr_release();
-	}
-
+	smgrreleaseall();
 	return true;
 }
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 6e46d8d96a..ffffa40db7 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,7 +23,6 @@
 extern void mdinit(void);
 extern void mdopen(SMgrRelation reln);
 extern void mdclose(SMgrRelation reln, ForkNumber forknum);
-extern void mdrelease(void);
 extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
 extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 8e3ef92cda..6b63c60fbd 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,6 +85,8 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
 extern void smgrclose(SMgrRelation reln);
 extern void smgrcloseall(void);
 extern void smgrclosenode(RelFileNodeBackend rnode);
+extern void smgrrelease(SMgrRelation reln);
+extern void smgrreleaseall(void);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
-- 
2.35.1

From 207a44d8baa210a1b6baa07db8867a226200edc4 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Mon, 21 Feb 2022 16:42:16 -0800
Subject: [PATCH v3 2/4] Fix old-fd issues using global barriers everywhere.

Commit 4eb21763 (and later improvement XXX) introduce a way to force
every backend to closed all relation files, to fix a long-standing
Windows-only problems.

This commit extends that behavior to all operating systems, to handle
a totaly different class of problem: the reuse of relfilenodes in
scenarios that have no other kind of cache invalidation to prevent
file descriptor mix-ups.

In all releases, the problem could happen when you moved a database to
another tablespace and then back again.  In 15, since commit aa010514,
it could happen when using CREATE DATABASE with a user-supplied OID,
particularly as part of pg_upgrade.

Author: Andres Freund <and...@anarazel.de>
Discussion: https://postgr.es/m/20220209220004.kb3dgtn2x2k2gtdm%40alap3.anarazel.de
---
 src/backend/commands/dbcommands.c |  6 ------
 src/backend/commands/tablespace.c | 11 ++++-------
 src/include/pg_config_manual.h    | 11 -----------
 3 files changed, 4 insertions(+), 24 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 1bbecfeddf..1171d2b91a 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1683,10 +1683,8 @@ dropdb(const char *dbname, bool missing_ok, bool force)
 	 */
 	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
 
-#if defined(USE_BARRIER_SMGRRELEASE)
 	/* Close all smgr fds in all backends. */
 	WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
-#endif
 
 	/*
 	 * Remove all tablespace subdirs belonging to the database.
@@ -1936,10 +1934,8 @@ movedb(const char *dbname, const char *tblspcname)
 	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
 					  | CHECKPOINT_FLUSH_ALL);
 
-#if defined(USE_BARRIER_SMGRRELEASE)
 	/* Close all smgr fds in all backends. */
 	WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
-#endif
 
 	/*
 	 * Now drop all buffers holding data of the target database; they should
@@ -3098,10 +3094,8 @@ dbase_redo(XLogReaderState *record)
 		/* Clean out the xlog relcache too */
 		XLogDropDatabase(xlrec->db_id);
 
-#if defined(USE_BARRIER_SMGRRELEASE)
 		/* Close all sgmr fds in all backends. */
 		WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
-#endif
 
 		for (i = 0; i < xlrec->ntablespaces; i++)
 		{
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 40514ab550..822d65287e 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -548,11 +548,10 @@ DropTableSpace(DropTableSpaceStmt *stmt)
 		 * use a global barrier to ask all backends to close all files, and
 		 * wait until they're finished.
 		 */
-#if defined(USE_BARRIER_SMGRRELEASE)
 		LWLockRelease(TablespaceCreateLock);
 		WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
 		LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
-#endif
+
 		/* And now try again. */
 		if (!destroy_tablespace_directories(tablespaceoid, false))
 		{
@@ -1574,6 +1573,9 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		/* Close all smgr fds in all backends. */
+		WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
@@ -1591,11 +1593,6 @@ tblspc_redo(XLogReaderState *record)
 		 */
 		if (!destroy_tablespace_directories(xlrec->ts_id, true))
 		{
-#if defined(USE_BARRIER_SMGRRELEASE)
-			/* Close all smgr fds in all backends. */
-			WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
-#endif
-
 			ResolveRecoveryConflictWithTablespace(xlrec->ts_id);
 
 			/*
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index 84ce5a4a5d..8d2e3e3a57 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -152,17 +152,6 @@
 #define EXEC_BACKEND
 #endif
 
-/*
- * If USE_BARRIER_SMGRRELEASE is defined, certain code paths that unlink
- * directories will ask other backends to close all smgr file descriptors.
- * This is enabled on Windows, because otherwise unlinked but still open files
- * can prevent rmdir(containing_directory) from succeeding.  On other
- * platforms, it can be defined to exercise those code paths.
- */
-#if defined(WIN32)
-#define USE_BARRIER_SMGRRELEASE
-#endif
-
 /*
  * Define this if your operating system supports link()
  */
-- 
2.35.1

From 665cdeb733d61c8703f1b44e380b8b2aa6d5a859 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Mon, 21 Feb 2022 15:44:02 -0800
Subject: [PATCH v3 3/4] WIP: test for file reuse dangers around database and
 tablespace commands.

---
 src/test/recovery/Makefile                   |   2 +-
 src/test/recovery/t/032_relfilenode_reuse.pl | 233 +++++++++++++++++++
 2 files changed, 234 insertions(+), 1 deletion(-)
 create mode 100644 src/test/recovery/t/032_relfilenode_reuse.pl

diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index da5b9ff397..c47eee273b 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -9,7 +9,7 @@
 #
 #-------------------------------------------------------------------------
 
-EXTRA_INSTALL=contrib/test_decoding
+EXTRA_INSTALL=contrib/test_decoding contrib/pg_prewarm
 
 subdir = src/test/recovery
 top_builddir = ../../..
diff --git a/src/test/recovery/t/032_relfilenode_reuse.pl b/src/test/recovery/t/032_relfilenode_reuse.pl
new file mode 100644
index 0000000000..22d8e85614
--- /dev/null
+++ b/src/test/recovery/t/032_relfilenode_reuse.pl
@@ -0,0 +1,233 @@
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use File::Basename;
+
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->append_conf('postgresql.conf', q[
+allow_in_place_tablespaces = true
+log_connections=on
+# to avoid "repairing" corruption
+full_page_writes=off
+log_min_messages=debug2
+autovacuum_naptime=1s
+shared_buffers=1MB
+]);
+$node_primary->start;
+
+
+# Create streaming standby linking to primary
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->start;
+
+# To avoid hanging while expecting some specific input from a psql
+# instance being driven by us, add a timeout high enough that it
+# should never trigger even on very slow machines, unless something
+# is really wrong.
+my $psql_timeout = IPC::Run::timer(300);
+
+my %psql_primary = (stdin => '', stdout => '', stderr => '');
+$psql_primary{run} = IPC::Run::start(
+	[ 'psql', '-XA', '-f', '-', '-d', $node_primary->connstr('postgres') ],
+	'<',
+	\$psql_primary{stdin},
+	'>',
+	\$psql_primary{stdout},
+	'2>',
+	\$psql_primary{stderr},
+	$psql_timeout);
+
+my %psql_standby = ('stdin' => '', 'stdout' => '', 'stderr' => '');
+$psql_standby{run} = IPC::Run::start(
+	[ 'psql', '-XA', '-f', '-', '-d', $node_standby->connstr('postgres') ],
+	'<',
+	\$psql_standby{stdin},
+	'>',
+	\$psql_standby{stdout},
+	'2>',
+	\$psql_standby{stderr},
+	$psql_timeout);
+
+
+# Create template database with a table that we'll update, to trigger dirty
+# rows. Using a template database + preexisting rows makes it a bit easier to
+# reproduce, because there's no cache invalidations generated.
+
+$node_primary->safe_psql('postgres', "CREATE DATABASE conflict_db_template OID = 50000;");
+$node_primary->safe_psql('conflict_db_template', q[
+    CREATE TABLE large(id serial primary key, dataa text, datab text);
+    INSERT INTO large(dataa, datab) SELECT g.i::text, 1 FROM generate_series(1, 4000) g(i);]);
+$node_primary->safe_psql('postgres', "CREATE DATABASE conflict_db TEMPLATE conflict_db_template OID = 50001;");
+
+$node_primary->safe_psql('postgres', q[
+    CREATE EXTENSION pg_prewarm;
+    CREATE TABLE replace_sb(data text);
+    INSERT INTO replace_sb(data) SELECT random()::text FROM generate_series(1, 15000);]);
+
+# Use longrunning transactions, so that AtEOXact_SMgr doesn't close files
+send_query_and_wait(
+	\%psql_primary,
+	q[BEGIN;],
+	qr/BEGIN/m);
+send_query_and_wait(
+	\%psql_standby,
+	q[BEGIN;],
+	qr/BEGIN/m);
+
+# Cause lots of dirty rows in shared_buffers
+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 1;");
+
+# Now do a bunch of work in another database. That will end up needing to
+# write back dirty data from the previous step, opening the relevant file
+# descriptors
+cause_eviction(\%psql_primary, \%psql_standby);
+
+# drop and recreate database
+$node_primary->safe_psql('postgres', "DROP DATABASE conflict_db;");
+$node_primary->safe_psql('postgres', "CREATE DATABASE conflict_db TEMPLATE conflict_db_template OID = 50001;");
+
+verify($node_primary, $node_standby, 1,
+	   "initial contents as expected");
+
+# Again cause lots of dirty rows in shared_buffers, but use a different update
+# value so we can check everything is OK
+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 2;");
+
+# Again cause a lot of IO. That'll again write back dirty data, but uses (XXX
+# adjust after bugfix) the already opened file descriptor.
+# FIXME
+cause_eviction(\%psql_primary, \%psql_standby);
+
+verify($node_primary, $node_standby, 2,
+	   "update to reused relfilenode (due to DB oid conflict) is not lost");
+
+
+$node_primary->safe_psql('conflict_db', "VACUUM FULL large;");
+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 3;");
+
+verify($node_primary, $node_standby, 3,
+	   "restored contents as expected");
+
+# Test for old filehandles after moving a database in / out of tablespace
+$node_primary->safe_psql('postgres', q[CREATE TABLESPACE test_tablespace LOCATION '']);
+
+# cause dirty buffers
+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 4;");
+# cause files to be opened in backend in other database
+cause_eviction(\%psql_primary, \%psql_standby);
+
+# move database back / forth
+$node_primary->safe_psql('postgres', 'ALTER DATABASE conflict_db SET TABLESPACE test_tablespace');
+$node_primary->safe_psql('postgres', 'ALTER DATABASE conflict_db SET TABLESPACE pg_default');
+
+# cause dirty buffers
+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 5;");
+cause_eviction(\%psql_primary, \%psql_standby);
+
+verify($node_primary, $node_standby, 5,
+	   "post move contents as expected");
+
+$node_primary->safe_psql('postgres', 'ALTER DATABASE conflict_db SET TABLESPACE test_tablespace');
+
+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 7;");
+cause_eviction(\%psql_primary, \%psql_standby);
+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 8;");
+$node_primary->safe_psql('postgres', 'DROP DATABASE conflict_db');
+$node_primary->safe_psql('postgres', 'DROP TABLESPACE test_tablespace');
+
+$node_primary->safe_psql('postgres', 'REINDEX TABLE pg_database');
+
+
+# explicitly shut down psql instances gracefully - to avoid hangs
+# or worse on windows
+$psql_primary{stdin} .= "\\q\n";
+$psql_primary{run}->finish;
+$psql_standby{stdin} .= "\\q\n";
+$psql_standby{run}->finish;
+
+$node_primary->stop();
+$node_standby->stop();
+
+# Make sure that there weren't crashes during shutdown
+
+command_like([ 'pg_controldata', $node_primary->data_dir ],
+	qr/Database cluster state:\s+shut down\n/, 'primary shut down ok');
+command_like([ 'pg_controldata', $node_standby->data_dir ],
+	qr/Database cluster state:\s+shut down in recovery\n/, 'standby shut down ok');
+done_testing();
+
+sub verify
+{
+	my ($primary, $standby, $counter, $message) = @_;
+
+	my $query = "SELECT datab, count(*) FROM large GROUP BY 1 ORDER BY 1 LIMIT 10";
+	is($primary->safe_psql('conflict_db', $query),
+	   "$counter|4000",
+	   "primary: $message");
+
+	$primary->wait_for_catchup($standby);
+	is($standby->safe_psql('conflict_db', $query),
+	   "$counter|4000",
+	   "standby: $message");
+}
+
+sub cause_eviction
+{
+	my ($psql_primary, $psql_standby) = @_;
+
+	send_query_and_wait(
+		$psql_primary,
+		q[SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE pg_relation_filenode(oid) != 0;],
+		qr/warmed_buffers/m);
+
+	send_query_and_wait(
+		$psql_standby,
+		q[SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE pg_relation_filenode(oid) != 0;],
+		qr/warmed_buffers/m);
+}
+
+# Send query, wait until string matches
+sub send_query_and_wait
+{
+	my ($psql, $query, $untl) = @_;
+	my $ret;
+
+	# send query
+	$$psql{stdin} .= $query;
+	$$psql{stdin} .= "\n";
+
+	# wait for query results
+	$$psql{run}->pump_nb();
+	while (1)
+	{
+		last if $$psql{stdout} =~ /$untl/;
+
+		if ($psql_timeout->is_expired)
+		{
+			BAIL_OUT("aborting wait: program timed out\n"
+				  . "stream contents: >>$$psql{stdout}<<\n"
+				  . "pattern searched for: $untl\n");
+			return 0;
+		}
+		if (not $$psql{run}->pumpable())
+		{
+			BAIL_OUT("aborting wait: program died\n"
+				  . "stream contents: >>$$psql{stdout}<<\n"
+				  . "pattern searched for: $untl\n");
+			return 0;
+		}
+		$$psql{run}->pump();
+	}
+
+	$$psql{stdout} = '';
+
+	return 1;
+}
-- 
2.35.1

From a12c061e612a898c30afeac247155f4fa4150930 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Mon, 21 Feb 2022 15:41:23 -0800
Subject: [PATCH v3 4/4] WIP: AssertFileNotDeleted(fd).

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/transam/slru.c     |  2 +
 src/backend/access/transam/xlog.c     |  2 +
 src/backend/replication/walreceiver.c |  2 +
 src/backend/storage/file/fd.c         | 83 +++++++++++++++++++++++++++
 src/include/storage/fd.h              |  1 +
 5 files changed, 90 insertions(+)

diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index b65cb49d7f..654da64128 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -871,6 +871,8 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruWriteAll fdata)
 		}
 	}
 
+	AssertFileNotDeleted(fd);
+
 	errno = 0;
 	pgstat_report_wait_start(WAIT_EVENT_SLRU_WRITE);
 	if (pg_pwrite(fd, shared->page_buffer[slotno], BLCKSZ, offset) != BLCKSZ)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5eabd32cf6..8d0ea6596e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2188,6 +2188,8 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 				if (track_wal_io_timing)
 					INSTR_TIME_SET_CURRENT(start);
 
+				AssertFileNotDeleted(openLogFile);
+
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 3c9411e221..37d92c1aba 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -912,6 +912,8 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
 		else
 			segbytes = nbytes;
 
+		AssertFileNotDeleted(recvFile);
+
 		/* OK to write the logs */
 		errno = 0;
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 24704b6a02..1fc811ec1b 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -93,6 +93,7 @@
 #include "common/file_perm.h"
 #include "common/file_utils.h"
 #include "common/pg_prng.h"
+#include "common/string.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/pg_iovec.h"
@@ -2073,6 +2074,8 @@ FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info)
 	if (returnCode < 0)
 		return returnCode;
 
+	AssertFileNotDeleted(VfdCache[file].fd);
+
 	pgstat_report_wait_start(wait_event_info);
 	returnCode = posix_fadvise(VfdCache[file].fd, offset, amount,
 							   POSIX_FADV_WILLNEED);
@@ -2103,6 +2106,11 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
 	if (returnCode < 0)
 		return;
 
+	/*
+	 * XXX: can't assert non-use of fd right now,
+	 * ScheduleBufferTagForWriteback can end up writing at a later time.
+	 */
+
 	pgstat_report_wait_start(wait_event_info);
 	pg_flush_data(VfdCache[file].fd, offset, nbytes);
 	pgstat_report_wait_end();
@@ -2128,6 +2136,8 @@ FileRead(File file, char *buffer, int amount, off_t offset,
 
 	vfdP = &VfdCache[file];
 
+	AssertFileNotDeleted(vfdP->fd);
+
 retry:
 	pgstat_report_wait_start(wait_event_info);
 	returnCode = pg_pread(vfdP->fd, buffer, amount, offset);
@@ -2184,6 +2194,8 @@ FileWrite(File file, char *buffer, int amount, off_t offset,
 
 	vfdP = &VfdCache[file];
 
+	AssertFileNotDeleted(vfdP->fd);
+
 	/*
 	 * If enforcing temp_file_limit and it's a temp file, check to see if the
 	 * write would overrun temp_file_limit, and throw error if so.  Note: it's
@@ -2276,6 +2288,8 @@ FileSync(File file, uint32 wait_event_info)
 	if (returnCode < 0)
 		return returnCode;
 
+	AssertFileNotDeleted(VfdCache[file].fd);
+
 	pgstat_report_wait_start(wait_event_info);
 	returnCode = pg_fsync(VfdCache[file].fd);
 	pgstat_report_wait_end();
@@ -2297,6 +2311,8 @@ FileSize(File file)
 			return (off_t) -1;
 	}
 
+	AssertFileNotDeleted(VfdCache[file].fd);
+
 	return lseek(VfdCache[file].fd, 0, SEEK_END);
 }
 
@@ -2314,6 +2330,8 @@ FileTruncate(File file, off_t offset, uint32 wait_event_info)
 	if (returnCode < 0)
 		return returnCode;
 
+	AssertFileNotDeleted(VfdCache[file].fd);
+
 	pgstat_report_wait_start(wait_event_info);
 	returnCode = ftruncate(VfdCache[file].fd, offset);
 	pgstat_report_wait_end();
@@ -3828,6 +3846,71 @@ data_sync_elevel(int elevel)
 	return data_sync_retry ? elevel : PANIC;
 }
 
+void
+AssertFileNotDeleted(int fd)
+{
+	struct stat statbuf;
+	int			ret;
+	char		deleted_filename[MAXPGPATH];
+	bool		have_filename = false;
+
+	/*
+	 * fstat shouldn't fail, so it seems ok to error out, even if it's
+	 * just a debugging aid.
+	 *
+	 * XXX: Figure out which operating systems this works on.
+	 */
+	ret = fstat(fd, &statbuf);
+	if (ret != 0)
+		elog(ERROR, "fstat failed: %m");
+
+	/*
+	 * On several operating systems st_nlink == 0 indicates that the file has
+	 * been deleted. On some OS/filesystem combinations a deleted file may
+	 * still show up with nlink > 0, but nlink == 0 shouldn't be returned
+	 * spuriously. Hardlinks obviously can prevent this from working, but we
+	 * don't expect any, so that's fine.
+	 */
+	if (statbuf.st_nlink > 0)
+		return;
+
+#if defined(__linux__)
+	{
+		char        path[MAXPGPATH];
+		const char *const deleted_suffix = " (deleted)";
+
+		/*
+		 * On linux we can figure out what the file name
+		 */
+		sprintf(path, "/proc/self/fd/%d", fd);
+		ret = readlink(path, deleted_filename, sizeof(deleted_filename) - 1);
+
+		// FIXME: Tolerate most errors here
+		if (ret == -1)
+			elog(PANIC, "readlink failed: %m");
+
+		/* readlink doesn't null terminate */
+		deleted_filename[ret] = 0;
+		have_filename = true;
+
+		/* chop off the " (deleted)" */
+		if (pg_str_endswith(deleted_filename, deleted_suffix))
+		{
+			Size		len = strlen(deleted_filename);
+
+			deleted_filename[len - strlen(deleted_suffix)] = 0;
+		}
+	}
+#endif
+
+	if (have_filename)
+		elog(PANIC, "file descriptor %d for file %s is of a deleted file",
+			 fd, deleted_filename);
+	else
+		elog(PANIC, "file descriptor %d is of a deleted file",
+			 fd);
+}
+
 /*
  * A convenience wrapper for pg_pwritev() that retries on partial write.  If an
  * error is returned, it is unspecified how much has been written.
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 69549b000f..31513a965c 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -191,6 +191,7 @@ extern int	durable_rename_excl(const char *oldfile, const char *newfile, int log
 extern void SyncDataDirectory(void);
 extern int	data_sync_elevel(int elevel);
 
+extern void AssertFileNotDeleted(int fd);
 /* Filename components */
 #define PG_TEMP_FILES_DIR "pgsql_tmp"
 #define PG_TEMP_FILE_PREFIX "pgsql_tmp"
-- 
2.35.1

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

Reply via email to