I spent some more time digging into this, experimenting with different approaches. Came up with pretty significant changes; see below:

On 18/09/2023 18:19, Robert Haas wrote:
I think that if you believe 0001 to be correct you should go ahead and
commit it sooner rather than later. If you're wrong and something
weird starts happening we'll then have a chance to notice that before
too much other stuff gets changed on top of this and confuses the
matter.

+1

On Wed, Aug 23, 2023 at 12:55 AM Thomas Munro <thomas.mu...@gmail.com> wrote:
Right, let's one find one good place.  I think smgropen() would be best.

I think it would be a good idea to give this comment a bit more oomph.
In lieu of this:

+ * This does not attempt to actually open the underlying files.  The returned
+ * object remains valid at least until AtEOXact_SMgr() is called, or until
+ * smgrdestroy() is called in non-transactional backends.

I would leave the existing "This does not attempt to actually open the
underlying files." comment as a separate comment, and add something
like this as a new paragraph:

In versions of PostgreSQL prior to 17, this function returned an
object with no defined lifetime. Now, however, the object remains
valid for the lifetime of the transaction, up to the point where
AtEOXact_SMgr() is called, making it much easier for callers to know
for how long they can hold on to a pointer to the returned object. If
this function is called outside of a transaction, the object remains
valid until smgrdestroy() or smgrdestroyall() is called. Background
processes that use smgr but not transactions typically do this once
per checkpoint cycle.

+1

Apart from that, the main thing that is bothering me is that the
justification for redefining smgrclose() to do what smgrrelease() now
does isn't really spelled out anywhere. You mentioned some reasons and
Heikki mentioned the benefit to extensions, but I feel like somebody
should be able to understand the reasoning clearly from reading the
commit message or the comments in the patch, rather than having to
visit the mailing list discussion, and I'm not sure we're there yet. I
feel like I understood why we were doing this and was convinced it was
a good idea at some point, but now the reasoning has gone out of my
head and I can't recover it. If somebody does smgropen() .. stuff ...
smgrclose() as in heapam_relation_copy_data() or index_copy_data(),
this change has the effect of making the SmgrRelation remain valid
until eoxact whereas before it would have been destroyed instantly. Is
that what we want? Presumably yes, or this patch wouldn't be shaped
like this, but I don't know why we want that...

Fair. I tried to address that by adding an overview comment at top of smgr.c, explaining how this stuff work. I hope that helps.

Another thing that seems a bit puzzling is how this is intended to, or
does, interact with the ownership mechanism. Intuitively, I feel like
a Relation owns an SMgrRelation *because* the SMgrRelation has no
defined lifetime. But I think that's not quite right. I guess the
ownership mechanism doesn't guarantee anything about the lifetime of
the object, just that the pointer in the Relation won't hang around
any longer than the object to which it's pointing. So does that mean
that we're free to redefine the object lifetime to be pretty much
anything we want and that mechanism doesn't really care? Huh,
interesting.

Yeah that owner mechanism is weird. It guarantees that the pointer to the SMgrRelation is cleared when the SMgrRelation is destroyed. But it also prevents the SMgrRelation from being destroyed at end of transaction. That's how it is in 'master' too.

But with this patch, we don't normally call smgrdestroy() on an SMgrRelation that came from the relation cache. We do call smgrdestroyall() in the aux processes, but they don't have a relcache. So the real effect of setting the owner now is to prevent the SMgrRelation from being destroyed at end of transaction; the mechanism of clearing the pointer is unused.

I found two exceptions to that, though, by adding some extra assertions and running the regression tests:

1. The smgrdestroyall() in a single-user backend in RequestCheckpoint(). It destroys SMgrRelations belonging to relcache entries, and the owner mechanism clears the pointers from the relcache. I think smgrcloseall(), or doing nothing, would actually be more appropriate there.

2. A funny case with foreign tables: ANALYZE on a foreign table calls visibilitymap_count(). A foreign table has no visibility map so it returns 0, but before doing so it calls RelationGetSmgr on the foreign table, which has 0/0/0 rellocator. That creates an SMgrRelation for 0/0/0, and sets the foreign table's relcache entry as its owner. If you then call ANALYZE on another foreign table, it also calls RelationGetSmgr with 0/0/0 rellocator, returning the same SMgrRelation entry, and changes its owner to the new relcache entry. That doesn't make much sense and it's pretty accidental that it works at all, so attached is a patch to avoid calling visibilitymap_count() on foreign tables.

I propose that we replace the single owner with a "pin counter". One SMgrRelation can have more than one pin on it, and the guarantee is that as long as the pin counter is non-zero, the SMgrRelation cannot be destroyed and the pointer remains valid. We don't really need the capability for more than one pin at the moment (the regression tests pass with an extra assertion that pincount <= 1 after fixing the foreign table issue), but it seems more straightforward than tracking an owner.

Here's another reason to do that: I noticed this at the end of swap_relation_files():

        /*
         * Close both relcache entries' smgr links.  We need this kluge because
         * both links will be invalidated during upcoming 
CommandCounterIncrement.
         * Whichever of the rels is the second to be cleared will have a 
dangling
         * reference to the other's smgr entry.  Rather than trying to avoid 
this
         * by ordering operations just so, it's easiest to close the links 
first.
         * (Fortunately, since one of the entries is local in our transaction,
         * it's sufficient to clear out our own relcache this way; the problem
         * cannot arise for other backends when they see our update on the
         * non-transient relation.)
         *
         * Caution: the placement of this step interacts with the decision to
         * handle toast rels by recursion.  When we are trying to rebuild 
pg_class
         * itself, the smgr close on pg_class must happen after all accesses in
         * this function.
         */
        RelationCloseSmgrByOid(r1);
        RelationCloseSmgrByOid(r2);

If RelationCloseSmgrByOid() merely closes the underlying file descriptor but doesn't destroy the SMgrRelation object - as it does with these patches - I think we reintroduce the dangling reference problem that the comment mentions. But if we allow the same SMgrRelation to be pinned by two different relcache entries, the problem goes away and we can remove that kluge.

I think we're missing test coverage for that though. I commented out those calls in 'master' and ran the regression tests, but got no failures. I don't fully understand the problem anyway. Or does it not exist anymore? Is there a moment where we have two relcache entries pointing to the same SMgrRelation? I don't see it. In any case, with a pin counter mechanism, I believe it would be fine.


Summary of the changes to the attached main patch:

* Added an overview comment at top of smgr.c

* Added the comment Robert suggested to smgropen()

* Replaced the single owner with a pin count and smgrpin() / smgrunpin() functions. smgrdestroyall() now only destroys unpinned entries

* Removed that kluge from swap_relation_files(). It should not be needed anymore with the pin counter.

* Changed a few places in bufmgr.c where we called RelationGetSmgr() on every smgr call to keep the SMgrRelation in a local variable. That was not safe before, but is now. I don't think we should go on a spree to change all callers - RelationGetSmgr() is still cheap - but in these few places it seems worthwhile.

* I kept the separate smgrclose() and smgrrelease() functions. I know I suggested to just change smgrclose() to do what smgrrelease() did, but on second thoughts keeping them separate seems nicer. However, sgmgrclose() just calls smgrrelease() now, so the distinction is just pro forma. The idea is that if you call smgrclose(), you hint that you don't need that SMgrRelation pointer anymore, although there might be other pointers to the same object and they stay valid.

--
Heikki Linnakangas
Neon (https://neon.tech)
From 23530bcdd36ba7dafbaa7aaf29890dc528d07fad Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Wed, 23 Aug 2023 14:38:38 +1200
Subject: [PATCH 1/3] Remove some obsolete smgrcloseall() calls.

Before the advent of PROCSIGNAL_BARRIER_SMGRRELEASE, we didn't have a
comprehensive way to deal with Windows file handles that get in the way
of unlinking directories.  We had smgrcloseall() calls in a few places
to try to mitigate.

It's still a good idea for bgwriter and checkpointer to do that once per
checkpoint so they don't accumulate unbounded SMgrRelation objects, but
there is no longer any reason to close them at other random places such
as the error path, and the explanation as given in the comments is now
obsolete.

Discussion: https://postgr.es/m/message-id/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com
---
 src/backend/postmaster/bgwriter.c     | 7 -------
 src/backend/postmaster/checkpointer.c | 7 -------
 src/backend/postmaster/walwriter.c    | 7 -------
 3 files changed, 21 deletions(-)

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index d02dc17b9c1..e14c88ac060 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -197,13 +197,6 @@ BackgroundWriterMain(void)
 		 */
 		pg_usleep(1000000L);
 
-		/*
-		 * Close all open files after any error.  This is helpful on Windows,
-		 * where holding deleted files open causes various strange errors.
-		 * It's not clear we need it elsewhere, but shouldn't hurt.
-		 */
-		smgrcloseall();
-
 		/* Report wait end here, when there is no further possibility of wait */
 		pgstat_report_wait_end();
 	}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 42c807d3528..89ab97bdbc7 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -301,13 +301,6 @@ CheckpointerMain(void)
 		 * fast as we can.
 		 */
 		pg_usleep(1000000L);
-
-		/*
-		 * Close all open files after any error.  This is helpful on Windows,
-		 * where holding deleted files open causes various strange errors.
-		 * It's not clear we need it elsewhere, but shouldn't hurt.
-		 */
-		smgrcloseall();
 	}
 
 	/* We can now handle ereport(ERROR) */
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 48bc92205b5..8789ffb01d0 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -189,13 +189,6 @@ WalWriterMain(void)
 		 * fast as we can.
 		 */
 		pg_usleep(1000000L);
-
-		/*
-		 * Close all open files after any error.  This is helpful on Windows,
-		 * where holding deleted files open causes various strange errors.
-		 * It's not clear we need it elsewhere, but shouldn't hurt.
-		 */
-		smgrcloseall();
 	}
 
 	/* We can now handle ereport(ERROR) */
-- 
2.39.2

From 7892cb9642286b81cdb45526f8d23117ed892578 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakan...@iki.fi>
Date: Wed, 29 Nov 2023 14:07:52 +0200
Subject: [PATCH 2/3] Don't try to open visibilitymap when analyzing a foreign
 table.

Also add an assertion in smgropen() that the relnumber is valid.
---
 src/backend/commands/analyze.c  | 5 ++++-
 src/backend/storage/smgr/smgr.c | 2 ++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index e264ffdcf28..1f4a9516814 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -634,7 +634,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	{
 		BlockNumber relallvisible;
 
-		visibilitymap_count(onerel, &relallvisible, NULL);
+		if (RELKIND_HAS_STORAGE(onerel->rd_rel->relkind))
+			visibilitymap_count(onerel, &relallvisible, NULL);
+		else
+			relallvisible = 0;
 
 		/* Update pg_class for table relation */
 		vac_update_relstats(onerel,
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 5d0f3d515c3..4c552649336 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -153,6 +153,8 @@ smgropen(RelFileLocator rlocator, BackendId backend)
 	SMgrRelation reln;
 	bool		found;
 
+	Assert(RelFileNumberIsValid(rlocator.relNumber));
+
 	if (SMgrRelationHash == NULL)
 	{
 		/* First time through: initialize the hash table */
-- 
2.39.2

From 0c62df4fa2be891d61e380b354e6df6c7f3c5aff Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Thu, 10 Aug 2023 18:16:31 +1200
Subject: [PATCH 3/3] Give SMgrRelation pointers a well-defined lifetime.

After calling smgropen(), it was not clear how long you could continue
to use the result, because various code paths including cache
invalidation could call smgrclose(), which freed the memory.

Guarantee that the object won't be destroyed until the end of the
current transaction, or in recovery, the commit/abort record that
destroys the underlying storage.

smgrclose() is now just an alias for smgrrelease(). It closes files
and forgets all state except the rlocator, but keeps the SMgrRelation
object valid.

A new smgrdestroy() function is used by rare places that know there
should be no other references to the SMgrRelation.

The short version:

 * smgrclose() is now just an alias for smgrrelease(). It releases
   resources, but doesn't destroy until EOX
 * smgrdestroy() now frees memory, and should rarely be used.

Existing code should be unaffected, but it is now possible for code that
has an SMgrRelation object to use it repeatedly during a transaction as
long as the storage hasn't been physically dropped.  Such code would
normally hold a lock on the relation.

This also replaces the "ownership" mechanism of SMgrRelations with a
pin counter.  An SMgrRelation can now be "pinned", which prevents it
from being destroyed at end of transaction.  There can be multiple pins
on the same SMgrRelation.  In practice, the pin mechanism is only used
by the relcache, so there cannot be more than one pin on the same
SMgrRelation.  Except with swap_relation_files XXX

Author: Thomas Munro, Heikki Linnakangas
Reviewed-by: Robert Haas <robertmh...@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com
---
 src/backend/access/transam/xlogutils.c    |  15 +-
 src/backend/commands/cluster.c            |  19 --
 src/backend/postmaster/bgwriter.c         |   8 +-
 src/backend/postmaster/checkpointer.c     |  15 +-
 src/backend/storage/buffer/bufmgr.c       |  32 +--
 src/backend/storage/freespace/freespace.c |   9 +-
 src/backend/storage/smgr/smgr.c           | 234 ++++++++++++----------
 src/backend/utils/cache/inval.c           |   2 +-
 src/backend/utils/cache/relcache.c        |  21 +-
 src/include/storage/smgr.h                |  36 ++--
 src/include/utils/rel.h                   |  18 +-
 src/include/utils/relcache.h              |   2 -
 12 files changed, 184 insertions(+), 227 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 43f7b31205d..e7d55dd9b03 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -616,7 +616,11 @@ CreateFakeRelcacheEntry(RelFileLocator rlocator)
 	rel->rd_lockInfo.lockRelId.dbId = rlocator.dbOid;
 	rel->rd_lockInfo.lockRelId.relId = rlocator.relNumber;
 
-	rel->rd_smgr = NULL;
+	/*
+	 * Set up a non-pinned SMgrRelation reference, so that we don't need to
+	 * worry about unpinning it on error.
+	 */
+	rel->rd_smgr = smgropen(rlocator, InvalidBackendId);
 
 	return rel;
 }
@@ -627,9 +631,6 @@ CreateFakeRelcacheEntry(RelFileLocator rlocator)
 void
 FreeFakeRelcacheEntry(Relation fakerel)
 {
-	/* make sure the fakerel is not referenced by the SmgrRelation anymore */
-	if (fakerel->rd_smgr != NULL)
-		smgrclearowner(&fakerel->rd_smgr, fakerel->rd_smgr);
 	pfree(fakerel);
 }
 
@@ -656,10 +657,10 @@ XLogDropDatabase(Oid dbid)
 	/*
 	 * This is unnecessarily heavy-handed, as it will close SMgrRelation
 	 * objects for other databases as well. DROP DATABASE occurs seldom enough
-	 * that it's not worth introducing a variant of smgrclose for just this
-	 * purpose. XXX: Or should we rather leave the smgr entries dangling?
+	 * that it's not worth introducing a variant of smgrdestroy for just this
+	 * purpose.
 	 */
-	smgrcloseall();
+	smgrdestroyall();
 
 	forget_invalid_pages_db(dbid);
 }
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index a3bef6ac34f..7113362eb36 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1422,25 +1422,6 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
 	heap_freetuple(reltup2);
 
 	table_close(relRelation, RowExclusiveLock);
-
-	/*
-	 * Close both relcache entries' smgr links.  We need this kluge because
-	 * both links will be invalidated during upcoming CommandCounterIncrement.
-	 * Whichever of the rels is the second to be cleared will have a dangling
-	 * reference to the other's smgr entry.  Rather than trying to avoid this
-	 * by ordering operations just so, it's easiest to close the links first.
-	 * (Fortunately, since one of the entries is local in our transaction,
-	 * it's sufficient to clear out our own relcache this way; the problem
-	 * cannot arise for other backends when they see our update on the
-	 * non-transient relation.)
-	 *
-	 * Caution: the placement of this step interacts with the decision to
-	 * handle toast rels by recursion.  When we are trying to rebuild pg_class
-	 * itself, the smgr close on pg_class must happen after all accesses in
-	 * this function.
-	 */
-	RelationCloseSmgrByOid(r1);
-	RelationCloseSmgrByOid(r2);
 }
 
 /*
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index e14c88ac060..ffdcbe2d902 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -239,10 +239,12 @@ BackgroundWriterMain(void)
 		if (FirstCallSinceLastCheckpoint())
 		{
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the bgwriter does
+			 * not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 		}
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 89ab97bdbc7..46c3f034d5d 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -442,10 +442,12 @@ CheckpointerMain(void)
 				ckpt_performed = CreateRestartPoint(flags);
 
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the checkpointer
+			 * does not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 
 			/*
 			 * Indicate checkpoint completion to any waiting backends.
@@ -928,11 +930,8 @@ RequestCheckpoint(int flags)
 		 */
 		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
 
-		/*
-		 * After any checkpoint, close all smgr files.  This is so we won't
-		 * hang onto smgr references to deleted files indefinitely.
-		 */
-		smgrcloseall();
+		/* Free all smgr objects, as CheckpointerMain() normally would. */
+		smgrdestroyall();
 
 		return;
 	}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f7c67d504cd..9beeab9d04e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -934,10 +934,6 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	{
 		LockRelationForExtension(bmr.rel, ExclusiveLock);
 
-		/* could have been closed while waiting for lock */
-		if (bmr.rel)
-			bmr.smgr = RelationGetSmgr(bmr.rel);
-
 		/* recheck, fork might have been created concurrently */
 		if (!smgrexists(bmr.smgr, fork))
 			smgrcreate(bmr.smgr, fork, flags & EB_PERFORMING_RECOVERY);
@@ -1897,11 +1893,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 	 * we get the lock.
 	 */
 	if (!(flags & EB_SKIP_EXTENSION_LOCK))
-	{
 		LockRelationForExtension(bmr.rel, ExclusiveLock);
-		if (bmr.rel)
-			bmr.smgr = RelationGetSmgr(bmr.rel);
-	}
 
 	/*
 	 * If requested, invalidate size cache, so that smgrnblocks asks the
@@ -4155,6 +4147,7 @@ FlushRelationBuffers(Relation rel)
 {
 	int			i;
 	BufferDesc *bufHdr;
+	SMgrRelation srel = RelationGetSmgr(rel);
 
 	if (RelationUsesLocalBuffers(rel))
 	{
@@ -4183,7 +4176,7 @@ FlushRelationBuffers(Relation rel)
 
 				io_start = pgstat_prepare_io_time();
 
-				smgrwrite(RelationGetSmgr(rel),
+				smgrwrite(srel,
 						  BufTagGetForkNum(&bufHdr->tag),
 						  bufHdr->tag.blockNum,
 						  localpage,
@@ -4229,7 +4222,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+			FlushBuffer(bufHdr, srel, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr);
 		}
@@ -4442,13 +4435,17 @@ void
 CreateAndCopyRelationData(RelFileLocator src_rlocator,
 						  RelFileLocator dst_rlocator, bool permanent)
 {
-	RelFileLocatorBackend rlocator;
 	char		relpersistence;
+	SMgrRelation src_rel;
+	SMgrRelation dst_rel;
 
 	/* Set the relpersistence. */
 	relpersistence = permanent ?
 		RELPERSISTENCE_PERMANENT : RELPERSISTENCE_UNLOGGED;
 
+	src_rel = smgropen(src_rlocator, InvalidBackendId);
+	dst_rel = smgropen(dst_rlocator, InvalidBackendId);
+
 	/*
 	 * Create and copy all forks of the relation.  During create database we
 	 * have a separate cleanup mechanism which deletes complete database
@@ -4465,9 +4462,9 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
 	for (ForkNumber forkNum = MAIN_FORKNUM + 1;
 		 forkNum <= MAX_FORKNUM; forkNum++)
 	{
-		if (smgrexists(smgropen(src_rlocator, InvalidBackendId), forkNum))
+		if (smgrexists(src_rel, forkNum))
 		{
-			smgrcreate(smgropen(dst_rlocator, InvalidBackendId), forkNum, false);
+			smgrcreate(dst_rel, forkNum, false);
 
 			/*
 			 * WAL log creation if the relation is persistent, or this is the
@@ -4481,15 +4478,6 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
 										   permanent);
 		}
 	}
-
-	/* close source and destination smgr if exists. */
-	rlocator.backend = InvalidBackendId;
-
-	rlocator.locator = src_rlocator;
-	smgrcloserellocator(rlocator);
-
-	rlocator.locator = dst_rlocator;
-	smgrcloserellocator(rlocator);
 }
 
 /* ---------------------------------------------------------------------
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index fb9440ff72f..71b386137f0 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -532,14 +532,7 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 {
 	BlockNumber blkno = fsm_logical_to_physical(addr);
 	Buffer		buf;
-	SMgrRelation reln;
-
-	/*
-	 * Caution: re-using this smgr pointer could fail if the relcache entry
-	 * gets closed.  It's safe as long as we only do smgr-level operations
-	 * between here and the last use of the pointer.
-	 */
-	reln = RelationGetSmgr(rel);
+	SMgrRelation reln = RelationGetSmgr(rel);
 
 	/*
 	 * If we haven't cached the size of the FSM yet, check it first.  Also
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 4c552649336..419fba0b2b9 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -3,8 +3,42 @@
  * smgr.c
  *	  public interface routines to storage manager switch.
  *
- *	  All file system operations in POSTGRES dispatch through these
- *	  routines.
+ * All file system operations on relations dispatch through these routines.
+ * An SMgrRelation represents physical on-disk relation files that are open
+ * for reading and writing.
+ *
+ * When a relation is first accessed through the relation cache, the
+ * corresponding SMgrRelation entry is opened by calling smgropen(), and the
+ * reference is stored in the relation cache entry.
+ *
+ * Accesses that don't go through the relation cache open the SMgrRelation
+ * directly.  That includes flushing buffers from the buffer cache, as well as
+ * all accesses in auxiliary processes like the checkpointer or the WAL redo
+ * in the startup process.
+ *
+ * Operations like CREATE, DROP, ALTER TABLE also hold SMgrRelation references
+ * independent of the relation cache.  They need to prepare the physical files
+ * before updating the relation cache.
+ *
+ * There is a hash table that holds all the SMgrRelation entries in the
+ * backend.  If you call smgropen() twice for the same rel locator, you get a
+ * reference to the same SMgrRelation. The reference is valid until the end of
+ * transaction.  This makes repeated access to the same relation efficient,
+ * and allows caching things like the relation size in the SMgrRelation entry.
+ *
+ * At end of transaction, all SMgrRelation entries that haven't been pinned
+ * are removed.  An SMgrRelation can hold kernel file system descriptors for
+ * the underlying files, and we'd like to close those reasonably soon if the
+ * file gets deleted.  The SMgrRelations references held by the relcache are
+ * pinned to prevent them from being closed.
+ *
+ * There is another mechanism to close file descriptors early:
+ * PROCSIGNAL_BARRIER_SMGRRELEASE.  It is a request to immediately close all
+ * file descriptors.  Upon receiveing that signal, the backend closes all file
+ * descriptors held open by SMgrRelations, but because it can happen in the
+ * middle of a transaction, we cannot destroy the SMgrRelation objects
+ * themselves, as there could pointers to them in active use.  See
+ * smgrrelease() and smgrreleaseall().
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -93,14 +127,15 @@ static const int NSmgr = lengthof(smgrsw);
 
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
- * In addition, "unowned" SMgrRelation objects are chained together in a list.
+ * In addition, "unpinned" SMgrRelation objects are chained together in a list.
  */
 static HTAB *SMgrRelationHash = NULL;
 
-static dlist_head unowned_relns;
+static dlist_head unpinned_relns;
 
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
+static void smgrdestroy(SMgrRelation reln);
 
 
 /*
@@ -144,7 +179,16 @@ smgrshutdown(int code, Datum arg)
 /*
  * smgropen() -- Return an SMgrRelation object, creating it if need be.
  *
- * This does not attempt to actually open the underlying file.
+ * In versions of PostgreSQL prior to 17, this function returned an object
+ * with no defined lifetime.  Now, however, the object remains valid for the
+ * lifetime of the transaction, up to the point where AtEOXact_SMgr() is
+ * called, making it much easier for callers to know for how long they can
+ * hold on to a pointer to the returned object.  If this function is called
+ * outside of a transaction, the object remains valid until smgrdestroy() or
+ * smgrdestroyall() is called.  Background processes that use smgr but not
+ * transactions typically do this once per checkpoint cycle.
+ *
+ * This does not attempt to actually open the underlying files.
  */
 SMgrRelation
 smgropen(RelFileLocator rlocator, BackendId backend)
@@ -164,7 +208,7 @@ smgropen(RelFileLocator rlocator, BackendId backend)
 		ctl.entrysize = sizeof(SMgrRelationData);
 		SMgrRelationHash = hash_create("smgr relation table", 400,
 									   &ctl, HASH_ELEM | HASH_BLOBS);
-		dlist_init(&unowned_relns);
+		dlist_init(&unpinned_relns);
 	}
 
 	/* Look up or create an entry */
@@ -178,7 +222,6 @@ smgropen(RelFileLocator rlocator, BackendId backend)
 	if (!found)
 	{
 		/* hash_search already filled in the lookup key */
-		reln->smgr_owner = NULL;
 		reln->smgr_targblock = InvalidBlockNumber;
 		for (int i = 0; i <= MAX_FORKNUM; ++i)
 			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
@@ -187,102 +230,61 @@ smgropen(RelFileLocator rlocator, BackendId backend)
 		/* implementation-specific initialization */
 		smgrsw[reln->smgr_which].smgr_open(reln);
 
-		/* it has no owner yet */
-		dlist_push_tail(&unowned_relns, &reln->node);
+		/* it is not pinned yet */
+		reln->pincount = 0;
+		dlist_push_tail(&unpinned_relns, &reln->node);
 	}
 
 	return reln;
 }
 
 /*
- * smgrsetowner() -- Establish a long-lived reference to an SMgrRelation object
- *
- * There can be only one owner at a time; this is sufficient since currently
- * the only such owners exist in the relcache.
+ * smgrpin() -- Prevent an SMgrRelation object from being destroyed at end of
+ *				of transaction
  */
 void
-smgrsetowner(SMgrRelation *owner, SMgrRelation reln)
+smgrpin(SMgrRelation reln)
 {
-	/* We don't support "disowning" an SMgrRelation here, use smgrclearowner */
-	Assert(owner != NULL);
-
-	/*
-	 * First, unhook any old owner.  (Normally there shouldn't be any, but it
-	 * seems possible that this can happen during swap_relation_files()
-	 * depending on the order of processing.  It's ok to close the old
-	 * relcache entry early in that case.)
-	 *
-	 * If there isn't an old owner, then the reln should be in the unowned
-	 * list, and we need to remove it.
-	 */
-	if (reln->smgr_owner)
-		*(reln->smgr_owner) = NULL;
-	else
+	if (reln->pincount == 0)
 		dlist_delete(&reln->node);
-
-	/* Now establish the ownership relationship. */
-	reln->smgr_owner = owner;
-	*owner = reln;
+	reln->pincount++;
 }
 
 /*
- * smgrclearowner() -- Remove long-lived reference to an SMgrRelation object
- *					   if one exists
+ * smgrunpin() -- Allow an SMgrRelation object to be desroyed at end of
+ *				  transaction
+ *
+ * The object remains valid, but if there are no other pins on it, it is moved
+ * to the unpinned list where it will be destroyed by AtEOXact_SMgr().
  */
 void
-smgrclearowner(SMgrRelation *owner, SMgrRelation reln)
-{
-	/* Do nothing if the SMgrRelation object is not owned by the owner */
-	if (reln->smgr_owner != owner)
-		return;
-
-	/* unset the owner's reference */
-	*owner = NULL;
-
-	/* unset our reference to the owner */
-	reln->smgr_owner = NULL;
-
-	/* add to list of unowned relations */
-	dlist_push_tail(&unowned_relns, &reln->node);
-}
-
-/*
- * smgrexists() -- Does the underlying file for a fork exist?
- */
-bool
-smgrexists(SMgrRelation reln, ForkNumber forknum)
+smgrunpin(SMgrRelation reln)
 {
-	return smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
+	Assert(reln->pincount > 0);
+	reln->pincount--;
+	if (reln->pincount == 0)
+		dlist_push_tail(&unpinned_relns, &reln->node);
 }
 
 /*
- * smgrclose() -- Close and delete an SMgrRelation object.
+ * smgrdestroy() -- Delete an SMgrRelation object.
  */
-void
-smgrclose(SMgrRelation reln)
+static void
+smgrdestroy(SMgrRelation reln)
 {
-	SMgrRelation *owner;
 	ForkNumber	forknum;
 
+	Assert(reln->pincount == 0);
+
 	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 		smgrsw[reln->smgr_which].smgr_close(reln, forknum);
 
-	owner = reln->smgr_owner;
-
-	if (!owner)
-		dlist_delete(&reln->node);
+	dlist_delete(&reln->node);
 
 	if (hash_search(SMgrRelationHash,
 					&(reln->smgr_rlocator),
 					HASH_REMOVE, NULL) == NULL)
 		elog(ERROR, "SMgrRelation hashtable corrupted");
-
-	/*
-	 * Unhook the owner pointer, if any.  We do this last since in the remote
-	 * possibility of failure above, the SMgrRelation object will still exist.
-	 */
-	if (owner)
-		*owner = NULL;
 }
 
 /*
@@ -302,31 +304,51 @@ smgrrelease(SMgrRelation reln)
 }
 
 /*
- * smgrreleaseall() -- Release resources used by all objects.
+ * smgrclose() -- Close an SMgrRelation object.
  *
- * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
+ * The SMgrRelation reference should not be used after this call.  However,
+ * because we don't keep track of the references returned by smgropen(), we
+ * don't know if there are other references still pointing to the same object,
+ * so we cannot remove the SMgrRelation object yet.  Therefore, this is just a
+ * synonym for smgrrelease() at the moment.
  */
 void
-smgrreleaseall(void)
+smgrclose(SMgrRelation reln)
 {
-	HASH_SEQ_STATUS status;
-	SMgrRelation reln;
+	smgrrelease(reln);
+}
 
-	/* Nothing to do if hashtable not set up */
-	if (SMgrRelationHash == NULL)
-		return;
+/*
+ * smgrdestroyall() -- Release resources used by all unpinned objects.
+ *
+ * It must be known that there are no pointers to SMgrRelations, other than
+ * those pinned with smgrpin().
+ */
+void
+smgrdestroyall(void)
+{
+	dlist_mutable_iter iter;
 
-	hash_seq_init(&status, SMgrRelationHash);
+	/*
+	 * Zap all unpinned SMgrRelations.  We rely on smgrdestroy() to remove
+	 * each one from the list.
+	 */
+	dlist_foreach_modify(iter, &unpinned_relns)
+	{
+		SMgrRelation rel = dlist_container(SMgrRelationData, node,
+										   iter.cur);
 
-	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrrelease(reln);
+		smgrdestroy(rel);
+	}
 }
 
 /*
- * smgrcloseall() -- Close all existing SMgrRelation objects.
+ * smgrreleaseall() -- Release resources used by all objects.
+ *
+ * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
  */
 void
-smgrcloseall(void)
+smgrreleaseall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -338,19 +360,21 @@ smgrcloseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrclose(reln);
+	{
+		smgrrelease(reln);
+	}
 }
 
 /*
- * smgrcloserellocator() -- Close SMgrRelation object for given RelFileLocator,
- *							if one exists.
+ * smgrreleaserellocator() -- Release resources for given RelFileLocator,
+ *							if it's open.
  *
- * This has the same effects as smgrclose(smgropen(rlocator)), but it avoids
+ * This has the same effects as smgrrelease(smgropen(rlocator)), but it avoids
  * uselessly creating a hashtable entry only to drop it again when no
  * such entry exists already.
  */
 void
-smgrcloserellocator(RelFileLocatorBackend rlocator)
+smgrreleaserellocator(RelFileLocatorBackend rlocator)
 {
 	SMgrRelation reln;
 
@@ -362,7 +386,16 @@ smgrcloserellocator(RelFileLocatorBackend rlocator)
 									  &rlocator,
 									  HASH_FIND, NULL);
 	if (reln != NULL)
-		smgrclose(reln);
+		smgrrelease(reln);
+}
+
+/*
+ * smgrexists() -- Does the underlying file for a fork exist?
+ */
+bool
+smgrexists(SMgrRelation reln, ForkNumber forknum)
+{
+	return smgrsw[reln->smgr_which].smgr_exists(reln, forknum);
 }
 
 /*
@@ -729,7 +762,8 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
  * AtEOXact_SMgr
  *
  * This routine is called during transaction commit or abort (it doesn't
- * particularly care which).  All transient SMgrRelation objects are closed.
+ * particularly care which).  All transient SMgrRelation objects are
+ * destroyed.
  *
  * We do this as a compromise between wanting transient SMgrRelations to
  * live awhile (to amortize the costs of blind writes of multiple blocks)
@@ -740,21 +774,7 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 void
 AtEOXact_SMgr(void)
 {
-	dlist_mutable_iter iter;
-
-	/*
-	 * Zap all unowned SMgrRelations.  We rely on smgrclose() to remove each
-	 * one from the list.
-	 */
-	dlist_foreach_modify(iter, &unowned_relns)
-	{
-		SMgrRelation rel = dlist_container(SMgrRelationData, node,
-										   iter.cur);
-
-		Assert(rel->smgr_owner == NULL);
-
-		smgrclose(rel);
-	}
+	smgrdestroyall();
 }
 
 /*
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index a041d7d6047..edaa56194ff 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -756,7 +756,7 @@ LocalExecuteInvalidationMessage(SharedInvalidationMessage *msg)
 
 		rlocator.locator = msg->sm.rlocator;
 		rlocator.backend = (msg->sm.backend_hi << 16) | (int) msg->sm.backend_lo;
-		smgrcloserellocator(rlocator);
+		smgrreleaserellocator(rlocator);
 	}
 	else if (msg->id == SHAREDINVALRELMAP_ID)
 	{
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index b3faccbefe5..b4c7dacdc9e 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3044,7 +3044,7 @@ RelationCacheInvalidate(bool debug_discard)
 	 * start to rebuild entries, since that may involve catalog fetches which
 	 * will re-open catalog files.
 	 */
-	smgrcloseall();
+	smgrdestroyall();
 
 	/* Phase 2: rebuild the items found to need rebuild in phase 1 */
 	foreach(l, rebuildFirstList)
@@ -3066,25 +3066,6 @@ RelationCacheInvalidate(bool debug_discard)
 			in_progress_list[i].invalidated = true;
 }
 
-/*
- * RelationCloseSmgrByOid - close a relcache entry's smgr link
- *
- * Needed in some cases where we are changing a relation's physical mapping.
- * The link will be automatically reopened on next use.
- */
-void
-RelationCloseSmgrByOid(Oid relationId)
-{
-	Relation	relation;
-
-	RelationIdCacheLookup(relationId, relation);
-
-	if (!PointerIsValid(relation))
-		return;					/* not in cache, nothing to do */
-
-	RelationCloseSmgr(relation);
-}
-
 static void
 RememberToFreeTupleDescAtEOX(TupleDesc td)
 {
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aabac..edccc4c89bb 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -21,29 +21,21 @@
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
- * by smgropen(), and destroyed by smgrclose().  Note that neither of these
- * operations imply I/O, they just create or destroy a hashtable entry.
- * (But smgrclose() may release associated resources, such as OS-level file
+ * by smgropen(), and destroyed by smgrdestroy().  Note that neither of these
+ * operations imply I/O, they just create or destroy a hashtable entry.  (But
+ * smgrdestroy() may release associated resources, such as OS-level file
  * descriptors.)
  *
- * An SMgrRelation may have an "owner", which is just a pointer to it from
- * somewhere else; smgr.c will clear this pointer if the SMgrRelation is
- * closed.  We use this to avoid dangling pointers from relcache to smgr
- * without having to make the smgr explicitly aware of relcache.  There
- * can't be more than one "owner" pointer per SMgrRelation, but that's
- * all we need.
- *
- * SMgrRelations that do not have an "owner" are considered to be transient,
- * and are deleted at end of transaction.
+ * An SMgrRelation may be "pinned", to prevent it from being destroyed while
+ * it's in use.  We use this to prevent pointers relcache to smgr from being
+ * invalidated.  SMgrRelations that are not pinned are deleted at end of
+ * transaction.
  */
 typedef struct SMgrRelationData
 {
 	/* rlocator is the hashtable lookup key, so it must be first! */
 	RelFileLocatorBackend smgr_rlocator;	/* relation physical identifier */
 
-	/* pointer to owning pointer, or NULL if none */
-	struct SMgrRelationData **smgr_owner;
-
 	/*
 	 * The following fields are reset to InvalidBlockNumber upon a cache flush
 	 * event, and hold the last known size for each fork.  This information is
@@ -68,7 +60,11 @@ typedef struct SMgrRelationData
 	int			md_num_open_segs[MAX_FORKNUM + 1];
 	struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
 
-	/* if unowned, list link in list of all unowned SMgrRelations */
+	/*
+	 * Pinning support.  If unpinned (pincount == 0), 'node' is a list link in
+	 * list of all unpinned SMgrRelations.
+	 */
+	int			pincount;
 	dlist_node	node;
 } SMgrRelationData;
 
@@ -80,13 +76,13 @@ typedef SMgrRelationData *SMgrRelation;
 extern void smgrinit(void);
 extern SMgrRelation smgropen(RelFileLocator rlocator, BackendId backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
-extern void smgrsetowner(SMgrRelation *owner, SMgrRelation reln);
-extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
+extern void smgrpin(SMgrRelation reln);
+extern void smgrunpin(SMgrRelation reln);
 extern void smgrclose(SMgrRelation reln);
-extern void smgrcloseall(void);
-extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
+extern void smgrdestroyall(void);
 extern void smgrrelease(SMgrRelation reln);
 extern void smgrreleaseall(void);
+extern void smgrreleaserellocator(RelFileLocatorBackend rlocator);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 0ad613c4b88..5ab7d54d897 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -561,18 +561,15 @@ typedef struct ViewOptions
  *
  * Very little code is authorized to touch rel->rd_smgr directly.  Instead
  * use this function to fetch its value.
- *
- * Note: since a relcache flush can cause the file handle to be closed again,
- * it's unwise to hold onto the pointer returned by this function for any
- * long period.  Recommended practice is to just re-execute RelationGetSmgr
- * each time you need to access the SMgrRelation.  It's quite cheap in
- * comparison to whatever an smgr function is going to do.
  */
 static inline SMgrRelation
 RelationGetSmgr(Relation rel)
 {
 	if (unlikely(rel->rd_smgr == NULL))
-		smgrsetowner(&(rel->rd_smgr), smgropen(rel->rd_locator, rel->rd_backend));
+	{
+		rel->rd_smgr = smgropen(rel->rd_locator, rel->rd_backend);
+		smgrpin(rel->rd_smgr);
+	}
 	return rel->rd_smgr;
 }
 
@@ -584,10 +581,11 @@ static inline void
 RelationCloseSmgr(Relation relation)
 {
 	if (relation->rd_smgr != NULL)
+	{
+		smgrunpin(relation->rd_smgr);
 		smgrclose(relation->rd_smgr);
-
-	/* smgrclose should unhook from owner pointer */
-	Assert(relation->rd_smgr == NULL);
+		relation->rd_smgr = NULL;
+	}
 }
 #endif							/* !FRONTEND */
 
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index f04b2477e34..c1c31baa429 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -129,8 +129,6 @@ extern void RelationCacheInvalidateEntry(Oid relationId);
 
 extern void RelationCacheInvalidate(bool debug_discard);
 
-extern void RelationCloseSmgrByOid(Oid relationId);
-
 #ifdef USE_ASSERT_CHECKING
 extern void AssertPendingSyncs_RelationCache(void);
 #else
-- 
2.39.2

Reply via email to