On Fri, Mar 5, 2021 at 12:12 PM Thomas Munro <thomas.mu...@gmail.com> wrote:
> On Fri, Feb 26, 2021 at 7:08 PM Thomas Munro <thomas.mu...@gmail.com> wrote:
> > Back in 2016, Robert Haas proposed to replace I/O locks with condition
> > variables[1].  Condition variables went in and have found lots of
> > uses, but this patch to replace a bunch of LWLocks and some busy
> > looping did not.  Since then, it has been tested quite a lot as part
> > of the AIO project[2], which currently depends on it.  That's why I'm
> > interested in following up now.  I asked Robert if he planned to
> > re-propose it and he said I should go for it, so... here I go.
>
> I removed a redundant (Size) cast, fixed the wait event name and
> category (WAIT_EVENT_BUFFILE_XXX is for buffile.c stuff, not bufmgr.c
> stuff, and this is really an IPC wait, not an IO wait despite the
> name), updated documentation and pgindented.

More review and some proposed changes:

The old I/O lock array was the only user of struct
LWLockMinimallyPadded, added in commit 6150a1b08a9, and it seems kinda
strange to leave it in the tree with no user.  Of course it's remotely
possible there are extensions using it (know of any?).  In the
attached, I've ripped that + associated commentary out, because it's
fun to delete dead code.  Objections?

Since the whole reason for that out-of-line array in the first place
was to keep BufferDesc inside one cache line, and since it is in fact
possible to put a new condition variable into BufferDesc without
exceeding 64 bytes on a 64 bit x86 box, perhaps we should just do that
instead?  I haven't yet considered other architectures or potential
member orders. It's also possible that some other project already had
designs on those BufferDesc bytes.  This drops quite a few lines from
the tree, including the comment about how nice it'd be to be able to
put the lock in BufferDesc.

I wonder if we should try to preserve user experience a little harder,
for the benefit of people who have monitoring queries that look for
this condition.  Instead of inventing a new wait_event value, let's
just keep showing "BufferIO" in that column.  In other words, the
change is that wait_event_type changes from "LWLock" to "IPC", which
is a pretty good summary of this patch.  Done in the attached.  Does
this make sense?

Please see attached, which gets us to:  8 files changed, 30
insertions(+), 113 deletions(-)

PS: An idea I thought about while studying this patch is that we
should be able to make signaling an empty condition variable
free/cheap (no spinlock acquisition or other extra memory
barrier-containing operation); I'll write about that separately.
From b6c4e8919dcca0bbf97ccd9489d6876e9890e4b4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Fri, 26 Feb 2021 18:01:28 +1300
Subject: [PATCH v3] Replace buffer I/O locks with condition variables.

1.  Backends waiting for buffer I/O are now interruptible.

2.  If something goes wrong in a backend that is currently performing
I/O, other backends no longer wake up until that backend reaches
AbortBufferIO() and broadcasts on the CV.  Previously, any waiters would
wake up (because the I/O lock was automatically released) and then
busy-loop until AbortBufferIO() cleared BM_IO_IN_PROGRESS.

3.  Remove LWLockMinimallyPadded, as it had no other user.

Author: Robert Haas <robertmh...@gmail.com>
Reviewed-by: Thomas Munro <thomas.mu...@gmail.com>
Reviewed-by: Tom Lane <t...@sss.pgh.pa.us> (earlier version, 2016)
Discussion: https://postgr.es/m/CA%2BhUKGJ8nBFrjLuCTuqKN0pd2PQOwj9b_jnsiGFFMDvUxahj_A%40mail.gmail.com
Discussion: https://postgr.es/m/CA+Tgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr=c56...@mail.gmail.com
---
 doc/src/sgml/monitoring.sgml          |  8 ++--
 src/backend/postmaster/pgstat.c       |  3 ++
 src/backend/storage/buffer/buf_init.c | 28 ++----------
 src/backend/storage/buffer/bufmgr.c   | 62 +++++++--------------------
 src/backend/storage/lmgr/lwlock.c     |  5 +--
 src/include/pgstat.h                  |  1 +
 src/include/storage/buf_internals.h   |  7 +--
 src/include/storage/lwlock.h          | 29 -------------
 8 files changed, 30 insertions(+), 113 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3513e127b7..ef3177f0e0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1581,6 +1581,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting for the page number needed to continue a parallel B-tree
        scan to become available.</entry>
      </row>
+     <row>
+      <entry><literal>BufferIO</literal></entry>
+      <entry>Waiting for buffer I/O to complete.</entry>
+     </row>
      <row>
       <entry><literal>CheckpointDone</literal></entry>
       <entry>Waiting for a checkpoint to complete.</entry>
@@ -1871,10 +1875,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>BufferContent</literal></entry>
       <entry>Waiting to access a data page in memory.</entry>
      </row>
-     <row>
-      <entry><literal>BufferIO</literal></entry>
-      <entry>Waiting for I/O on a data page.</entry>
-     </row>
      <row>
       <entry><literal>BufferMapping</literal></entry>
       <entry>Waiting to associate a data block with a buffer in the buffer
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..f050a1ddb1 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4011,6 +4011,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_BTREE_PAGE:
 			event_name = "BtreePage";
 			break;
+		case WAIT_EVENT_BUFFER_IO:
+			event_name = "BufferIO";
+			break;
 		case WAIT_EVENT_CHECKPOINT_DONE:
 			event_name = "CheckpointDone";
 			break;
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index e9e4f35bb5..f815462ec0 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -19,7 +19,6 @@
 
 BufferDescPadded *BufferDescriptors;
 char	   *BufferBlocks;
-LWLockMinimallyPadded *BufferIOLWLockArray = NULL;
 WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
 
@@ -68,7 +67,6 @@ InitBufferPool(void)
 {
 	bool		foundBufs,
 				foundDescs,
-				foundIOLocks,
 				foundBufCkpt;
 
 	/* Align descriptors to a cacheline boundary. */
@@ -81,12 +79,6 @@ InitBufferPool(void)
 		ShmemInitStruct("Buffer Blocks",
 						NBuffers * (Size) BLCKSZ, &foundBufs);
 
-	/* Align lwlocks to cacheline boundary */
-	BufferIOLWLockArray = (LWLockMinimallyPadded *)
-		ShmemInitStruct("Buffer IO Locks",
-						NBuffers * (Size) sizeof(LWLockMinimallyPadded),
-						&foundIOLocks);
-
 	/*
 	 * The array used to sort to-be-checkpointed buffer ids is located in
 	 * shared memory, to avoid having to allocate significant amounts of
@@ -98,10 +90,10 @@ InitBufferPool(void)
 		ShmemInitStruct("Checkpoint BufferIds",
 						NBuffers * sizeof(CkptSortItem), &foundBufCkpt);
 
-	if (foundDescs || foundBufs || foundIOLocks || foundBufCkpt)
+	if (foundDescs || foundBufs || foundBufCkpt)
 	{
 		/* should find all of these, or none of them */
-		Assert(foundDescs && foundBufs && foundIOLocks && foundBufCkpt);
+		Assert(foundDescs && foundBufs && foundBufCkpt);
 		/* note: this path is only taken in EXEC_BACKEND case */
 	}
 	else
@@ -131,8 +123,7 @@ InitBufferPool(void)
 			LWLockInitialize(BufferDescriptorGetContentLock(buf),
 							 LWTRANCHE_BUFFER_CONTENT);
 
-			LWLockInitialize(BufferDescriptorGetIOLock(buf),
-							 LWTRANCHE_BUFFER_IO);
+			ConditionVariableInit(&buf->io_cv);
 		}
 
 		/* Correct last entry of linked list */
@@ -169,19 +160,6 @@ BufferShmemSize(void)
 	/* size of stuff controlled by freelist.c */
 	size = add_size(size, StrategyShmemSize());
 
-	/*
-	 * It would be nice to include the I/O locks in the BufferDesc, but that
-	 * would increase the size of a BufferDesc to more than one cache line,
-	 * and benchmarking has shown that keeping every BufferDesc aligned on a
-	 * cache line boundary is important for performance.  So, instead, the
-	 * array of I/O locks is allocated in a separate tranche.  Because those
-	 * locks are not highly contended, we lay out the array with minimal
-	 * padding.
-	 */
-	size = add_size(size, mul_size(NBuffers, sizeof(LWLockMinimallyPadded)));
-	/* to allow aligning the above */
-	size = add_size(size, PG_CACHE_LINE_SIZE);
-
 	/* size of checkpoint sort array in bufmgr.c */
 	size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 561c212092..cb533f66cd 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1352,9 +1352,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to get the io_in_progress
-	 * lock.  If StartBufferIO returns false, then someone else managed to
-	 * read it before we did, so there's nothing left for BufferAlloc() to do.
+	 * Buffer contents are currently invalid.  Try to obtain the right to
+	 * start I/O.  If StartBufferIO returns false, then someone else managed
+	 * to read it before we did, so there's nothing left for BufferAlloc() to
+	 * do.
 	 */
 	if (StartBufferIO(buf, true))
 		*foundPtr = false;
@@ -1777,9 +1778,8 @@ UnpinBuffer(BufferDesc *buf, bool fixOwner)
 		 */
 		VALGRIND_MAKE_MEM_NOACCESS(BufHdrGetBlock(buf), BLCKSZ);
 
-		/* I'd better not still hold any locks on the buffer */
+		/* I'd better not still hold the buffer content lock */
 		Assert(!LWLockHeldByMe(BufferDescriptorGetContentLock(buf)));
-		Assert(!LWLockHeldByMe(BufferDescriptorGetIOLock(buf)));
 
 		/*
 		 * Decrement the shared reference count.
@@ -2742,9 +2742,9 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	uint32		buf_state;
 
 	/*
-	 * Acquire the buffer's io_in_progress lock.  If StartBufferIO returns
-	 * false, then someone else flushed the buffer before we could, so we need
-	 * not do anything.
+	 * Try to start an I/O operation.  If StartBufferIO returns false, then
+	 * someone else flushed the buffer before we could, so we need not do
+	 * anything.
 	 */
 	if (!StartBufferIO(buf, false))
 		return;
@@ -2800,7 +2800,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	/*
 	 * Now it's safe to write buffer to disk. Note that no one else should
 	 * have been able to write it while we were busy with log flushing because
-	 * we have the io_in_progress lock.
+	 * only one process at a time can set the BM_IO_IN_PROGRESS bit.
 	 */
 	bufBlock = BufHdrGetBlock(buf);
 
@@ -2835,7 +2835,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 
 	/*
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
-	 * end the io_in_progress state.
+	 * end the BM_IO_IN_PROGRESS state.
 	 */
 	TerminateBufferIO(buf, true, 0);
 
@@ -4271,7 +4271,7 @@ IsBufferCleanupOK(Buffer buffer)
  *	Functions for buffer I/O handling
  *
  *	Note: We assume that nested buffer I/O never occurs.
- *	i.e at most one io_in_progress lock is held per proc.
+ *	i.e at most one BM_IO_IN_PROGRESS bit is set per proc.
  *
  *	Also note that these are used only for shared buffers, not local ones.
  */
@@ -4282,13 +4282,7 @@ IsBufferCleanupOK(Buffer buffer)
 static void
 WaitIO(BufferDesc *buf)
 {
-	/*
-	 * Changed to wait until there's no IO - Inoue 01/13/2000
-	 *
-	 * Note this is *necessary* because an error abort in the process doing
-	 * I/O could release the io_in_progress_lock prematurely. See
-	 * AbortBufferIO.
-	 */
+	ConditionVariablePrepareToSleep(&buf->io_cv);
 	for (;;)
 	{
 		uint32		buf_state;
@@ -4303,9 +4297,9 @@ WaitIO(BufferDesc *buf)
 
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
-		LWLockAcquire(BufferDescriptorGetIOLock(buf), LW_SHARED);
-		LWLockRelease(BufferDescriptorGetIOLock(buf));
+		ConditionVariableSleep(&buf->io_cv, WAIT_EVENT_BUFFER_IO);
 	}
+	ConditionVariableCancelSleep();
 }
 
 /*
@@ -4317,7 +4311,7 @@ WaitIO(BufferDesc *buf)
  * In some scenarios there are race conditions in which multiple backends
  * could attempt the same I/O operation concurrently.  If someone else
  * has already started I/O on this buffer then we will block on the
- * io_in_progress lock until he's done.
+ * I/O condition variable until he's done.
  *
  * Input operations are only attempted on buffers that are not BM_VALID,
  * and output operations only on buffers that are BM_VALID and BM_DIRTY,
@@ -4335,25 +4329,11 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 
 	for (;;)
 	{
-		/*
-		 * Grab the io_in_progress lock so that other processes can wait for
-		 * me to finish the I/O.
-		 */
-		LWLockAcquire(BufferDescriptorGetIOLock(buf), LW_EXCLUSIVE);
-
 		buf_state = LockBufHdr(buf);
 
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
-
-		/*
-		 * The only way BM_IO_IN_PROGRESS could be set when the io_in_progress
-		 * lock isn't held is if the process doing the I/O is recovering from
-		 * an error (see AbortBufferIO).  If that's the case, we must wait for
-		 * him to get unwedged.
-		 */
 		UnlockBufHdr(buf, buf_state);
-		LWLockRelease(BufferDescriptorGetIOLock(buf));
 		WaitIO(buf);
 	}
 
@@ -4363,7 +4343,6 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 	{
 		/* someone else already did the I/O */
 		UnlockBufHdr(buf, buf_state);
-		LWLockRelease(BufferDescriptorGetIOLock(buf));
 		return false;
 	}
 
@@ -4381,7 +4360,6 @@ StartBufferIO(BufferDesc *buf, bool forInput)
  *	(Assumptions)
  *	My process is executing IO for the buffer
  *	BM_IO_IN_PROGRESS bit is set for the buffer
- *	We hold the buffer's io_in_progress lock
  *	The buffer is Pinned
  *
  * If clear_dirty is true and BM_JUST_DIRTIED is not set, we clear the
@@ -4413,7 +4391,7 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
 
 	InProgressBuf = NULL;
 
-	LWLockRelease(BufferDescriptorGetIOLock(buf));
+	ConditionVariableBroadcast(&buf->io_cv);
 }
 
 /*
@@ -4434,14 +4412,6 @@ AbortBufferIO(void)
 	{
 		uint32		buf_state;
 
-		/*
-		 * Since LWLockReleaseAll has already been called, we're not holding
-		 * the buffer's io_in_progress_lock. We have to re-acquire it so that
-		 * we can use TerminateBufferIO. Anyone who's executing WaitIO on the
-		 * buffer will be in a busy spin until we succeed in doing this.
-		 */
-		LWLockAcquire(BufferDescriptorGetIOLock(buf), LW_EXCLUSIVE);
-
 		buf_state = LockBufHdr(buf);
 		Assert(buf_state & BM_IO_IN_PROGRESS);
 		if (IsForInput)
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 8cb6a6f042..adf19aba75 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -146,8 +146,6 @@ static const char *const BuiltinTrancheNames[] = {
 	"WALInsert",
 	/* LWTRANCHE_BUFFER_CONTENT: */
 	"BufferContent",
-	/* LWTRANCHE_BUFFER_IO: */
-	"BufferIO",
 	/* LWTRANCHE_REPLICATION_ORIGIN_STATE: */
 	"ReplicationOriginState",
 	/* LWTRANCHE_REPLICATION_SLOT_IO: */
@@ -469,8 +467,7 @@ CreateLWLocks(void)
 	StaticAssertStmt(LW_VAL_EXCLUSIVE > (uint32) MAX_BACKENDS,
 					 "MAX_BACKENDS too big for lwlock.c");
 
-	StaticAssertStmt(sizeof(LWLock) <= LWLOCK_MINIMAL_SIZE &&
-					 sizeof(LWLock) <= LWLOCK_PADDED_SIZE,
+	StaticAssertStmt(sizeof(LWLock) <= LWLOCK_PADDED_SIZE,
 					 "Miscalculated LWLock padding");
 
 	if (!IsUnderPostmaster)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..1176ed60a5 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -961,6 +961,7 @@ typedef enum
 	WAIT_EVENT_BGWORKER_SHUTDOWN,
 	WAIT_EVENT_BGWORKER_STARTUP,
 	WAIT_EVENT_BTREE_PAGE,
+	WAIT_EVENT_BUFFER_IO,
 	WAIT_EVENT_CHECKPOINT_DONE,
 	WAIT_EVENT_CHECKPOINT_START,
 	WAIT_EVENT_EXECUTE_GATHER,
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index f6b5782965..d8a5da5ad9 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -18,6 +18,7 @@
 #include "port/atomics.h"
 #include "storage/buf.h"
 #include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
 #include "storage/latch.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
@@ -181,10 +182,10 @@ typedef struct BufferDesc
 
 	/* state of the tag, containing flags, refcount and usagecount */
 	pg_atomic_uint32 state;
+	ConditionVariable io_cv;	/* fires when BM_IO_IN_PROGRESS is cleared */
 
 	int			wait_backend_pid;	/* backend PID of pin-count waiter */
 	int			freeNext;		/* link in freelist chain */
-
 	LWLock		content_lock;	/* to lock access to buffer contents */
 } BufferDesc;
 
@@ -221,13 +222,9 @@ typedef union BufferDescPadded
 
 #define BufferDescriptorGetBuffer(bdesc) ((bdesc)->buf_id + 1)
 
-#define BufferDescriptorGetIOLock(bdesc) \
-	(&(BufferIOLWLockArray[(bdesc)->buf_id]).lock)
 #define BufferDescriptorGetContentLock(bdesc) \
 	((LWLock*) (&(bdesc)->content_lock))
 
-extern PGDLLIMPORT LWLockMinimallyPadded *BufferIOLWLockArray;
-
 /*
  * The freeNext field is either the index of the next freelist entry,
  * or one of these special values:
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index cbf2510fbf..a8f052e484 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -48,29 +48,8 @@ typedef struct LWLock
  * even more padding so that each LWLock takes up an entire cache line; this is
  * useful, for example, in the main LWLock array, where the overall number of
  * locks is small but some are heavily contended.
- *
- * When allocating a tranche that contains data other than LWLocks, it is
- * probably best to include a bare LWLock and then pad the resulting structure
- * as necessary for performance.  For an array that contains only LWLocks,
- * LWLockMinimallyPadded can be used for cases where we just want to ensure
- * that we don't cross cache line boundaries within a single lock, while
- * LWLockPadded can be used for cases where we want each lock to be an entire
- * cache line.
- *
- * An LWLockMinimallyPadded might contain more than the absolute minimum amount
- * of padding required to keep a lock from crossing a cache line boundary,
- * because an unpadded LWLock will normally fit into 16 bytes.  We ignore that
- * possibility when determining the minimal amount of padding.  Older releases
- * had larger LWLocks, so 32 really was the minimum, and packing them in
- * tighter might hurt performance.
- *
- * LWLOCK_MINIMAL_SIZE should be 32 on basically all common platforms, but
- * because pg_atomic_uint32 is more than 4 bytes on some obscure platforms, we
- * allow for the possibility that it might be 64.  Even on those platforms,
- * we probably won't exceed 32 bytes unless LOCK_DEBUG is defined.
  */
 #define LWLOCK_PADDED_SIZE	PG_CACHE_LINE_SIZE
-#define LWLOCK_MINIMAL_SIZE (sizeof(LWLock) <= 32 ? 32 : 64)
 
 /* LWLock, padded to a full cache line size */
 typedef union LWLockPadded
@@ -79,13 +58,6 @@ typedef union LWLockPadded
 	char		pad[LWLOCK_PADDED_SIZE];
 } LWLockPadded;
 
-/* LWLock, minimally padded */
-typedef union LWLockMinimallyPadded
-{
-	LWLock		lock;
-	char		pad[LWLOCK_MINIMAL_SIZE];
-} LWLockMinimallyPadded;
-
 extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
 
 /* struct for storing named tranche information */
@@ -202,7 +174,6 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SERIAL_BUFFER,
 	LWTRANCHE_WAL_INSERT,
 	LWTRANCHE_BUFFER_CONTENT,
-	LWTRANCHE_BUFFER_IO,
 	LWTRANCHE_REPLICATION_ORIGIN_STATE,
 	LWTRANCHE_REPLICATION_SLOT_IO,
 	LWTRANCHE_LOCK_FASTPATH,
-- 
2.30.0

Reply via email to