From 7d3012cbb1aaddfa112966f641aa129e88a42600 Mon Sep 17 00:00:00 2001
From: Greg Burd <greg@burd.me>
Date: Tue, 12 Aug 2025 09:33:38 -0400
Subject: [PATCH v13 3/4] Remove the need for a buffer_strategy_lock

Combine nextVictimBuffer and completePasses into a single uint64 called
clockSweepCounter so as to eliminate the need for the
buffer_strategy_lock that synchronizes their progression. Increment this
counter atomically by 1 at each tick.  The hand's location is the
counter modulo NBuffers, the number of complete passes is the counter
divided by NBuffers. Overflow of the clockSweepCounter would require ~10
years of continuous operation at ~59 billion ticks per-second, so
unlikely that we consider this to be impossible.
---
 src/backend/storage/buffer/README     |  70 +++++++------
 src/backend/storage/buffer/bufmgr.c   |   8 ++
 src/backend/storage/buffer/freelist.c | 135 ++++++++------------------
 src/include/storage/buf_internals.h   |   2 +-
 4 files changed, 86 insertions(+), 129 deletions(-)

diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index 119f31b5d65..52d5b2c4069 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -127,11 +127,10 @@ bits of the tag's hash value.  The rules stated above apply to each partition
 independently.  If it is necessary to lock more than one partition at a time,
 they must be locked in partition-number order to avoid risk of deadlock.
 
-* A separate system-wide spinlock, buffer_strategy_lock, provides mutual
-exclusion for operations that select buffers for replacement.  A spinlock is
-used here rather than a lightweight lock for efficiency; no other locks of any
-sort should be acquired while buffer_strategy_lock is held.  This is essential
-to allow buffer replacement to happen in multiple backends with reasonable
+* Operations that select buffers for replacement don't require a lock, but
+rather use atomic operations to ensure coordination across backends when
+accessing members of the BufferStrategyControl datastructure. This allows
+buffer replacement to happen in multiple backends with reasonable
 concurrency.
 
 * Each buffer header contains a spinlock that must be taken when examining
@@ -158,30 +157,39 @@ unset by sleeping on the buffer's condition variable.
 Normal Buffer Replacement Strategy
 ----------------------------------
 
-To choose a victim buffer to recycle we use a simple clock-sweep algorithm.  It
-works like this:
+To choose a victim buffer to recycle we use a simple clock-sweep algorithm,
+done carefully this can be accomplished without the need to take a system-wide
+lock. The strategy works like this:
+
+The "clock hand" is a buffer index that conceptually moves circularly through
+all the available buffers in the range of 0 to NBuffers-1.  Each time the hand
+returns to 0 is a "complete pass" of the buffers managed by the clock.  The
+hand progresses one "tick" at a time around the clock identifying a potential
+"victim".  These two values, the hand's position and the number of complete
+passes, must be consistent across backends.
+
+In this implementation the hand's position on the clock is determined by the
+value of clockSweepCounter modulo the value of NBuffers. The value of
+clockSweepCounter, a uint64, is atomically incremented by 1 at each tick of the
+clock.  The number of complete passes is therefor the clockSweepCounter divided
+by NBuffers.  This reduces the coordination across backends to either an atomic
+read or an atomic fetch add.
 
 Each buffer header contains a usage counter, which is incremented (up to a
 small limit value) whenever the buffer is pinned.  (This requires only the
 buffer header spinlock, which would have to be taken anyway to increment the
 buffer reference count, so it's nearly free.)
 
-The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly
-through all the available buffers.  nextVictimBuffer is protected by the
-buffer_strategy_lock.
-
 The algorithm for a process that needs to obtain a victim buffer is:
 
-1. Obtain buffer_strategy_lock.
-
-2. Select the buffer pointed to by nextVictimBuffer, and circularly advance
-nextVictimBuffer for next time. Release buffer_strategy_lock.
+1. Move around the clock one tick. Atomically read and advance
+clockSweepCounter by 1 and return its previous value modulo NBuffers.
 
-3. If the selected buffer is pinned or has a nonzero usage count, it cannot
-be used.  Decrement its usage count (if nonzero), reacquire
-buffer_strategy_lock, and return to step 3 to examine the next buffer.
+2. If the selected buffer is pinned or has a nonzero usage count, it cannot be
+used.  Decrement its usage count (if nonzero), return to step 1 to examine the
+next buffer.
 
-4. Pin the selected buffer, and return.
+3. Pin the selected buffer, and return.
 
 (Note that if the selected buffer is dirty, we will have to write it out
 before we can recycle it; if someone else pins the buffer meanwhile we will
@@ -237,19 +245,17 @@ Background Writer's Processing
 ------------------------------
 
 The background writer is designed to write out pages that are likely to be
-recycled soon, thereby offloading the writing work from active backends.
-To do this, it scans forward circularly from the current position of
-nextVictimBuffer (which it does not change!), looking for buffers that are
-dirty and not pinned nor marked with a positive usage count.  It pins,
-writes, and releases any such buffer.
-
-If we can assume that reading nextVictimBuffer is an atomic action, then
-the writer doesn't even need to take buffer_strategy_lock in order to look
-for buffers to write; it needs only to spinlock each buffer header for long
-enough to check the dirtybit.  Even without that assumption, the writer
-only needs to take the lock long enough to read the variable value, not
-while scanning the buffers.  (This is a very substantial improvement in
-the contention cost of the writer compared to PG 8.0.)
+recycled soon, thereby offloading the writing work from active backends. To do
+this, it scans forward circularly from the current position of the clock-sweep
+hand (read atomically and not modified), looking for buffers that are dirty and
+not pinned nor marked with a positive usage count.  It pins, writes, and
+releases any such buffer.
+
+It only needs to spinlock each buffer header for long enough to check the
+dirtybit.  Even without that assumption, the writer only needs to take the lock
+long enough to read the variable value, not while scanning the buffers. (This
+is a very substantial improvement in the contention cost of the writer compared
+to PG 8.0.)
 
 The background writer takes shared content lock on a buffer while writing it
 out (and anyone else who flushes buffer contents to disk must do so too).
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 719a5bb6f97..d0c14158115 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3666,6 +3666,14 @@ BgBufferSync(WritebackContext *wb_context)
 	{
 		int32		passes_delta = strategy_passes - prev_strategy_passes;
 
+		/*
+		 * It would take ~10 years of continuous operation at ~59 billion
+		 * clock ticks per-second to overflow the uint64 value of
+		 * clockSweepCounter. We consider this impossible and memorialize that
+		 * decision with this assert.
+		 */
+		Assert(prev_strategy_passes <= strategy_passes);
+
 		strategy_delta = strategy_buf_id - prev_strategy_buf_id;
 		strategy_delta += (long) passes_delta * NBuffers;
 
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7d59a92bd1a..7d68f2227b3 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -29,21 +29,18 @@
  */
 typedef struct
 {
-	/* Spinlock: protects the values below */
-	slock_t		buffer_strategy_lock;
-
 	/*
-	 * clock-sweep hand: index of next buffer to consider grabbing. Note that
-	 * this isn't a concrete buffer - we only ever increase the value. So, to
-	 * get an actual buffer, it needs to be used modulo NBuffers.
+	 * The clock-sweep counter is atomically updated by 1 at every tick.  Use
+	 * the macro CLOCKSWEEP_HAND() to find the location of the hand on the
+	 * clock. Use CLOCKSWEEP_PASSES() to calculate the number of times the
+	 * clock-sweep hand has made a complete pass around the clock.
 	 */
-	pg_atomic_uint32 nextVictimBuffer;
+	pg_atomic_uint64 clockSweepCounter;
 
 	/*
 	 * Statistics.  These counters should be wide enough that they can't
 	 * overflow during a single bgwriter cycle.
 	 */
-	uint32		completePasses; /* Complete cycles of the clock-sweep */
 	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
 
 	/*
@@ -83,76 +80,47 @@ typedef struct BufferAccessStrategyData
 	Buffer		buffers[FLEXIBLE_ARRAY_MEMBER];
 }			BufferAccessStrategyData;
 
-
 /* Prototypes for internal functions */
 static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 							BufferDesc *buf);
 
+/*
+ * The clock-sweep counter is a uint64 but the clock hand can never be larger
+ * than a uint32.  Enforce that contract uniformly using this macro.
+ */
+#define CLOCKSWEEP_HAND(counter) \ ((uint32) (counter)) % NBuffers
+
+/*
+ * The number of times the clock hand has made a complete pass around the clock
+ * visiting all the available buffers is the counter divided by NBuffers.
+ */
+#define CLOCKSWEEP_PASSES(counter) \ (uint32) ((counter) / NBuffers)
+
 /*
  * ClockSweepTick - Helper routine for StrategyGetBuffer()
  *
  * Move the clock hand one buffer ahead of its current position and return the
- * id of the buffer now under the hand.
+ * index of the buffer previously under the hand.
  */
 static inline uint32
 ClockSweepTick(void)
 {
-	uint32		victim;
+	uint64		counter;
+	uint32		hand;
 
 	/*
 	 * Atomically move hand ahead one buffer - if there's several processes
 	 * doing this, this can lead to buffers being returned slightly out of
 	 * apparent order.
 	 */
-	victim =
-		pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
-
-	if (victim >= NBuffers)
-	{
-		uint32		originalVictim = victim;
+	counter = pg_atomic_fetch_add_u64(&StrategyControl->clockSweepCounter, 1);
 
-		/* always wrap what we look up in BufferDescriptors */
-		victim = victim % NBuffers;
-
-		/*
-		 * If we're the one that just caused a wraparound, force
-		 * completePasses to be incremented while holding the spinlock. We
-		 * need the spinlock so StrategySyncStart() can return a consistent
-		 * value consisting of nextVictimBuffer and completePasses.
-		 */
-		if (victim == 0)
-		{
-			uint32		expected;
-			uint32		wrapped;
-			bool		success = false;
-
-			expected = originalVictim + 1;
-
-			while (!success)
-			{
-				/*
-				 * Acquire the spinlock while increasing completePasses. That
-				 * allows other readers to read nextVictimBuffer and
-				 * completePasses in a consistent manner which is required for
-				 * StrategySyncStart().  In theory delaying the increment
-				 * could lead to an overflow of nextVictimBuffers, but that's
-				 * highly unlikely and wouldn't be particularly harmful.
-				 */
-				SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+	hand = CLOCKSWEEP_HAND(counter);
+	Assert(hand < NBuffers);
 
-				wrapped = expected % NBuffers;
-
-				success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
-														 &expected, wrapped);
-				if (success)
-					StrategyControl->completePasses++;
-				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-			}
-		}
-	}
-	return victim;
+	return hand;
 }
 
 /*
@@ -177,10 +145,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 
 	*from_ring = false;
 
-	/*
-	 * If given a strategy object, see whether it can select a buffer. We
-	 * assume strategy objects don't need buffer_strategy_lock.
-	 */
+	/* If given a strategy object, see whether it can select a buffer */
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
@@ -275,37 +240,25 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
  * The result is the buffer index of the best buffer to sync first.
  * BgBufferSync() will proceed circularly around the buffer array from there.
  *
- * In addition, we return the completed-pass count (which is effectively
- * the higher-order bits of nextVictimBuffer) and the count of recent buffer
- * allocs if non-NULL pointers are passed.  The alloc count is reset after
- * being read.
+ * In addition, we return the completed-pass count and the count of recent
+ * buffer allocs if non-NULL pointers are passed.  The alloc count is reset
+ * after being read.
  */
-int
+uint32
 StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 {
-	uint32		nextVictimBuffer;
-	int			result;
+	uint64		counter;
+	uint32		result;
 
-	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-	nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
-	result = nextVictimBuffer % NBuffers;
+	counter = pg_atomic_read_u64(&StrategyControl->clockSweepCounter);
+	result = CLOCKSWEEP_HAND(counter);
 
 	if (complete_passes)
-	{
-		*complete_passes = StrategyControl->completePasses;
-
-		/*
-		 * Additionally add the number of wraparounds that happened before
-		 * completePasses could be incremented. C.f. ClockSweepTick().
-		 */
-		*complete_passes += nextVictimBuffer / NBuffers;
-	}
+		*complete_passes = CLOCKSWEEP_PASSES(counter);
 
 	if (num_buf_alloc)
-	{
 		*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
-	}
-	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+
 	return result;
 }
 
@@ -320,14 +273,7 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 void
 StrategyNotifyBgWriter(int bgwprocno)
 {
-	/*
-	 * We acquire buffer_strategy_lock just to ensure that the store appears
-	 * atomic to StrategyGetBuffer.  The bgwriter should call this rather
-	 * infrequently, so there's no performance penalty from being safe.
-	 */
-	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
 	StrategyControl->bgwprocno = bgwprocno;
-	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 }
 
 
@@ -392,13 +338,10 @@ StrategyInitialize(bool init)
 		 */
 		Assert(init);
 
-		SpinLockInit(&StrategyControl->buffer_strategy_lock);
-
-		/* Initialize the clock-sweep pointer */
-		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
+		/* Initialize combined clock-sweep pointer/complete passes counter */
+		pg_atomic_init_u64(&StrategyControl->clockSweepCounter, 0);
 
 		/* Clear statistics */
-		StrategyControl->completePasses = 0;
 		pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
 
 		/* No pending notification */
@@ -714,9 +657,9 @@ IOContextForStrategy(BufferAccessStrategy strategy)
  * When a nondefault strategy is used, the buffer manager calls this function
  * when it turns out that the buffer selected by StrategyGetBuffer needs to
  * be written out and doing so would require flushing WAL too.  This gives us
- * a chance to choose a different victim.
+ * a chance to choose a different buffer.
  *
- * Returns true if buffer manager should ask for a new victim, and false
+ * Returns true if buffer manager should ask for a new buffer, and false
  * if this buffer should be written and re-used.
  */
 bool
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9fcc94ef02d..b6ff361f2dd 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -439,7 +439,7 @@ extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 								 BufferDesc *buf, bool from_ring);
 
-extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern uint32 StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
 
 extern Size StrategyShmemSize(void);
-- 
2.49.0