On Thu, Feb 27, 2025 at 11:20 PM Andres Freund <and...@anarazel.de> wrote:
> On 2025-02-27 11:19:55 +1300, Thomas Munro wrote:
> > On Wed, Feb 26, 2025 at 10:55 PM Andres Freund <and...@anarazel.de> wrote:
> > > I was working on expanding tests for AIO and as part of that wrote a test 
> > > for
> > > temp tables -- our coverage is fairly awful, there were many times during 
> > > AIO
> > > development where I knew I had trivially reachable temp table specific 
> > > bugs
> > > but all tests passed.
> > >
> > > The test for that does trigger the problem described above and is fixed 
> > > by the
> > > patches in this thread (which I included in the other thread):

Here is a subset of those patches again:

1.  Per-backend buffer limit, take III.  Now the check is in
read_stream_start_pending_read() so TOC == TOU.

Annoyingly, test cases like the one below still fail, despite
following the rules.  The other streams eat all the buffers and then
one gets an allowance of zero, but uses its right to take one pin
anyway to make progress, and there isn't one.  I wonder if we should
use temp_buffers - 100?  Then leave the minimum GUC value at 100
still, so you have an easy way to test with 0, 1, ... additional
buffers?

2.  It shouldn't give up issuing random advice immediately after a
jump, or it could stall on (say) the second 128kB of a 256kB
sequential chunk (ie the strace you showed on the BHS thread).  It
only makes sense to assume kernel readahead takes over once you've
actually *read* sequentially.  In practice this makes it a lot more
aggressive about advice (like the BHS code in master): it only gives
up if the whole look-ahead window is sequential.

3.  Change the distance algorithm to care only about hits and misses,
not sequential heuristics.  It made at least some sense before, but it
doesn't make sense for AIO, and even in synchronous mode it means that
you hit random jumps with insufficient look-ahead, so I don't think we
should keep it.

I also realised that the sequential heuristics are confused by that
hidden trailing block thing, so in contrived pattern testing with
hit-miss-hit-miss... would be considered sequential, and even if you
fix that (the forwarding patches above fix that), an exact
hit-miss-hit-miss pattern also gets stuck between distances 1 and 2
(double, decrement, double, ... might be worth waiting a bit longer
before decrementing, IDK.

I'll rebase the others and post soon.


set io_combine_limit = 32;
set temp_buffers = 100;

create temp table t1 as select generate_series(1, 10000);
create temp table t2 as select generate_series(1, 10000);
create temp table t3 as select generate_series(1, 10000);
create temp table t4 as select generate_series(1, 10000);
create temp table t5 as select generate_series(1, 10000);

do
$$
declare
  c1 cursor for select * from t1;
  c2 cursor for select * from t2;
  c3 cursor for select * from t3;
  c4 cursor for select * from t4;
  c5 cursor for select * from t5;
  x record;
begin
  open c1;
  open c2;
  open c3;
  open c4;
  open c5;
  loop
    fetch next from c1 into x;
    exit when not found;
    fetch next from c2 into x;
    exit when not found;
    fetch next from c3 into x;
    exit when not found;
    fetch next from c4 into x;
    exit when not found;
    fetch next from c5 into x;
    exit when not found;
  end loop;
end;
$$;
From dd6b334cb744e71fc042057624e4953e3c039629 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmu...@postgresql.org>
Date: Thu, 27 Feb 2025 21:03:39 +1300
Subject: [PATCH v2 1/4] Improve buffer manager API for backend pin limits.

Previously the support functions assumed that the caller needed one pin
to make progress, and could optionally use some more.  Add a couple more
functions for callers that want to know:

* what the maximum possible number could be irrespective of currently
  held pins, for space planning purposes, called the "soft pin limit"

* how many additional pins they could acquire right now, without the
  special case allowing one pin, for users that already hold pins and
  could make progress even if zero extra pins are available

These APIs are better suited to read_stream.c, which will be improved in
a follow-up patch.  Also compute MaxProportionalPins up front, to avoid
performing division whenever we check the balance.

Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 83 +++++++++++++++++++--------
 src/backend/storage/buffer/localbuf.c | 16 ++++++
 src/include/storage/bufmgr.h          |  4 ++
 3 files changed, 78 insertions(+), 25 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7915ed624c1..3f44681b1fe 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -211,6 +211,8 @@ static int32 PrivateRefCountOverflowed = 0;
 static uint32 PrivateRefCountClock = 0;
 static PrivateRefCountEntry *ReservedRefCountEntry = NULL;
 
+static uint32 MaxProportionalPins;
+
 static void ReservePrivateRefCountEntry(void);
 static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
 static PrivateRefCountEntry *GetPrivateRefCountEntry(Buffer buffer, bool do_move);
@@ -2097,43 +2099,65 @@ again:
 	return buf;
 }
 
+/*
+ * Return the maximum number of buffer than this backend should try to pin at
+ * once, to avoid pinning more than its fair share.  This is the highest value
+ * that GetAdditionalPinLimit() and LimitAdditionalPins() could ever return.
+ *
+ * It's called a soft limit because nothing stops a backend from trying to
+ * acquire more pins than this if it needs them to make progress, but code that
+ * wants optional extra buffers for optimizations should respect this
+ * per-backend limit.
+ */
+uint32
+GetSoftPinLimit(void)
+{
+	return MaxProportionalPins;
+}
+
+/*
+ * Return the maximum number of additional buffers that this backend should
+ * pin if it wants to stay under the per-backend soft limit, considering the
+ * number of buffers it has already pinned.
+ */
+uint32
+GetAdditionalPinLimit(void)
+{
+	uint32		estimated_pins_held;
+
+	/*
+	 * We get the number of "overflowed" pins for free, but don't know the
+	 * number of pins in PrivateRefCountArray.  The cost of calculating that
+	 * exactly doesn't seem worth it, so just assume the max.
+	 */
+	estimated_pins_held = PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
+
+	/* Is this backend already holding more than its fair share? */
+	if (estimated_pins_held > MaxProportionalPins)
+		return 0;
+
+	return MaxProportionalPins - estimated_pins_held;
+}
+
 /*
  * Limit the number of pins a batch operation may additionally acquire, to
  * avoid running out of pinnable buffers.
  *
- * One additional pin is always allowed, as otherwise the operation likely
- * cannot be performed at all.
- *
- * The number of allowed pins for a backend is computed based on
- * shared_buffers and the maximum number of connections possible. That's very
- * pessimistic, but outside of toy-sized shared_buffers it should allow
- * sufficient pins.
+ * One additional pin is always allowed, on the assumption that the operation
+ * requires at least one to make progress.
  */
 void
 LimitAdditionalPins(uint32 *additional_pins)
 {
-	uint32		max_backends;
-	int			max_proportional_pins;
+	uint32		limit;
 
 	if (*additional_pins <= 1)
 		return;
 
-	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
-	max_proportional_pins = NBuffers / max_backends;
-
-	/*
-	 * Subtract the approximate number of buffers already pinned by this
-	 * backend. We get the number of "overflowed" pins for free, but don't
-	 * know the number of pins in PrivateRefCountArray. The cost of
-	 * calculating that exactly doesn't seem worth it, so just assume the max.
-	 */
-	max_proportional_pins -= PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
-
-	if (max_proportional_pins <= 0)
-		max_proportional_pins = 1;
-
-	if (*additional_pins > max_proportional_pins)
-		*additional_pins = max_proportional_pins;
+	limit = GetAdditionalPinLimit();
+	limit = Max(limit, 1);
+	if (limit < *additional_pins)
+		*additional_pins = limit;
 }
 
 /*
@@ -3575,6 +3599,15 @@ InitBufferManagerAccess(void)
 {
 	HASHCTL		hash_ctl;
 
+	/*
+	 * The soft limit on the number of pins each backend should respect, based
+	 * on shared_buffers and the maximum number of connections possible.
+	 * That's very pessimistic, but outside toy-sized shared_buffers it should
+	 * allow plenty of pins.  LimitAdditionalPins() or GetAdditionalPinLimit()
+	 * can be used to check the remaining balance.
+	 */
+	MaxProportionalPins = NBuffers / (MaxBackends + NUM_AUXILIARY_PROCS);
+
 	memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));
 
 	hash_ctl.keysize = sizeof(int32);
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 80b83444eb2..5378ba84316 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -286,6 +286,22 @@ GetLocalVictimBuffer(void)
 	return BufferDescriptorGetBuffer(bufHdr);
 }
 
+/* see GetSoftPinLimit() */
+uint32
+GetSoftLocalPinLimit(void)
+{
+	/* Every backend has its own temporary buffers, and can pin them all. */
+	return num_temp_buffers;
+}
+
+/* see GetAdditionalPinLimit() */
+uint32
+GetAdditionalLocalPinLimit(void)
+{
+	Assert(NLocalPinnedBuffers <= num_temp_buffers);
+	return num_temp_buffers - NLocalPinnedBuffers;
+}
+
 /* see LimitAdditionalPins() */
 void
 LimitAdditionalLocalPins(uint32 *additional_pins)
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7c1e4316dde..597ecb97897 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -290,6 +290,10 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern uint32 GetSoftPinLimit(void);
+extern uint32 GetSoftLocalPinLimit(void);
+extern uint32 GetAdditionalPinLimit(void);
+extern uint32 GetAdditionalLocalPinLimit(void);
 extern void LimitAdditionalPins(uint32 *additional_pins);
 extern void LimitAdditionalLocalPins(uint32 *additional_pins);
 
-- 
2.39.5

From 812bc196c977fb3610d3ea8320988d6bd4b00f29 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmu...@postgresql.org>
Date: Thu, 27 Feb 2025 21:42:05 +1300
Subject: [PATCH v2 2/4] Respect pin limits accurately in read_stream.c.

To avoid pinning too much of the buffer pool at once, we previously used
LimitAdditionalBuffers().  The coding was naive, and only considered the
available buffers at stream construction time.

This commit checks at the time of use with new buffer manager APIs.  The
result might change dynamically due to pins acquired outside this stream
by the same backend.  No extra CPU cycles are added to the all-buffered
fast-path code, but the I/O-starting path now considers the up-to-date
remaining buffer limit when making look-ahead decisions.

In practice it was very difficult to exceed the limit in v17, so no
back-patch, but changes due to land soon make it easy.

Per code review from Andres, in the course of testing his AIO patches.

Reported-by: Andres Freund <and...@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c | 108 ++++++++++++++++++++++----
 1 file changed, 94 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 04bdb5e6d4b..6c2b4ec011d 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -115,6 +115,7 @@ struct ReadStream
 	int16		pinned_buffers;
 	int16		distance;
 	bool		advice_enabled;
+	bool		temporary;
 
 	/*
 	 * One-block buffer to support 'ungetting' a block number, to resolve flow
@@ -224,7 +225,17 @@ read_stream_unget_block(ReadStream *stream, BlockNumber blocknum)
 	stream->buffered_blocknum = blocknum;
 }
 
-static void
+/*
+ * Start as much of the current pending read as we can.  If we have to split it
+ * because of the per-backend buffer limit, or the buffer manager decides to
+ * split it, then the pending read is adjusted to hold the remaining portion.
+ *
+ * We can always start a read of at least size one if we have no progress yet.
+ * Otherwise it's possible that we can't start a read at all because of a lack
+ * of buffers, and then false is returned.  Buffer shortages also reduce the
+ * distance to a level that prevents look-ahead until buffers are released.
+ */
+static bool
 read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 {
 	bool		need_wait;
@@ -233,12 +244,13 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 	int16		io_index;
 	int16		overflow;
 	int16		buffer_index;
+	int16		buffer_limit;
 
 	/* This should only be called with a pending read. */
 	Assert(stream->pending_read_nblocks > 0);
 	Assert(stream->pending_read_nblocks <= io_combine_limit);
 
-	/* We had better not exceed the pin limit by starting this read. */
+	/* We had better not exceed the per-stream buffer limit with this read. */
 	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
 		   stream->max_pinned_buffers);
 
@@ -259,10 +271,39 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 	else
 		flags = 0;
 
-	/* We say how many blocks we want to read, but may be smaller on return. */
+	/* Compute the remaining portion of the per-backend buffer limit. */
+	if (stream->temporary)
+		buffer_limit = Min(GetAdditionalLocalPinLimit(), PG_INT16_MAX);
+	else
+		buffer_limit = Min(GetAdditionalPinLimit(), PG_INT16_MAX);
+	if (buffer_limit == 0 && stream->pinned_buffers == 0)
+		buffer_limit = 1;		/* guarantee progress */
+
+	/* Does the per-backend buffer limit affect this read? */
+	nblocks = stream->pending_read_nblocks;
+	if (buffer_limit < nblocks)
+	{
+		int16		new_distance;
+
+		/* Shrink distance: no more look-ahead until buffers are released. */
+		new_distance = stream->pinned_buffers + buffer_limit;
+		if (stream->distance > new_distance)
+			stream->distance = new_distance;
+
+		/* If we've already made progress, just give up and wait for buffers. */
+		if (stream->pinned_buffers > 0)
+			return false;
+
+		/* A short read is required to make progress. */
+		nblocks = buffer_limit;
+	}
+
+	/*
+	 * We say how many blocks we want to read, but it may be smaller on return
+	 * if the buffer manager decides it needs a short read at its level.
+	 */
 	buffer_index = stream->next_buffer_index;
 	io_index = stream->next_io_index;
-	nblocks = stream->pending_read_nblocks;
 	need_wait = StartReadBuffers(&stream->ios[io_index].op,
 								 &stream->buffers[buffer_index],
 								 stream->pending_read_blocknum,
@@ -312,19 +353,27 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 	/* Adjust the pending read to cover the remaining portion, if any. */
 	stream->pending_read_blocknum += nblocks;
 	stream->pending_read_nblocks -= nblocks;
+
+	return true;
 }
 
 static void
 read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 {
 	while (stream->ios_in_progress < stream->max_ios &&
-		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
+		   ((stream->pinned_buffers == 0 && stream->distance > 0) ||
+			stream->pinned_buffers + stream->pending_read_nblocks < stream->distance))
 	{
 		BlockNumber blocknum;
 		int16		buffer_index;
 		void	   *per_buffer_data;
 
-		if (stream->pending_read_nblocks == io_combine_limit)
+		/* If have a pending read that can't be extended, start it now. */
+		Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
+			   stream->max_pinned_buffers);
+		if (stream->pending_read_nblocks == io_combine_limit ||
+			(stream->pinned_buffers == 0 &&
+			 stream->pending_read_nblocks == stream->max_pinned_buffers))
 		{
 			read_stream_start_pending_read(stream, suppress_advice);
 			suppress_advice = false;
@@ -360,14 +409,15 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		/* We have to start the pending read before we can build another. */
 		while (stream->pending_read_nblocks > 0)
 		{
-			read_stream_start_pending_read(stream, suppress_advice);
-			suppress_advice = false;
-			if (stream->ios_in_progress == stream->max_ios)
+			if (!read_stream_start_pending_read(stream, suppress_advice) ||
+				stream->ios_in_progress == stream->max_ios)
 			{
-				/* And we've hit the limit.  Rewind, and stop here. */
+				/* And we've hit a buffer or I/O limit.  Rewind and wait. */
 				read_stream_unget_block(stream, blocknum);
 				return;
 			}
+
+			suppress_advice = false;
 		}
 
 		/* This is the start of a new pending read. */
@@ -390,6 +440,14 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		 stream->distance == 0) &&
 		stream->ios_in_progress < stream->max_ios)
 		read_stream_start_pending_read(stream, suppress_advice);
+
+	/*
+	 * There should always be something pinned when we leave this function,
+	 * whether started by this call or not, unless we've hit the end of the
+	 * stream.  In the worst case we can always make progress one buffer at a
+	 * time.
+	 */
+	Assert(stream->pinned_buffers > 0 || stream->distance == 0);
 }
 
 /*
@@ -418,6 +476,7 @@ read_stream_begin_impl(int flags,
 	int			max_ios;
 	int			strategy_pin_limit;
 	uint32		max_pinned_buffers;
+	uint32		max_possible_buffer_limit;
 	Oid			tablespace_id;
 
 	/*
@@ -465,12 +524,23 @@ read_stream_begin_impl(int flags,
 	strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
 	max_pinned_buffers = Min(strategy_pin_limit, max_pinned_buffers);
 
-	/* Don't allow this backend to pin more than its share of buffers. */
+	/*
+	 * Also limit our queue to the maximum number of pins we could possibly
+	 * ever be allowed to acquire according to the buffer manager.  We may not
+	 * really be able to use them all due to other pins held by this backend,
+	 * but we'll check that later in read_stream_start_pending_read().
+	 */
 	if (SmgrIsTemp(smgr))
-		LimitAdditionalLocalPins(&max_pinned_buffers);
+		max_possible_buffer_limit = GetSoftLocalPinLimit();
 	else
-		LimitAdditionalPins(&max_pinned_buffers);
-	Assert(max_pinned_buffers > 0);
+		max_possible_buffer_limit = GetSoftPinLimit();
+	max_pinned_buffers = Min(max_pinned_buffers, max_possible_buffer_limit);
+
+	/*
+	 * The soft limit might be zero on a system configured with more
+	 * connections than buffers.  We need at least one to make progress.
+	 */
+	max_pinned_buffers = Max(1, max_pinned_buffers);
 
 	/*
 	 * We need one extra entry for buffers and per-buffer data, because users
@@ -530,6 +600,7 @@ read_stream_begin_impl(int flags,
 	stream->callback = callback;
 	stream->callback_private_data = callback_private_data;
 	stream->buffered_blocknum = InvalidBlockNumber;
+	stream->temporary = SmgrIsTemp(smgr);
 
 	/*
 	 * Skip the initial ramp-up phase if the caller says we're going to be
@@ -658,6 +729,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			 * arbitrary I/O entry (they're all free).  We don't have to
 			 * adjust pinned_buffers because we're transferring one to caller
 			 * but pinning one more.
+			 *
+			 * In the fast path we don't need to check the pin limit.  We're
+			 * always allowed at least one pin so that progress can be made,
+			 * and that's all we need here.  Although two pins are momentarily
+			 * held at the same time, the model used here is that the stream
+			 * holds only one, and the other now belongs to the caller.
 			 */
 			if (likely(!StartReadBuffer(&stream->ios[0].op,
 										&stream->buffers[oldest_buffer_index],
@@ -858,6 +935,9 @@ read_stream_reset(ReadStream *stream)
 	stream->buffered_blocknum = InvalidBlockNumber;
 	stream->fast_path = false;
 
+	/* There is no point in reading whatever was pending. */
+	stream->pending_read_nblocks = 0;
+
 	/* Unpin anything that wasn't consumed. */
 	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
 		ReleaseBuffer(buffer);
-- 
2.39.5

From f4d0c7c81c1b960e71d7f76f1262279b329eac07 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Tue, 18 Feb 2025 15:59:13 +1300
Subject: [PATCH v2 3/4] Improve read stream advice for larger random reads.

read_stream.c tries not to issue advice when it thinks the kernel's
readahead should be active, ie when using buffered I/O and reading
sequential blocks.  It previously gave up a little too easily: it
should issue advice until it has started running sequential pread()
calls, not just when it's planning to.  The simpler strategy worked for
random regions of size <= io_combine_limit and large sequential regions,
but not so well when reading random regions of size
> io_combine limit.  For a 256kB chunk of data far away
from recent access, it would issue advice for the first half (assuming
io_combine_limit=128kB) and then suffer an I/O stall for the second
half.

Discovered by Tomas Vondra's regression testing of many data clustering
patterns using Melanie Plageman's streaming Bitmap Heap Scan patch, with
analysis of the I/O stall-producing pattern from Andres Freund.

Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com
Discussion: https://postgr.es/m/CA%2BhUKGJ3HSWciQCz8ekP1Zn7N213RfA4nbuotQawfpq23%2Bw-5Q%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c | 44 +++++++++++++++++++++------
 1 file changed, 35 insertions(+), 9 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 6c2b4ec011d..a028217a08e 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -132,6 +132,7 @@ struct ReadStream
 
 	/* Next expected block, for detecting sequential access. */
 	BlockNumber seq_blocknum;
+	BlockNumber seq_start;
 
 	/* The read operation we are currently preparing. */
 	BlockNumber pending_read_blocknum;
@@ -260,16 +261,30 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 	else
 		Assert(stream->next_buffer_index == stream->oldest_buffer_index);
 
-	/*
-	 * If advice hasn't been suppressed, this system supports it, and this
-	 * isn't a strictly sequential pattern, then we'll issue advice.
-	 */
-	if (!suppress_advice &&
-		stream->advice_enabled &&
-		stream->pending_read_blocknum != stream->seq_blocknum)
+	/* Do we need to issue read-ahead advice? */
+	flags = 0;
+	if (stream->advice_enabled)
+	{
 		flags = READ_BUFFERS_ISSUE_ADVICE;
-	else
-		flags = 0;
+
+		if (stream->pending_read_blocknum == stream->seq_blocknum)
+		{
+			/*
+			 * Suppress advice if our WaitReadBuffers() calls have caught up
+			 * with the first advice we issued for this sequential run.
+			 */
+			if (stream->seq_start == InvalidBlockNumber)
+				suppress_advice = true;
+		}
+		else
+		{
+			/* Random jump, so start a new sequential run. */
+			stream->seq_start = stream->pending_read_blocknum;
+		}
+
+		if (suppress_advice)
+			flags = 0;
+	}
 
 	/* Compute the remaining portion of the per-backend buffer limit. */
 	if (stream->temporary)
@@ -601,6 +616,8 @@ read_stream_begin_impl(int flags,
 	stream->callback_private_data = callback_private_data;
 	stream->buffered_blocknum = InvalidBlockNumber;
 	stream->temporary = SmgrIsTemp(smgr);
+	stream->seq_blocknum = InvalidBlockNumber;
+	stream->seq_start = InvalidBlockNumber;
 
 	/*
 	 * Skip the initial ramp-up phase if the caller says we're going to be
@@ -825,6 +842,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			distance = stream->distance * 2;
 			distance = Min(distance, stream->max_pinned_buffers);
 			stream->distance = distance;
+
+			/*
+			 * If we've caught up with the first advice issued for the current
+			 * sequential run, cancel further advice until the next random
+			 * jump.  The kernel should be able to see the pattern now that
+			 * we're issuing sequential preadv() calls.
+			 */
+			if (stream->ios[io_index].op.blocknum == stream->seq_start)
+				stream->seq_start = InvalidBlockNumber;
 		}
 		else
 		{
-- 
2.39.5

From cdafc8a7ec64dd669ad9dff26a5e7a63fc64fee2 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Wed, 19 Feb 2025 01:25:40 +1300
Subject: [PATCH v2 4/4] Look ahead more when sequential in read_stream.c.

Previously, sequential reads would cause the look-ahead distance to
fall back to io_combine_limit, on the basis that kernel read-ahead
should start helping.  It also meant that we'd have to ramp the distance
back up when a sequential region was followed by a burst of random
jumps, with little hope of avoiding a stall, which is not a good
trade-off and is incompatible with AIO plans (you have to look ahead if
you have to start real I/O).

Simplify the algorithm: now only cache hits make the look-ahead distance
drop off, and cache misses still make it grow rapidly.  Random vs
sequential heuristics are no longer taken into consideration while
making that decision.

Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c | 92 ++++++++++-----------------
 1 file changed, 33 insertions(+), 59 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index a028217a08e..170ec805b4c 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -17,30 +17,12 @@
  * pending read.  When that isn't possible, the existing pending read is sent
  * to StartReadBuffers() so that a new one can begin to form.
  *
- * The algorithm for controlling the look-ahead distance tries to classify the
- * stream into three ideal behaviors:
+ * The algorithm for controlling the look-ahead distance is based on recent
+ * cache hits and misses:
  *
- * A) No I/O is necessary, because the requested blocks are fully cached
- * already.  There is no benefit to looking ahead more than one block, so
- * distance is 1.  This is the default initial assumption.
- *
- * B) I/O is necessary, but read-ahead advice is undesirable because the
- * access is sequential and we can rely on the kernel's read-ahead heuristics,
- * or impossible because direct I/O is enabled, or the system doesn't support
- * read-ahead advice.  There is no benefit in looking ahead more than
- * io_combine_limit, because in this case the only goal is larger read system
- * calls.  Looking further ahead would pin many buffers and perform
- * speculative work for no benefit.
- *
- * C) I/O is necessary, it appears to be random, and this system supports
- * read-ahead advice.  We'll look further ahead in order to reach the
- * configured level of I/O concurrency.
- *
- * The distance increases rapidly and decays slowly, so that it moves towards
- * those levels as different I/O patterns are discovered.  For example, a
- * sequential scan of fully cached data doesn't bother looking ahead, but a
- * sequential scan that hits a region of uncached blocks will start issuing
- * increasingly wide read calls until it plateaus at io_combine_limit.
+ * When no I/O is necessary, there is no point in looking ahead more than one
+ * block.  This is the default initial assumption.  Otherwise rapidly increase
+ * the distance to try to benefit from I/O combining and I/O concurrency.
  *
  * The main data structure is a circular queue of buffers of size
  * max_pinned_buffers plus some extra space for technical reasons, ready to be
@@ -329,7 +311,7 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 	/* Remember whether we need to wait before returning this buffer. */
 	if (!need_wait)
 	{
-		/* Look-ahead distance decays, no I/O necessary (behavior A). */
+		/* Look-ahead distance decays, no I/O necessary. */
 		if (stream->distance > 1)
 			stream->distance--;
 	}
@@ -516,6 +498,15 @@ read_stream_begin_impl(int flags,
 	else
 		max_ios = get_tablespace_io_concurrency(tablespace_id);
 
+	/*
+	 * XXX Since we don't have asynchronous I/O yet, if direct I/O is enabled
+	 * then just behave as though I/O concurrency is set to 0.  Otherwise we
+	 * would look ahead pinning many buffers for no benefit, for lack of
+	 * advice and AIO.
+	 */
+	if (io_direct_flags & IO_DIRECT_DATA)
+		max_ios = 0;
+
 	/* Cap to INT16_MAX to avoid overflowing below */
 	max_ios = Min(max_ios, PG_INT16_MAX);
 
@@ -622,7 +613,7 @@ read_stream_begin_impl(int flags,
 	/*
 	 * Skip the initial ramp-up phase if the caller says we're going to be
 	 * reading the whole relation.  This way we start out assuming we'll be
-	 * doing full io_combine_limit sized reads (behavior B).
+	 * doing full io_combine_limit sized reads.
 	 */
 	if (flags & READ_STREAM_FULL)
 		stream->distance = Min(max_pinned_buffers, io_combine_limit);
@@ -713,10 +704,10 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 #ifndef READ_STREAM_DISABLE_FAST_PATH
 
 	/*
-	 * A fast path for all-cached scans (behavior A).  This is the same as the
-	 * usual algorithm, but it is specialized for no I/O and no per-buffer
-	 * data, so we can skip the queue management code, stay in the same buffer
-	 * slot and use singular StartReadBuffer().
+	 * A fast path for all-cached scans.  This is the same as the usual
+	 * algorithm, but it is specialized for no I/O and no per-buffer data, so
+	 * we can skip the queue management code, stay in the same buffer slot and
+	 * use singular StartReadBuffer().
 	 */
 	if (likely(stream->fast_path))
 	{
@@ -836,37 +827,20 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		if (++stream->oldest_io_index == stream->max_ios)
 			stream->oldest_io_index = 0;
 
-		if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
-		{
-			/* Distance ramps up fast (behavior C). */
-			distance = stream->distance * 2;
-			distance = Min(distance, stream->max_pinned_buffers);
-			stream->distance = distance;
+		/* Look-ahead distance ramps up quickly after we do I/O. */
+		distance = stream->distance * 2;
+		distance = Min(distance, stream->max_pinned_buffers);
+		stream->distance = distance;
 
-			/*
-			 * If we've caught up with the first advice issued for the current
-			 * sequential run, cancel further advice until the next random
-			 * jump.  The kernel should be able to see the pattern now that
-			 * we're issuing sequential preadv() calls.
-			 */
-			if (stream->ios[io_index].op.blocknum == stream->seq_start)
-				stream->seq_start = InvalidBlockNumber;
-		}
-		else
-		{
-			/* No advice; move towards io_combine_limit (behavior B). */
-			if (stream->distance > io_combine_limit)
-			{
-				stream->distance--;
-			}
-			else
-			{
-				distance = stream->distance * 2;
-				distance = Min(distance, io_combine_limit);
-				distance = Min(distance, stream->max_pinned_buffers);
-				stream->distance = distance;
-			}
-		}
+		/*
+		 * If we've caught up with the first advice issued for the current
+		 * sequential run, cancel further advice until the next random jump.
+		 * The kernel should be able to see the pattern now that we're issuing
+		 * sequential preadv() calls.
+		 */
+		if (stream->advice_enabled &&
+			stream->ios[io_index].op.blocknum == stream->seq_start)
+			stream->seq_start = InvalidBlockNumber;
 	}
 
 #ifdef CLOBBER_FREED_MEMORY
-- 
2.39.5

Reply via email to