On Wed, Feb 12, 2025 at 3:22 PM Andres Freund <and...@anarazel.de> wrote:
> On 2025-02-12 13:59:21 +1300, Thomas Munro wrote:
> > How about just maintaining it in a new variable
> > effective_io_combine_limit, whenever either of them is assigned?
>
> Yea, that's probably the least bad way.
>
> I wonder if we should just name that variable io_combine_limit and have the
> GUC be _raw or _guc or such? There's gonna be a fair number of references to
> the variable in code...

Alternatively, we could compute that as stream->io_combine_limit and
use that.  That has the advantage that it's fixed for the life of the
stream, even if you change it (eg between fetches from a CURSOR that
has streams).  Pretty sure it won't break anything today, but it might
just run out of queue space limiting concurrency arbitrarily if you
increase it, which is a bit weird now that I focus on that.  Capturing
the value we'll use up front seems better on that front.  On the other
hand, other future code might also have to remember to compute that
too (write streams, ...), a tiny bit of duplication.  Something like
the attached.  Or ... I guess we could do both things?
From 8cfa23a370a4564a0369991e2b0068b48983a0f6 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Wed, 12 Feb 2025 13:52:22 +1300
Subject: [PATCH v2] Introduce max_io_combine_limit.

The existing io_combine_limit parameter can be set by users.  The new
max_io_combine_limit parameter can be set only at server startup time.
Code that combines I/Os should respect both of them by taking the
smaller value.

This allows the administrator to cap the total I/O size system-wide, but
also provides a way for proposed patches to know what the maximum could
possibly be in cases where it is multiplied by other GUCs to allocate
shared memory, without having to assume that it's as high as the
compile-time MAX_IO_COMBINE_LIMIT value.

The read_stream.c code is changed to compute the minimum value up front
as stream->io_combine_limit instead of using io_combine_limit directly.
That has the extra benefit of remaining stable throughout the lifetime
of the stream even if the user changes it (eg while consuming from a
CURSOR).  As previously coded, an mid-stream increase could limit
concurrency artificially just because we run out of queue space too
soon.
---
 doc/src/sgml/config.sgml                      | 23 ++++++++++++++-
 src/backend/commands/variable.c               |  1 -
 src/backend/storage/aio/read_stream.c         | 29 ++++++++++++-------
 src/backend/storage/buffer/bufmgr.c           |  5 ++--
 src/backend/utils/misc/postgresql.conf.sample |  2 ++
 src/include/storage/bufmgr.h                  |  1 +
 6 files changed, 47 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3b557ecabfb..c6de8b9e236 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2605,6 +2605,24 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-max-io-combine-limit" xreflabel="max_io_combine_limit">
+       <term><varname>max_io_combine_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>max_io_combine_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the largest I/O size in operations that combine I/O, and silently
+         limits the user-settable parameter <varname>io_combine_limit</varname>.
+         This parameter can only be set in
+         the <filename>postgresql.conf</filename> file or on the server
+         command line.
+         The default is 128kB.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-io-combine-limit" xreflabel="io_combine_limit">
        <term><varname>io_combine_limit</varname> (<type>integer</type>)
        <indexterm>
@@ -2613,7 +2631,10 @@ include_dir 'conf.d'
        </term>
        <listitem>
         <para>
-         Controls the largest I/O size in operations that combine I/O.
+         Controls the largest I/O size in operations that combine I/O.  If set
+         higher than the <varname>max_io_combine_limit</varname> parameter, the
+         smaller value will silently be used instead, so both may need to be raised
+         to increase the I/O size.
          The default is 128kB.
         </para>
        </listitem>
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 4ad6e236d69..8f77eb6ee79 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1156,7 +1156,6 @@ assign_maintenance_io_concurrency(int newval, void *extra)
 #endif
 }
 
-
 /*
  * These show hooks just exist because we want to show the values in octal.
  */
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index e4414b2e915..ad3da5470c4 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -108,6 +108,7 @@ typedef struct InProgressIO
  */
 struct ReadStream
 {
+	int16		io_combine_limit;
 	int16		max_ios;
 	int16		ios_in_progress;
 	int16		queue_size;
@@ -225,7 +226,7 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 
 	/* This should only be called with a pending read. */
 	Assert(stream->pending_read_nblocks > 0);
-	Assert(stream->pending_read_nblocks <= io_combine_limit);
+	Assert(stream->pending_read_nblocks <= stream->io_combine_limit);
 
 	/* We had better not exceed the pin limit by starting this read. */
 	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
@@ -313,7 +314,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		int16		buffer_index;
 		void	   *per_buffer_data;
 
-		if (stream->pending_read_nblocks == io_combine_limit)
+		if (stream->pending_read_nblocks == stream->io_combine_limit)
 		{
 			read_stream_start_pending_read(stream, suppress_advice);
 			suppress_advice = false;
@@ -373,7 +374,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 	 * signaled end-of-stream, we start the read immediately.
 	 */
 	if (stream->pending_read_nblocks > 0 &&
-		(stream->pending_read_nblocks == io_combine_limit ||
+		(stream->pending_read_nblocks == stream->io_combine_limit ||
 		 (stream->pending_read_nblocks == stream->distance &&
 		  stream->pinned_buffers == 0) ||
 		 stream->distance == 0) &&
@@ -402,6 +403,7 @@ read_stream_begin_impl(int flags,
 					   size_t per_buffer_data_size)
 {
 	ReadStream *stream;
+	int			effective_io_combine_limit;
 	size_t		size;
 	int16		queue_size;
 	int			max_ios;
@@ -409,6 +411,12 @@ read_stream_begin_impl(int flags,
 	uint32		max_pinned_buffers;
 	Oid			tablespace_id;
 
+	/*
+	 * Respect both the system-wide limit and the user-settable limit on I/O
+	 * combining size.
+	 */
+	effective_io_combine_limit = Min(max_io_combine_limit, io_combine_limit);
+
 	/*
 	 * Decide how many I/Os we will allow to run at the same time.  That
 	 * currently means advice to the kernel to tell it that we will soon read.
@@ -442,9 +450,9 @@ read_stream_begin_impl(int flags,
 	 * overflow (even though that's not possible with the current GUC range
 	 * limits), allowing also for the spare entry and the overflow space.
 	 */
-	max_pinned_buffers = Max(max_ios * 4, io_combine_limit);
+	max_pinned_buffers = Max(max_ios * 4, effective_io_combine_limit);
 	max_pinned_buffers = Min(max_pinned_buffers,
-							 PG_INT16_MAX - io_combine_limit - 1);
+							 PG_INT16_MAX - effective_io_combine_limit - 1);
 
 	/* Give the strategy a chance to limit the number of buffers we pin. */
 	strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
@@ -474,14 +482,14 @@ read_stream_begin_impl(int flags,
 	 * io_combine_limit - 1 elements.
 	 */
 	size = offsetof(ReadStream, buffers);
-	size += sizeof(Buffer) * (queue_size + io_combine_limit - 1);
+	size += sizeof(Buffer) * (queue_size + effective_io_combine_limit - 1);
 	size += sizeof(InProgressIO) * Max(1, max_ios);
 	size += per_buffer_data_size * queue_size;
 	size += MAXIMUM_ALIGNOF * 2;
 	stream = (ReadStream *) palloc(size);
 	memset(stream, 0, offsetof(ReadStream, buffers));
 	stream->ios = (InProgressIO *)
-		MAXALIGN(&stream->buffers[queue_size + io_combine_limit - 1]);
+		MAXALIGN(&stream->buffers[queue_size + effective_io_combine_limit - 1]);
 	if (per_buffer_data_size > 0)
 		stream->per_buffer_data = (void *)
 			MAXALIGN(&stream->ios[Max(1, max_ios)]);
@@ -508,6 +516,7 @@ read_stream_begin_impl(int flags,
 	if (max_ios == 0)
 		max_ios = 1;
 
+	stream->io_combine_limit = effective_io_combine_limit;
 	stream->max_ios = max_ios;
 	stream->per_buffer_data_size = per_buffer_data_size;
 	stream->max_pinned_buffers = max_pinned_buffers;
@@ -522,7 +531,7 @@ read_stream_begin_impl(int flags,
 	 * doing full io_combine_limit sized reads (behavior B).
 	 */
 	if (flags & READ_STREAM_FULL)
-		stream->distance = Min(max_pinned_buffers, io_combine_limit);
+		stream->distance = Min(max_pinned_buffers, stream->io_combine_limit);
 	else
 		stream->distance = 1;
 
@@ -737,14 +746,14 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No advice; move towards io_combine_limit (behavior B). */
-			if (stream->distance > io_combine_limit)
+			if (stream->distance > stream->io_combine_limit)
 			{
 				stream->distance--;
 			}
 			else
 			{
 				distance = stream->distance * 2;
-				distance = Min(distance, io_combine_limit);
+				distance = Min(distance, stream->io_combine_limit);
 				distance = Min(distance, stream->max_pinned_buffers);
 				stream->distance = distance;
 			}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ee83669992b..3ccee6557d0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -159,9 +159,10 @@ int			maintenance_io_concurrency = DEFAULT_MAINTENANCE_IO_CONCURRENCY;
 
 /*
  * Limit on how many blocks should be handled in single I/O operations.
- * StartReadBuffers() callers should respect it, as should other operations
- * that call smgr APIs directly.
+ * StartReadBuffers() callers should respect both of these, as should other
+ * operations that call smgr APIs directly.
  */
+int			max_io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
 int			io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
 
 /*
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d472987ed46..cae4ffbdce7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -197,6 +197,8 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
+#max_io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
+					# (change requires restart)
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 
 # - Worker Processes -
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7c1e4316dde..246192b78ad 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -164,6 +164,7 @@ extern PGDLLIMPORT int maintenance_io_concurrency;
 #define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
 #define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
 extern PGDLLIMPORT int io_combine_limit;
+extern PGDLLIMPORT int max_io_combine_limit;
 
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
-- 
2.48.1

Reply via email to