Hi,

Tomas queried[1] the limit of 256kB (or really 32 blocks) for
io_combine_limit.  Yeah, I think we should increase it and allow
experimentation with larger numbers.  Note that real hardware and
protocols have segment and size limits that can force the kernel to
split your I/Os, so it's not at all a given that it'll help much or at
all to use very large sizes, but YMMV.  I was originally cautious
because I didn't want to make a few stack buffers too big, but arrays
of BlockNumber, struct iovec, and pointer don't seem too excessive at
say 128 (cf whole blocks on the stack, a thing we do, which would
still be many times larger that the relevant arrays).  I was also
anticipating future code that would need to multiply that number by
other terms to allocate shared memory, but after some off-list
discussion, that seems OK: such code should be able to deal with that
using GUCs instead of maximally pessimal allocation.  128 gives a nice
round number of 1M as a maximum transfer size, and comparable systems
seem to have upper limits around that mark.  Patch attached.

[1] 
https://www.postgresql.org/message-id/flat/85696b8e-f1bf-459e-ba97-5608c644c185%40vondra.me#a4c98be8b55095ce14897dab4793c255
From e8bb23b2130023ec497855c239b0299421b6833d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Tue, 21 Jan 2025 11:44:35 +1300
Subject: [PATCH] Increase io_combine_limit maximum to 1MB.

The default value of 128kB is not changed, but the fairly arbitrary
upper limit is changed from 256kB to 1MB (assuming default block size,
and assuming the operating system doesn't limit us to a smaller size).
Whether your system can really do 1MB physical transfers without
splitting them up is another matter, but at least having the option to
try is useful.  Other RDBMSes seem to have limits around that number.

The concrete change is to our definition of PG_IOV_MAX, which in turn
controls MAX_IO_COMBINE_LIMIT.  The latter limits the GUC used for
buffer pool I/O, but the former also affects a couple of other places
that work with arrays of struct iovec or smaller objects on the stack
and always use the compile-time maximum.  Therefore we still don't want
to use IOV_MAX directly, which is not under our control and likely to be
an unnecessarily high 1024.  128 blocks seems acceptable though.

The only current Unix on our target list that is known to have a very
low IOV_MAX is older Solaris, though even it changed recently.  Solaris
11.4 SRU69 added preadv/pwritev, 11.4 SRU72 bumped IOV_MAX from 16 to
1024.  Our Solaris build farm animals don't have either of those updates
yet so are currently testing the fallback code (like Windows).  Note
that its open source cousin illumos already made equivalent changes
years ago.

For Windows, we can't use real scatter/gather yet (though it's possible,
in later work), so we continue to provide an IOV_MAX value of 16 and
emulate preadv()/pwritev() with loops.  It's debatable whether we should
increase that number too: it still benefits from I/O combining sometimes
when buffers happen to be consecutive in memory.  Someone would need to
research that.

This change also makes it theoretically possible for read_stream.c's
internal cap of INT16_MAX to be hit.  With effective_io_concurrency set
to 1000 and io_combine_limit set to 1MB, you could theoretically want to
pin 128K buffers at once (= 1GB of data), but there is a cap at ~32K
buffers that stems from the choice of data type.  That could be
revisited in future, but in practice several other limits are very
likely to kick in first.

Suggested-by: Tomas Vondra <to...@vondra.me>
---
 doc/src/sgml/config.sgml                      | 2 ++
 src/backend/storage/aio/read_stream.c         | 9 ++++++---
 src/backend/utils/misc/postgresql.conf.sample | 2 +-
 src/include/port/pg_iovec.h                   | 8 ++++++--
 4 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 38244409e3c..5f7d1e41951 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2614,6 +2614,8 @@ include_dir 'conf.d'
        <listitem>
         <para>
          Controls the largest I/O size in operations that combine I/O.
+         The maximum allowed size depends on the operating system and block
+         size, but is typically 1MB on Unix and 128kB on Windows.
          The default is 128kB.
         </para>
        </listitem>
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index e4414b2e915..b5a8f3ed941 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -438,9 +438,12 @@ read_stream_begin_impl(int flags,
 	 * Choose the maximum number of buffers we're prepared to pin.  We try to
 	 * pin fewer if we can, though.  We clamp it to at least io_combine_limit
 	 * so that we can have a chance to build up a full io_combine_limit sized
-	 * read, even when max_ios is zero.  Be careful not to allow int16 to
-	 * overflow (even though that's not possible with the current GUC range
-	 * limits), allowing also for the spare entry and the overflow space.
+	 * read, even when max_ios is zero.
+	 *
+	 * Be careful not to allow int16 to overflow.  That is possible with the
+	 * current GUC range limits, so this is an artificial limit of ~32k
+	 * buffers and we'd need to adjust the types to exceed that.  We also have
+	 * to allow for the spare entry and the overflow space.
 	 */
 	max_pinned_buffers = Max(max_ios * 4, io_combine_limit);
 	max_pinned_buffers = Min(max_pinned_buffers,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c40b7a3121e..bd2fca53612 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -197,7 +197,7 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
-#io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
+#io_combine_limit = 128kB		# usually 1-128 blocks (depends on OS)
 #max_worker_processes = 8		# (change requires restart)
 #max_parallel_workers_per_gather = 2	# limited by max_parallel_workers
 #max_parallel_maintenance_workers = 2	# limited by max_parallel_workers
diff --git a/src/include/port/pg_iovec.h b/src/include/port/pg_iovec.h
index d9891d3805d..426ef978301 100644
--- a/src/include/port/pg_iovec.h
+++ b/src/include/port/pg_iovec.h
@@ -33,8 +33,12 @@ struct iovec
 
 #endif
 
-/* Define a reasonable maximum that is safe to use on the stack. */
-#define PG_IOV_MAX Min(IOV_MAX, 32)
+/*
+ * Define a reasonable maximum that is safe to use on the stack in arrays of
+ * struct iovec and other small types.  The operating system may limit us to a
+ * lower value, though most systems have 1024.
+ */
+#define PG_IOV_MAX Min(IOV_MAX, 128)
 
 /*
  * Like preadv(), but with a prefix to remind us of a side-effect: on Windows
-- 
2.39.5

Reply via email to