On Wed, Feb 12, 2025 at 3:24 PM Thomas Munro <thomas.mu...@gmail.com> wrote: > On Wed, Feb 12, 2025 at 3:22 PM Andres Freund <and...@anarazel.de> wrote: > > On 2025-02-12 13:59:21 +1300, Thomas Munro wrote: > > > How about just maintaining it in a new variable > > > effective_io_combine_limit, whenever either of them is assigned? > > > > Yea, that's probably the least bad way. > > > > I wonder if we should just name that variable io_combine_limit and have the > > GUC be _raw or _guc or such? There's gonna be a fair number of references to > > the variable in code... > > Alternatively, we could compute that as stream->io_combine_limit and > use that. That has the advantage that it's fixed for the life of the > stream, even if you change it (eg between fetches from a CURSOR that > has streams). Pretty sure it won't break anything today,
Unfortunately that turned out to be untrue. :-( 0001 is a patch to address that, which should be back-patched. It's hard to come up with a repro for an actual problem, but I'm sure it's possible, will try... 0002 and 0003 are new versions of the patches to add max_io_combine_limit and increase MAX_IO_COMBINE_LIMIT to 1MB. This time using the name io_combine_limit_guc, so that io_combine_limit remains as the name referred to in the rest of the tree. I imagine/hope that we'll one day figure out how to make at least the easy case (?) of dependencies on PGC_POSTMASTER GUCs work for PGC_USERSET values, and then io_combine_limit_guc could be deleted...
From a9d09c0140b36cd5acf797ebb998ba7018590888 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Wed, 19 Feb 2025 16:52:43 +1300 Subject: [PATCH v3 1/3] Fix read_stream.c for changing io_combine_limit. In a couple of places, read_stream.c assumed that io_combine_limit would stable during the lifetime of a stream. That is not true in at least one unusual case: streams held by CURSORs where you could change the GUC between FETCH commands, with unpredictable results. Fix, by storing stream->io_combine_limit and referring only to that after construction. This mirrors the treatment of the other important setting {effective,maintenance}_io_concurrency, which is stored in stream->max_ios. One of the cases was the queue overflow space, which was sized for io_combine_limit and could be overrun if the GUC was increased. Since that coding was a little hard to follow, also introduce a variable for better readability instead of open-coding the arithmetic. Doing so revealed an off-by-one thinko while clamping max_pinned_buffers to INT16_MAX, though that wasn't a live bug due to the current limits on GUC values. Back-patch to 17. Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com --- src/backend/storage/aio/read_stream.c | 38 +++++++++++++++++++-------- 1 file changed, 27 insertions(+), 11 deletions(-) diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c index 04bdb5e6d4b..36fb9fe152c 100644 --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -109,6 +109,7 @@ typedef struct InProgressIO struct ReadStream { int16 max_ios; + int16 io_combine_limit; int16 ios_in_progress; int16 queue_size; int16 max_pinned_buffers; @@ -236,7 +237,7 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice) /* This should only be called with a pending read. */ Assert(stream->pending_read_nblocks > 0); - Assert(stream->pending_read_nblocks <= io_combine_limit); + Assert(stream->pending_read_nblocks <= stream->io_combine_limit); /* We had better not exceed the pin limit by starting this read. */ Assert(stream->pinned_buffers + stream->pending_read_nblocks <= @@ -324,7 +325,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice) int16 buffer_index; void *per_buffer_data; - if (stream->pending_read_nblocks == io_combine_limit) + if (stream->pending_read_nblocks == stream->io_combine_limit) { read_stream_start_pending_read(stream, suppress_advice); suppress_advice = false; @@ -384,7 +385,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice) * signaled end-of-stream, we start the read immediately. */ if (stream->pending_read_nblocks > 0 && - (stream->pending_read_nblocks == io_combine_limit || + (stream->pending_read_nblocks == stream->io_combine_limit || (stream->pending_read_nblocks == stream->distance && stream->pinned_buffers == 0) || stream->distance == 0) && @@ -415,6 +416,7 @@ read_stream_begin_impl(int flags, ReadStream *stream; size_t size; int16 queue_size; + int16 queue_overflow; int max_ios; int strategy_pin_limit; uint32 max_pinned_buffers; @@ -445,6 +447,14 @@ read_stream_begin_impl(int flags, /* Cap to INT16_MAX to avoid overflowing below */ max_ios = Min(max_ios, PG_INT16_MAX); + /* + * If starting a multi-block I/O near the end of the queue, we might + * temporarily need extra space for overflowing buffers before they are + * moved to regular circular position. This is the maximum extra space we + * could need. + */ + queue_overflow = io_combine_limit - 1; + /* * Choose the maximum number of buffers we're prepared to pin. We try to * pin fewer if we can, though. We add one so that we can make progress @@ -459,7 +469,7 @@ read_stream_begin_impl(int flags, */ max_pinned_buffers = (max_ios + 1) * io_combine_limit; max_pinned_buffers = Min(max_pinned_buffers, - PG_INT16_MAX - io_combine_limit - 1); + PG_INT16_MAX - queue_overflow - 1); /* Give the strategy a chance to limit the number of buffers we pin. */ strategy_pin_limit = GetAccessStrategyPinLimit(strategy); @@ -485,18 +495,17 @@ read_stream_begin_impl(int flags, * one big chunk. Though we have queue_size buffers, we want to be able * to assume that all the buffers for a single read are contiguous (i.e. * don't wrap around halfway through), so we allow temporary overflows of - * up to the maximum possible read size by allocating an extra - * io_combine_limit - 1 elements. + * up to the maximum possible overflow size. */ size = offsetof(ReadStream, buffers); - size += sizeof(Buffer) * (queue_size + io_combine_limit - 1); + size += sizeof(Buffer) * (queue_size + queue_overflow); size += sizeof(InProgressIO) * Max(1, max_ios); size += per_buffer_data_size * queue_size; size += MAXIMUM_ALIGNOF * 2; stream = (ReadStream *) palloc(size); memset(stream, 0, offsetof(ReadStream, buffers)); stream->ios = (InProgressIO *) - MAXALIGN(&stream->buffers[queue_size + io_combine_limit - 1]); + MAXALIGN(&stream->buffers[queue_size + queue_overflow]); if (per_buffer_data_size > 0) stream->per_buffer_data = (void *) MAXALIGN(&stream->ios[Max(1, max_ios)]); @@ -523,7 +532,14 @@ read_stream_begin_impl(int flags, if (max_ios == 0) max_ios = 1; + /* + * Capture stable values for these two GUC-derived numbers for the + * lifetime of this stream, so we don't have to worry about the GUCs + * changing underneath us beyond this point. + */ stream->max_ios = max_ios; + stream->io_combine_limit = io_combine_limit; + stream->per_buffer_data_size = per_buffer_data_size; stream->max_pinned_buffers = max_pinned_buffers; stream->queue_size = queue_size; @@ -537,7 +553,7 @@ read_stream_begin_impl(int flags, * doing full io_combine_limit sized reads (behavior B). */ if (flags & READ_STREAM_FULL) - stream->distance = Min(max_pinned_buffers, io_combine_limit); + stream->distance = Min(max_pinned_buffers, stream->io_combine_limit); else stream->distance = 1; @@ -752,14 +768,14 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data) else { /* No advice; move towards io_combine_limit (behavior B). */ - if (stream->distance > io_combine_limit) + if (stream->distance > stream->io_combine_limit) { stream->distance--; } else { distance = stream->distance * 2; - distance = Min(distance, io_combine_limit); + distance = Min(distance, stream->io_combine_limit); distance = Min(distance, stream->max_pinned_buffers); stream->distance = distance; } -- 2.48.1
From b7a4bd996a0a296eeda9a3938ff965131825cb9f Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Wed, 12 Feb 2025 13:52:22 +1300 Subject: [PATCH v3 2/3] Introduce max_io_combine_limit. The existing io_combine_limit parameter can be set by users. The new max_io_combine_limit parameter can be set only at server startup time, and applies as a silent limit on top of that. Since our GUC system doesn't allow dependencies between GUCs, the user-settable one assigns a "raw" value to io_combine_limit_guc, and the minimum of io_combine_limit_guc and max_io_combine_limit is maintained in io_combine_limit. In other words, code should continue to refer to io_combine_limit directly, and io_combine_limit_guc is an implementation detail that could eventually be removed if the GUC system ever learns how to deal with dependencies. max_io_combine_limit allows the administrator to cap the total I/O size system-wide, and also provides a way for proposed AIO patches to know what the maximum could possibly be when allocating related memory at server startup, without having to assume it is the compile-time maximum MAX_IO_COMBINE_LIMIT. This paves the way for increasing the latter. Reviewed-by: Andres Freund <and...@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com --- doc/src/sgml/config.sgml | 23 ++++++++++++++++++- src/backend/commands/variable.c | 18 +++++++++++++++ src/backend/storage/buffer/bufmgr.c | 5 +++- src/backend/utils/misc/guc_tables.c | 16 ++++++++++++- src/backend/utils/misc/postgresql.conf.sample | 2 ++ src/include/storage/bufmgr.h | 4 +++- src/include/utils/guc_hooks.h | 2 ++ 7 files changed, 66 insertions(+), 4 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index e55700f35b8..68ba7cc7980 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2631,6 +2631,24 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-max-io-combine-limit" xreflabel="max_io_combine_limit"> + <term><varname>max_io_combine_limit</varname> (<type>integer</type>) + <indexterm> + <primary><varname>max_io_combine_limit</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Controls the largest I/O size in operations that combine I/O, and silently + limits the user-settable parameter <varname>io_combine_limit</varname>. + This parameter can only be set in + the <filename>postgresql.conf</filename> file or on the server + command line. + The default is 128kB. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-io-combine-limit" xreflabel="io_combine_limit"> <term><varname>io_combine_limit</varname> (<type>integer</type>) <indexterm> @@ -2639,7 +2657,10 @@ include_dir 'conf.d' </term> <listitem> <para> - Controls the largest I/O size in operations that combine I/O. + Controls the largest I/O size in operations that combine I/O. If set + higher than the <varname>max_io_combine_limit</varname> parameter, the + smaller value will silently be used instead, so both may need to be raised + to increase the I/O size. The default is 128kB. </para> </listitem> diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c index 4ad6e236d69..fab8f2c7958 100644 --- a/src/backend/commands/variable.c +++ b/src/backend/commands/variable.c @@ -1156,6 +1156,24 @@ assign_maintenance_io_concurrency(int newval, void *extra) #endif } +/* + * GUC assign hooks that recompute io_combine_limit whenever + * io_combine_limit_guc and max_io_combine_limit are changed. These are needed + * because the GUC subsystem doesn't support dependencies between GUCs, and + * they may be assigned in either order. + */ +void +assign_max_io_combine_limit(int newval, void *extra) +{ + max_io_combine_limit = newval; + io_combine_limit = Min(max_io_combine_limit, io_combine_limit_guc); +} +void +assign_io_combine_limit(int newval, void *extra) +{ + io_combine_limit_guc = newval; + io_combine_limit = Min(max_io_combine_limit, io_combine_limit_guc); +} /* * These show hooks just exist because we want to show the values in octal. diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 7915ed624c1..af97d08b0ae 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -160,9 +160,12 @@ int maintenance_io_concurrency = DEFAULT_MAINTENANCE_IO_CONCURRENCY; /* * Limit on how many blocks should be handled in single I/O operations. * StartReadBuffers() callers should respect it, as should other operations - * that call smgr APIs directly. + * that call smgr APIs directly. It is computed as the minimum of underlying + * GUCs io_combine_limit_guc and max_io_combine_limit. */ int io_combine_limit = DEFAULT_IO_COMBINE_LIMIT; +int io_combine_limit_guc = DEFAULT_IO_COMBINE_LIMIT; +int max_io_combine_limit = DEFAULT_IO_COMBINE_LIMIT; /* * GUC variables about triggering kernel writeback for buffers written; OS diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index ad25cbb39c5..d1cb93562d1 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -3240,6 +3240,20 @@ struct config_int ConfigureNamesInt[] = NULL }, + { + {"max_io_combine_limit", + PGC_POSTMASTER, + RESOURCES_IO, + gettext_noop("System-wide limit applied to io_combine_limit."), + NULL, + GUC_UNIT_BLOCKS + }, + &max_io_combine_limit, + DEFAULT_IO_COMBINE_LIMIT, + 1, MAX_IO_COMBINE_LIMIT, + NULL, assign_max_io_combine_limit, NULL + }, + { {"io_combine_limit", PGC_USERSET, @@ -3251,7 +3265,7 @@ struct config_int ConfigureNamesInt[] = &io_combine_limit, DEFAULT_IO_COMBINE_LIMIT, 1, MAX_IO_COMBINE_LIMIT, - NULL, NULL, NULL + NULL, assign_io_combine_limit, NULL }, { diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 5362ff80519..1d63f9e5f36 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -200,6 +200,8 @@ #backend_flush_after = 0 # measured in pages, 0 disables #effective_io_concurrency = 1 # 1-1000; 0 disables prefetching #maintenance_io_concurrency = 10 # 1-1000; 0 disables prefetching +#max_io_combine_limit = 128kB # usually 1-32 blocks (depends on OS) + # (change requires restart) #io_combine_limit = 128kB # usually 1-32 blocks (depends on OS) # - Worker Processes - diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index 7c1e4316dde..91a2f548c3a 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -163,7 +163,9 @@ extern PGDLLIMPORT int maintenance_io_concurrency; #define MAX_IO_COMBINE_LIMIT PG_IOV_MAX #define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ) -extern PGDLLIMPORT int io_combine_limit; +extern PGDLLIMPORT int io_combine_limit; /* min of the two GUCs below */ +extern PGDLLIMPORT int io_combine_limit_guc; +extern PGDLLIMPORT int max_io_combine_limit; extern PGDLLIMPORT int checkpoint_flush_after; extern PGDLLIMPORT int backend_flush_after; diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h index 951451a9765..8f52af2fcee 100644 --- a/src/include/utils/guc_hooks.h +++ b/src/include/utils/guc_hooks.h @@ -82,6 +82,8 @@ extern const char *show_log_timezone(void); extern bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source); extern void assign_maintenance_io_concurrency(int newval, void *extra); +extern void assign_max_io_combine_limit(int newval, void *extra); +extern void assign_io_combine_limit(int newval, void *extra); extern bool check_max_slot_wal_keep_size(int *newval, void **extra, GucSource source); extern void assign_max_wal_size(int newval, void *extra); -- 2.48.1
From ab110f371c3fd690b193c543da6afad96988c70a Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Tue, 21 Jan 2025 11:44:35 +1300 Subject: [PATCH v3 3/3] Increase io_combine_limit maximum to 1MB. The default value of 128kB is not changed, but the arbitrarily chosen upper limit is changed from 256kB to 1MB (assuming default block size, and assuming the system IOV_MAX doesn't limit us to a smaller size). Whether your system can really do 1MB physical transfers without splitting them up is another matter, but having the option to experiment is useful. Other RDBMSes seem to have limits around that number. The concrete change is to our definition of PG_IOV_MAX, which in turn controls MAX_IO_COMBINE_LIMIT. The latter limits the GUCs used for buffer pool I/O, but the former also affects a couple of other places that work with arrays of struct iovec or smaller objects on the stack and always use the compile-time maximum. Therefore we still don't want to use IOV_MAX directly, which is not under our control and likely to be an unnecessarily high 1024. 128 blocks seems acceptable though. The only current Unix on our target list that is known to have a low IOV_MAX is Solaris before 11.4 SRU72. Our Solaris build farm animals don't have that update, or even SRU69 for preadv/pwritev, so are currently testing the fallback code as used on Windows. (illumos already made these changes many years ago.) For Windows, we can't use real scatter/gather yet (though it's possible, in later work), so we continue to provide an IOV_MAX value of 16 and emulate preadv()/pwritev() with loops. It's debatable whether we should increase that number too: it still benefits from I/O combining sometimes when buffers happen to be consecutive in memory. Someone would need to research that. This change also makes it theoretically possible for read_stream.c's internal cap of INT16_MAX to be hit. With effective_io_concurrency set to 1000 and io_combine_limit set to 1MB, you could theoretically want to pin 128K buffers at once (= 1GB of data), but there is a cap at ~32K buffers that stems from the choice of data type. That could be revisited in future, but in practice several other limits are very likely to kick in first. Suggested-by: Tomas Vondra <to...@vondra.me> Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com --- doc/src/sgml/config.sgml | 4 ++++ src/backend/storage/aio/read_stream.c | 7 ++++--- src/backend/utils/misc/postgresql.conf.sample | 4 ++-- src/include/port/pg_iovec.h | 8 ++++++-- 4 files changed, 16 insertions(+), 7 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 68ba7cc7980..a668ff08ab9 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2644,6 +2644,8 @@ include_dir 'conf.d' This parameter can only be set in the <filename>postgresql.conf</filename> file or on the server command line. + The maximum possible size depends on the operating system and block + size, but is typically 1MB on Unix and 128kB on Windows. The default is 128kB. </para> </listitem> @@ -2661,6 +2663,8 @@ include_dir 'conf.d' higher than the <varname>max_io_combine_limit</varname> parameter, the smaller value will silently be used instead, so both may need to be raised to increase the I/O size. + The maximum possible size depends on the operating system and block + size, but is typically 1MB on Unix and 128kB on Windows. The default is 128kB. </para> </listitem> diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c index 36fb9fe152c..6939fb8ccf9 100644 --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -463,9 +463,10 @@ read_stream_begin_impl(int flags, * finishes we don't want to have to wait for its buffers to be consumed * before starting a new one. * - * Be careful not to allow int16 to overflow (even though that's not - * possible with the current GUC range limits), allowing also for the - * spare entry and the overflow space. + * Be careful not to allow int16 to overflow. That is possible with the + * current GUC range limits, so this is an artificial limit of ~32k + * buffers and we'd need to adjust the types to exceed that. We also have + * to allow for the spare entry and the overflow space. */ max_pinned_buffers = (max_ios + 1) * io_combine_limit; max_pinned_buffers = Min(max_pinned_buffers, diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 1d63f9e5f36..1859c040b42 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -200,9 +200,9 @@ #backend_flush_after = 0 # measured in pages, 0 disables #effective_io_concurrency = 1 # 1-1000; 0 disables prefetching #maintenance_io_concurrency = 10 # 1-1000; 0 disables prefetching -#max_io_combine_limit = 128kB # usually 1-32 blocks (depends on OS) +#max_io_combine_limit = 128kB # usually 1-128 blocks (depends on OS) # (change requires restart) -#io_combine_limit = 128kB # usually 1-32 blocks (depends on OS) +#io_combine_limit = 128kB # usually 1-128 blocks (depends on OS) # - Worker Processes - diff --git a/src/include/port/pg_iovec.h b/src/include/port/pg_iovec.h index d9891d3805d..df40c7208be 100644 --- a/src/include/port/pg_iovec.h +++ b/src/include/port/pg_iovec.h @@ -33,8 +33,12 @@ struct iovec #endif -/* Define a reasonable maximum that is safe to use on the stack. */ -#define PG_IOV_MAX Min(IOV_MAX, 32) +/* + * Define a reasonable maximum that is safe to use on the stack in arrays of + * struct iovec and other small types. The operating system could limit us to + * a number as low as 16, but most systems have 1024. + */ +#define PG_IOV_MAX Min(IOV_MAX, 128) /* * Like preadv(), but with a prefix to remind us of a side-effect: on Windows -- 2.48.1